Skip to content

KMS Key Provider — Design Spec

Date: 2026-04-23 Status: Draft — autonomous design, user approved Related: Completes the KeyProvider abstraction introduced in PR #24 (notifications PHI-at-rest encryption). Enables the notifications service to use AWS KMS for DEK wrapping in production.


1. Scope, Goals, Non-Goals

Goal

Implement a new KmsKeyProvider in @sa-platform/common alongside the existing LocalKeyProvider. Wire the notifications service to optionally use it via its existing NOTIFICATIONS_KEY_PROVIDER=kms config switch. Clinical-api stays on LocalKeyProvider in this PR — migrating it involves separate perf and ops considerations.

Why

LocalKeyProvider works for dev and for production when the operator is comfortable managing a raw 256-bit KEK as an env var. Real prod deployments want AWS KMS — IAM-gated key access, audited usage, rotation handled by AWS. The KeyProvider interface was designed for this swap; we're filling it in.

In scope

  • Add KmsKeyProvider implementation in packages/common/src/crypto/kms-key-provider.ts.
  • Export from @sa-platform/common's public surface.
  • Add @aws-sdk/client-kms as a dependency of packages/common.
  • Update services/notifications/src/crypto/crypto.module.ts factory to branch on config.keyProvider:
  • 'kms' → construct KmsKeyProvider({ keyId, region })
  • anything else (default 'local') → current LocalKeyProvider
  • Add NOTIFICATIONS_KMS_KEY_ID and AWS_REGION config fields to notifications' AppConfigService. AWS_REGION is optional-with-default ('eu-west-2') in dev, required in prod via prodRequired if keyProvider === 'kms'.
  • .env.example entries documenting the new vars.
  • Unit tests for KmsKeyProvider with mocked @aws-sdk/client-kms.

Out of scope (explicit)

  • Migrating clinical-api to KMS. Clinical-api unwraps DEKs per-patient, which at scale means a KMS call per patient operation. Needs caching design + ops readiness. Separate PR.
  • KEK caching pattern for per-request performance. Notifications' service-wide DEK unwraps once at boot; irrelevant here. Clinical-api migration will need it.
  • Actual KMS CMK provisioning via Terraform/CloudFormation. Infra concern, not this PR.
  • Integration tests using real KMS. CI uses LocalKeyProvider (same as today). Real-KMS smoke verification is manual at deploy time.
  • DEK rotation. Separate follow-up.

Success criteria

  • KmsKeyProvider wraps and unwraps DEKs correctly against a mocked KMS client.
  • Notifications service boots with NOTIFICATIONS_KEY_PROVIDER=kms + valid NOTIFICATIONS_KMS_KEY_ID and uses KmsKeyProvider at runtime.
  • Notifications service boots with NOTIFICATIONS_KEY_PROVIDER=local (default) exactly as today — zero behaviour change.
  • KmsKeyProvider surfaces typed errors for 4xx (invalid key, access denied) vs 5xx/network (retryable).
  • @sa-platform/common exports the new class.
  • Full monorepo typecheck + test suites green.

2. Architecture

KeyProvider interface — unchanged

export interface KeyProvider {
  wrapKey(plainDek: Buffer): Promise<string>;
  unwrapKey(wrappedDek: string): Promise<Buffer>;
}

Return type of wrapKey is a string — opaque to the caller, each provider chooses its own format. No interface change required.

KmsKeyProvider shape

import { EncryptCommand, DecryptCommand, KMSClient } from '@aws-sdk/client-kms';

export interface KmsKeyProviderOptions {
  keyId: string; // CMK ARN or alias (e.g. 'alias/sa-platform-notifications')
  region: string; // e.g. 'eu-west-2'
}

export class KmsProviderError extends Error {
  constructor(
    message: string,
    public readonly classification: 'client' | 'server' | 'unknown',
    public readonly cause?: unknown,
  ) {
    super(message);
    this.name = 'KmsProviderError';
  }
}

export class KmsKeyProvider implements KeyProvider {
  private readonly client: KMSClient;

  constructor(private readonly options: KmsKeyProviderOptions) {
    this.client = new KMSClient({ region: options.region });
  }

  async wrapKey(plainDek: Buffer): Promise<string> {
    try {
      const result = await this.client.send(
        new EncryptCommand({ KeyId: this.options.keyId, Plaintext: plainDek }),
      );
      if (!result.CiphertextBlob) {
        throw new KmsProviderError('KMS Encrypt returned no ciphertext', 'unknown');
      }
      return Buffer.from(result.CiphertextBlob).toString('base64');
    } catch (err) {
      throw this.classifyAndRethrow(err, 'wrapKey');
    }
  }

  async unwrapKey(wrappedDek: string): Promise<Buffer> {
    try {
      const ciphertext = Buffer.from(wrappedDek, 'base64');
      const result = await this.client.send(new DecryptCommand({ CiphertextBlob: ciphertext }));
      if (!result.Plaintext) {
        throw new KmsProviderError('KMS Decrypt returned no plaintext', 'unknown');
      }
      return Buffer.from(result.Plaintext);
    } catch (err) {
      throw this.classifyAndRethrow(err, 'unwrapKey');
    }
  }

  private classifyAndRethrow(err: unknown, operation: string): KmsProviderError {
    if (err instanceof KmsProviderError) return err;
    const e = err as { $metadata?: { httpStatusCode?: number }; name?: string };
    const status = e?.$metadata?.httpStatusCode;
    let classification: 'client' | 'server' | 'unknown' = 'unknown';
    if (status && status >= 500) classification = 'server';
    else if (status && status >= 400 && status < 500) classification = 'client';
    return new KmsProviderError(
      `KMS ${operation} failed (status=${status ?? 'unknown'}, name=${e?.name ?? 'unknown'})`,
      classification,
      err,
    );
  }
}

Key decisions:

  • Decrypt doesn't pass KeyId. KMS's Decrypt can infer the CMK from the ciphertext blob — correct for cross-CMK support. If the service's IAM role can't decrypt with that CMK, Decrypt throws.
  • Lazy SDK init in constructor. KMSClient construction is cheap; no need to defer.
  • Error classification at the boundary helps callers retry transients (server) but not client errors (bad key, access denied).
  • Wrapped format is pure base64 of the KMS blob — distinguishable from LocalKeyProvider's iv:ciphertext:authTag format (3 colon-separated parts) if we ever need to tell them apart at runtime. Not needed in this PR, but a useful property.

Notifications crypto module — conditional factory

services/notifications/src/crypto/crypto.module.ts:

@Module({
  providers: [
    CryptoService,
    NotificationsDekResolver,
    NotificationCryptoHelper,
    {
      provide: KEY_PROVIDER,
      useFactory: (config: AppConfigService) => {
        const { keyProvider, kmsKeyId, awsRegion, localKekHex } = config.config;
        if (keyProvider === 'kms') {
          if (!kmsKeyId) {
            throw new Error(
              'NOTIFICATIONS_KMS_KEY_ID is required when NOTIFICATIONS_KEY_PROVIDER=kms',
            );
          }
          return new KmsKeyProvider({ keyId: kmsKeyId, region: awsRegion });
        }
        return new LocalKeyProvider(localKekHex);
      },
      inject: [AppConfigService],
    },
  ],
  exports: [NotificationCryptoHelper],
})
export class NotificationsCryptoModule {}

Notifications config additions

AppConfigService.config:

// existing
keyProvider: 'local' | 'kms';

// new
kmsKeyId: string | undefined; // required only when keyProvider === 'kms'
awsRegion: string; // default 'eu-west-2'

Construction (using prodRequired from PR #26 where relevant):

kmsKeyId: process.env.NOTIFICATIONS_KMS_KEY_ID || undefined,
awsRegion: process.env.AWS_REGION ?? 'eu-west-2',

kmsKeyId is NOT always required — only when keyProvider === 'kms'. The factory checks that explicitly, so we don't put it through prodRequired. If someone sets NOTIFICATIONS_KEY_PROVIDER=kms in prod without NOTIFICATIONS_KMS_KEY_ID, the service fails at module init with a clear message.


3. The Actual Changes

Files created

packages/common/src/crypto/kms-key-provider.ts
packages/common/src/crypto/kms-key-provider.spec.ts

Files modified

packages/common/package.json                          # add @aws-sdk/client-kms dependency
packages/common/src/index.ts                          # export KmsKeyProvider + KmsProviderError + KmsKeyProviderOptions

services/notifications/src/config/app-config.ts       # add kmsKeyId, awsRegion
services/notifications/src/crypto/crypto.module.ts    # conditional factory
services/notifications/.env.example                   # document NOTIFICATIONS_KMS_KEY_ID + AWS_REGION

pnpm-lock.yaml                                        # regenerated

Files NOT touched

  • packages/common/src/crypto/local-key-provider.ts — unchanged.
  • packages/common/src/crypto/key-provider.interface.ts — unchanged.
  • packages/common/src/crypto/crypto.module.ts — the shared CryptoModule's defaults stay LocalKeyProvider. Services using CryptoModule.forRoot with custom config get whatever they configure.
  • services/clinical-api/** — stays on LocalKeyProvider. No migration in this PR.
  • All integration tests — use LocalKeyProvider via their existing env setup.

4. Unit tests

kms-key-provider.spec.ts:

Mock @aws-sdk/client-kms at the module level (jest.mock('@aws-sdk/client-kms')). Assert KMSClient.send call shapes.

Cases:

  1. wrapKey happy path. Mock send to return { CiphertextBlob: Buffer.from([0xab, 0xcd]) }. Assert the base64 of [0xab, 0xcd] is returned. Assert EncryptCommand was constructed with the configured keyId and the plain DEK.
  2. wrapKey server error (500 status). Mock throws with { $metadata: { httpStatusCode: 500 }, name: 'InternalError' }. Assert throws KmsProviderError with classification: 'server'.
  3. wrapKey client error (4xx). Mock throws with { $metadata: { httpStatusCode: 400 }, name: 'InvalidKeyUsageException' }. Assert throws KmsProviderError with classification: 'client'.
  4. wrapKey returns no ciphertext. Mock returns {}. Assert throws KmsProviderError with 'KMS Encrypt returned no ciphertext'.
  5. unwrapKey happy path. Round-trip: base64-encode a sentinel blob, mock send to return { Plaintext: Buffer.from('SECRET') }, assert returned Buffer is Buffer.from('SECRET'). Assert DecryptCommand was constructed without KeyId and with the correct CiphertextBlob buffer.
  6. unwrapKey client error (e.g. access denied). Mock throws with $metadata.httpStatusCode = 400, name: 'AccessDeniedException'. Assert classification: 'client'.
  7. unwrapKey returns no plaintext. Mock returns {}. Asserts typed error.
  8. Network / unknown error. Mock throws with no $metadata. Assert classification: 'unknown'.

5. Risks & rollback

Risk Likelihood Mitigation
@aws-sdk/client-kms in packages/common bloats bundle size for services that don't use KMS Low-Medium AWS SDK v3 is modular — @aws-sdk/client-kms is ~1.5MB. Only loaded if a consumer instantiates KmsKeyProvider. Shared package already depends on @aws-sdk/client-sesv2 in notifications-specific code paths, same pattern.
Prod deployment forgets NOTIFICATIONS_KMS_KEY_ID after flipping NOTIFICATIONS_KEY_PROVIDER=kms Low Factory throws at module init with clear message naming the missing var.
IAM role lacks kms:Encrypt / kms:Decrypt on the CMK Operational Wrapped error surfaces as KmsProviderError with classification: 'client'. Clear failure at first PHI send or boot-time unwrap.
Switching notifications' existing wrapped DEK from local to kms requires re-generating Expected Operator generates a fresh DEK under KMS when making the switch. Documented below.

Rollback: revert the PR. No persisted state, no schema changes. If notifications has already been re-wrapped with KMS in prod, reverting means re-generating a local-wrapped DEK.

Operator runbook addition (not part of this PR but worth noting)

To switch from local to kms for an existing deployment:

  1. Pre-provision a KMS CMK with the notifications service's IAM role granted kms:Encrypt + kms:Decrypt.
  2. Set NOTIFICATIONS_KEY_PROVIDER=kms + NOTIFICATIONS_KMS_KEY_ID=<arn> + AWS_REGION=<region>.
  3. Generate a fresh DEK wrapped with KMS (via a one-off script: fetch a new DEK via KmsKeyProvider.wrapKey(randomBytes(32)), set the result as NOTIFICATIONS_WRAPPED_DEK).
  4. Redeploy. All new PHI-encrypted writes will use the KMS-wrapped DEK.

Old local-wrapped notifications in the DB become unreadable. Acceptable per the "forward-only" stance from PR #24 (near-zero PHI volume historically).


6. Out of scope — follow-ups

  • Clinical-api migration to KMS with per-request DEK cache + request-context threading. Bigger piece of work.
  • Terraform / infra code to provision the KMS CMK with correct IAM.
  • DEK rotation — KMS makes this easier (generate new plain DEK, re-wrap, swap env var) but needs an operator workflow.
  • CMK rotation — AWS KMS rotates the underlying key material annually if enabled; ciphertexts from old versions still decrypt. No code change needed.