KMS Key Provider — Design Spec¶
Date: 2026-04-23
Status: Draft — autonomous design, user approved
Related: Completes the KeyProvider abstraction introduced in PR #24 (notifications PHI-at-rest encryption). Enables the notifications service to use AWS KMS for DEK wrapping in production.
1. Scope, Goals, Non-Goals¶
Goal¶
Implement a new KmsKeyProvider in @sa-platform/common alongside the existing LocalKeyProvider. Wire the notifications service to optionally use it via its existing NOTIFICATIONS_KEY_PROVIDER=kms config switch. Clinical-api stays on LocalKeyProvider in this PR — migrating it involves separate perf and ops considerations.
Why¶
LocalKeyProvider works for dev and for production when the operator is comfortable managing a raw 256-bit KEK as an env var. Real prod deployments want AWS KMS — IAM-gated key access, audited usage, rotation handled by AWS. The KeyProvider interface was designed for this swap; we're filling it in.
In scope¶
- Add
KmsKeyProviderimplementation inpackages/common/src/crypto/kms-key-provider.ts. - Export from
@sa-platform/common's public surface. - Add
@aws-sdk/client-kmsas a dependency ofpackages/common. - Update
services/notifications/src/crypto/crypto.module.tsfactory to branch onconfig.keyProvider: 'kms'→ constructKmsKeyProvider({ keyId, region })- anything else (default
'local') → currentLocalKeyProvider - Add
NOTIFICATIONS_KMS_KEY_IDandAWS_REGIONconfig fields to notifications'AppConfigService.AWS_REGIONis optional-with-default ('eu-west-2') in dev, required in prod viaprodRequiredifkeyProvider === 'kms'. .env.exampleentries documenting the new vars.- Unit tests for
KmsKeyProviderwith mocked@aws-sdk/client-kms.
Out of scope (explicit)¶
- Migrating clinical-api to KMS. Clinical-api unwraps DEKs per-patient, which at scale means a KMS call per patient operation. Needs caching design + ops readiness. Separate PR.
- KEK caching pattern for per-request performance. Notifications' service-wide DEK unwraps once at boot; irrelevant here. Clinical-api migration will need it.
- Actual KMS CMK provisioning via Terraform/CloudFormation. Infra concern, not this PR.
- Integration tests using real KMS. CI uses
LocalKeyProvider(same as today). Real-KMS smoke verification is manual at deploy time. - DEK rotation. Separate follow-up.
Success criteria¶
KmsKeyProviderwraps and unwraps DEKs correctly against a mocked KMS client.- Notifications service boots with
NOTIFICATIONS_KEY_PROVIDER=kms+ validNOTIFICATIONS_KMS_KEY_IDand usesKmsKeyProviderat runtime. - Notifications service boots with
NOTIFICATIONS_KEY_PROVIDER=local(default) exactly as today — zero behaviour change. KmsKeyProvidersurfaces typed errors for 4xx (invalid key, access denied) vs 5xx/network (retryable).@sa-platform/commonexports the new class.- Full monorepo typecheck + test suites green.
2. Architecture¶
KeyProvider interface — unchanged¶
export interface KeyProvider {
wrapKey(plainDek: Buffer): Promise<string>;
unwrapKey(wrappedDek: string): Promise<Buffer>;
}
Return type of wrapKey is a string — opaque to the caller, each provider chooses its own format. No interface change required.
KmsKeyProvider shape¶
import { EncryptCommand, DecryptCommand, KMSClient } from '@aws-sdk/client-kms';
export interface KmsKeyProviderOptions {
keyId: string; // CMK ARN or alias (e.g. 'alias/sa-platform-notifications')
region: string; // e.g. 'eu-west-2'
}
export class KmsProviderError extends Error {
constructor(
message: string,
public readonly classification: 'client' | 'server' | 'unknown',
public readonly cause?: unknown,
) {
super(message);
this.name = 'KmsProviderError';
}
}
export class KmsKeyProvider implements KeyProvider {
private readonly client: KMSClient;
constructor(private readonly options: KmsKeyProviderOptions) {
this.client = new KMSClient({ region: options.region });
}
async wrapKey(plainDek: Buffer): Promise<string> {
try {
const result = await this.client.send(
new EncryptCommand({ KeyId: this.options.keyId, Plaintext: plainDek }),
);
if (!result.CiphertextBlob) {
throw new KmsProviderError('KMS Encrypt returned no ciphertext', 'unknown');
}
return Buffer.from(result.CiphertextBlob).toString('base64');
} catch (err) {
throw this.classifyAndRethrow(err, 'wrapKey');
}
}
async unwrapKey(wrappedDek: string): Promise<Buffer> {
try {
const ciphertext = Buffer.from(wrappedDek, 'base64');
const result = await this.client.send(new DecryptCommand({ CiphertextBlob: ciphertext }));
if (!result.Plaintext) {
throw new KmsProviderError('KMS Decrypt returned no plaintext', 'unknown');
}
return Buffer.from(result.Plaintext);
} catch (err) {
throw this.classifyAndRethrow(err, 'unwrapKey');
}
}
private classifyAndRethrow(err: unknown, operation: string): KmsProviderError {
if (err instanceof KmsProviderError) return err;
const e = err as { $metadata?: { httpStatusCode?: number }; name?: string };
const status = e?.$metadata?.httpStatusCode;
let classification: 'client' | 'server' | 'unknown' = 'unknown';
if (status && status >= 500) classification = 'server';
else if (status && status >= 400 && status < 500) classification = 'client';
return new KmsProviderError(
`KMS ${operation} failed (status=${status ?? 'unknown'}, name=${e?.name ?? 'unknown'})`,
classification,
err,
);
}
}
Key decisions:
- Decrypt doesn't pass
KeyId. KMS'sDecryptcan infer the CMK from the ciphertext blob — correct for cross-CMK support. If the service's IAM role can't decrypt with that CMK, Decrypt throws. - Lazy SDK init in constructor.
KMSClientconstruction is cheap; no need to defer. - Error classification at the boundary helps callers retry transients (server) but not client errors (bad key, access denied).
- Wrapped format is pure base64 of the KMS blob — distinguishable from
LocalKeyProvider'siv:ciphertext:authTagformat (3 colon-separated parts) if we ever need to tell them apart at runtime. Not needed in this PR, but a useful property.
Notifications crypto module — conditional factory¶
services/notifications/src/crypto/crypto.module.ts:
@Module({
providers: [
CryptoService,
NotificationsDekResolver,
NotificationCryptoHelper,
{
provide: KEY_PROVIDER,
useFactory: (config: AppConfigService) => {
const { keyProvider, kmsKeyId, awsRegion, localKekHex } = config.config;
if (keyProvider === 'kms') {
if (!kmsKeyId) {
throw new Error(
'NOTIFICATIONS_KMS_KEY_ID is required when NOTIFICATIONS_KEY_PROVIDER=kms',
);
}
return new KmsKeyProvider({ keyId: kmsKeyId, region: awsRegion });
}
return new LocalKeyProvider(localKekHex);
},
inject: [AppConfigService],
},
],
exports: [NotificationCryptoHelper],
})
export class NotificationsCryptoModule {}
Notifications config additions¶
AppConfigService.config:
// existing
keyProvider: 'local' | 'kms';
// new
kmsKeyId: string | undefined; // required only when keyProvider === 'kms'
awsRegion: string; // default 'eu-west-2'
Construction (using prodRequired from PR #26 where relevant):
kmsKeyId: process.env.NOTIFICATIONS_KMS_KEY_ID || undefined,
awsRegion: process.env.AWS_REGION ?? 'eu-west-2',
kmsKeyId is NOT always required — only when keyProvider === 'kms'. The factory checks that explicitly, so we don't put it through prodRequired. If someone sets NOTIFICATIONS_KEY_PROVIDER=kms in prod without NOTIFICATIONS_KMS_KEY_ID, the service fails at module init with a clear message.
3. The Actual Changes¶
Files created¶
packages/common/src/crypto/kms-key-provider.ts
packages/common/src/crypto/kms-key-provider.spec.ts
Files modified¶
packages/common/package.json # add @aws-sdk/client-kms dependency
packages/common/src/index.ts # export KmsKeyProvider + KmsProviderError + KmsKeyProviderOptions
services/notifications/src/config/app-config.ts # add kmsKeyId, awsRegion
services/notifications/src/crypto/crypto.module.ts # conditional factory
services/notifications/.env.example # document NOTIFICATIONS_KMS_KEY_ID + AWS_REGION
pnpm-lock.yaml # regenerated
Files NOT touched¶
packages/common/src/crypto/local-key-provider.ts— unchanged.packages/common/src/crypto/key-provider.interface.ts— unchanged.packages/common/src/crypto/crypto.module.ts— the shared CryptoModule's defaults stayLocalKeyProvider. Services using CryptoModule.forRoot with custom config get whatever they configure.services/clinical-api/**— stays on LocalKeyProvider. No migration in this PR.- All integration tests — use LocalKeyProvider via their existing env setup.
4. Unit tests¶
kms-key-provider.spec.ts:
Mock @aws-sdk/client-kms at the module level (jest.mock('@aws-sdk/client-kms')). Assert KMSClient.send call shapes.
Cases:
wrapKeyhappy path. Mocksendto return{ CiphertextBlob: Buffer.from([0xab, 0xcd]) }. Assert the base64 of[0xab, 0xcd]is returned. AssertEncryptCommandwas constructed with the configuredkeyIdand the plain DEK.wrapKeyserver error (500 status). Mock throws with{ $metadata: { httpStatusCode: 500 }, name: 'InternalError' }. Assert throwsKmsProviderErrorwithclassification: 'server'.wrapKeyclient error (4xx). Mock throws with{ $metadata: { httpStatusCode: 400 }, name: 'InvalidKeyUsageException' }. Assert throwsKmsProviderErrorwithclassification: 'client'.wrapKeyreturns no ciphertext. Mock returns{}. Assert throwsKmsProviderErrorwith'KMS Encrypt returned no ciphertext'.unwrapKeyhappy path. Round-trip: base64-encode a sentinel blob, mocksendto return{ Plaintext: Buffer.from('SECRET') }, assert returned Buffer isBuffer.from('SECRET'). AssertDecryptCommandwas constructed withoutKeyIdand with the correctCiphertextBlobbuffer.unwrapKeyclient error (e.g. access denied). Mock throws with$metadata.httpStatusCode = 400,name: 'AccessDeniedException'. Assertclassification: 'client'.unwrapKeyreturns no plaintext. Mock returns{}. Asserts typed error.- Network / unknown error. Mock throws with no
$metadata. Assertclassification: 'unknown'.
5. Risks & rollback¶
| Risk | Likelihood | Mitigation |
|---|---|---|
@aws-sdk/client-kms in packages/common bloats bundle size for services that don't use KMS |
Low-Medium | AWS SDK v3 is modular — @aws-sdk/client-kms is ~1.5MB. Only loaded if a consumer instantiates KmsKeyProvider. Shared package already depends on @aws-sdk/client-sesv2 in notifications-specific code paths, same pattern. |
Prod deployment forgets NOTIFICATIONS_KMS_KEY_ID after flipping NOTIFICATIONS_KEY_PROVIDER=kms |
Low | Factory throws at module init with clear message naming the missing var. |
IAM role lacks kms:Encrypt / kms:Decrypt on the CMK |
Operational | Wrapped error surfaces as KmsProviderError with classification: 'client'. Clear failure at first PHI send or boot-time unwrap. |
| Switching notifications' existing wrapped DEK from local to kms requires re-generating | Expected | Operator generates a fresh DEK under KMS when making the switch. Documented below. |
Rollback: revert the PR. No persisted state, no schema changes. If notifications has already been re-wrapped with KMS in prod, reverting means re-generating a local-wrapped DEK.
Operator runbook addition (not part of this PR but worth noting)¶
To switch from local to kms for an existing deployment:
- Pre-provision a KMS CMK with the notifications service's IAM role granted
kms:Encrypt+kms:Decrypt. - Set
NOTIFICATIONS_KEY_PROVIDER=kms+NOTIFICATIONS_KMS_KEY_ID=<arn>+AWS_REGION=<region>. - Generate a fresh DEK wrapped with KMS (via a one-off script: fetch a new DEK via
KmsKeyProvider.wrapKey(randomBytes(32)), set the result asNOTIFICATIONS_WRAPPED_DEK). - Redeploy. All new PHI-encrypted writes will use the KMS-wrapped DEK.
Old local-wrapped notifications in the DB become unreadable. Acceptable per the "forward-only" stance from PR #24 (near-zero PHI volume historically).
6. Out of scope — follow-ups¶
- Clinical-api migration to KMS with per-request DEK cache + request-context threading. Bigger piece of work.
- Terraform / infra code to provision the KMS CMK with correct IAM.
- DEK rotation — KMS makes this easier (generate new plain DEK, re-wrap, swap env var) but needs an operator workflow.
- CMK rotation — AWS KMS rotates the underlying key material annually if enabled; ciphertexts from old versions still decrypt. No code change needed.