Prod-Secret Enforcement — Design Spec¶
Date: 2026-04-23
Status: Draft — autonomous design, user approved
Related: PR #23 review flagged ADMIN_API_SECRET + SERVICE_AUTH_TOKEN silent dev-default fall-through as a platform-wide concern. This spec generalises the fix to every dev-default secret across all services.
1. Scope, Goals, Non-Goals¶
Goal¶
Prevent any service from silently booting in production with a hardcoded dev-default secret. In NODE_ENV=production, each service's AppConfigService throws at construction if any secret env var is unset. In dev/test, keep the current dev defaults unchanged so nobody's local workflow breaks.
Why¶
Every service's AppConfigService currently has rows like:
adminApiSecret: process.env.ADMIN_API_SECRET ?? 'admin-dev-secret',
In production, if the env var is not set, the service silently falls back to the dev value — an obvious security hole. A misconfigured prod deployment ships with a public "admin" token and no alarm fires.
In scope¶
Add a prodRequired(name, devDefault) private method to every AppConfigService. Swap the 14 known-dev-default secrets listed below from process.env.X ?? 'default' to this.prodRequired('X', 'default').
Secrets covered:
| Service | Secrets |
|---|---|
auth |
ADMIN_API_SECRET |
clinical-api |
ADMIN_API_SECRET, JWT_SIGNING_SECRET, ENCRYPTION_MASTER_KEY, KMS_CMK_ARN, S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY |
consent |
ADMIN_API_SECRET |
notifications |
ADMIN_API_SECRET, SERVICE_AUTH_TOKEN, SLACK_BOT_TOKEN, LOCAL_KEK_HEX, NOTIFICATIONS_WRAPPED_DEK |
user-management |
ADMIN_API_SECRET, SERVICE_API_SECRET, COGNITO_USER_POOL_ID |
For empty-string defaults (JWT_SIGNING_SECRET ?? '', ENCRYPTION_MASTER_KEY ?? '', KMS_CMK_ARN ?? '', COGNITO_USER_POOL_ID ?? ''): same treatment. The dev fallback is '' (preserves current no-op-when-unset behaviour), prod throws.
Out of scope¶
- Operational config (ports, regions, bucket names, log level, TTLs, issuer names): fine as dev defaults in prod too. Not treated as secrets.
- Migration to schema-validation library (zod/joi): overkill for a one-line helper.
- Secret sourcing from cloud secret managers (AWS Secrets Manager, Vault, etc.): out of scope. Env vars remain the interface.
- Secret rotation.
- Extracting the helper to
@sa-platform/common: 5 near-identical copies is fine. Extract later if the pattern entrenches.
Success criteria¶
- Running any service with
NODE_ENV=productionand a listed secret unset throws atAppConfigServiceconstruction with a message naming the missing var. - Running any service with
NODE_ENV=test(or unset) and a listed secret unset succeeds with the dev default (no behaviour change). - One unit test per service covering both branches.
- Full monorepo typecheck + test suite green.
- CI (runs with
NODE_ENV=testby default) unaffected.
2. Mechanism¶
Each service's AppConfigService gets this private method:
private prodRequired(name: string, devDefault: string): string {
const value = process.env[name];
if (value) return value;
if (process.env.NODE_ENV === 'production') {
throw new Error(`${name} is required in production`);
}
return devDefault;
}
Call sites swap from:
adminApiSecret: process.env.ADMIN_API_SECRET ?? 'admin-dev-secret',
to:
adminApiSecret: this.prodRequired('ADMIN_API_SECRET', 'admin-dev-secret'),
No other changes to AppConfigService. required() (hard-required at all times, e.g. DATABASE_URL) stays as-is. parsePort() and other helpers untouched.
Production detection¶
Exact-equal process.env.NODE_ENV === 'production'. Standard Node.js convention. Any staging-type environment running real workloads should set NODE_ENV=production — there is no middle ground.
Failure mode¶
Throws at AppConfigService construction → NestJS DI initialisation fails → service doesn't start → Docker/k8s container restarts loop with a clear error in logs. Loud and early.
3. The Actual Changes¶
Files modified (5)¶
services/auth/src/config/app-config.ts
services/clinical-api/src/config/app-config.ts
services/consent/src/config/app-config.ts
services/notifications/src/config/app-config.ts
services/user-management/src/config/app-config.ts
Plus 5 spec files (one per service) to cover the new behaviour:
services/auth/src/config/app-config.spec.ts # new or extend existing
services/clinical-api/src/config/app-config.spec.ts # new or extend existing
services/consent/src/config/app-config.spec.ts # new or extend existing
services/notifications/src/config/app-config.spec.ts # new or extend existing
services/user-management/src/config/app-config.spec.ts # new or extend existing
Check during implementation whether each service already has an app-config.spec.ts. If yes, extend it; if no, create a minimal one.
Files NOT touched¶
.env.examplefiles — current dev defaults are still valid for dev, no documentation change needed.- Integration-test setup — they run with
NODE_ENV=test, unaffected. - CI workflow — runs
NODE_ENV=testby default, unaffected. - Dockerfiles / docker-compose — production deployments set
NODE_ENV=productionand inject real secrets via env; this PR is the contract that makes forgetting that throw loudly instead of silently succeeding.
Helper duplication¶
5 copies of the prodRequired method. Acceptable — 4 lines each, trivial to keep in sync manually. If we add a 6th service and notice the copies drift, extract to @sa-platform/common as a follow-up.
4. Test strategy¶
One unit test per service, structured as two cases:
describe('AppConfigService', () => {
const ORIGINAL_ENV = process.env;
afterEach(() => {
process.env = { ...ORIGINAL_ENV };
});
it('throws in production when ADMIN_API_SECRET is unset', () => {
process.env = { ...ORIGINAL_ENV, NODE_ENV: 'production', DATABASE_URL: 'mysql://x' };
delete process.env.ADMIN_API_SECRET;
expect(() => new AppConfigService()).toThrow(/ADMIN_API_SECRET is required in production/);
});
it('uses dev default when NODE_ENV is not production', () => {
process.env = { ...ORIGINAL_ENV, NODE_ENV: 'test', DATABASE_URL: 'mysql://x' };
delete process.env.ADMIN_API_SECRET;
const config = new AppConfigService();
expect(config.config.adminApiSecret).toBe('admin-dev-secret');
});
});
Each service tests one representative secret. Testing all 14 exhaustively is overkill — the helper is identical per-service, and one case per service proves the wiring.
Integration/smoke¶
No new integration tests. The existing integration suites run under NODE_ENV=test and already validate the dev-default path end-to-end. The prod-throw path is verified at the unit level because starting a real service with missing env vars in integration tests is fiddly for little marginal value.
5. Risks & rollback¶
| Risk | Likelihood | Mitigation |
|---|---|---|
Someone's local dev setup breaks because NODE_ENV=production was set accidentally |
Very low | Dev machines shouldn't set NODE_ENV=production. If someone does, the error is self-explanatory and the fix is one env-var unset. |
| A prod deployment with secrets correctly set still throws due to a whitespace/empty-string quirk | Low | The helper checks if (value) (truthy) — empty strings fall through to the prod check. Matches current implicit behaviour where empty env vars get replaced by ?? defaults. |
| Forgotten secret in the audit | Low | The list in Section 1 was derived by grepping every process.env.X ?? '...' in every service's app-config.ts. If I missed one, it still gets the dev default in prod (same as today) — no regression. |
| Helper drift across the 5 copies | Medium long-term | Accepted. Extract to @sa-platform/common when drift becomes a problem. |
Rollback: revert the PR. No persisted state, no schema changes.
6. Out of scope — follow-ups¶
- Extract
prodRequiredto@sa-platform/common. - Adopt a schema-validation library (zod) for a more expressive config model.
- Integrate with a cloud secret manager (Secrets Manager, Vault).
- Add a CI job that boots every service with
NODE_ENV=productionand no secrets, asserts it throws — cheap insurance that nobody accidentally removes the guard later.