Skip to content

Prod-Secret Enforcement — Design Spec

Date: 2026-04-23 Status: Draft — autonomous design, user approved Related: PR #23 review flagged ADMIN_API_SECRET + SERVICE_AUTH_TOKEN silent dev-default fall-through as a platform-wide concern. This spec generalises the fix to every dev-default secret across all services.


1. Scope, Goals, Non-Goals

Goal

Prevent any service from silently booting in production with a hardcoded dev-default secret. In NODE_ENV=production, each service's AppConfigService throws at construction if any secret env var is unset. In dev/test, keep the current dev defaults unchanged so nobody's local workflow breaks.

Why

Every service's AppConfigService currently has rows like:

adminApiSecret: process.env.ADMIN_API_SECRET ?? 'admin-dev-secret',

In production, if the env var is not set, the service silently falls back to the dev value — an obvious security hole. A misconfigured prod deployment ships with a public "admin" token and no alarm fires.

In scope

Add a prodRequired(name, devDefault) private method to every AppConfigService. Swap the 14 known-dev-default secrets listed below from process.env.X ?? 'default' to this.prodRequired('X', 'default').

Secrets covered:

Service Secrets
auth ADMIN_API_SECRET
clinical-api ADMIN_API_SECRET, JWT_SIGNING_SECRET, ENCRYPTION_MASTER_KEY, KMS_CMK_ARN, S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY
consent ADMIN_API_SECRET
notifications ADMIN_API_SECRET, SERVICE_AUTH_TOKEN, SLACK_BOT_TOKEN, LOCAL_KEK_HEX, NOTIFICATIONS_WRAPPED_DEK
user-management ADMIN_API_SECRET, SERVICE_API_SECRET, COGNITO_USER_POOL_ID

For empty-string defaults (JWT_SIGNING_SECRET ?? '', ENCRYPTION_MASTER_KEY ?? '', KMS_CMK_ARN ?? '', COGNITO_USER_POOL_ID ?? ''): same treatment. The dev fallback is '' (preserves current no-op-when-unset behaviour), prod throws.

Out of scope

  • Operational config (ports, regions, bucket names, log level, TTLs, issuer names): fine as dev defaults in prod too. Not treated as secrets.
  • Migration to schema-validation library (zod/joi): overkill for a one-line helper.
  • Secret sourcing from cloud secret managers (AWS Secrets Manager, Vault, etc.): out of scope. Env vars remain the interface.
  • Secret rotation.
  • Extracting the helper to @sa-platform/common: 5 near-identical copies is fine. Extract later if the pattern entrenches.

Success criteria

  • Running any service with NODE_ENV=production and a listed secret unset throws at AppConfigService construction with a message naming the missing var.
  • Running any service with NODE_ENV=test (or unset) and a listed secret unset succeeds with the dev default (no behaviour change).
  • One unit test per service covering both branches.
  • Full monorepo typecheck + test suite green.
  • CI (runs with NODE_ENV=test by default) unaffected.

2. Mechanism

Each service's AppConfigService gets this private method:

private prodRequired(name: string, devDefault: string): string {
  const value = process.env[name];
  if (value) return value;
  if (process.env.NODE_ENV === 'production') {
    throw new Error(`${name} is required in production`);
  }
  return devDefault;
}

Call sites swap from:

adminApiSecret: process.env.ADMIN_API_SECRET ?? 'admin-dev-secret',

to:

adminApiSecret: this.prodRequired('ADMIN_API_SECRET', 'admin-dev-secret'),

No other changes to AppConfigService. required() (hard-required at all times, e.g. DATABASE_URL) stays as-is. parsePort() and other helpers untouched.

Production detection

Exact-equal process.env.NODE_ENV === 'production'. Standard Node.js convention. Any staging-type environment running real workloads should set NODE_ENV=production — there is no middle ground.

Failure mode

Throws at AppConfigService construction → NestJS DI initialisation fails → service doesn't start → Docker/k8s container restarts loop with a clear error in logs. Loud and early.


3. The Actual Changes

Files modified (5)

services/auth/src/config/app-config.ts
services/clinical-api/src/config/app-config.ts
services/consent/src/config/app-config.ts
services/notifications/src/config/app-config.ts
services/user-management/src/config/app-config.ts

Plus 5 spec files (one per service) to cover the new behaviour:

services/auth/src/config/app-config.spec.ts              # new or extend existing
services/clinical-api/src/config/app-config.spec.ts      # new or extend existing
services/consent/src/config/app-config.spec.ts           # new or extend existing
services/notifications/src/config/app-config.spec.ts     # new or extend existing
services/user-management/src/config/app-config.spec.ts   # new or extend existing

Check during implementation whether each service already has an app-config.spec.ts. If yes, extend it; if no, create a minimal one.

Files NOT touched

  • .env.example files — current dev defaults are still valid for dev, no documentation change needed.
  • Integration-test setup — they run with NODE_ENV=test, unaffected.
  • CI workflow — runs NODE_ENV=test by default, unaffected.
  • Dockerfiles / docker-compose — production deployments set NODE_ENV=production and inject real secrets via env; this PR is the contract that makes forgetting that throw loudly instead of silently succeeding.

Helper duplication

5 copies of the prodRequired method. Acceptable — 4 lines each, trivial to keep in sync manually. If we add a 6th service and notice the copies drift, extract to @sa-platform/common as a follow-up.


4. Test strategy

One unit test per service, structured as two cases:

describe('AppConfigService', () => {
  const ORIGINAL_ENV = process.env;

  afterEach(() => {
    process.env = { ...ORIGINAL_ENV };
  });

  it('throws in production when ADMIN_API_SECRET is unset', () => {
    process.env = { ...ORIGINAL_ENV, NODE_ENV: 'production', DATABASE_URL: 'mysql://x' };
    delete process.env.ADMIN_API_SECRET;
    expect(() => new AppConfigService()).toThrow(/ADMIN_API_SECRET is required in production/);
  });

  it('uses dev default when NODE_ENV is not production', () => {
    process.env = { ...ORIGINAL_ENV, NODE_ENV: 'test', DATABASE_URL: 'mysql://x' };
    delete process.env.ADMIN_API_SECRET;
    const config = new AppConfigService();
    expect(config.config.adminApiSecret).toBe('admin-dev-secret');
  });
});

Each service tests one representative secret. Testing all 14 exhaustively is overkill — the helper is identical per-service, and one case per service proves the wiring.

Integration/smoke

No new integration tests. The existing integration suites run under NODE_ENV=test and already validate the dev-default path end-to-end. The prod-throw path is verified at the unit level because starting a real service with missing env vars in integration tests is fiddly for little marginal value.


5. Risks & rollback

Risk Likelihood Mitigation
Someone's local dev setup breaks because NODE_ENV=production was set accidentally Very low Dev machines shouldn't set NODE_ENV=production. If someone does, the error is self-explanatory and the fix is one env-var unset.
A prod deployment with secrets correctly set still throws due to a whitespace/empty-string quirk Low The helper checks if (value) (truthy) — empty strings fall through to the prod check. Matches current implicit behaviour where empty env vars get replaced by ?? defaults.
Forgotten secret in the audit Low The list in Section 1 was derived by grepping every process.env.X ?? '...' in every service's app-config.ts. If I missed one, it still gets the dev default in prod (same as today) — no regression.
Helper drift across the 5 copies Medium long-term Accepted. Extract to @sa-platform/common when drift becomes a problem.

Rollback: revert the PR. No persisted state, no schema changes.


6. Out of scope — follow-ups

  • Extract prodRequired to @sa-platform/common.
  • Adopt a schema-validation library (zod) for a more expressive config model.
  • Integrate with a cloud secret manager (Secrets Manager, Vault).
  • Add a CI job that boots every service with NODE_ENV=production and no secrets, asserts it throws — cheap insurance that nobody accidentally removes the guard later.