Skip to content

Notifications Service — Design Spec

Date: 2026-04-21 Status: Draft — pending review Build order position: Step 4 in the SA Platform build order (auth → user-management → consent → notifications → everything else) Related: docs/superpowers/specs/2026-04-19-sa-platform-design.md


1. Scope, Goals, Non-Goals

What this service is

services/notifications/ — a standalone NestJS service that owns notification templates, per-user preferences, the delivery pipeline, and audit trail for platform notifications. Exposes a REST API plus subscribes to domain events on Redis pub/sub. Single-tenant provider configuration for v1.

Channels in v1

  • Email via AWS SES
  • Slack via the Slack Web API (platform bot token)

SMS and push are explicit non-goals for v1. The provider adapter interface is designed so additional channels drop in without reworking the pipeline.

What it does

  • Stores templates (seeded from code, editable per-org at runtime is reserved for post-v1)
  • Stores NotificationSubscription rows mapping domain events → templates → recipients
  • Accepts ad-hoc sends via REST (POST /notifications)
  • Resolves recipients (user IDs via user-management REST, or pass-through channel IDs)
  • Enforces per-user preferences (category opt-out; transactional category always sends)
  • Enforces PHI channel policy (email may carry limited PHI; Slack never)
  • Queues sends, retries transient failures, persists delivery status
  • Emits domain events (notification.sent, notification.failed, notification.suppressed) for audit and ops consumers

What it deliberately does not do in v1

  • No SMS or push channels
  • No inbound provider webhooks (bounce/complaint ingestion)
  • No scheduled / delayed sends — the caller is responsible for firing at the right time
  • No per-org provider credentials (platform-level SES and Slack only)
  • No preference-centre UI — preferences are managed via API
  • No template WYSIWYG editor — templates are seeded from code
  • No digest / batching logic

2. Architecture & Data Flow

Service layout

Follows the established monorepo pattern.

services/notifications/
├── prisma/                        # own schema, own MySQL database `notifications`
├── src/
│   ├── app.module.ts
│   ├── templates/                 # template CRUD + seed runner
│   ├── subscriptions/             # NotificationSubscription CRUD
│   ├── preferences/               # user preference API
│   ├── notifications/             # public API (POST /notifications, GET status)
│   ├── events/                    # Redis pub/sub consumer, maps event → subscription → queue
│   ├── delivery/                  # queue writer + worker + provider adapters
│   │   ├── queue.service.ts       # BullMQ wrapper
│   │   ├── worker.service.ts      # consumes queue, orchestrates send
│   │   ├── renderer.service.ts    # Handlebars render + PHI enforcement
│   │   ├── providers/
│   │   │   ├── email.ses.ts
│   │   │   ├── slack.ts
│   │   │   └── provider.interface.ts
│   │   └── retry.policy.ts
│   ├── recipients/                # resolves userIds via user-management client
│   └── audit/                     # emits domain events on send outcome
└── test/

Ingress paths — two ways a notification starts

  1. Event-driven. events/ subscribes to domain events on Redis, looks up matching NotificationSubscription rows, materialises a Notification row per recipient resolved, and enqueues to BullMQ.
  2. API-driven. POST /notifications with a template code + recipient + variables writes a Notification row and enqueues.

Delivery pipeline (same for both ingress paths)

Notification row (status=pending)
  → BullMQ queue (Redis)
  → Worker picks up
  → Preference check (skip if opted out + not transactional)
  → Recipient resolution (call user-management if userId; or use passed channel)
  → Suppression-list check (resolved address)
  → Render template (Handlebars) + PHI enforcement
  → Provider adapter .send()
  → Update status (sent / failed / suppressed) + provider message ID
  → Emit domain event (notification.sent | .failed | .suppressed)
  → On transient failure: exponential backoff retry, max 5 attempts

Dependencies

  • Redis — queue (BullMQ) + domain event pub/sub
  • MySQL — state (own database notifications)
  • user-management REST API — recipient contact info
  • AWS SES — email delivery
  • Slack Web API — Slack delivery

Where this fits with existing platform patterns

Concern Approach
Auth @sa-platform/auth-client JWT guards, platform scope convention
Crypto @sa-platform/common CryptoService for encrypting PHI-flagged persisted data
ID generation @sa-platform/common UUID v7
Redis @sa-platform/common RedisModule
Domain events (in/out) @sa-platform/common EventModule (same Redis pub/sub channels)
Prisma Own schema, own client output path (pattern from auth/user-management/consent)

What's new versus existing services

BullMQ is the first use of a persistent queue in the platform. Every other pattern reuses established shared modules.


3. Data Model

Seven tables in the notifications service's own Prisma schema / MySQL database (notifications).

Template — canonical content definitions

Field Type Notes
id String (uuid v7)
code String Slug, e.g. case-ready-practitioner
channel enum email | slack
category enum transactional | clinical | marketing | operational
phi Boolean If true, template may carry PHI. Slack channel + phi:true is rejected at send time
subject String? Email only
bodyTemplate String Handlebars source
locale String Default en-GB
version Int Incremented when body/subject changes
source enum seeded | custom seeded rows are rewritten on deploy; custom reserved for post-v1 per-org edits
orgId String? null = platform default; non-null = per-org override (post-v1 but modelled now)
createdAt DateTime
updatedAt DateTime

Unique index: (code, channel, locale, orgId).

NotificationSubscription — event → template routing rules

Field Type Notes
id String (uuid v7)
eventType String e.g. clinical.case.completed
templateCode String FK-by-convention to Template.code
recipientRule Json Describes how to derive recipient from the event payload. Examples: { type: 'user', path: 'assignedPractitionerId' }, { type: 'slackChannel', channel: '#clinical-ops' }
orgId String? null = applies to all orgs
enabled Boolean
createdAt DateTime
updatedAt DateTime

Index: (eventType, orgId, enabled).

Notification — one row per send attempt (core state row)

Field Type Notes
id String (uuid v7)
templateCode String
templateVersion Int Version of Template used
channel enum email | slack
recipientType enum user | slackChannel | email
recipientRef String User ID, Slack channel ID, or raw email address
resolvedAddress String? Populated at worker time (email address or Slack channel ID). Encrypted at rest if phi: true
orgId String
category enum (matches Template.category)
phi Boolean Copied from template at creation time
variables Json | encrypted string Variables merged into template. Encrypted via CryptoService if phi: true
status enum pending | queued | sending | sent | failed | suppressed
attempts Int
providerMessageId String?
errorCode String? e.g. phi_policy_violation, invalid_recipient, provider_5xx
errorDetail String?
correlationId String? Carries through from caller
triggerSource String event:<eventId> or api:<clientId>
createdAt DateTime
sentAt DateTime?
updatedAt DateTime

Indexes: (orgId, status, createdAt), (recipientType, recipientRef, createdAt).

NotificationPreference — per-user category opt-outs

Field Type Notes
userId String
orgId String
category enum
channel enum
optedOut Boolean
updatedAt DateTime

Composite primary key: (userId, category, channel).

transactional category is always sent regardless of preference rows.

DeliveryAttempt — audit of each provider call

Field Type Notes
id String (uuid v7)
notificationId String (FK)
attemptNumber Int
providerName String e.g. ses, slack
request Json Redacted (no PHI, no full body — keep metadata only)
response Json
success Boolean
errorCode String?
latencyMs Int
createdAt DateTime

Index: notificationId.

EventInbox — idempotency + replay protection for incoming domain events

Field Type Notes
eventId String (PK)
eventType String
receivedAt DateTime
processedAt DateTime?
notificationIds Json Notification IDs this event produced

Detects redelivered events so we do not fan-out notifications twice.

SuppressionList — hard suppressions

Field Type Notes
id String
address String Email address or Slack channel ID
channel enum
reason String Manually added in v1 (bounce-webhook ingestion is post-v1)
createdAt DateTime

Unique: (address, channel). Worker checks this table before sending and marks Notification as suppressed on hit.

Cross-service FK note

No FK to users/orgs across the service boundary (user-management owns those); userId / orgId are opaque identifiers here, consistent with the platform pattern.


4. API Surface & Event Subscriptions

All endpoints sit behind @sa-platform/auth-client JWT guards. Scopes follow the platform convention notifications:*.

Notifications (core)

Method Path Scope Purpose
POST /notifications notifications:send Enqueue a send. Body: { templateCode, recipient: {type, ref}, variables, correlationId? }. Returns { id, status }
GET /notifications/:id notifications:read Fetch status + delivery attempts
GET /notifications notifications:read Paginated list with filters (orgId, recipientRef, status, from, to)

Templates (admin)

Method Path Scope Purpose
GET /admin/templates notifications:admin List, filter by channel / category / orgId
GET /admin/templates/:code notifications:admin Fetch single (most recent version)
PUT /admin/templates/:code notifications:admin Update (creates new version). Editing a source: seeded template is allowed but the response body includes a warnings: ["will be overwritten on next deploy"] field. Per-org custom overrides are reserved for post-v1
POST /admin/templates/:code/render-preview notifications:admin Dry-run render with sample variables. Returns rendered subject + body + PHI validation result. No send

Subscriptions (admin)

Method Path Scope Purpose
GET /admin/subscriptions notifications:admin List, filter by eventType, orgId
POST /admin/subscriptions notifications:admin Create. Body: eventType, templateCode, recipientRule, orgId nullable, enabled
PATCH /admin/subscriptions/:id notifications:admin Update enabled / template / rule
DELETE /admin/subscriptions/:id notifications:admin Delete

Preferences

Method Path Scope Purpose
GET /preferences/:userId notifications:preferences:read Read preferences. Caller must be the user themselves or admin (enforced via @sa-platform/auth-client actor)
PUT /preferences/:userId notifications:preferences:write Body: [{ category, channel, optedOut }]

Suppressions (admin)

Method Path Scope
GET /admin/suppressions notifications:admin
POST /admin/suppressions notifications:admin
DELETE /admin/suppressions/:id notifications:admin

Queue inspection (admin)

Method Path Scope Purpose
GET /admin/queue/failed notifications:admin Paginated view of BullMQ failed jobs (retained for 7 days)

Health

Method Path Purpose
GET /health Liveness
GET /health/ready Readiness — checks MySQL + Redis + outbound provider reachability

Event subscriptions (pub/sub ingress)

Notifications subscribes to a wildcard pattern on the Redis pub/sub channel events:* (matching the platform pattern).

For each inbound event:

  1. Check EventInbox. If eventId already processed, skip (idempotency).
  2. Query NotificationSubscription for matching eventType + orgId (falling back to platform rules with orgId = null).
  3. For each matching subscription:
  4. Resolve recipient via recipientRule (JSON-path into the event payload, or literal channel ID).
  5. Create a Notification row (status: pending, triggerSource: event:<eventId>).
  6. Enqueue to BullMQ.
  7. Record EventInbox row with the list of created notification IDs.

Domain events emitted (on events:notifications.*)

  • notification.sent{ notificationId, templateCode, channel, orgId, correlationId } (PHI-free by design).
  • notification.failed — same shape plus errorCode, finalAttempt (bool).
  • notification.suppressed — emitted when preference or suppression-list rejects the send.

These are for audit consumers and ops dashboards, not for re-triggering sends.


5. Delivery Pipeline

Queue — BullMQ on Redis

Single queue notifications:send. Jobs carry only { notificationId } — the MySQL row is the source of truth, the job is a pointer. This keeps jobs small and lets the worker re-read fresh state on each attempt (e.g. pick up a suppression added between enqueue and send).

Queue config:

  • Concurrency: 10 workers per service instance (tunable via env).
  • Job options: attempts: 5, backoff: { type: 'exponential', delay: 2000 } — i.e. 2s, 4s, 8s, 16s, 32s.
  • Final failure marks the Notification as failed and emits notification.failed with finalAttempt: true.
  • Dead-letter: after attempts are exhausted, the job stays on the BullMQ failed list in Redis with 7-day retention for ops inspection. No auto-replay.

Worker flow (per job)

  1. Load Notification by ID. If status ∉ {pending, queued}, skip (idempotent — handles re-delivery and manual retries).
  2. Mark status: sending, increment attempts.
  3. Preference gate (only when recipientType === 'user'): look up NotificationPreference(userId = recipientRef, category, channel). If optedOut AND category ≠ transactional, set status: suppressed, emit notification.suppressed, done. Preferences don't apply to slackChannel or raw-email recipients — those addresses aren't owned by a user in this service.
  4. Recipient resolution:
  5. recipientType: user — call user-management GET /users/:id/contact via a typed client authenticated with @sa-platform/auth-client. No caching — contact info can change.
  6. recipientType: slackChannel — use recipientRef directly.
  7. recipientType: email — use recipientRef directly (ad-hoc API usage).
  8. Store resolvedAddress on the row before continuing (encrypted if phi: true).
  9. Suppression-list gate: check SuppressionList(resolvedAddress, channel). Hit → status: suppressed, emit notification.suppressed, done.
  10. Render template:
  11. Load Template by (code, channel, locale, orgId) with platform fallback (orgId = null).
  12. Record templateVersion on the row.
  13. Handlebars render of subject and bodyTemplate with variables.
  14. PHI enforcement: if Template.phi === true and channel === 'slack', abort immediately — set status: failed, errorCode: phi_policy_violation, emit notification.failed. This is a code-level guarantee; no code path allows PHI templates to reach Slack.
  15. Provider adapter call:
  16. EmailProvider.send({ to, subject, body, correlationId }) — SES via @aws-sdk/client-sesv2, tagged via an SES_CONFIGURATION_SET.
  17. SlackProvider.send({ channel, blocks, correlationId })@slack/web-api with the platform bot token.
  18. Record DeliveryAttempt row (redacted request payload — never log full PHI).
  19. On provider success: status: sent, providerMessageId, sentAt = now, emit notification.sent.
  20. On provider failure, classify via retry.policy.ts:
    • transient (5xx, 429 rate-limit, network errors) → throw to let BullMQ retry.
    • permanent (4xx validation, invalid recipient, auth errors) → status: failed, errorCode, emit notification.failed, do not re-throw.

Provider adapter interface (provider.interface.ts)

interface NotificationProvider {
  channel: 'email' | 'slack';
  send(input: ProviderSendInput): Promise<ProviderSendResult>;
  classify(error: unknown): 'transient' | 'permanent';
}

Two implementations for v1 (SES and Slack); SMS and push adapters slot into this later without worker changes.

Seed templates (code-owned, synced at deploy)

A templates/seeds/ directory holds .template.ts files, each exporting { code, channel, category, phi, subject, body, locale }.

On service boot (idempotent), the seed runner upserts Template rows with source: seeded, bumping version if body or subject changed. This is how new templates ship with code while remaining editable in the database.

Observability

  • Structured logs per attempt: correlationId, notificationId, channel, provider, outcome, latencyMs.
  • Prometheus metrics:
  • Counter notifications_sent_total{channel,category,status}
  • Histogram notification_delivery_seconds{channel}
  • Failed-queue inspection endpoint for ops (GET /admin/queue/failed, scope notifications:admin).

6. Testing Strategy

Follows the existing services' approach — unit tests against mocked collaborators, integration tests against real infrastructure via Testcontainers.

Unit tests (~50 tests)

  • templates/ — seed runner upserts idempotently; version bumps only when content changes; render-preview returns PHI validation result.
  • subscriptions/recipientRule resolution against varied event payloads (user path, literal channel, missing field).
  • preferences/ — category opt-out enforcement; transactional override; per-user / per-channel precedence.
  • events/consumer — idempotency via EventInbox; fan-out per matching subscription; malformed event payload handling.
  • delivery/renderer — Handlebars rendering; PHI enforcement (Slack + phi: true → throws with phi_policy_violation); locale fallback.
  • delivery/retry.policy — classification matrix (5xx, 429, 4xx, network, auth errors).
  • delivery/worker — stateful sequencing (pending → sending → sent); stale-job skip; suppression-list gate.
  • delivery/providers/* — SES + Slack adapter happy paths and error classification, with SDK clients mocked.
  • recipients/client — user-management client happy path; 404 → permanent failure; 5xx → transient.

Integration tests (~20 tests, real MySQL + Redis, provider SDKs mocked at the client boundary)

  • Full event → notification → worker → sent flow (emit event on Redis, assert Notification row, assert provider mock called, assert notification.sent emitted).
  • Idempotency — same event delivered twice produces one notification set.
  • Retry behaviour — transient failure triggers BullMQ retry; eventual success.
  • Permanent failure — 4xx short-circuits retries; marks failed.
  • Preference opt-out — non-transactional template is suppressed; transactional is not.
  • Suppression list — address on the list short-circuits to suppressed.
  • PHI guard — Slack template with phi: true enqueued → failed with phi_policy_violation.
  • Recipient resolution — user-management mock returns contact, worker uses it; 404 marks failed.
  • Template versioning — edit a seeded template's body → version bumps; new Notifications record the new version.
  • Admin API — CRUD on subscriptions and templates end-to-end, with notifications:admin scope enforcement.
  • Health — /health/ready reports red when Redis or MySQL is unreachable.

Infrastructure for integration

  • test/testcontainer.ts pattern (same as the consent service): Redis + MySQL via Testcontainers.
  • SES + Slack SDK clients mocked at the client level (no mocked HTTP stack); provider adapter tests verify we call the SDKs correctly.
  • CI — extend .github/workflows/ci.yml to run notifications Prisma generate + unit + integration tests, alongside the other services.

Out of scope for v1 tests

  • Load / performance tests.
  • Real-provider smoke tests (SES sandbox, Slack bot in a dev workspace) — candidate for a post-v1 nightly job.
  • Chaos / fault-injection against BullMQ.

7. Open Questions / Deferred Decisions

These are deliberately deferred and called out so a future plan doesn't treat them as oversights.

  • Per-org template overrides. Schema supports Template.orgId non-null and source: custom. No API surface yet — reserved for post-v1 when a preference-centre or white-label requirement actually lands.
  • Per-org provider credentials. Platform-level only in v1. When needed, a ProviderConfig(orgId, channel, credentials) table and a per-send lookup step can be added without changing the adapter interface.
  • Inbound provider webhooks (SES bounce / complaint, Slack event API). Planned for when email volume justifies sender-reputation hygiene. Entry point will update Notification.status (e.g. delivered, bounced) and auto-insert SuppressionList rows on hard bounces.
  • Scheduled sends. Not in v1. When added: a scheduledFor field on Notification, a time-based dispatcher, and cancellation semantics tied to the trigger source.
  • SMS and push. Non-goals for v1. Drop in via the provider adapter interface.
  • Preference centre UI. A product concern, not a platform concern. API is ready; UI is out of scope here.