Notifications Service — Design Spec¶
Date: 2026-04-21
Status: Draft — pending review
Build order position: Step 4 in the SA Platform build order (auth → user-management → consent → notifications → everything else)
Related: docs/superpowers/specs/2026-04-19-sa-platform-design.md
1. Scope, Goals, Non-Goals¶
What this service is¶
services/notifications/ — a standalone NestJS service that owns notification templates, per-user preferences, the delivery pipeline, and audit trail for platform notifications. Exposes a REST API plus subscribes to domain events on Redis pub/sub. Single-tenant provider configuration for v1.
Channels in v1¶
- Email via AWS SES
- Slack via the Slack Web API (platform bot token)
SMS and push are explicit non-goals for v1. The provider adapter interface is designed so additional channels drop in without reworking the pipeline.
What it does¶
- Stores templates (seeded from code, editable per-org at runtime is reserved for post-v1)
- Stores
NotificationSubscriptionrows mapping domain events → templates → recipients - Accepts ad-hoc sends via REST (
POST /notifications) - Resolves recipients (user IDs via
user-managementREST, or pass-through channel IDs) - Enforces per-user preferences (category opt-out; transactional category always sends)
- Enforces PHI channel policy (email may carry limited PHI; Slack never)
- Queues sends, retries transient failures, persists delivery status
- Emits domain events (
notification.sent,notification.failed,notification.suppressed) for audit and ops consumers
What it deliberately does not do in v1¶
- No SMS or push channels
- No inbound provider webhooks (bounce/complaint ingestion)
- No scheduled / delayed sends — the caller is responsible for firing at the right time
- No per-org provider credentials (platform-level SES and Slack only)
- No preference-centre UI — preferences are managed via API
- No template WYSIWYG editor — templates are seeded from code
- No digest / batching logic
2. Architecture & Data Flow¶
Service layout¶
Follows the established monorepo pattern.
services/notifications/
├── prisma/ # own schema, own MySQL database `notifications`
├── src/
│ ├── app.module.ts
│ ├── templates/ # template CRUD + seed runner
│ ├── subscriptions/ # NotificationSubscription CRUD
│ ├── preferences/ # user preference API
│ ├── notifications/ # public API (POST /notifications, GET status)
│ ├── events/ # Redis pub/sub consumer, maps event → subscription → queue
│ ├── delivery/ # queue writer + worker + provider adapters
│ │ ├── queue.service.ts # BullMQ wrapper
│ │ ├── worker.service.ts # consumes queue, orchestrates send
│ │ ├── renderer.service.ts # Handlebars render + PHI enforcement
│ │ ├── providers/
│ │ │ ├── email.ses.ts
│ │ │ ├── slack.ts
│ │ │ └── provider.interface.ts
│ │ └── retry.policy.ts
│ ├── recipients/ # resolves userIds via user-management client
│ └── audit/ # emits domain events on send outcome
└── test/
Ingress paths — two ways a notification starts¶
- Event-driven.
events/subscribes to domain events on Redis, looks up matchingNotificationSubscriptionrows, materialises aNotificationrow per recipient resolved, and enqueues to BullMQ. - API-driven.
POST /notificationswith a template code + recipient + variables writes aNotificationrow and enqueues.
Delivery pipeline (same for both ingress paths)¶
Notification row (status=pending)
→ BullMQ queue (Redis)
→ Worker picks up
→ Preference check (skip if opted out + not transactional)
→ Recipient resolution (call user-management if userId; or use passed channel)
→ Suppression-list check (resolved address)
→ Render template (Handlebars) + PHI enforcement
→ Provider adapter .send()
→ Update status (sent / failed / suppressed) + provider message ID
→ Emit domain event (notification.sent | .failed | .suppressed)
→ On transient failure: exponential backoff retry, max 5 attempts
Dependencies¶
- Redis — queue (BullMQ) + domain event pub/sub
- MySQL — state (own database
notifications) - user-management REST API — recipient contact info
- AWS SES — email delivery
- Slack Web API — Slack delivery
Where this fits with existing platform patterns¶
| Concern | Approach |
|---|---|
| Auth | @sa-platform/auth-client JWT guards, platform scope convention |
| Crypto | @sa-platform/common CryptoService for encrypting PHI-flagged persisted data |
| ID generation | @sa-platform/common UUID v7 |
| Redis | @sa-platform/common RedisModule |
| Domain events (in/out) | @sa-platform/common EventModule (same Redis pub/sub channels) |
| Prisma | Own schema, own client output path (pattern from auth/user-management/consent) |
What's new versus existing services¶
BullMQ is the first use of a persistent queue in the platform. Every other pattern reuses established shared modules.
3. Data Model¶
Seven tables in the notifications service's own Prisma schema / MySQL database (notifications).
Template — canonical content definitions¶
| Field | Type | Notes |
|---|---|---|
id |
String (uuid v7) |
|
code |
String |
Slug, e.g. case-ready-practitioner |
channel |
enum email | slack |
|
category |
enum transactional | clinical | marketing | operational |
|
phi |
Boolean |
If true, template may carry PHI. Slack channel + phi:true is rejected at send time |
subject |
String? |
Email only |
bodyTemplate |
String |
Handlebars source |
locale |
String |
Default en-GB |
version |
Int |
Incremented when body/subject changes |
source |
enum seeded | custom |
seeded rows are rewritten on deploy; custom reserved for post-v1 per-org edits |
orgId |
String? |
null = platform default; non-null = per-org override (post-v1 but modelled now) |
createdAt |
DateTime |
|
updatedAt |
DateTime |
Unique index: (code, channel, locale, orgId).
NotificationSubscription — event → template routing rules¶
| Field | Type | Notes |
|---|---|---|
id |
String (uuid v7) |
|
eventType |
String |
e.g. clinical.case.completed |
templateCode |
String |
FK-by-convention to Template.code |
recipientRule |
Json |
Describes how to derive recipient from the event payload. Examples: { type: 'user', path: 'assignedPractitionerId' }, { type: 'slackChannel', channel: '#clinical-ops' } |
orgId |
String? |
null = applies to all orgs |
enabled |
Boolean |
|
createdAt |
DateTime |
|
updatedAt |
DateTime |
Index: (eventType, orgId, enabled).
Notification — one row per send attempt (core state row)¶
| Field | Type | Notes |
|---|---|---|
id |
String (uuid v7) |
|
templateCode |
String |
|
templateVersion |
Int |
Version of Template used |
channel |
enum email | slack |
|
recipientType |
enum user | slackChannel | email |
|
recipientRef |
String |
User ID, Slack channel ID, or raw email address |
resolvedAddress |
String? |
Populated at worker time (email address or Slack channel ID). Encrypted at rest if phi: true |
orgId |
String |
|
category |
enum (matches Template.category) |
|
phi |
Boolean |
Copied from template at creation time |
variables |
Json | encrypted string |
Variables merged into template. Encrypted via CryptoService if phi: true |
status |
enum pending | queued | sending | sent | failed | suppressed |
|
attempts |
Int |
|
providerMessageId |
String? |
|
errorCode |
String? |
e.g. phi_policy_violation, invalid_recipient, provider_5xx |
errorDetail |
String? |
|
correlationId |
String? |
Carries through from caller |
triggerSource |
String |
event:<eventId> or api:<clientId> |
createdAt |
DateTime |
|
sentAt |
DateTime? |
|
updatedAt |
DateTime |
Indexes: (orgId, status, createdAt), (recipientType, recipientRef, createdAt).
NotificationPreference — per-user category opt-outs¶
| Field | Type | Notes |
|---|---|---|
userId |
String |
|
orgId |
String |
|
category |
enum | |
channel |
enum | |
optedOut |
Boolean |
|
updatedAt |
DateTime |
Composite primary key: (userId, category, channel).
transactional category is always sent regardless of preference rows.
DeliveryAttempt — audit of each provider call¶
| Field | Type | Notes |
|---|---|---|
id |
String (uuid v7) |
|
notificationId |
String (FK) |
|
attemptNumber |
Int |
|
providerName |
String |
e.g. ses, slack |
request |
Json |
Redacted (no PHI, no full body — keep metadata only) |
response |
Json |
|
success |
Boolean |
|
errorCode |
String? |
|
latencyMs |
Int |
|
createdAt |
DateTime |
Index: notificationId.
EventInbox — idempotency + replay protection for incoming domain events¶
| Field | Type | Notes |
|---|---|---|
eventId |
String (PK) |
|
eventType |
String |
|
receivedAt |
DateTime |
|
processedAt |
DateTime? |
|
notificationIds |
Json |
Notification IDs this event produced |
Detects redelivered events so we do not fan-out notifications twice.
SuppressionList — hard suppressions¶
| Field | Type | Notes |
|---|---|---|
id |
String |
|
address |
String |
Email address or Slack channel ID |
channel |
enum | |
reason |
String |
Manually added in v1 (bounce-webhook ingestion is post-v1) |
createdAt |
DateTime |
Unique: (address, channel). Worker checks this table before sending and marks Notification as suppressed on hit.
Cross-service FK note¶
No FK to users/orgs across the service boundary (user-management owns those); userId / orgId are opaque identifiers here, consistent with the platform pattern.
4. API Surface & Event Subscriptions¶
All endpoints sit behind @sa-platform/auth-client JWT guards. Scopes follow the platform convention notifications:*.
Notifications (core)¶
| Method | Path | Scope | Purpose |
|---|---|---|---|
| POST | /notifications |
notifications:send |
Enqueue a send. Body: { templateCode, recipient: {type, ref}, variables, correlationId? }. Returns { id, status } |
| GET | /notifications/:id |
notifications:read |
Fetch status + delivery attempts |
| GET | /notifications |
notifications:read |
Paginated list with filters (orgId, recipientRef, status, from, to) |
Templates (admin)¶
| Method | Path | Scope | Purpose |
|---|---|---|---|
| GET | /admin/templates |
notifications:admin |
List, filter by channel / category / orgId |
| GET | /admin/templates/:code |
notifications:admin |
Fetch single (most recent version) |
| PUT | /admin/templates/:code |
notifications:admin |
Update (creates new version). Editing a source: seeded template is allowed but the response body includes a warnings: ["will be overwritten on next deploy"] field. Per-org custom overrides are reserved for post-v1 |
| POST | /admin/templates/:code/render-preview |
notifications:admin |
Dry-run render with sample variables. Returns rendered subject + body + PHI validation result. No send |
Subscriptions (admin)¶
| Method | Path | Scope | Purpose |
|---|---|---|---|
| GET | /admin/subscriptions |
notifications:admin |
List, filter by eventType, orgId |
| POST | /admin/subscriptions |
notifications:admin |
Create. Body: eventType, templateCode, recipientRule, orgId nullable, enabled |
| PATCH | /admin/subscriptions/:id |
notifications:admin |
Update enabled / template / rule |
| DELETE | /admin/subscriptions/:id |
notifications:admin |
Delete |
Preferences¶
| Method | Path | Scope | Purpose |
|---|---|---|---|
| GET | /preferences/:userId |
notifications:preferences:read |
Read preferences. Caller must be the user themselves or admin (enforced via @sa-platform/auth-client actor) |
| PUT | /preferences/:userId |
notifications:preferences:write |
Body: [{ category, channel, optedOut }] |
Suppressions (admin)¶
| Method | Path | Scope |
|---|---|---|
| GET | /admin/suppressions |
notifications:admin |
| POST | /admin/suppressions |
notifications:admin |
| DELETE | /admin/suppressions/:id |
notifications:admin |
Queue inspection (admin)¶
| Method | Path | Scope | Purpose |
|---|---|---|---|
| GET | /admin/queue/failed |
notifications:admin |
Paginated view of BullMQ failed jobs (retained for 7 days) |
Health¶
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
Liveness |
| GET | /health/ready |
Readiness — checks MySQL + Redis + outbound provider reachability |
Event subscriptions (pub/sub ingress)¶
Notifications subscribes to a wildcard pattern on the Redis pub/sub channel events:* (matching the platform pattern).
For each inbound event:
- Check
EventInbox. IfeventIdalready processed, skip (idempotency). - Query
NotificationSubscriptionfor matchingeventType+orgId(falling back to platform rules withorgId = null). - For each matching subscription:
- Resolve recipient via
recipientRule(JSON-path into the event payload, or literal channel ID). - Create a
Notificationrow (status: pending,triggerSource: event:<eventId>). - Enqueue to BullMQ.
- Record
EventInboxrow with the list of created notification IDs.
Domain events emitted (on events:notifications.*)¶
notification.sent—{ notificationId, templateCode, channel, orgId, correlationId }(PHI-free by design).notification.failed— same shape pluserrorCode,finalAttempt(bool).notification.suppressed— emitted when preference or suppression-list rejects the send.
These are for audit consumers and ops dashboards, not for re-triggering sends.
5. Delivery Pipeline¶
Queue — BullMQ on Redis¶
Single queue notifications:send. Jobs carry only { notificationId } — the MySQL row is the source of truth, the job is a pointer. This keeps jobs small and lets the worker re-read fresh state on each attempt (e.g. pick up a suppression added between enqueue and send).
Queue config:
- Concurrency: 10 workers per service instance (tunable via env).
- Job options:
attempts: 5,backoff: { type: 'exponential', delay: 2000 }— i.e. 2s, 4s, 8s, 16s, 32s. - Final failure marks the
Notificationasfailedand emitsnotification.failedwithfinalAttempt: true. - Dead-letter: after attempts are exhausted, the job stays on the BullMQ failed list in Redis with 7-day retention for ops inspection. No auto-replay.
Worker flow (per job)¶
- Load
Notificationby ID. Ifstatus ∉ {pending, queued}, skip (idempotent — handles re-delivery and manual retries). - Mark
status: sending, incrementattempts. - Preference gate (only when
recipientType === 'user'): look upNotificationPreference(userId = recipientRef, category, channel). IfoptedOutANDcategory ≠ transactional, setstatus: suppressed, emitnotification.suppressed, done. Preferences don't apply toslackChannelor raw-emailrecipients — those addresses aren't owned by a user in this service. - Recipient resolution:
recipientType: user— call user-managementGET /users/:id/contactvia a typed client authenticated with@sa-platform/auth-client. No caching — contact info can change.recipientType: slackChannel— userecipientRefdirectly.recipientType: email— userecipientRefdirectly (ad-hoc API usage).- Store
resolvedAddresson the row before continuing (encrypted ifphi: true). - Suppression-list gate: check
SuppressionList(resolvedAddress, channel). Hit →status: suppressed, emitnotification.suppressed, done. - Render template:
- Load
Templateby(code, channel, locale, orgId)with platform fallback (orgId = null). - Record
templateVersionon the row. - Handlebars render of
subjectandbodyTemplatewithvariables. - PHI enforcement: if
Template.phi === trueandchannel === 'slack', abort immediately — setstatus: failed,errorCode: phi_policy_violation, emitnotification.failed. This is a code-level guarantee; no code path allows PHI templates to reach Slack. - Provider adapter call:
EmailProvider.send({ to, subject, body, correlationId })— SES via@aws-sdk/client-sesv2, tagged via anSES_CONFIGURATION_SET.SlackProvider.send({ channel, blocks, correlationId })—@slack/web-apiwith the platform bot token.- Record
DeliveryAttemptrow (redacted request payload — never log full PHI). - On provider success:
status: sent,providerMessageId,sentAt = now, emitnotification.sent. - On provider failure, classify via
retry.policy.ts:transient(5xx, 429 rate-limit, network errors) → throw to let BullMQ retry.permanent(4xx validation, invalid recipient, auth errors) →status: failed,errorCode, emitnotification.failed, do not re-throw.
Provider adapter interface (provider.interface.ts)¶
interface NotificationProvider {
channel: 'email' | 'slack';
send(input: ProviderSendInput): Promise<ProviderSendResult>;
classify(error: unknown): 'transient' | 'permanent';
}
Two implementations for v1 (SES and Slack); SMS and push adapters slot into this later without worker changes.
Seed templates (code-owned, synced at deploy)¶
A templates/seeds/ directory holds .template.ts files, each exporting { code, channel, category, phi, subject, body, locale }.
On service boot (idempotent), the seed runner upserts Template rows with source: seeded, bumping version if body or subject changed. This is how new templates ship with code while remaining editable in the database.
Observability¶
- Structured logs per attempt:
correlationId,notificationId,channel,provider,outcome,latencyMs. - Prometheus metrics:
- Counter
notifications_sent_total{channel,category,status} - Histogram
notification_delivery_seconds{channel} - Failed-queue inspection endpoint for ops (
GET /admin/queue/failed, scopenotifications:admin).
6. Testing Strategy¶
Follows the existing services' approach — unit tests against mocked collaborators, integration tests against real infrastructure via Testcontainers.
Unit tests (~50 tests)¶
templates/— seed runner upserts idempotently; version bumps only when content changes;render-previewreturns PHI validation result.subscriptions/—recipientRuleresolution against varied event payloads (user path, literal channel, missing field).preferences/— category opt-out enforcement; transactional override; per-user / per-channel precedence.events/consumer— idempotency viaEventInbox; fan-out per matching subscription; malformed event payload handling.delivery/renderer— Handlebars rendering; PHI enforcement (Slack +phi: true→ throws withphi_policy_violation); locale fallback.delivery/retry.policy— classification matrix (5xx, 429, 4xx, network, auth errors).delivery/worker— stateful sequencing (pending → sending → sent); stale-job skip; suppression-list gate.delivery/providers/*— SES + Slack adapter happy paths and error classification, with SDK clients mocked.recipients/client— user-management client happy path; 404 → permanent failure; 5xx → transient.
Integration tests (~20 tests, real MySQL + Redis, provider SDKs mocked at the client boundary)¶
- Full event → notification → worker →
sentflow (emit event on Redis, assertNotificationrow, assert provider mock called, assertnotification.sentemitted). - Idempotency — same event delivered twice produces one notification set.
- Retry behaviour — transient failure triggers BullMQ retry; eventual success.
- Permanent failure — 4xx short-circuits retries; marks
failed. - Preference opt-out — non-transactional template is suppressed; transactional is not.
- Suppression list — address on the list short-circuits to
suppressed. - PHI guard — Slack template with
phi: trueenqueued →failedwithphi_policy_violation. - Recipient resolution — user-management mock returns contact, worker uses it; 404 marks
failed. - Template versioning — edit a seeded template's body → version bumps; new
Notifications record the new version. - Admin API — CRUD on subscriptions and templates end-to-end, with
notifications:adminscope enforcement. - Health —
/health/readyreports red when Redis or MySQL is unreachable.
Infrastructure for integration¶
test/testcontainer.tspattern (same as the consent service): Redis + MySQL via Testcontainers.- SES + Slack SDK clients mocked at the client level (no mocked HTTP stack); provider adapter tests verify we call the SDKs correctly.
- CI — extend
.github/workflows/ci.ymlto run notifications Prisma generate + unit + integration tests, alongside the other services.
Out of scope for v1 tests¶
- Load / performance tests.
- Real-provider smoke tests (SES sandbox, Slack bot in a dev workspace) — candidate for a post-v1 nightly job.
- Chaos / fault-injection against BullMQ.
7. Open Questions / Deferred Decisions¶
These are deliberately deferred and called out so a future plan doesn't treat them as oversights.
- Per-org template overrides. Schema supports
Template.orgIdnon-null andsource: custom. No API surface yet — reserved for post-v1 when a preference-centre or white-label requirement actually lands. - Per-org provider credentials. Platform-level only in v1. When needed, a
ProviderConfig(orgId, channel, credentials)table and a per-send lookup step can be added without changing the adapter interface. - Inbound provider webhooks (SES bounce / complaint, Slack event API). Planned for when email volume justifies sender-reputation hygiene. Entry point will update
Notification.status(e.g.delivered,bounced) and auto-insertSuppressionListrows on hard bounces. - Scheduled sends. Not in v1. When added: a
scheduledForfield onNotification, a time-based dispatcher, and cancellation semantics tied to the trigger source. - SMS and push. Non-goals for v1. Drop in via the provider adapter interface.
- Preference centre UI. A product concern, not a platform concern. API is ready; UI is out of scope here.