Notifications Follow-ups Bundle — Design Spec¶
Date: 2026-04-22 Status: Draft — autonomous design, user review pending Related: Notifications service (PR #21), auth-client AuthModule global (PR #23)
1. Scope, Goals, Non-Goals¶
Goal¶
Three small, discrete items flagged in prior PRs that together tidy up deferred notifications work. One bundled PR because each item is too small for its own spec, and together they form a coherent "notifications polish" change.
The three items¶
-
slackUserIdfield on user-management's User model. The notifications worker (PR #21) resolves a user's Slack ID from user-management's/v1/users/:id/contactendpoint, but user-management's schema has noslackUserIdcolumn. Today any Slack DM to a user fails with permanentrecipient_not_found. Adding the column + surfacing it through the contact endpoint unblocks the path. -
Periodic re-enqueue sweep for stuck
queuednotifications. PR #23's final review flagged an at-least-once edge: if the notifications service process crashes between an event-consumer transaction commit and the subsequent Redisqueue.enqueuecall, theNotificationrow exists withstatus: queuedbut has no BullMQ job. Currently it stays stuck. A small periodic sweep finds such rows and re-enqueues them. -
Simplify integration-test auth bypass. The notifications integration tests in
test/integration/helpers.tsoverrideJwksServicewith a permissive mock because, at the time of PR #21, NestJSoverrideGuard(AuthGuard)didn't work —AuthGuardwas registered asAPP_GUARDwhichoverrideGuarddoesn't touch. PR #23 markedAuthModuleasglobal: true, which should now allow cleaner test setup. Verify and simplify.
In scope¶
Item 1: slackUserId on user-management¶
- Add
slackUserId String? @db.VarChar(255)nullable column toservices/user-management/prisma/schema.prisma'sUsermodel. - Migration
add_slack_user_id_to_user. - Update
GET /v1/users/:id/contactto returnslackUserIdin the response (already-defined shape from PR #21's Task 16; just populate from the new column). - Update the unit/integration tests for the endpoint to verify
slackUserIdround-trips.
Item 2: Re-enqueue sweep¶
- New
services/notifications/src/delivery/re-enqueue.service.ts— periodically scansNotificationrows wherestatus = 'queued' AND createdAt < now() - 10 minutesAND the BullMQ queue doesn't already have a job for the ID. - Scheduled via NestJS
@Interval(from@nestjs/schedule) or BullMQ's repeatable-job mechanism. Pick the lower-friction option — NestJS schedule adds a new dep but is simpler than a repeatable job for this case. - On each run: find candidates, for each check BullMQ
queue.getJob(notificationId)to avoid double-enqueueing, re-enqueue those that need it, log count. - Admin-facing endpoint
POST /v1/admin/queue/re-enqueue-stuckfor manual runs (useful for ops during incidents — still admin-guarded). - Unit tests: the service's logic with mocked Prisma + BullMQ. Integration test: create a stuck
queuedrow without enqueuing, run the sweep, assert it gets enqueued.
Item 3: Auth test simplification¶
- Read
services/notifications/test/integration/helpers.ts. - Verify that with
AuthModulenow global (PR #23), NestJS'sTest.createTestingModule(...).overrideGuard(AuthGuard)works as documented. - Replace the
overrideProvider(JwksService).useValue({ verifyToken: async () => ({...}) })workaround withoverrideGuard(AuthGuard).useValue({ canActivate: () => true }). - Keep the per-test actor-context header injection — that's orthogonal to guard bypass.
- All 46 existing notifications integration tests continue passing.
Out of scope¶
- Prod-secret enforcement for
ADMIN_API_SECRETandSERVICE_AUTH_TOKEN— flagged in PR #23 review, deferred to a separate platform-hardening PR that'll touch every service'sapp-config.ts. Bundling it here would conflict with PR #24's notifications changes and muddies the review. - Migrating
auth/consentto the newprisma-clientgenerator. - KMS key provider.
- Patient-side preference centre UI.
Success criteria¶
slackUserIdpresent on user-management User model, exposed via contact endpoint, covered by unit + integration tests.- A Notification row in
queuedstate older than 10 minutes with no BullMQ job is re-enqueued by the sweep within one interval window. Zero double-enqueues verified. helpers.tsin notifications integration tests usesoverrideGuard(AuthGuard)— noJwksServicemock. All 46 + any new tests still green.- Full monorepo typecheck + tests green.
2. Architecture — three slices¶
2.1 slackUserId¶
Trivial. Schema adds the column, migration applies ALTER TABLE user ADD COLUMN slack_user_id VARCHAR(255) NULL;. The service returns it in the contact response; no other logic changes.
Verify by: notifications worker resolving a Slack DM recipient now correctly succeeds when the target user has a slackUserId, rather than throwing UserNotFoundError with "contact missing slackUserId".
2.2 Re-enqueue sweep¶
┌─────────────────────────────┐
│ @Interval('re-enqueue', 5m)│
│ ReEnqueueService.sweep() │
└──────────┬──────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ prisma.notification.findMany({ │
│ where: { │
│ status: 'queued', │
│ createdAt: { lt: now - 10 minutes } │
│ }, │
│ select: { id: true }, │
│ take: 100, │
│ }) │
└──────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ for each candidate: │
│ existing = queue.getJob(candidate.id) │
│ if !existing: queue.enqueue(candidate.id) │
│ metric: notifications_reenqueued_total++ │
└─────────────────────────────────────────────────────┘
Implementation notes:
- The
createdAt < 10mthreshold gives BullMQ's normal retry policy time to act before we interfere. Notifications withstatus: queuedthat fail to process land instatus: sending(per PR #23's atomic claim), soqueuedrows older than 10 minutes genuinely indicate missing jobs. queue.getJob(id)in BullMQ v5 accepts a job ID and returns the job if present. BullMQ job IDs are strings; we pass the notification ID.- Take up to 100 per sweep to avoid long-running batches under load.
- Cron library:
@nestjs/scheduleis already a widely-used NestJS utility. Adding it as a dependency is acceptable. Alternative (no new dep):setIntervalinOnModuleInit, clear inOnModuleDestroy. Simpler, no new library — I'll go with that.
Admin endpoint POST /v1/admin/queue/re-enqueue-stuck:
- Body:
{ limit?: number }— optional override of the batch limit. - Returns:
{ candidatesFound: N, reenqueued: M }. - Admin-auth guard, scope
notifications:admin.
2.3 Test auth simplification¶
helpers.ts currently looks roughly like:
const testModule = await Test.createTestingModule({ imports: [AppModule] })
.overrideProvider(JwksService)
.useValue({ verifyToken: async () => ({ sub: 'test', ...actorContext }) })
.compile();
Target:
const testModule = await Test.createTestingModule({ imports: [AppModule] })
.overrideGuard(AuthGuard)
.useValue({ canActivate: () => true })
.compile();
If overrideGuard still doesn't work for some reason (e.g. AuthGuard is being inserted by decorators not DI), fall back to the existing workaround with a note and close the task as DONE_WITH_CONCERNS.
Either way, the actor-context header injection — which populates req.actorContext via ActorContextMiddleware — is unchanged. Tests that need admin permissions still set the header; tests that need self-access still set it.
3. The Actual Changes¶
Files created¶
services/notifications/src/delivery/re-enqueue.service.ts
services/notifications/src/delivery/re-enqueue.service.spec.ts
services/notifications/src/delivery/re-enqueue.controller.ts # admin POST endpoint
services/notifications/test/integration/re-enqueue.integration-spec.ts
services/user-management/prisma/migrations/<ts>_add_slack_user_id_to_user/migration.sql
Files modified¶
services/user-management/prisma/schema.prisma # + slackUserId String? @db.VarChar(255)
services/user-management/src/users/users.service.ts # select + return slackUserId
services/user-management/src/users/users.service.spec.ts
services/user-management/test/integration/users.integration-spec.ts
services/notifications/src/delivery/delivery.module.ts # register ReEnqueueService + controller
services/notifications/test/integration/helpers.ts # overrideGuard instead of JwksService mock
Files NOT touched¶
- Notifications
worker.service.ts,event-consumer.service.ts,notifications.service.ts— re-enqueue is additive, existing flow unchanged. - Notifications schema — no new columns.
- Other services.
No schema changes on notifications¶
The re-enqueue sweep reads the existing Notification.status + Notification.createdAt — no schema edits needed.
4. Verification Plan¶
Local verification sequence¶
pnpm install— no new deps if we usesetInterval; if we add@nestjs/schedulethat's one new devDep.pnpm --filter @sa-platform/user-management prisma:migrate:dev --name add_slack_user_id_to_user.pnpm --filter @sa-platform/user-management prisma:generate.pnpm turbo run typecheck— green.pnpm turbo run test— green, including new unit tests:users.service.spec.ts— getContact returns slackUserId.re-enqueue.service.spec.ts— (unit, mocked Prisma + BullMQ):- Finds stuck rows older than threshold.
- Skips rows that already have a BullMQ job.
- Enqueues rows missing a job.
- Respects
limitparam. - Returns
{ candidatesFound, reenqueued }counts.
pnpm --filter @sa-platform/user-management test:integration— contact endpoint returnsslackUserId.pnpm --filter @sa-platform/notifications test:integration:- Existing 46 tests still green.
- Simplified helpers produce correct actor-context behaviour.
- 1 new test: create a stuck
queuedrow withcreatedAtbackdated 15 minutes ago, NO BullMQ job — runPOST /v1/admin/queue/re-enqueue-stuck— assert response{ candidatesFound: 1, reenqueued: 1 }, verify the BullMQ queue now has the job. - Manual smoke: start notifications service, confirm
/health/ready200. The internal sweep interval logs but does nothing (no stuck rows).
Rollback plan¶
Standard revert. Migration is safe to revert (column drop — we can re-run the migration forward if desired). No runtime state relies on the column existing until the notifications worker actually needs to read it from a contact response.
5. Risks¶
| Risk | Likelihood | Mitigation |
|---|---|---|
Re-enqueue causes a race — BullMQ worker picks up job, then sweep fires, sees row still in queued momentarily, double-enqueues |
Low | Worker's atomic claim (PR #23 C1 fix) transitions status queued → sending before any send. Sweep filters by status: 'queued' only, so sending/sent/failed/suppressed are excluded. The 10-minute threshold also prevents racing normal path. |
@nestjs/schedule adds dependency weight |
N/A | Using setInterval in OnModuleInit — no new dep. |
overrideGuard still doesn't work, tests break |
Medium | Attempt the simplification; if it doesn't work, revert just item 3 and keep the rest. Report as DONE_WITH_CONCERNS. |
| Admin endpoint for manual sweep gets abused | Low | Admin-guard protected. Rate-limiting is a platform concern out of scope here. |
| Migration adds nullable column — no data loss, no lock concerns for small tables | Very low | MySQL 8 adds nullable columns online. Non-issue at current data volume. |
6. Out of scope — still pending for future work¶
- Prod-secret enforcement (separate platform-hardening PR).
- Backfill for pre-encryption PHI rows — near-zero volume, forward-only still fine.
- KMS key provider.
- DEK rotation.
- Per-org DEK scoping.