Skip to content

Notifications Follow-ups Bundle — Design Spec

Date: 2026-04-22 Status: Draft — autonomous design, user review pending Related: Notifications service (PR #21), auth-client AuthModule global (PR #23)


1. Scope, Goals, Non-Goals

Goal

Three small, discrete items flagged in prior PRs that together tidy up deferred notifications work. One bundled PR because each item is too small for its own spec, and together they form a coherent "notifications polish" change.

The three items

  1. slackUserId field on user-management's User model. The notifications worker (PR #21) resolves a user's Slack ID from user-management's /v1/users/:id/contact endpoint, but user-management's schema has no slackUserId column. Today any Slack DM to a user fails with permanent recipient_not_found. Adding the column + surfacing it through the contact endpoint unblocks the path.

  2. Periodic re-enqueue sweep for stuck queued notifications. PR #23's final review flagged an at-least-once edge: if the notifications service process crashes between an event-consumer transaction commit and the subsequent Redis queue.enqueue call, the Notification row exists with status: queued but has no BullMQ job. Currently it stays stuck. A small periodic sweep finds such rows and re-enqueues them.

  3. Simplify integration-test auth bypass. The notifications integration tests in test/integration/helpers.ts override JwksService with a permissive mock because, at the time of PR #21, NestJS overrideGuard(AuthGuard) didn't work — AuthGuard was registered as APP_GUARD which overrideGuard doesn't touch. PR #23 marked AuthModule as global: true, which should now allow cleaner test setup. Verify and simplify.

In scope

Item 1: slackUserId on user-management

  • Add slackUserId String? @db.VarChar(255) nullable column to services/user-management/prisma/schema.prisma's User model.
  • Migration add_slack_user_id_to_user.
  • Update GET /v1/users/:id/contact to return slackUserId in the response (already-defined shape from PR #21's Task 16; just populate from the new column).
  • Update the unit/integration tests for the endpoint to verify slackUserId round-trips.

Item 2: Re-enqueue sweep

  • New services/notifications/src/delivery/re-enqueue.service.ts — periodically scans Notification rows where status = 'queued' AND createdAt < now() - 10 minutes AND the BullMQ queue doesn't already have a job for the ID.
  • Scheduled via NestJS @Interval (from @nestjs/schedule) or BullMQ's repeatable-job mechanism. Pick the lower-friction option — NestJS schedule adds a new dep but is simpler than a repeatable job for this case.
  • On each run: find candidates, for each check BullMQ queue.getJob(notificationId) to avoid double-enqueueing, re-enqueue those that need it, log count.
  • Admin-facing endpoint POST /v1/admin/queue/re-enqueue-stuck for manual runs (useful for ops during incidents — still admin-guarded).
  • Unit tests: the service's logic with mocked Prisma + BullMQ. Integration test: create a stuck queued row without enqueuing, run the sweep, assert it gets enqueued.

Item 3: Auth test simplification

  • Read services/notifications/test/integration/helpers.ts.
  • Verify that with AuthModule now global (PR #23), NestJS's Test.createTestingModule(...).overrideGuard(AuthGuard) works as documented.
  • Replace the overrideProvider(JwksService).useValue({ verifyToken: async () => ({...}) }) workaround with overrideGuard(AuthGuard).useValue({ canActivate: () => true }).
  • Keep the per-test actor-context header injection — that's orthogonal to guard bypass.
  • All 46 existing notifications integration tests continue passing.

Out of scope

  • Prod-secret enforcement for ADMIN_API_SECRET and SERVICE_AUTH_TOKEN — flagged in PR #23 review, deferred to a separate platform-hardening PR that'll touch every service's app-config.ts. Bundling it here would conflict with PR #24's notifications changes and muddies the review.
  • Migrating auth/consent to the new prisma-client generator.
  • KMS key provider.
  • Patient-side preference centre UI.

Success criteria

  • slackUserId present on user-management User model, exposed via contact endpoint, covered by unit + integration tests.
  • A Notification row in queued state older than 10 minutes with no BullMQ job is re-enqueued by the sweep within one interval window. Zero double-enqueues verified.
  • helpers.ts in notifications integration tests uses overrideGuard(AuthGuard) — no JwksService mock. All 46 + any new tests still green.
  • Full monorepo typecheck + tests green.

2. Architecture — three slices

2.1 slackUserId

Trivial. Schema adds the column, migration applies ALTER TABLE user ADD COLUMN slack_user_id VARCHAR(255) NULL;. The service returns it in the contact response; no other logic changes.

Verify by: notifications worker resolving a Slack DM recipient now correctly succeeds when the target user has a slackUserId, rather than throwing UserNotFoundError with "contact missing slackUserId".

2.2 Re-enqueue sweep

┌─────────────────────────────┐
│  @Interval('re-enqueue', 5m)│
│  ReEnqueueService.sweep()   │
└──────────┬──────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────┐
│ prisma.notification.findMany({                      │
│   where: {                                          │
│     status: 'queued',                               │
│     createdAt: { lt: now - 10 minutes }             │
│   },                                                │
│   select: { id: true },                             │
│   take: 100,                                        │
│ })                                                  │
└──────────┬──────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────┐
│ for each candidate:                                 │
│   existing = queue.getJob(candidate.id)             │
│   if !existing: queue.enqueue(candidate.id)         │
│   metric: notifications_reenqueued_total++          │
└─────────────────────────────────────────────────────┘

Implementation notes:

  • The createdAt < 10m threshold gives BullMQ's normal retry policy time to act before we interfere. Notifications with status: queued that fail to process land in status: sending (per PR #23's atomic claim), so queued rows older than 10 minutes genuinely indicate missing jobs.
  • queue.getJob(id) in BullMQ v5 accepts a job ID and returns the job if present. BullMQ job IDs are strings; we pass the notification ID.
  • Take up to 100 per sweep to avoid long-running batches under load.
  • Cron library: @nestjs/schedule is already a widely-used NestJS utility. Adding it as a dependency is acceptable. Alternative (no new dep): setInterval in OnModuleInit, clear in OnModuleDestroy. Simpler, no new library — I'll go with that.

Admin endpoint POST /v1/admin/queue/re-enqueue-stuck:

  • Body: { limit?: number } — optional override of the batch limit.
  • Returns: { candidatesFound: N, reenqueued: M }.
  • Admin-auth guard, scope notifications:admin.

2.3 Test auth simplification

helpers.ts currently looks roughly like:

const testModule = await Test.createTestingModule({ imports: [AppModule] })
  .overrideProvider(JwksService)
  .useValue({ verifyToken: async () => ({ sub: 'test', ...actorContext }) })
  .compile();

Target:

const testModule = await Test.createTestingModule({ imports: [AppModule] })
  .overrideGuard(AuthGuard)
  .useValue({ canActivate: () => true })
  .compile();

If overrideGuard still doesn't work for some reason (e.g. AuthGuard is being inserted by decorators not DI), fall back to the existing workaround with a note and close the task as DONE_WITH_CONCERNS.

Either way, the actor-context header injection — which populates req.actorContext via ActorContextMiddleware — is unchanged. Tests that need admin permissions still set the header; tests that need self-access still set it.


3. The Actual Changes

Files created

services/notifications/src/delivery/re-enqueue.service.ts
services/notifications/src/delivery/re-enqueue.service.spec.ts
services/notifications/src/delivery/re-enqueue.controller.ts       # admin POST endpoint
services/notifications/test/integration/re-enqueue.integration-spec.ts

services/user-management/prisma/migrations/<ts>_add_slack_user_id_to_user/migration.sql

Files modified

services/user-management/prisma/schema.prisma                       # + slackUserId String? @db.VarChar(255)
services/user-management/src/users/users.service.ts                 # select + return slackUserId
services/user-management/src/users/users.service.spec.ts
services/user-management/test/integration/users.integration-spec.ts

services/notifications/src/delivery/delivery.module.ts              # register ReEnqueueService + controller
services/notifications/test/integration/helpers.ts                  # overrideGuard instead of JwksService mock

Files NOT touched

  • Notifications worker.service.ts, event-consumer.service.ts, notifications.service.ts — re-enqueue is additive, existing flow unchanged.
  • Notifications schema — no new columns.
  • Other services.

No schema changes on notifications

The re-enqueue sweep reads the existing Notification.status + Notification.createdAt — no schema edits needed.


4. Verification Plan

Local verification sequence

  1. pnpm install — no new deps if we use setInterval; if we add @nestjs/schedule that's one new devDep.
  2. pnpm --filter @sa-platform/user-management prisma:migrate:dev --name add_slack_user_id_to_user.
  3. pnpm --filter @sa-platform/user-management prisma:generate.
  4. pnpm turbo run typecheck — green.
  5. pnpm turbo run test — green, including new unit tests:
  6. users.service.spec.ts — getContact returns slackUserId.
  7. re-enqueue.service.spec.ts — (unit, mocked Prisma + BullMQ):
    • Finds stuck rows older than threshold.
    • Skips rows that already have a BullMQ job.
    • Enqueues rows missing a job.
    • Respects limit param.
    • Returns { candidatesFound, reenqueued } counts.
  8. pnpm --filter @sa-platform/user-management test:integration — contact endpoint returns slackUserId.
  9. pnpm --filter @sa-platform/notifications test:integration:
  10. Existing 46 tests still green.
  11. Simplified helpers produce correct actor-context behaviour.
  12. 1 new test: create a stuck queued row with createdAt backdated 15 minutes ago, NO BullMQ job — run POST /v1/admin/queue/re-enqueue-stuck — assert response { candidatesFound: 1, reenqueued: 1 }, verify the BullMQ queue now has the job.
  13. Manual smoke: start notifications service, confirm /health/ready 200. The internal sweep interval logs but does nothing (no stuck rows).

Rollback plan

Standard revert. Migration is safe to revert (column drop — we can re-run the migration forward if desired). No runtime state relies on the column existing until the notifications worker actually needs to read it from a contact response.


5. Risks

Risk Likelihood Mitigation
Re-enqueue causes a race — BullMQ worker picks up job, then sweep fires, sees row still in queued momentarily, double-enqueues Low Worker's atomic claim (PR #23 C1 fix) transitions status queued → sending before any send. Sweep filters by status: 'queued' only, so sending/sent/failed/suppressed are excluded. The 10-minute threshold also prevents racing normal path.
@nestjs/schedule adds dependency weight N/A Using setInterval in OnModuleInit — no new dep.
overrideGuard still doesn't work, tests break Medium Attempt the simplification; if it doesn't work, revert just item 3 and keep the rest. Report as DONE_WITH_CONCERNS.
Admin endpoint for manual sweep gets abused Low Admin-guard protected. Rate-limiting is a platform concern out of scope here.
Migration adds nullable column — no data loss, no lock concerns for small tables Very low MySQL 8 adds nullable columns online. Non-issue at current data volume.

6. Out of scope — still pending for future work

  • Prod-secret enforcement (separate platform-hardening PR).
  • Backfill for pre-encryption PHI rows — near-zero volume, forward-only still fine.
  • KMS key provider.
  • DEK rotation.
  • Per-org DEK scoping.