Orchestrator Service — Design Spec¶
Date: 2026-04-26 Status: Draft for review Author: Jim Holmes (with brainstorming session) Tracking issue / PR: TBD
1. Service overview & boundaries¶
1.1 Purpose¶
The orchestrator is a cross-module workflow engine for clinical cases. It owns case-flow definitions and runtime state. It drives modules in active mode and observes them in client-driven mode. The architecture is generic; v1 ships only clinical step kinds and clinical workflow templates, but the foundation supports non-clinical workflows (onboarding, billing, prescription approval) without redesign.
The motivation: clinical-api today does two different jobs — owns case data (system of record) and drives workflow (control plane). Separating them lets each evolve at its own cadence. Workflow rules churn rapidly as orgs come online with new contracts; case-data shape evolves slowly. Mixing them puts workflow churn in the same service that owns 25 Prisma tables and 60 endpoints, which gets fragile.
1.2 What the orchestrator owns¶
- Workflow definitions (per-org/per-product, versioned, immutable once published)
- Workflow templates (SA-shipped starter flows; orgs clone-and-customize)
- Workflow instances (one per case, with snapshot of definition at instantiation)
- Step state + history per instance (status, attempts, timestamps, outputs)
- Per-step retry + timeout scheduling (BullMQ-backed)
- Reason code registry (cross-cutting; halt reasons, step failure reasons, decline reasons)
1.3 What it does not own¶
- Case data — clinical-api stays system of record for cases, findings, diagnoses
- AI inference — ai-review
- Reviewer queue + decision capture — human-review (designed separately)
- Consent records, audit, retention — consent service, clinical-api retention/audit modules
- Auth — auth service + auth-client package
1.4 Modes¶
Two modes coexist; mode is set per workflow definition (default) with optional per-case override at case creation.
- Active: orchestrator emits
*.requestedevents; modules consume; modules emit*.completed/*.failed; orchestrator advances state. - Client-driven: client calls module REST directly; modules still emit
*.completed; orchestrator silently observes and tracks state.
The dual-mode requirement forces module APIs to remain first-class and self-contained — the orchestrator is one consumer, never a gatekeeper. SDK power users, customers integrating their own workflow tools, and SA's internal QA tooling all retain low-level control without going through the orchestrator.
1.5 Tenancy¶
Org-scoped, same patterns as every other service. Cross-tenant isolation via existing tenancy interceptors in @sa-platform/auth-client.
1.6 Deployment¶
- Service path:
services/orchestrator/ - Port: 3006 (next free)
- NestJS, TypeScript, Prisma, MySQL, Redis (Streams + BullMQ) — same stack as ai-review, notifications.
- Standalone Prisma database (
orchestrator).
1.7 v1 explicit non-goals¶
- Non-clinical workflows (architecture supports; no step kinds or templates ship)
- Visual / BPMN authoring UI (admin REST only; UI is a separate frontend track)
- Step kinds:
parallel,wait_for_histology,delay,loop,manual_intervention,assign_reviewer - Workflow migration of in-flight cases on definition update (admin "re-snapshot to latest" is v2)
- Backfill of existing cases on orchestrator launch
- Managed event bus (SNS/SQS, EventBridge); Redis Streams in v1
- Cross-region or multi-cluster deploys
- BPMN import / interop
- Approval workflows for definition publishing (
orchestrator:publishis single-actor in v1) - Per-step authorization (any
orchestrator:intervenecan intervene on any step) - Workflow-level SLA alerting (basic metrics ship; alert rules deferred to ops)
2. Data model & DB schema¶
Own MySQL database (orchestrator), Prisma-managed. Eight tables.
2.1 WorkflowDefinition¶
Templates and per-org definitions, versioned and append-only.
id UUID PK
org_id UUID NULL -- NULL for SA-shipped templates
product_id UUID NULL -- NULL for org-wide or template
name VARCHAR
version INT -- monotonic per (org_id, product_id)
is_template BOOLEAN
status ENUM(draft, active, archived)
default_mode ENUM(active, client_driven)
definition JSON -- full DAG: steps, transitions, retry/timeout
workflow_timeout_seconds INT -- outer safety net (default 30 days)
created_by VARCHAR -- admin user id
created_at, updated_at, published_at
UNIQUE(org_id, product_id, version)
INDEX(org_id, product_id, status)
2.2 WorkflowDefinitionRevision¶
Change history for audit.
id UUID PK
workflow_definition_id UUID FK
changed_by VARCHAR
diff JSON -- before/after of definition payload
reason TEXT
created_at
2.3 WorkflowInstance¶
One row per case.
id UUID PK
case_id UUID UNIQUE -- 1:1 with clinical-api Case
org_id UUID
product_id UUID
definition_id UUID FK
definition_snapshot JSON -- frozen copy at instantiation
mode ENUM(active, client_driven)
status ENUM(running, halted, completed, cancelled)
halt_reason VARCHAR NULL -- references ReasonCode.code (scope=halt)
halt_step_id VARCHAR NULL -- step that triggered halt
started_at, deadline_at, completed_at
context JSON -- accumulated step outputs
INDEX(org_id, status, deadline_at)
INDEX(case_id)
2.4 WorkflowInstanceCurrentStep¶
Normalized "what's in flight" — one row per active step (multiple if v2 introduces parallel).
instance_id UUID FK
step_id VARCHAR
step_row_id UUID FK to WorkflowStep
PRIMARY KEY(instance_id, step_id)
INDEX(instance_id)
2.5 WorkflowStep¶
Per-step state per attempt; retries get separate rows so history is reconstructable. Latest row per (instance_id, step_id) is "current."
id UUID PK
instance_id UUID FK
step_id VARCHAR -- matches step id in definition
step_kind VARCHAR -- denormalized for query convenience
status ENUM(pending, in_progress, completed, failed, timed_out, skipped)
attempt INT -- 1-based
started_at, completed_at
timeout_at DATETIME NULL
input JSON -- params + relevant context slice when dispatched
output JSON NULL -- what the completion event reported
error JSON NULL -- failure detail
correlation_id VARCHAR -- ties to *.requested/*.completed event pair
INDEX(instance_id, step_id)
INDEX(correlation_id)
INDEX(timeout_at) WHERE status = 'in_progress'
2.6 WorkflowEvent¶
Raw inbound event log per instance. Lets us debug "we never advanced from step X" by checking whether the completion event arrived at all.
id UUID PK
instance_id UUID FK NULL -- null if event arrived without matching instance
event_type VARCHAR
envelope JSON
received_at
processed BOOLEAN
INDEX(instance_id, received_at)
2.7 WorkflowIntervention¶
Audit trail for admin actions on instances.
id UUID PK
instance_id UUID FK
action ENUM(retry_step, cancel, supersede, halt, resume)
performed_by VARCHAR
reason TEXT
before_state JSON -- snapshot of relevant fields
after_state JSON
created_at
INDEX(instance_id, created_at)
2.8 ReasonCode¶
Cross-cutting reason code registry. Per-tenant configurable with system defaults.
id UUID PK
org_id UUID NULL -- NULL = system default (overridable per org)
scope ENUM(halt, step_failure, human_decline)
code VARCHAR -- e.g. 'image_quality', 'consent_missing'
label VARCHAR
description TEXT NULL
requires_note BOOLEAN
applicable_step_kinds JSON NULL -- optional filter
active BOOLEAN
UNIQUE(org_id, scope, code)
INDEX(scope, active)
2.9 Key schema design choices¶
definition_snapshotis the source of truth for a running workflow, not a join toWorkflowDefinition. Definitions can be archived without breaking running cases.WorkflowSteprows accrue per attempt — retries get separate rows.contextis a single JSON blob on the instance, accumulated as steps complete. Nested by step id ({ ai_review: {confidence: 0.62, ...} }); conditions reference dotted paths.correlation_idmatches the event envelope's correlation id; same one that flows throughai_review.requested→ai_review.completedtoday.ReasonCodeis in the orchestrator (not a separate service) because the orchestrator already owns halt reasons and is the natural cross-cutting authority. Modules query via REST.
3. Workflow definition schema¶
3.1 Top-level shape¶
Stored as WorkflowDefinition.definition JSON.
{
"name": "AI + customer review with confidence escalation",
"description": "Default flow for derm clinics with sub-0.7 confidence escalation to clinician.",
"default_mode": "active",
"workflow_timeout_seconds": 2592000, // 30 days outer safety net
"start_step": "consent_gate",
"steps": {
"consent_gate": {
"kind": "gate_on_consent",
"params": { "types": ["ai_analysis"] },
"timeout_seconds": 60,
"max_retries": 0,
"transitions": { "on_pass": "image_check", "on_fail": "halt_consent" },
},
"image_check": {
"kind": "require_images",
"params": { "min_count": 3, "image_types": ["DERMOSCOPIC", "MACROSCOPIC"] },
"timeout_seconds": 60,
"max_retries": 0,
"transitions": { "on_pass": "ai_review", "on_fail": "halt_images" },
},
"ai_review": {
"kind": "run_ai_review",
"params": {},
"timeout_seconds": 900,
"max_retries": 3,
"retry_backoff": "exponential",
"transitions": { "on_complete": "branch_confidence", "on_failure": "halt_ai" },
},
"branch_confidence": {
"kind": "condition",
"expr": "ai_review.confidence < 0.7",
"transitions": { "on_true": "customer_review", "on_false": "emit_done" },
},
"customer_review": {
"kind": "request_human_review",
"params": { "tier": "customer_clinician" },
"timeout_seconds": 86400,
"transitions": { "on_complete": "emit_done" },
},
"emit_done": { "kind": "emit_final", "transitions": { "on_complete": "TERMINAL" } },
"halt_consent": { "kind": "halt", "params": { "reason_code": "consent_missing" } },
"halt_images": { "kind": "halt", "params": { "reason_code": "insufficient_images" } },
"halt_ai": { "kind": "halt", "params": { "reason_code": "ai_review_failed" } },
},
}
3.2 Transition model¶
Each step's transitions is a map of {outcome → next_step_id}. Outcomes are step-kind-specific. Reserved target id "TERMINAL" indicates successful workflow completion. Halt is its own step kind, not a transition target keyword — keeps reason codes explicit and queryable.
3.3 Step kind contracts¶
| Kind | Params | Outcomes | Notes |
|---|---|---|---|
gate_on_consent |
types: string[] |
on_pass, on_fail |
Queries consent service for org+patient combo |
require_images |
min_count: int, image_types?: string[] |
on_pass, on_fail |
Count check only; quality is ai-review's job |
run_ai_review |
(none) | on_complete, on_failure |
Failure ≠ low confidence; failure = AI couldn't run (image quality, derm error) |
request_human_review |
tier: 'customer_clinician' \| 'sa_qa' |
on_complete |
Decline by reviewer is internal re-queue, not a transition |
condition |
expr: string |
on_true, on_false |
Pure branch; no module call |
emit_webhook |
event_name: string, payload_template?: object |
on_complete |
Fire-and-forget; treated as instantly complete |
emit_final |
include_diagnoses?: bool (default true) |
on_complete (must target TERMINAL) |
Emits case.workflow.completed with full context |
set_case_status |
status: string |
on_complete |
Calls clinical-api to override case status |
halt |
reason_code: string, note?: string |
(terminal) | Reason code must exist in ReasonCode registry with scope=halt |
3.4 Condition expression DSL¶
Subset of JS, sandboxed evaluation. Surface deliberately small (~150 LOC parser/evaluator).
- Comparisons:
<,<=,==,!=,>=,>,in - Boolean:
&&,||,!, parens - Dotted path access into context:
ai_review.confidence - Literals: numbers, strings, booleans, arrays
- Helpers:
sample(p)(deterministic percase_id— same case + same definition always yields the same outcome, for reproducibility/audit),now(),len(x),coalesce(a, b) - Not allowed: function calls beyond the helper set, loops, assignments, arbitrary code
Example expressions:
ai_review.confidence < 0.7len(ai_review.diagnoses) > 1sample(0.1) && ai_review.confidence < 0.9consent_gate.granted && image_check.count >= 3
3.5 Validation rules at publish time¶
start_stepreferences an existing step.- Every step's
transitionstargets reference existing steps orTERMINAL. - No unreachable steps (DAG reachable from
start_step). - No cycles (workflows are DAGs in v1; loops are v2).
- Every condition expression parses successfully.
- Every
haltstep'sreason_codeexists inReasonCoderegistry with scope=halt. - Every
request_human_reviewreferences a valid tier. emit_finalexists and is reachable; at least one path leads toTERMINAL.- Per-step
timeout_seconds≤workflow_timeout_seconds.
3.6 Workflow context¶
Built up as steps complete. Each step's output is merged under its step id:
{
"case_id": "uuid",
"org_id": "uuid",
"product_id": "uuid",
"consent_gate": { "granted": true, "missing_types": [] },
"image_check": { "count": 3, "passed": true },
"ai_review": { "confidence": 0.62, "diagnoses": [...], "raw_response_id": "uuid" },
"branch_confidence": { "result": true },
"customer_review": { "decision": "override", "diagnoses": [...], "reviewer_id": "uuid" }
}
Module step outputs are validated against per-kind output schema before merging. Catches contract drift early.
4. Event contracts¶
4.1 Shared package¶
New @sa-platform/events package contains:
- Event envelope type + Zod validator
- Per-event-type payload schemas (one schema per event)
- Helper:
publishEvent(stream, envelope)andconsumeEvents(stream, handler)— thin wrappers around ioredis Streams - Migration target for ai-review's currently inline event types
4.2 Standard envelope¶
type EventEnvelope<T> = {
event_id: string; // UUID, unique per emit
event_type: string; // e.g. "ai_review.completed"
schema_version: string; // "v1"
occurred_at: string; // ISO8601
correlation_id: string; // ties request → completion → next request
causation_id?: string; // event_id of the event that caused this one
case_id: string;
org_id: string;
product_id: string;
payload: T;
};
correlation_id is the contract that links a step's *.requested to its *.completed/*.failed. The orchestrator generates it when emitting *.requested; modules echo it back in their completion event. Retry attempt N gets a new correlation_id; only the latest is "current" for the step.
4.3 Event catalogue (v1)¶
Workflow lifecycle (orchestrator emits):
workflow.started { instance_id, definition_id, mode }
workflow.halted { instance_id, halt_step_id, reason_code, reason_note? }
workflow.completed { instance_id, completed_at }
workflow.cancelled { instance_id, cancelled_by, reason? }
Consent gate (orchestrator emits → consent consumes):
consent.check.requested { types: string[], patient_id }
consent.check.completed { granted: bool, missing_types: string[] }
Image check (orchestrator emits → clinical-api consumes):
case.images_check.requested { min_count: int, image_types?: string[] }
case.images_check.completed { count, passed: bool, image_ids: uuid[] }
AI review (existing — extends current ai_review.* contracts):
ai_review.requested { images: [...], product_id } — unchanged shape
ai_review.completed { confidence, diagnoses, raw_response_id } — existing
ai_review.failed { reason_code, retryable: bool, error?: string } — extended
Human review (designed separately in human-review spec):
human_review.requested { tier: 'customer_clinician' | 'sa_qa', context_snapshot }
human_review.completed { decision: 'confirm' | 'override', diagnoses?, reviewer_id }
human_review.failed { reason_code, retryable: bool }
Set case status (orchestrator emits → clinical-api consumes):
case.set_status.requested { status: string }
case.set_status.completed { previous_status, new_status }
Webhook emit (orchestrator emits → events module consumes):
workflow.webhook_emit.requested { event_name, payload }
workflow.webhook_emit.completed { delivery_id }
Final emit (orchestrator emits → clinical-api consumes):
case.workflow.completed { final_diagnoses: [...], context_snapshot, definition_id }
Note: orchestrator passes the full context but does not make the clinical decision.
clinical-api's existing supersession logic ("most recent in source-rank order wins")
determines the final diagnosis.
4.4 Transport¶
Redis Streams (already in stack via ai-review's BullMQ + ioredis setup). Each event type is its own stream; consumers subscribe via consumer groups. Stream retention 7 days for replay. For compliance-grade audit replay we already have the audit-logging service.
4.5 Consumer/publisher matrix¶
| Service | Consumes | Publishes |
|---|---|---|
| orchestrator | case.created, *.completed, *.failed for every step kind |
workflow.*, *.requested for every step kind |
| ai-review | ai_review.requested |
ai_review.completed, ai_review.failed |
| human-review | human_review.requested |
human_review.completed, human_review.failed |
| consent | consent.check.requested |
consent.check.completed |
| clinical-api | case.images_check.requested, case.set_status.requested, case.workflow.completed |
case.images_check.completed, case.set_status.completed, case.created |
| events (existing) | workflow.webhook_emit.requested |
workflow.webhook_emit.completed, external HMAC-signed webhooks |
4.6 Idempotency¶
- Modules use
correlation_idas idempotency key when handling*.requested. Duplicate correlation_id is acked silently. Existing pattern in ai-review. - Orchestrator uses
event_idas dedup key on incoming*.completedevents.WorkflowEvent.processedflag prevents double state advancement.
4.7 Migration¶
ai_review.failedpayload extended withreason_code: string; olderrorfield kept for one release.clinical-apicontinues to emitai_review.requestedfrom its existingai-review-trigger.service.ts(preserves client-driven mode and orchestrator-bypassed flows). Theai_review.requestedconsumer in ai-review accepts either source.
5. REST API surface¶
Standard NestJS controllers, scope-guarded via @sa-platform/auth-client. All endpoints multi-tenant via existing org-context interceptor.
5.1 Auth scopes¶
Registered in auth service's scope registry:
orchestrator:read → list/get definitions, instances, templates, reason codes
orchestrator:read-instances → read workflow instances (granted to clinical roles too)
orchestrator:write → create/update draft definitions and reason codes
orchestrator:publish → publish/archive definitions
orchestrator:intervene → retry/cancel/supersede halted or in-flight instances
orchestrator:admin → cross-org access (SA panel + ops staff)
5.2 Workflow definitions¶
POST /workflow-definitions Create new draft (from scratch or template clone)
GET /workflow-definitions?org_id=&product_id=&status=
GET /workflow-definitions/:id
PATCH /workflow-definitions/:id Edit draft only
POST /workflow-definitions/:id/publish Validate + activate; previous active → archived
POST /workflow-definitions/:id/archive
POST /workflow-definitions/:id/clone Copy as new draft (commonly from templates)
GET /workflow-definitions/:id/revisions Change history
POST /workflow-definitions/:id/validate Dry-run validation
5.3 Workflow templates¶
GET /workflow-templates List SA-shipped starter flows
GET /workflow-templates/:id Full definition JSON
5.4 Workflow instances¶
GET /workflow-instances?case_id=&org_id=&status=&mode=
GET /workflow-instances/:id Full state: current step(s), context, definition snapshot ref
GET /workflow-instances/:id/steps Step history with attempts
GET /workflow-instances/:id/events Raw inbound event log
GET /workflow-instances/:id/next-step SDK advisory (see 5.6)
POST /workflow-instances/:id/retry-step Re-emit current step's *.requested
POST /workflow-instances/:id/cancel Terminate workflow with reason
POST /workflow-instances/:id/supersede Archive current; new instance against same case (latest active definition)
POST /workflow-instances/:id/halt Manual halt with reason code
POST /workflow-instances/:id/resume Resume halted instance from current step
5.5 Reason codes¶
GET /reason-codes?scope=&org_id=&active=
POST /reason-codes Create org-specific reason
PATCH /reason-codes/:id
DELETE /reason-codes/:id Soft-delete (active=false)
5.6 Health¶
GET /health Liveness
GET /health/ready Readiness incl. Redis Streams reachability
5.7 Mode signaling at case creation¶
No orchestrator endpoint for instance creation — always event-driven from case.created. Clinical-api stays the single front door for case creation.
clinical-api emits case.created with optional orchestration_mode field. Orchestrator listens, creates WorkflowInstance with mode resolved as:
final_mode = case_creation_request.orchestration_mode
?? workflow_definition.default_mode
?? 'active'
5.8 SDK advisory query¶
GET /workflow-instances/:id/next-step returns:
{
"instance_id": "uuid",
"case_id": "uuid",
"status": "running",
"current_step": {
"id": "ai_review",
"kind": "run_ai_review",
"params": {},
"input_context": {
/* relevant slice */
},
"expected_outputs": ["confidence", "diagnoses", "raw_response_id"],
},
"next_action": {
"type": "call_module",
"module": "ai-review",
"endpoint": "POST /reviews",
"payload_hint": {
/* prefilled from context */
},
"completion_event": "ai_review.completed",
},
"definition_summary": {
"name": "...",
"remaining_steps": ["ai_review", "branch_confidence", "customer_review", "emit_done"],
},
}
5.9 Intervention semantics¶
- retry-step: increments attempt counter, generates new
correlation_id, re-emits*.requested. Allowed when current step isfailed/timed_outor workflow ishaltedbecause of that step. Bounded by overall workflow deadline. - supersede: archives current instance (
status='cancelled', cancelled_reason='superseded_by:<new_id>'); creates new instance against same case using latest active definition. - cancel: terminal, no resume. Records cancelled_by + reason.
- halt (manual): force-halt running instance with reason code.
- resume: only on halted instances; re-emits current step's
*.requested.
All intervention actions are recorded in WorkflowIntervention.
5.10 Webhook emission for terminal state changes¶
cancel, supersede, halt, and natural workflow completion all emit corresponding workflow.* events to the events service for HMAC-signed external webhook delivery (existing webhook flow). Clients in client-driven mode can subscribe to these. Clients can also poll GET /workflow-instances/:id for state at any time.
6. Module-side changes¶
All changes are additive — no breaking changes to existing module APIs.
6.1 @sa-platform/events (new package)¶
- Event envelope type + Zod validator
- Per-event-type payload schemas
- Helpers:
publishEvent,consumeEvents - Migration target for ai-review's existing inline event types
6.2 clinical-api modifications¶
| Change | Why |
|---|---|
Add case.created publisher in cases.service.ts (on POST /cases success) |
Orchestrator listens to instantiate workflow |
Add orchestration_mode?: 'active' \| 'client_driven' field to CreateCaseDto |
Per-case mode override at creation |
Add case.images_check.requested consumer + case.images_check.completed publisher |
New step kind: require_images |
Add case.set_status.requested consumer + case.set_status.completed publisher |
New step kind: set_case_status |
Add case.workflow.completed consumer |
Orchestrator's final emit; clinical-api projects diagnoses (existing logic handles supersession) |
ai-review-trigger.service.ts keeps emitting ai_review.requested for backward compat |
No regression for existing flow / client-driven mode |
6.3 ai-review modifications¶
| Change | Why |
|---|---|
ai_review.failed payload extended with reason_code: string referencing ReasonCode registry; old error field kept one release |
Standardize failure taxonomy |
Migrate inline event types to @sa-platform/events |
Single source of truth |
Otherwise zero changes (existing ai_review.requested consumer accepts any emitter) |
6.4 consent modifications¶
| Change | Why |
|---|---|
Add consent.check.requested consumer + consent.check.completed publisher |
New step kind: gate_on_consent |
| REST endpoints unchanged — still callable directly in client-driven mode |
6.5 auth modifications¶
| Change | Why |
|---|---|
| Register orchestrator scopes (see 5.1) | Auth for orchestrator endpoints |
6.6 events / webhooks (in clinical-api)¶
| Change | Why |
|---|---|
Register new webhook event types: workflow.started, workflow.halted, workflow.completed, workflow.cancelled, workflow.intervened |
External clients subscribe |
6.7 audit (in clinical-api)¶
| Change | Why |
|---|---|
Add workflow.* events to audit-recorded event types |
Compliance — workflow lifecycle alongside data-access logs |
6.8 human-review (designed separately)¶
| Change | Why |
|---|---|
human_review.requested consumer; human_review.completed + human_review.failed publishers from day one |
Native orchestrator integration |
6.9 No changes needed in¶
notifications,user-management— don't participate in v1 clinical workflow
6.10 Migration approach¶
- Each module's changes ship as a separate task in the orchestrator implementation plan, with mocked event-bus tests.
- Backwards-compat:
ai_review.requestedkeeps both old and new shapes for one release; old shape removed in a follow-up. - No backfill of in-flight cases at orchestrator launch. Orchestrator only picks up
case.createdevents from launch onward; existing in-flight cases finish on the legacy clinical-api-driven flow.
7. v1 workflow templates¶
Six canonical starter flows shipped as WorkflowDefinition seed rows (is_template: true, org_id: NULL). Orgs clone via POST /workflow-definitions/:id/clone.
| Template | Steps | Use case |
|---|---|---|
ai_only |
consent_gate → require_images → run_ai_review → emit_final | AI triage only; no human review |
ai_plus_clinician |
consent_gate → require_images → run_ai_review → customer_review → emit_final | Standard regulated flow; every case gets clinician review |
ai_with_confidence_escalation |
consent_gate → require_images → run_ai_review → branch(ai.confidence < 0.7) → customer_review → emit_final |
Cost-efficient; clinician sees only low-confidence cases |
ai_plus_clinician_plus_sa_qa_sample |
…customer_review → branch(sample(0.1)) → sa_qa → emit_final |
Full triple-tier with 10% SA audit |
clinician_only |
consent_gate → require_images → customer_review → emit_final | Pure tele-derm, AI bypassed |
ai_escalation_plus_sa_sample |
consent_gate → require_images → run_ai_review → branch(confidence) → customer_review → branch(sample) → sa_qa → emit_final | Full compliance flow with both gates |
Onboarding flow: customer-success runs clone against the most-fitting template, customizes (min image count, confidence threshold, sample rate, timeouts), publishes.
8. Scheduling, retries, observability¶
8.1 Scheduling¶
BullMQ (already in stack) for delayed jobs.
- Per-step timeouts: when a step's
*.requestedis emitted, a BullMQ job is enqueued with delay =timeout_seconds. If the job fires before*.completedarrives, the step is markedtimed_outand workflow halts. - Per-workflow deadline: similar BullMQ job at
deadline_at; halts workflow. - Retry backoff: implemented via BullMQ's delayed-retry with configured
retry_backoff(exponential,linear,fixed).
8.2 Retry boundary¶
Orchestrator retries are workflow-step retries, not module-internal retries. ai-review still does its own DERM retries internally; orchestrator only retries when ai_review.failed arrives with retryable: true. Modules signal retryability in the failure event.
8.3 Observability¶
Structured logs via NestJS Logger; correlation_id in every log line.
Metrics (Prometheus-shaped counters/histograms):
workflow_instance_started_total{org,product,definition,mode}workflow_instance_completed_total{org,terminal_state}workflow_step_duration_seconds{kind}histogramworkflow_halt_total{reason_code}workflow_intervention_total{action}
Health endpoints: /health (liveness), /health/ready (incl. Redis Streams reachability)
Audit trail: existing audit service consumes workflow.* events plus interventions.
9. Testing strategy¶
9.1 Unit tests (~50)¶
- Workflow definition validator (all rules in 3.5)
- Expression evaluator (sandboxed exec safety, edge cases)
- State-machine transitions
- Retry counter logic
- ReasonCode registry (per-org override resolution)
- Idempotency dedup
- Snapshot freezing on instantiation
9.2 Integration tests (~10)¶
Real MySQL + Redis via testcontainers. Module event responses mocked with msw-style stubs (existing pattern from ai-review).
- Happy path through
ai_with_confidence_escalationtemplate - Step timeout halts the workflow
- Retry succeeds on second attempt
- Definition publish + clone + activation
- Client-driven mode: orchestrator observes external
ai_review.completedand advances state without emittingai_review.requested - Intervention: retry-step on halted workflow
- Intervention: supersede creates new instance against latest definition
- Definition archival doesn't break running case (snapshot still works)
- Reason-code registry per-org override
- Idempotency: duplicate
*.completeddoesn't double-advance
9.3 No live external services in CI¶
msw-style mocks for module event responses; testcontainers for MySQL + Redis.
10. Out-of-scope / deferred¶
Captured here to prevent scope creep:
- Non-clinical workflows (architecture supports; no step kinds or templates ship)
- Visual workflow editor UI (admin REST only; UI is a separate frontend track)
- Step kinds:
parallel,wait_for_histology,delay,loop,manual_intervention,assign_reviewer - Workflow migration of in-flight cases on definition update (admin re-snapshot is v2)
- Backfill of existing cases on orchestrator launch
- Managed event bus (SNS/SQS, EventBridge) — Redis Streams in v1
- Cross-region or multi-cluster orchestrator deploys
- BPMN import / interop
- Approval workflows for definition publishing — single-actor in v1
- Per-step authorization — any
orchestrator:intervenecan intervene on any step - Workflow-level SLA monitoring/alerting — basic metrics ship; alert rules are an ops concern
11. Open questions for implementation plan¶
- Exact MySQL JSON column performance at scale —
WorkflowInstance.contextmay grow large for long workflows; consider TEXT + manual JSON parse vs JSON column. Implementation plan should benchmark. - BullMQ timeout-job cleanup on step completion — need explicit job removal when step finishes early, or accept harmless no-op fires.
- Definition validator implementation: hand-rolled vs Zod vs JSON Schema (ajv). Lean Zod for consistency with rest of codebase.
- Expression evaluator library choice:
expr-evalvs custom recursive-descent. Custom is ~150 LOC and removes a dep; lean custom.