Skip to content

Orchestrator Service — Design Spec

Date: 2026-04-26 Status: Draft for review Author: Jim Holmes (with brainstorming session) Tracking issue / PR: TBD


1. Service overview & boundaries

1.1 Purpose

The orchestrator is a cross-module workflow engine for clinical cases. It owns case-flow definitions and runtime state. It drives modules in active mode and observes them in client-driven mode. The architecture is generic; v1 ships only clinical step kinds and clinical workflow templates, but the foundation supports non-clinical workflows (onboarding, billing, prescription approval) without redesign.

The motivation: clinical-api today does two different jobs — owns case data (system of record) and drives workflow (control plane). Separating them lets each evolve at its own cadence. Workflow rules churn rapidly as orgs come online with new contracts; case-data shape evolves slowly. Mixing them puts workflow churn in the same service that owns 25 Prisma tables and 60 endpoints, which gets fragile.

1.2 What the orchestrator owns

  • Workflow definitions (per-org/per-product, versioned, immutable once published)
  • Workflow templates (SA-shipped starter flows; orgs clone-and-customize)
  • Workflow instances (one per case, with snapshot of definition at instantiation)
  • Step state + history per instance (status, attempts, timestamps, outputs)
  • Per-step retry + timeout scheduling (BullMQ-backed)
  • Reason code registry (cross-cutting; halt reasons, step failure reasons, decline reasons)

1.3 What it does not own

  • Case data — clinical-api stays system of record for cases, findings, diagnoses
  • AI inference — ai-review
  • Reviewer queue + decision capture — human-review (designed separately)
  • Consent records, audit, retention — consent service, clinical-api retention/audit modules
  • Auth — auth service + auth-client package

1.4 Modes

Two modes coexist; mode is set per workflow definition (default) with optional per-case override at case creation.

  • Active: orchestrator emits *.requested events; modules consume; modules emit *.completed/*.failed; orchestrator advances state.
  • Client-driven: client calls module REST directly; modules still emit *.completed; orchestrator silently observes and tracks state.

The dual-mode requirement forces module APIs to remain first-class and self-contained — the orchestrator is one consumer, never a gatekeeper. SDK power users, customers integrating their own workflow tools, and SA's internal QA tooling all retain low-level control without going through the orchestrator.

1.5 Tenancy

Org-scoped, same patterns as every other service. Cross-tenant isolation via existing tenancy interceptors in @sa-platform/auth-client.

1.6 Deployment

  • Service path: services/orchestrator/
  • Port: 3006 (next free)
  • NestJS, TypeScript, Prisma, MySQL, Redis (Streams + BullMQ) — same stack as ai-review, notifications.
  • Standalone Prisma database (orchestrator).

1.7 v1 explicit non-goals

  • Non-clinical workflows (architecture supports; no step kinds or templates ship)
  • Visual / BPMN authoring UI (admin REST only; UI is a separate frontend track)
  • Step kinds: parallel, wait_for_histology, delay, loop, manual_intervention, assign_reviewer
  • Workflow migration of in-flight cases on definition update (admin "re-snapshot to latest" is v2)
  • Backfill of existing cases on orchestrator launch
  • Managed event bus (SNS/SQS, EventBridge); Redis Streams in v1
  • Cross-region or multi-cluster deploys
  • BPMN import / interop
  • Approval workflows for definition publishing (orchestrator:publish is single-actor in v1)
  • Per-step authorization (any orchestrator:intervene can intervene on any step)
  • Workflow-level SLA alerting (basic metrics ship; alert rules deferred to ops)

2. Data model & DB schema

Own MySQL database (orchestrator), Prisma-managed. Eight tables.

2.1 WorkflowDefinition

Templates and per-org definitions, versioned and append-only.

id                          UUID PK
org_id                      UUID NULL              -- NULL for SA-shipped templates
product_id                  UUID NULL              -- NULL for org-wide or template
name                        VARCHAR
version                     INT                    -- monotonic per (org_id, product_id)
is_template                 BOOLEAN
status                      ENUM(draft, active, archived)
default_mode                ENUM(active, client_driven)
definition                  JSON                   -- full DAG: steps, transitions, retry/timeout
workflow_timeout_seconds    INT                    -- outer safety net (default 30 days)
created_by                  VARCHAR                -- admin user id
created_at, updated_at, published_at
UNIQUE(org_id, product_id, version)
INDEX(org_id, product_id, status)

2.2 WorkflowDefinitionRevision

Change history for audit.

id                          UUID PK
workflow_definition_id      UUID FK
changed_by                  VARCHAR
diff                        JSON                   -- before/after of definition payload
reason                      TEXT
created_at

2.3 WorkflowInstance

One row per case.

id                          UUID PK
case_id                     UUID UNIQUE            -- 1:1 with clinical-api Case
org_id                      UUID
product_id                  UUID
definition_id               UUID FK
definition_snapshot         JSON                   -- frozen copy at instantiation
mode                        ENUM(active, client_driven)
status                      ENUM(running, halted, completed, cancelled)
halt_reason                 VARCHAR NULL           -- references ReasonCode.code (scope=halt)
halt_step_id                VARCHAR NULL           -- step that triggered halt
started_at, deadline_at, completed_at
context                     JSON                   -- accumulated step outputs
INDEX(org_id, status, deadline_at)
INDEX(case_id)

2.4 WorkflowInstanceCurrentStep

Normalized "what's in flight" — one row per active step (multiple if v2 introduces parallel).

instance_id                 UUID FK
step_id                     VARCHAR
step_row_id                 UUID FK to WorkflowStep
PRIMARY KEY(instance_id, step_id)
INDEX(instance_id)

2.5 WorkflowStep

Per-step state per attempt; retries get separate rows so history is reconstructable. Latest row per (instance_id, step_id) is "current."

id                          UUID PK
instance_id                 UUID FK
step_id                     VARCHAR                -- matches step id in definition
step_kind                   VARCHAR                -- denormalized for query convenience
status                      ENUM(pending, in_progress, completed, failed, timed_out, skipped)
attempt                     INT                    -- 1-based
started_at, completed_at
timeout_at                  DATETIME NULL
input                       JSON                   -- params + relevant context slice when dispatched
output                      JSON NULL              -- what the completion event reported
error                       JSON NULL              -- failure detail
correlation_id              VARCHAR                -- ties to *.requested/*.completed event pair
INDEX(instance_id, step_id)
INDEX(correlation_id)
INDEX(timeout_at) WHERE status = 'in_progress'

2.6 WorkflowEvent

Raw inbound event log per instance. Lets us debug "we never advanced from step X" by checking whether the completion event arrived at all.

id                          UUID PK
instance_id                 UUID FK NULL           -- null if event arrived without matching instance
event_type                  VARCHAR
envelope                    JSON
received_at
processed                   BOOLEAN
INDEX(instance_id, received_at)

2.7 WorkflowIntervention

Audit trail for admin actions on instances.

id                          UUID PK
instance_id                 UUID FK
action                      ENUM(retry_step, cancel, supersede, halt, resume)
performed_by                VARCHAR
reason                      TEXT
before_state                JSON                   -- snapshot of relevant fields
after_state                 JSON
created_at
INDEX(instance_id, created_at)

2.8 ReasonCode

Cross-cutting reason code registry. Per-tenant configurable with system defaults.

id                          UUID PK
org_id                      UUID NULL              -- NULL = system default (overridable per org)
scope                       ENUM(halt, step_failure, human_decline)
code                        VARCHAR                -- e.g. 'image_quality', 'consent_missing'
label                       VARCHAR
description                 TEXT NULL
requires_note               BOOLEAN
applicable_step_kinds       JSON NULL              -- optional filter
active                      BOOLEAN
UNIQUE(org_id, scope, code)
INDEX(scope, active)

2.9 Key schema design choices

  1. definition_snapshot is the source of truth for a running workflow, not a join to WorkflowDefinition. Definitions can be archived without breaking running cases.
  2. WorkflowStep rows accrue per attempt — retries get separate rows.
  3. context is a single JSON blob on the instance, accumulated as steps complete. Nested by step id ({ ai_review: {confidence: 0.62, ...} }); conditions reference dotted paths.
  4. correlation_id matches the event envelope's correlation id; same one that flows through ai_review.requestedai_review.completed today.
  5. ReasonCode is in the orchestrator (not a separate service) because the orchestrator already owns halt reasons and is the natural cross-cutting authority. Modules query via REST.

3. Workflow definition schema

3.1 Top-level shape

Stored as WorkflowDefinition.definition JSON.

{
  "name": "AI + customer review with confidence escalation",
  "description": "Default flow for derm clinics with sub-0.7 confidence escalation to clinician.",
  "default_mode": "active",
  "workflow_timeout_seconds": 2592000, // 30 days outer safety net
  "start_step": "consent_gate",
  "steps": {
    "consent_gate": {
      "kind": "gate_on_consent",
      "params": { "types": ["ai_analysis"] },
      "timeout_seconds": 60,
      "max_retries": 0,
      "transitions": { "on_pass": "image_check", "on_fail": "halt_consent" },
    },
    "image_check": {
      "kind": "require_images",
      "params": { "min_count": 3, "image_types": ["DERMOSCOPIC", "MACROSCOPIC"] },
      "timeout_seconds": 60,
      "max_retries": 0,
      "transitions": { "on_pass": "ai_review", "on_fail": "halt_images" },
    },
    "ai_review": {
      "kind": "run_ai_review",
      "params": {},
      "timeout_seconds": 900,
      "max_retries": 3,
      "retry_backoff": "exponential",
      "transitions": { "on_complete": "branch_confidence", "on_failure": "halt_ai" },
    },
    "branch_confidence": {
      "kind": "condition",
      "expr": "ai_review.confidence < 0.7",
      "transitions": { "on_true": "customer_review", "on_false": "emit_done" },
    },
    "customer_review": {
      "kind": "request_human_review",
      "params": { "tier": "customer_clinician" },
      "timeout_seconds": 86400,
      "transitions": { "on_complete": "emit_done" },
    },
    "emit_done": { "kind": "emit_final", "transitions": { "on_complete": "TERMINAL" } },
    "halt_consent": { "kind": "halt", "params": { "reason_code": "consent_missing" } },
    "halt_images": { "kind": "halt", "params": { "reason_code": "insufficient_images" } },
    "halt_ai": { "kind": "halt", "params": { "reason_code": "ai_review_failed" } },
  },
}

3.2 Transition model

Each step's transitions is a map of {outcome → next_step_id}. Outcomes are step-kind-specific. Reserved target id "TERMINAL" indicates successful workflow completion. Halt is its own step kind, not a transition target keyword — keeps reason codes explicit and queryable.

3.3 Step kind contracts

Kind Params Outcomes Notes
gate_on_consent types: string[] on_pass, on_fail Queries consent service for org+patient combo
require_images min_count: int, image_types?: string[] on_pass, on_fail Count check only; quality is ai-review's job
run_ai_review (none) on_complete, on_failure Failure ≠ low confidence; failure = AI couldn't run (image quality, derm error)
request_human_review tier: 'customer_clinician' \| 'sa_qa' on_complete Decline by reviewer is internal re-queue, not a transition
condition expr: string on_true, on_false Pure branch; no module call
emit_webhook event_name: string, payload_template?: object on_complete Fire-and-forget; treated as instantly complete
emit_final include_diagnoses?: bool (default true) on_complete (must target TERMINAL) Emits case.workflow.completed with full context
set_case_status status: string on_complete Calls clinical-api to override case status
halt reason_code: string, note?: string (terminal) Reason code must exist in ReasonCode registry with scope=halt

3.4 Condition expression DSL

Subset of JS, sandboxed evaluation. Surface deliberately small (~150 LOC parser/evaluator).

  • Comparisons: <, <=, ==, !=, >=, >, in
  • Boolean: &&, ||, !, parens
  • Dotted path access into context: ai_review.confidence
  • Literals: numbers, strings, booleans, arrays
  • Helpers: sample(p) (deterministic per case_id — same case + same definition always yields the same outcome, for reproducibility/audit), now(), len(x), coalesce(a, b)
  • Not allowed: function calls beyond the helper set, loops, assignments, arbitrary code

Example expressions:

  • ai_review.confidence < 0.7
  • len(ai_review.diagnoses) > 1
  • sample(0.1) && ai_review.confidence < 0.9
  • consent_gate.granted && image_check.count >= 3

3.5 Validation rules at publish time

  1. start_step references an existing step.
  2. Every step's transitions targets reference existing steps or TERMINAL.
  3. No unreachable steps (DAG reachable from start_step).
  4. No cycles (workflows are DAGs in v1; loops are v2).
  5. Every condition expression parses successfully.
  6. Every halt step's reason_code exists in ReasonCode registry with scope=halt.
  7. Every request_human_review references a valid tier.
  8. emit_final exists and is reachable; at least one path leads to TERMINAL.
  9. Per-step timeout_secondsworkflow_timeout_seconds.

3.6 Workflow context

Built up as steps complete. Each step's output is merged under its step id:

{
  "case_id": "uuid",
  "org_id": "uuid",
  "product_id": "uuid",
  "consent_gate": { "granted": true, "missing_types": [] },
  "image_check": { "count": 3, "passed": true },
  "ai_review": { "confidence": 0.62, "diagnoses": [...], "raw_response_id": "uuid" },
  "branch_confidence": { "result": true },
  "customer_review": { "decision": "override", "diagnoses": [...], "reviewer_id": "uuid" }
}

Module step outputs are validated against per-kind output schema before merging. Catches contract drift early.


4. Event contracts

4.1 Shared package

New @sa-platform/events package contains:

  • Event envelope type + Zod validator
  • Per-event-type payload schemas (one schema per event)
  • Helper: publishEvent(stream, envelope) and consumeEvents(stream, handler) — thin wrappers around ioredis Streams
  • Migration target for ai-review's currently inline event types

4.2 Standard envelope

type EventEnvelope<T> = {
  event_id: string; // UUID, unique per emit
  event_type: string; // e.g. "ai_review.completed"
  schema_version: string; // "v1"
  occurred_at: string; // ISO8601
  correlation_id: string; // ties request → completion → next request
  causation_id?: string; // event_id of the event that caused this one
  case_id: string;
  org_id: string;
  product_id: string;
  payload: T;
};

correlation_id is the contract that links a step's *.requested to its *.completed/*.failed. The orchestrator generates it when emitting *.requested; modules echo it back in their completion event. Retry attempt N gets a new correlation_id; only the latest is "current" for the step.

4.3 Event catalogue (v1)

Workflow lifecycle (orchestrator emits):
  workflow.started        { instance_id, definition_id, mode }
  workflow.halted         { instance_id, halt_step_id, reason_code, reason_note? }
  workflow.completed      { instance_id, completed_at }
  workflow.cancelled      { instance_id, cancelled_by, reason? }

Consent gate (orchestrator emits → consent consumes):
  consent.check.requested { types: string[], patient_id }
  consent.check.completed { granted: bool, missing_types: string[] }

Image check (orchestrator emits → clinical-api consumes):
  case.images_check.requested { min_count: int, image_types?: string[] }
  case.images_check.completed { count, passed: bool, image_ids: uuid[] }

AI review (existing — extends current ai_review.* contracts):
  ai_review.requested     { images: [...], product_id }       — unchanged shape
  ai_review.completed     { confidence, diagnoses, raw_response_id }   — existing
  ai_review.failed        { reason_code, retryable: bool, error?: string }   — extended

Human review (designed separately in human-review spec):
  human_review.requested  { tier: 'customer_clinician' | 'sa_qa', context_snapshot }
  human_review.completed  { decision: 'confirm' | 'override', diagnoses?, reviewer_id }
  human_review.failed     { reason_code, retryable: bool }

Set case status (orchestrator emits → clinical-api consumes):
  case.set_status.requested  { status: string }
  case.set_status.completed  { previous_status, new_status }

Webhook emit (orchestrator emits → events module consumes):
  workflow.webhook_emit.requested  { event_name, payload }
  workflow.webhook_emit.completed  { delivery_id }

Final emit (orchestrator emits → clinical-api consumes):
  case.workflow.completed  { final_diagnoses: [...], context_snapshot, definition_id }

  Note: orchestrator passes the full context but does not make the clinical decision.
  clinical-api's existing supersession logic ("most recent in source-rank order wins")
  determines the final diagnosis.

4.4 Transport

Redis Streams (already in stack via ai-review's BullMQ + ioredis setup). Each event type is its own stream; consumers subscribe via consumer groups. Stream retention 7 days for replay. For compliance-grade audit replay we already have the audit-logging service.

4.5 Consumer/publisher matrix

Service Consumes Publishes
orchestrator case.created, *.completed, *.failed for every step kind workflow.*, *.requested for every step kind
ai-review ai_review.requested ai_review.completed, ai_review.failed
human-review human_review.requested human_review.completed, human_review.failed
consent consent.check.requested consent.check.completed
clinical-api case.images_check.requested, case.set_status.requested, case.workflow.completed case.images_check.completed, case.set_status.completed, case.created
events (existing) workflow.webhook_emit.requested workflow.webhook_emit.completed, external HMAC-signed webhooks

4.6 Idempotency

  • Modules use correlation_id as idempotency key when handling *.requested. Duplicate correlation_id is acked silently. Existing pattern in ai-review.
  • Orchestrator uses event_id as dedup key on incoming *.completed events. WorkflowEvent.processed flag prevents double state advancement.

4.7 Migration

  • ai_review.failed payload extended with reason_code: string; old error field kept for one release.
  • clinical-api continues to emit ai_review.requested from its existing ai-review-trigger.service.ts (preserves client-driven mode and orchestrator-bypassed flows). The ai_review.requested consumer in ai-review accepts either source.

5. REST API surface

Standard NestJS controllers, scope-guarded via @sa-platform/auth-client. All endpoints multi-tenant via existing org-context interceptor.

5.1 Auth scopes

Registered in auth service's scope registry:

orchestrator:read              → list/get definitions, instances, templates, reason codes
orchestrator:read-instances    → read workflow instances (granted to clinical roles too)
orchestrator:write             → create/update draft definitions and reason codes
orchestrator:publish           → publish/archive definitions
orchestrator:intervene         → retry/cancel/supersede halted or in-flight instances
orchestrator:admin             → cross-org access (SA panel + ops staff)

5.2 Workflow definitions

POST   /workflow-definitions                      Create new draft (from scratch or template clone)
GET    /workflow-definitions?org_id=&product_id=&status=
GET    /workflow-definitions/:id
PATCH  /workflow-definitions/:id                  Edit draft only
POST   /workflow-definitions/:id/publish          Validate + activate; previous active → archived
POST   /workflow-definitions/:id/archive
POST   /workflow-definitions/:id/clone            Copy as new draft (commonly from templates)
GET    /workflow-definitions/:id/revisions        Change history
POST   /workflow-definitions/:id/validate         Dry-run validation

5.3 Workflow templates

GET    /workflow-templates                        List SA-shipped starter flows
GET    /workflow-templates/:id                    Full definition JSON

5.4 Workflow instances

GET    /workflow-instances?case_id=&org_id=&status=&mode=
GET    /workflow-instances/:id                    Full state: current step(s), context, definition snapshot ref
GET    /workflow-instances/:id/steps              Step history with attempts
GET    /workflow-instances/:id/events             Raw inbound event log
GET    /workflow-instances/:id/next-step          SDK advisory (see 5.6)
POST   /workflow-instances/:id/retry-step         Re-emit current step's *.requested
POST   /workflow-instances/:id/cancel             Terminate workflow with reason
POST   /workflow-instances/:id/supersede          Archive current; new instance against same case (latest active definition)
POST   /workflow-instances/:id/halt               Manual halt with reason code
POST   /workflow-instances/:id/resume             Resume halted instance from current step

5.5 Reason codes

GET    /reason-codes?scope=&org_id=&active=
POST   /reason-codes                              Create org-specific reason
PATCH  /reason-codes/:id
DELETE /reason-codes/:id                          Soft-delete (active=false)

5.6 Health

GET    /health                                    Liveness
GET    /health/ready                              Readiness incl. Redis Streams reachability

5.7 Mode signaling at case creation

No orchestrator endpoint for instance creation — always event-driven from case.created. Clinical-api stays the single front door for case creation.

clinical-api emits case.created with optional orchestration_mode field. Orchestrator listens, creates WorkflowInstance with mode resolved as:

final_mode = case_creation_request.orchestration_mode
          ?? workflow_definition.default_mode
          ?? 'active'

5.8 SDK advisory query

GET /workflow-instances/:id/next-step returns:

{
  "instance_id": "uuid",
  "case_id": "uuid",
  "status": "running",
  "current_step": {
    "id": "ai_review",
    "kind": "run_ai_review",
    "params": {},
    "input_context": {
      /* relevant slice */
    },
    "expected_outputs": ["confidence", "diagnoses", "raw_response_id"],
  },
  "next_action": {
    "type": "call_module",
    "module": "ai-review",
    "endpoint": "POST /reviews",
    "payload_hint": {
      /* prefilled from context */
    },
    "completion_event": "ai_review.completed",
  },
  "definition_summary": {
    "name": "...",
    "remaining_steps": ["ai_review", "branch_confidence", "customer_review", "emit_done"],
  },
}

5.9 Intervention semantics

  • retry-step: increments attempt counter, generates new correlation_id, re-emits *.requested. Allowed when current step is failed/timed_out or workflow is halted because of that step. Bounded by overall workflow deadline.
  • supersede: archives current instance (status='cancelled', cancelled_reason='superseded_by:<new_id>'); creates new instance against same case using latest active definition.
  • cancel: terminal, no resume. Records cancelled_by + reason.
  • halt (manual): force-halt running instance with reason code.
  • resume: only on halted instances; re-emits current step's *.requested.

All intervention actions are recorded in WorkflowIntervention.

5.10 Webhook emission for terminal state changes

cancel, supersede, halt, and natural workflow completion all emit corresponding workflow.* events to the events service for HMAC-signed external webhook delivery (existing webhook flow). Clients in client-driven mode can subscribe to these. Clients can also poll GET /workflow-instances/:id for state at any time.


6. Module-side changes

All changes are additive — no breaking changes to existing module APIs.

6.1 @sa-platform/events (new package)

  • Event envelope type + Zod validator
  • Per-event-type payload schemas
  • Helpers: publishEvent, consumeEvents
  • Migration target for ai-review's existing inline event types

6.2 clinical-api modifications

Change Why
Add case.created publisher in cases.service.ts (on POST /cases success) Orchestrator listens to instantiate workflow
Add orchestration_mode?: 'active' \| 'client_driven' field to CreateCaseDto Per-case mode override at creation
Add case.images_check.requested consumer + case.images_check.completed publisher New step kind: require_images
Add case.set_status.requested consumer + case.set_status.completed publisher New step kind: set_case_status
Add case.workflow.completed consumer Orchestrator's final emit; clinical-api projects diagnoses (existing logic handles supersession)
ai-review-trigger.service.ts keeps emitting ai_review.requested for backward compat No regression for existing flow / client-driven mode

6.3 ai-review modifications

Change Why
ai_review.failed payload extended with reason_code: string referencing ReasonCode registry; old error field kept one release Standardize failure taxonomy
Migrate inline event types to @sa-platform/events Single source of truth
Otherwise zero changes (existing ai_review.requested consumer accepts any emitter)
Change Why
Add consent.check.requested consumer + consent.check.completed publisher New step kind: gate_on_consent
REST endpoints unchanged — still callable directly in client-driven mode

6.5 auth modifications

Change Why
Register orchestrator scopes (see 5.1) Auth for orchestrator endpoints

6.6 events / webhooks (in clinical-api)

Change Why
Register new webhook event types: workflow.started, workflow.halted, workflow.completed, workflow.cancelled, workflow.intervened External clients subscribe

6.7 audit (in clinical-api)

Change Why
Add workflow.* events to audit-recorded event types Compliance — workflow lifecycle alongside data-access logs

6.8 human-review (designed separately)

Change Why
human_review.requested consumer; human_review.completed + human_review.failed publishers from day one Native orchestrator integration

6.9 No changes needed in

  • notifications, user-management — don't participate in v1 clinical workflow

6.10 Migration approach

  • Each module's changes ship as a separate task in the orchestrator implementation plan, with mocked event-bus tests.
  • Backwards-compat: ai_review.requested keeps both old and new shapes for one release; old shape removed in a follow-up.
  • No backfill of in-flight cases at orchestrator launch. Orchestrator only picks up case.created events from launch onward; existing in-flight cases finish on the legacy clinical-api-driven flow.

7. v1 workflow templates

Six canonical starter flows shipped as WorkflowDefinition seed rows (is_template: true, org_id: NULL). Orgs clone via POST /workflow-definitions/:id/clone.

Template Steps Use case
ai_only consent_gate → require_images → run_ai_review → emit_final AI triage only; no human review
ai_plus_clinician consent_gate → require_images → run_ai_review → customer_review → emit_final Standard regulated flow; every case gets clinician review
ai_with_confidence_escalation consent_gate → require_images → run_ai_review → branch(ai.confidence < 0.7) → customer_review → emit_final Cost-efficient; clinician sees only low-confidence cases
ai_plus_clinician_plus_sa_qa_sample …customer_review → branch(sample(0.1)) → sa_qa → emit_final Full triple-tier with 10% SA audit
clinician_only consent_gate → require_images → customer_review → emit_final Pure tele-derm, AI bypassed
ai_escalation_plus_sa_sample consent_gate → require_images → run_ai_review → branch(confidence) → customer_review → branch(sample) → sa_qa → emit_final Full compliance flow with both gates

Onboarding flow: customer-success runs clone against the most-fitting template, customizes (min image count, confidence threshold, sample rate, timeouts), publishes.


8. Scheduling, retries, observability

8.1 Scheduling

BullMQ (already in stack) for delayed jobs.

  • Per-step timeouts: when a step's *.requested is emitted, a BullMQ job is enqueued with delay = timeout_seconds. If the job fires before *.completed arrives, the step is marked timed_out and workflow halts.
  • Per-workflow deadline: similar BullMQ job at deadline_at; halts workflow.
  • Retry backoff: implemented via BullMQ's delayed-retry with configured retry_backoff (exponential, linear, fixed).

8.2 Retry boundary

Orchestrator retries are workflow-step retries, not module-internal retries. ai-review still does its own DERM retries internally; orchestrator only retries when ai_review.failed arrives with retryable: true. Modules signal retryability in the failure event.

8.3 Observability

Structured logs via NestJS Logger; correlation_id in every log line.

Metrics (Prometheus-shaped counters/histograms):

  • workflow_instance_started_total{org,product,definition,mode}
  • workflow_instance_completed_total{org,terminal_state}
  • workflow_step_duration_seconds{kind} histogram
  • workflow_halt_total{reason_code}
  • workflow_intervention_total{action}

Health endpoints: /health (liveness), /health/ready (incl. Redis Streams reachability)

Audit trail: existing audit service consumes workflow.* events plus interventions.


9. Testing strategy

9.1 Unit tests (~50)

  • Workflow definition validator (all rules in 3.5)
  • Expression evaluator (sandboxed exec safety, edge cases)
  • State-machine transitions
  • Retry counter logic
  • ReasonCode registry (per-org override resolution)
  • Idempotency dedup
  • Snapshot freezing on instantiation

9.2 Integration tests (~10)

Real MySQL + Redis via testcontainers. Module event responses mocked with msw-style stubs (existing pattern from ai-review).

  1. Happy path through ai_with_confidence_escalation template
  2. Step timeout halts the workflow
  3. Retry succeeds on second attempt
  4. Definition publish + clone + activation
  5. Client-driven mode: orchestrator observes external ai_review.completed and advances state without emitting ai_review.requested
  6. Intervention: retry-step on halted workflow
  7. Intervention: supersede creates new instance against latest definition
  8. Definition archival doesn't break running case (snapshot still works)
  9. Reason-code registry per-org override
  10. Idempotency: duplicate *.completed doesn't double-advance

9.3 No live external services in CI

msw-style mocks for module event responses; testcontainers for MySQL + Redis.


10. Out-of-scope / deferred

Captured here to prevent scope creep:

  • Non-clinical workflows (architecture supports; no step kinds or templates ship)
  • Visual workflow editor UI (admin REST only; UI is a separate frontend track)
  • Step kinds: parallel, wait_for_histology, delay, loop, manual_intervention, assign_reviewer
  • Workflow migration of in-flight cases on definition update (admin re-snapshot is v2)
  • Backfill of existing cases on orchestrator launch
  • Managed event bus (SNS/SQS, EventBridge) — Redis Streams in v1
  • Cross-region or multi-cluster orchestrator deploys
  • BPMN import / interop
  • Approval workflows for definition publishing — single-actor in v1
  • Per-step authorization — any orchestrator:intervene can intervene on any step
  • Workflow-level SLA monitoring/alerting — basic metrics ship; alert rules are an ops concern

11. Open questions for implementation plan

  1. Exact MySQL JSON column performance at scale — WorkflowInstance.context may grow large for long workflows; consider TEXT + manual JSON parse vs JSON column. Implementation plan should benchmark.
  2. BullMQ timeout-job cleanup on step completion — need explicit job removal when step finishes early, or accept harmless no-op fires.
  3. Definition validator implementation: hand-rolled vs Zod vs JSON Schema (ajv). Lean Zod for consistency with rest of codebase.
  4. Expression evaluator library choice: expr-eval vs custom recursive-descent. Custom is ~150 LOC and removes a dep; lean custom.