Orchestrator Service — Design Spec¶

Date: 2026-04-26 Status: Draft for review Author: Jim Holmes (with brainstorming session) Tracking issue / PR: TBD

1. Service overview & boundaries¶

1.1 Purpose¶

The orchestrator is a cross-module workflow engine for clinical cases. It owns case-flow definitions and runtime state. It drives modules in active mode and observes them in client-driven mode. The architecture is generic; v1 ships only clinical step kinds and clinical workflow templates, but the foundation supports non-clinical workflows (onboarding, billing, prescription approval) without redesign.

The motivation: clinical-api today does two different jobs — owns case data (system of record) and drives workflow (control plane). Separating them lets each evolve at its own cadence. Workflow rules churn rapidly as orgs come online with new contracts; case-data shape evolves slowly. Mixing them puts workflow churn in the same service that owns 25 Prisma tables and 60 endpoints, which gets fragile.

1.2 What the orchestrator owns¶

Workflow definitions (per-org/per-product, versioned, immutable once published)
Workflow templates (SA-shipped starter flows; orgs clone-and-customize)
Workflow instances (one per case, with snapshot of definition at instantiation)
Step state + history per instance (status, attempts, timestamps, outputs)
Per-step retry + timeout scheduling (BullMQ-backed)
Reason code registry (cross-cutting; halt reasons, step failure reasons, decline reasons)

1.3 What it does not own¶

Case data — clinical-api stays system of record for cases, findings, diagnoses
AI inference — ai-review
Reviewer queue + decision capture — human-review (designed separately)
Consent records, audit, retention — consent service, clinical-api retention/audit modules
Auth — auth service + auth-client package

1.4 Modes¶

Two modes coexist; mode is set per workflow definition (default) with optional per-case override at case creation.

Active: orchestrator emits *.requested events; modules consume; modules emit *.completed/*.failed; orchestrator advances state.
Client-driven: client calls module REST directly; modules still emit *.completed; orchestrator silently observes and tracks state.

The dual-mode requirement forces module APIs to remain first-class and self-contained — the orchestrator is one consumer, never a gatekeeper. SDK power users, customers integrating their own workflow tools, and SA's internal QA tooling all retain low-level control without going through the orchestrator.

1.5 Tenancy¶

Org-scoped, same patterns as every other service. Cross-tenant isolation via existing tenancy interceptors in @sa-platform/auth-client.

1.6 Deployment¶

Service path: services/orchestrator/
Port: 3006 (next free)
NestJS, TypeScript, Prisma, MySQL, Redis (Streams + BullMQ) — same stack as ai-review, notifications.
Standalone Prisma database (orchestrator).

1.7 v1 explicit non-goals¶

Non-clinical workflows (architecture supports; no step kinds or templates ship)
Visual / BPMN authoring UI (admin REST only; UI is a separate frontend track)
Step kinds: parallel, wait_for_histology, delay, loop, manual_intervention, assign_reviewer
Workflow migration of in-flight cases on definition update (admin "re-snapshot to latest" is v2)
Backfill of existing cases on orchestrator launch
Managed event bus (SNS/SQS, EventBridge); Redis Streams in v1
Cross-region or multi-cluster deploys
BPMN import / interop
Approval workflows for definition publishing (orchestrator:publish is single-actor in v1)
Per-step authorization (any orchestrator:intervene can intervene on any step)
Workflow-level SLA alerting (basic metrics ship; alert rules deferred to ops)

2. Data model & DB schema¶

Own MySQL database (orchestrator), Prisma-managed. Eight tables.

2.1 `WorkflowDefinition`¶

Templates and per-org definitions, versioned and append-only.

id                          UUID PK
org_id                      UUID NULL              -- NULL for SA-shipped templates
product_id                  UUID NULL              -- NULL for org-wide or template
name                        VARCHAR
version                     INT                    -- monotonic per (org_id, product_id)
is_template                 BOOLEAN
status                      ENUM(draft, active, archived)
default_mode                ENUM(active, client_driven)
definition                  JSON                   -- full DAG: steps, transitions, retry/timeout
workflow_timeout_seconds    INT                    -- outer safety net (default 30 days)
created_by                  VARCHAR                -- admin user id
created_at, updated_at, published_at
UNIQUE(org_id, product_id, version)
INDEX(org_id, product_id, status)

2.2 `WorkflowDefinitionRevision`¶

Change history for audit.

id                          UUID PK
workflow_definition_id      UUID FK
changed_by                  VARCHAR
diff                        JSON                   -- before/after of definition payload
reason                      TEXT
created_at

2.3 `WorkflowInstance`¶

One row per case.

id                          UUID PK
case_id                     UUID UNIQUE            -- 1:1 with clinical-api Case
org_id                      UUID
product_id                  UUID
definition_id               UUID FK
definition_snapshot         JSON                   -- frozen copy at instantiation
mode                        ENUM(active, client_driven)
status                      ENUM(running, halted, completed, cancelled)
halt_reason                 VARCHAR NULL           -- references ReasonCode.code (scope=halt)
halt_step_id                VARCHAR NULL           -- step that triggered halt
started_at, deadline_at, completed_at
context                     JSON                   -- accumulated step outputs
INDEX(org_id, status, deadline_at)
INDEX(case_id)

2.4 `WorkflowInstanceCurrentStep`¶

Normalized "what's in flight" — one row per active step (multiple if v2 introduces parallel).

instance_id                 UUID FK
step_id                     VARCHAR
step_row_id                 UUID FK to WorkflowStep
PRIMARY KEY(instance_id, step_id)
INDEX(instance_id)

2.5 `WorkflowStep`¶

Per-step state per attempt; retries get separate rows so history is reconstructable. Latest row per (instance_id, step_id) is "current."

id                          UUID PK
instance_id                 UUID FK
step_id                     VARCHAR                -- matches step id in definition
step_kind                   VARCHAR                -- denormalized for query convenience
status                      ENUM(pending, in_progress, completed, failed, timed_out, skipped)
attempt                     INT                    -- 1-based
started_at, completed_at
timeout_at                  DATETIME NULL
input                       JSON                   -- params + relevant context slice when dispatched
output                      JSON NULL              -- what the completion event reported
error                       JSON NULL              -- failure detail
correlation_id              VARCHAR                -- ties to *.requested/*.completed event pair
INDEX(instance_id, step_id)
INDEX(correlation_id)
INDEX(timeout_at) WHERE status = 'in_progress'

2.6 `WorkflowEvent`¶

Raw inbound event log per instance. Lets us debug "we never advanced from step X" by checking whether the completion event arrived at all.

id                          UUID PK
instance_id                 UUID FK NULL           -- null if event arrived without matching instance
event_type                  VARCHAR
envelope                    JSON
received_at
processed                   BOOLEAN
INDEX(instance_id, received_at)

2.7 `WorkflowIntervention`¶

Audit trail for admin actions on instances.

id                          UUID PK
instance_id                 UUID FK
action                      ENUM(retry_step, cancel, supersede, halt, resume)
performed_by                VARCHAR
reason                      TEXT
before_state                JSON                   -- snapshot of relevant fields
after_state                 JSON
created_at
INDEX(instance_id, created_at)

2.8 `ReasonCode`¶

Cross-cutting reason code registry. Per-tenant configurable with system defaults.

id                          UUID PK
org_id                      UUID NULL              -- NULL = system default (overridable per org)
scope                       ENUM(halt, step_failure, human_decline)
code                        VARCHAR                -- e.g. 'image_quality', 'consent_missing'
label                       VARCHAR
description                 TEXT NULL
requires_note               BOOLEAN
applicable_step_kinds       JSON NULL              -- optional filter
active                      BOOLEAN
UNIQUE(org_id, scope, code)
INDEX(scope, active)

2.9 Key schema design choices¶

definition_snapshot is the source of truth for a running workflow, not a join to WorkflowDefinition. Definitions can be archived without breaking running cases.
WorkflowStep rows accrue per attempt — retries get separate rows.
context is a single JSON blob on the instance, accumulated as steps complete. Nested by step id ({ ai_review: {confidence: 0.62, ...} }); conditions reference dotted paths.
correlation_id matches the event envelope's correlation id; same one that flows through ai_review.requested → ai_review.completed today.
ReasonCode is in the orchestrator (not a separate service) because the orchestrator already owns halt reasons and is the natural cross-cutting authority. Modules query via REST.

3. Workflow definition schema¶

3.1 Top-level shape¶

Stored as WorkflowDefinition.definition JSON.

{
  "name": "AI + customer review with confidence escalation",
  "description": "Default flow for derm clinics with sub-0.7 confidence escalation to clinician.",
  "default_mode": "active",
  "workflow_timeout_seconds": 2592000, // 30 days outer safety net
  "start_step": "consent_gate",
  "steps": {
    "consent_gate": {
      "kind": "gate_on_consent",
      "params": { "types": ["ai_analysis"] },
      "timeout_seconds": 60,
      "max_retries": 0,
      "transitions": { "on_pass": "image_check", "on_fail": "halt_consent" },
    },
    "image_check": {
      "kind": "require_images",
      "params": { "min_count": 3, "image_types": ["DERMOSCOPIC", "MACROSCOPIC"] },
      "timeout_seconds": 60,
      "max_retries": 0,
      "transitions": { "on_pass": "ai_review", "on_fail": "halt_images" },
    },
    "ai_review": {
      "kind": "run_ai_review",
      "params": {},
      "timeout_seconds": 900,
      "max_retries": 3,
      "retry_backoff": "exponential",
      "transitions": { "on_complete": "branch_confidence", "on_failure": "halt_ai" },
    },
    "branch_confidence": {
      "kind": "condition",
      "expr": "ai_review.confidence < 0.7",
      "transitions": { "on_true": "customer_review", "on_false": "emit_done" },
    },
    "customer_review": {
      "kind": "request_human_review",
      "params": { "tier": "customer_clinician" },
      "timeout_seconds": 86400,
      "transitions": { "on_complete": "emit_done" },
    },
    "emit_done": { "kind": "emit_final", "transitions": { "on_complete": "TERMINAL" } },
    "halt_consent": { "kind": "halt", "params": { "reason_code": "consent_missing" } },
    "halt_images": { "kind": "halt", "params": { "reason_code": "insufficient_images" } },
    "halt_ai": { "kind": "halt", "params": { "reason_code": "ai_review_failed" } },
  },
}

3.2 Transition model¶

Each step's transitions is a map of {outcome → next_step_id}. Outcomes are step-kind-specific. Reserved target id "TERMINAL" indicates successful workflow completion. Halt is its own step kind, not a transition target keyword — keeps reason codes explicit and queryable.

3.3 Step kind contracts¶

Kind	Params	Outcomes	Notes
`gate_on_consent`	`types: string[]`	`on_pass`, `on_fail`	Queries consent service for org+patient combo
`require_images`	`min_count: int`, `image_types?: string[]`	`on_pass`, `on_fail`	Count check only; quality is ai-review's job
`run_ai_review`	(none)	`on_complete`, `on_failure`	Failure ≠ low confidence; failure = AI couldn't run (image quality, derm error)
`request_human_review`	`tier: 'customer_clinician' \\| 'sa_qa'`	`on_complete`	Decline by reviewer is internal re-queue, not a transition
`condition`	`expr: string`	`on_true`, `on_false`	Pure branch; no module call
`emit_webhook`	`event_name: string`, `payload_template?: object`	`on_complete`	Fire-and-forget; treated as instantly complete
`emit_final`	`include_diagnoses?: bool` (default true)	`on_complete` (must target `TERMINAL`)	Emits `case.workflow.completed` with full context
`set_case_status`	`status: string`	`on_complete`	Calls clinical-api to override case status
`halt`	`reason_code: string`, `note?: string`	(terminal)	Reason code must exist in `ReasonCode` registry with scope=halt

3.4 Condition expression DSL¶

Subset of JS, sandboxed evaluation. Surface deliberately small (~150 LOC parser/evaluator).

Comparisons: <, <=, ==, !=, >=, >, in
Boolean: &&, ||, !, parens
Dotted path access into context: ai_review.confidence
Literals: numbers, strings, booleans, arrays
Helpers: sample(p) (deterministic per case_id — same case + same definition always yields the same outcome, for reproducibility/audit), now(), len(x), coalesce(a, b)
Not allowed: function calls beyond the helper set, loops, assignments, arbitrary code

Example expressions:

ai_review.confidence < 0.7
len(ai_review.diagnoses) > 1
sample(0.1) && ai_review.confidence < 0.9
consent_gate.granted && image_check.count >= 3

3.5 Validation rules at publish time¶

start_step references an existing step.
Every step's transitions targets reference existing steps or TERMINAL.
No unreachable steps (DAG reachable from start_step).
No cycles (workflows are DAGs in v1; loops are v2).
Every condition expression parses successfully.
Every halt step's reason_code exists in ReasonCode registry with scope=halt.
Every request_human_review references a valid tier.
emit_final exists and is reachable; at least one path leads to TERMINAL.
Per-step timeout_seconds ≤ workflow_timeout_seconds.

3.6 Workflow context¶

Built up as steps complete. Each step's output is merged under its step id:

{
  "case_id": "uuid",
  "org_id": "uuid",
  "product_id": "uuid",
  "consent_gate": { "granted": true, "missing_types": [] },
  "image_check": { "count": 3, "passed": true },
  "ai_review": { "confidence": 0.62, "diagnoses": [...], "raw_response_id": "uuid" },
  "branch_confidence": { "result": true },
  "customer_review": { "decision": "override", "diagnoses": [...], "reviewer_id": "uuid" }
}

Module step outputs are validated against per-kind output schema before merging. Catches contract drift early.

4. Event contracts¶

4.1 Shared package¶

New @sa-platform/events package contains:

Event envelope type + Zod validator
Per-event-type payload schemas (one schema per event)
Helper: publishEvent(stream, envelope) and consumeEvents(stream, handler) — thin wrappers around ioredis Streams
Migration target for ai-review's currently inline event types

4.2 Standard envelope¶

type EventEnvelope<T> = {
  event_id: string; // UUID, unique per emit
  event_type: string; // e.g. "ai_review.completed"
  schema_version: string; // "v1"
  occurred_at: string; // ISO8601
  correlation_id: string; // ties request → completion → next request
  causation_id?: string; // event_id of the event that caused this one
  case_id: string;
  org_id: string;
  product_id: string;
  payload: T;
};

correlation_id is the contract that links a step's *.requested to its *.completed/*.failed. The orchestrator generates it when emitting *.requested; modules echo it back in their completion event. Retry attempt N gets a new correlation_id; only the latest is "current" for the step.

4.3 Event catalogue (v1)¶

Workflow lifecycle (orchestrator emits):
  workflow.started        { instance_id, definition_id, mode }
  workflow.halted         { instance_id, halt_step_id, reason_code, reason_note? }
  workflow.completed      { instance_id, completed_at }
  workflow.cancelled      { instance_id, cancelled_by, reason? }

Consent gate (orchestrator emits → consent consumes):
  consent.check.requested { types: string[], patient_id }
  consent.check.completed { granted: bool, missing_types: string[] }

Image check (orchestrator emits → clinical-api consumes):
  case.images_check.requested { min_count: int, image_types?: string[] }
  case.images_check.completed { count, passed: bool, image_ids: uuid[] }

AI review (existing — extends current ai_review.* contracts):
  ai_review.requested     { images: [...], product_id }       — unchanged shape
  ai_review.completed     { confidence, diagnoses, raw_response_id }   — existing
  ai_review.failed        { reason_code, retryable: bool, error?: string }   — extended

Human review (designed separately in human-review spec):
  human_review.requested  { tier: 'customer_clinician' | 'sa_qa', context_snapshot }
  human_review.completed  { decision: 'confirm' | 'override', diagnoses?, reviewer_id }
  human_review.failed     { reason_code, retryable: bool }

Set case status (orchestrator emits → clinical-api consumes):
  case.set_status.requested  { status: string }
  case.set_status.completed  { previous_status, new_status }

Webhook emit (orchestrator emits → events module consumes):
  workflow.webhook_emit.requested  { event_name, payload }
  workflow.webhook_emit.completed  { delivery_id }

Final emit (orchestrator emits → clinical-api consumes):
  case.workflow.completed  { final_diagnoses: [...], context_snapshot, definition_id }

  Note: orchestrator passes the full context but does not make the clinical decision.
  clinical-api's existing supersession logic ("most recent in source-rank order wins")
  determines the final diagnosis.

4.4 Transport¶

Redis Streams (already in stack via ai-review's BullMQ + ioredis setup). Each event type is its own stream; consumers subscribe via consumer groups. Stream retention 7 days for replay. For compliance-grade audit replay we already have the audit-logging service.

4.5 Consumer/publisher matrix¶

Service	Consumes	Publishes
orchestrator	`case.created`, `.completed`, `.failed` for every step kind	`workflow.`, `.requested` for every step kind
ai-review	`ai_review.requested`	`ai_review.completed`, `ai_review.failed`
human-review	`human_review.requested`	`human_review.completed`, `human_review.failed`
consent	`consent.check.requested`	`consent.check.completed`
clinical-api	`case.images_check.requested`, `case.set_status.requested`, `case.workflow.completed`	`case.images_check.completed`, `case.set_status.completed`, `case.created`
events (existing)	`workflow.webhook_emit.requested`	`workflow.webhook_emit.completed`, external HMAC-signed webhooks

4.6 Idempotency¶

Modules use correlation_id as idempotency key when handling *.requested. Duplicate correlation_id is acked silently. Existing pattern in ai-review.
Orchestrator uses event_id as dedup key on incoming *.completed events. WorkflowEvent.processed flag prevents double state advancement.

4.7 Migration¶

ai_review.failed payload extended with reason_code: string; old error field kept for one release.
clinical-api continues to emit ai_review.requested from its existing ai-review-trigger.service.ts (preserves client-driven mode and orchestrator-bypassed flows). The ai_review.requested consumer in ai-review accepts either source.

5. REST API surface¶

Standard NestJS controllers, scope-guarded via @sa-platform/auth-client. All endpoints multi-tenant via existing org-context interceptor.

5.1 Auth scopes¶

Registered in auth service's scope registry:

orchestrator:read              → list/get definitions, instances, templates, reason codes
orchestrator:read-instances    → read workflow instances (granted to clinical roles too)
orchestrator:write             → create/update draft definitions and reason codes
orchestrator:publish           → publish/archive definitions
orchestrator:intervene         → retry/cancel/supersede halted or in-flight instances
orchestrator:admin             → cross-org access (SA panel + ops staff)

5.2 Workflow definitions¶

POST   /workflow-definitions                      Create new draft (from scratch or template clone)
GET    /workflow-definitions?org_id=&product_id=&status=
GET    /workflow-definitions/:id
PATCH  /workflow-definitions/:id                  Edit draft only
POST   /workflow-definitions/:id/publish          Validate + activate; previous active → archived
POST   /workflow-definitions/:id/archive
POST   /workflow-definitions/:id/clone            Copy as new draft (commonly from templates)
GET    /workflow-definitions/:id/revisions        Change history
POST   /workflow-definitions/:id/validate         Dry-run validation

5.3 Workflow templates¶

GET    /workflow-templates                        List SA-shipped starter flows
GET    /workflow-templates/:id                    Full definition JSON

5.4 Workflow instances¶

GET    /workflow-instances?case_id=&org_id=&status=&mode=
GET    /workflow-instances/:id                    Full state: current step(s), context, definition snapshot ref
GET    /workflow-instances/:id/steps              Step history with attempts
GET    /workflow-instances/:id/events             Raw inbound event log
GET    /workflow-instances/:id/next-step          SDK advisory (see 5.6)
POST   /workflow-instances/:id/retry-step         Re-emit current step's *.requested
POST   /workflow-instances/:id/cancel             Terminate workflow with reason
POST   /workflow-instances/:id/supersede          Archive current; new instance against same case (latest active definition)
POST   /workflow-instances/:id/halt               Manual halt with reason code
POST   /workflow-instances/:id/resume             Resume halted instance from current step

5.5 Reason codes¶

GET    /reason-codes?scope=&org_id=&active=
POST   /reason-codes                              Create org-specific reason
PATCH  /reason-codes/:id
DELETE /reason-codes/:id                          Soft-delete (active=false)

5.6 Health¶

GET    /health                                    Liveness
GET    /health/ready                              Readiness incl. Redis Streams reachability

5.7 Mode signaling at case creation¶

No orchestrator endpoint for instance creation — always event-driven from case.created. Clinical-api stays the single front door for case creation.

clinical-api emits case.created with optional orchestration_mode field. Orchestrator listens, creates WorkflowInstance with mode resolved as:

final_mode = case_creation_request.orchestration_mode
          ?? workflow_definition.default_mode
          ?? 'active'

5.8 SDK advisory query¶

GET /workflow-instances/:id/next-step returns:

{
  "instance_id": "uuid",
  "case_id": "uuid",
  "status": "running",
  "current_step": {
    "id": "ai_review",
    "kind": "run_ai_review",
    "params": {},
    "input_context": {
      /* relevant slice */
    },
    "expected_outputs": ["confidence", "diagnoses", "raw_response_id"],
  },
  "next_action": {
    "type": "call_module",
    "module": "ai-review",
    "endpoint": "POST /reviews",
    "payload_hint": {
      /* prefilled from context */
    },
    "completion_event": "ai_review.completed",
  },
  "definition_summary": {
    "name": "...",
    "remaining_steps": ["ai_review", "branch_confidence", "customer_review", "emit_done"],
  },
}

5.9 Intervention semantics¶

retry-step: increments attempt counter, generates new correlation_id, re-emits *.requested. Allowed when current step is failed/timed_out or workflow is halted because of that step. Bounded by overall workflow deadline.
supersede: archives current instance (status='cancelled', cancelled_reason='superseded_by:<new_id>'); creates new instance against same case using latest active definition.
cancel: terminal, no resume. Records cancelled_by + reason.
halt (manual): force-halt running instance with reason code.
resume: only on halted instances; re-emits current step's *.requested.

All intervention actions are recorded in WorkflowIntervention.

5.10 Webhook emission for terminal state changes¶

cancel, supersede, halt, and natural workflow completion all emit corresponding workflow.* events to the events service for HMAC-signed external webhook delivery (existing webhook flow). Clients in client-driven mode can subscribe to these. Clients can also poll GET /workflow-instances/:id for state at any time.

6. Module-side changes¶

All changes are additive — no breaking changes to existing module APIs.

6.1 `@sa-platform/events` (new package)¶

Event envelope type + Zod validator
Per-event-type payload schemas
Helpers: publishEvent, consumeEvents
Migration target for ai-review's existing inline event types

6.2 `clinical-api` modifications¶

Change	Why
Add `case.created` publisher in `cases.service.ts` (on POST /cases success)	Orchestrator listens to instantiate workflow
Add `orchestration_mode?: 'active' \\| 'client_driven'` field to `CreateCaseDto`	Per-case mode override at creation
Add `case.images_check.requested` consumer + `case.images_check.completed` publisher	New step kind: `require_images`
Add `case.set_status.requested` consumer + `case.set_status.completed` publisher	New step kind: `set_case_status`
Add `case.workflow.completed` consumer	Orchestrator's final emit; clinical-api projects diagnoses (existing logic handles supersession)
`ai-review-trigger.service.ts` keeps emitting `ai_review.requested` for backward compat	No regression for existing flow / client-driven mode

6.3 `ai-review` modifications¶

Change	Why
`ai_review.failed` payload extended with `reason_code: string` referencing `ReasonCode` registry; old `error` field kept one release	Standardize failure taxonomy
Migrate inline event types to `@sa-platform/events`	Single source of truth
Otherwise zero changes (existing `ai_review.requested` consumer accepts any emitter)

6.4 `consent` modifications¶

Change	Why
Add `consent.check.requested` consumer + `consent.check.completed` publisher	New step kind: `gate_on_consent`
REST endpoints unchanged — still callable directly in client-driven mode

6.5 `auth` modifications¶

Change	Why
Register orchestrator scopes (see 5.1)	Auth for orchestrator endpoints

6.6 `events` / webhooks (in clinical-api)¶

Change	Why
Register new webhook event types: `workflow.started`, `workflow.halted`, `workflow.completed`, `workflow.cancelled`, `workflow.intervened`	External clients subscribe

6.7 `audit` (in clinical-api)¶

Change	Why
Add `workflow.*` events to audit-recorded event types	Compliance — workflow lifecycle alongside data-access logs

6.8 `human-review` (designed separately)¶

Change	Why
`human_review.requested` consumer; `human_review.completed` + `human_review.failed` publishers from day one	Native orchestrator integration

6.9 No changes needed in¶

notifications, user-management — don't participate in v1 clinical workflow

6.10 Migration approach¶

Each module's changes ship as a separate task in the orchestrator implementation plan, with mocked event-bus tests.
Backwards-compat: ai_review.requested keeps both old and new shapes for one release; old shape removed in a follow-up.
No backfill of in-flight cases at orchestrator launch. Orchestrator only picks up case.created events from launch onward; existing in-flight cases finish on the legacy clinical-api-driven flow.

7. v1 workflow templates¶

Six canonical starter flows shipped as WorkflowDefinition seed rows (is_template: true, org_id: NULL). Orgs clone via POST /workflow-definitions/:id/clone.

Template	Steps	Use case
`ai_only`	consent_gate → require_images → run_ai_review → emit_final	AI triage only; no human review
`ai_plus_clinician`	consent_gate → require_images → run_ai_review → customer_review → emit_final	Standard regulated flow; every case gets clinician review
`ai_with_confidence_escalation`	consent_gate → require_images → run_ai_review → branch(`ai.confidence < 0.7`) → customer_review → emit_final	Cost-efficient; clinician sees only low-confidence cases
`ai_plus_clinician_plus_sa_qa_sample`	…customer_review → branch(`sample(0.1)`) → sa_qa → emit_final	Full triple-tier with 10% SA audit
`clinician_only`	consent_gate → require_images → customer_review → emit_final	Pure tele-derm, AI bypassed
`ai_escalation_plus_sa_sample`	consent_gate → require_images → run_ai_review → branch(confidence) → customer_review → branch(sample) → sa_qa → emit_final	Full compliance flow with both gates

Onboarding flow: customer-success runs clone against the most-fitting template, customizes (min image count, confidence threshold, sample rate, timeouts), publishes.

8. Scheduling, retries, observability¶

8.1 Scheduling¶

BullMQ (already in stack) for delayed jobs.

Per-step timeouts: when a step's *.requested is emitted, a BullMQ job is enqueued with delay = timeout_seconds. If the job fires before *.completed arrives, the step is marked timed_out and workflow halts.
Per-workflow deadline: similar BullMQ job at deadline_at; halts workflow.
Retry backoff: implemented via BullMQ's delayed-retry with configured retry_backoff (exponential, linear, fixed).

8.2 Retry boundary¶

Orchestrator retries are workflow-step retries, not module-internal retries. ai-review still does its own DERM retries internally; orchestrator only retries when ai_review.failed arrives with retryable: true. Modules signal retryability in the failure event.

8.3 Observability¶

Structured logs via NestJS Logger; correlation_id in every log line.

Metrics (Prometheus-shaped counters/histograms):

workflow_instance_started_total{org,product,definition,mode}
workflow_instance_completed_total{org,terminal_state}
workflow_step_duration_seconds{kind} histogram
workflow_halt_total{reason_code}
workflow_intervention_total{action}

Health endpoints: /health (liveness), /health/ready (incl. Redis Streams reachability)

Audit trail: existing audit service consumes workflow.* events plus interventions.

9. Testing strategy¶

9.1 Unit tests (~50)¶

Workflow definition validator (all rules in 3.5)
Expression evaluator (sandboxed exec safety, edge cases)
State-machine transitions
Retry counter logic
ReasonCode registry (per-org override resolution)
Idempotency dedup
Snapshot freezing on instantiation

9.2 Integration tests (~10)¶

Real MySQL + Redis via testcontainers. Module event responses mocked with msw-style stubs (existing pattern from ai-review).

Happy path through ai_with_confidence_escalation template
Step timeout halts the workflow
Retry succeeds on second attempt
Definition publish + clone + activation
Client-driven mode: orchestrator observes external ai_review.completed and advances state without emitting ai_review.requested
Intervention: retry-step on halted workflow
Intervention: supersede creates new instance against latest definition
Definition archival doesn't break running case (snapshot still works)
Reason-code registry per-org override
Idempotency: duplicate *.completed doesn't double-advance

9.3 No live external services in CI¶

msw-style mocks for module event responses; testcontainers for MySQL + Redis.

10. Out-of-scope / deferred¶

Captured here to prevent scope creep:

Non-clinical workflows (architecture supports; no step kinds or templates ship)
Visual workflow editor UI (admin REST only; UI is a separate frontend track)
Step kinds: parallel, wait_for_histology, delay, loop, manual_intervention, assign_reviewer
Workflow migration of in-flight cases on definition update (admin re-snapshot is v2)
Backfill of existing cases on orchestrator launch
Managed event bus (SNS/SQS, EventBridge) — Redis Streams in v1
Cross-region or multi-cluster orchestrator deploys
BPMN import / interop
Approval workflows for definition publishing — single-actor in v1
Per-step authorization — any orchestrator:intervene can intervene on any step
Workflow-level SLA monitoring/alerting — basic metrics ship; alert rules are an ops concern

11. Open questions for implementation plan¶

Exact MySQL JSON column performance at scale — WorkflowInstance.context may grow large for long workflows; consider TEXT + manual JSON parse vs JSON column. Implementation plan should benchmark.
BullMQ timeout-job cleanup on step completion — need explicit job removal when step finishes early, or accept harmless no-op fires.
Definition validator implementation: hand-rolled vs Zod vs JSON Schema (ajv). Lean Zod for consistency with rest of codebase.
Expression evaluator library choice: expr-eval vs custom recursive-descent. Custom is ~150 LOC and removes a dep; lean custom.