Clinical Data Model — Design¶
Date: 2026-04-15 Status: Draft for review Scope: v1 core data model, API, authentication, admin site, encryption, ingestion pipelines, infrastructure, documentation, testing.
1. Purpose and scope¶
1.1 Purpose¶
Provide a single, consistent, reusable data model and set of APIs that underpin all of our dermatology clinical platforms (initially OV2 and AIDA). The service owns clinical records in a secure, auditable, multi-tenant form, exposes them via versioned REST APIs, and provides an admin site for managing tenants, client credentials, consent types, image processing policies, diagnosis code mappings, patient merge review, and retention.
1.2 Initial platforms served¶
OV2 and AIDA are the first known consumers. They are not the only consumers ever — the platform must be designed so that future dermatology products (and potentially adjacent clinical domains) can be onboarded without a code deploy, simply by provisioning them through the admin site. No OV2 or AIDA assumption is allowed to leak into the schema, the API contracts, or the service code. The specifics below describe what those two products need day one; the spec sections that follow describe the general-purpose platform that happens to satisfy them.
- OV2 — skin cancer diagnosis using AI and human teledermatology, based on dermoscopic and macroscopic imagery of skin lesions. Supports multi-lesion-per-image AI detection, multiple diagnoses per lesion (AI + human + histopathology ground truth), AI training consent, and late-arriving histology.
- AIDA — human-led teledermatology for non-cancer skin conditions (rashes, patches, blemishes), using macroscopic imagery only.
Both platforms share the same data model via a unified skin_finding abstraction with typed extensions. The same abstraction is designed to accommodate future products whose clinical shape is not yet known.
1.3 Extensibility for future products¶
A new product must be onboardable through the admin site alone, with no code change, no schema migration, and no service redeploy. To make that true, the following things are per-product configurable data rather than hard-coded behaviour:
- Authentication method and credentials — a new product's api_clients are provisioned via the admin site; their auth method (
oauth2_client_credentials,external_jwt, future types) is a data-driven strategy chain. - Actor-context JWT verification — the product's JWKS URL, issuer, and audience are stored on the
productrow and verified dynamically. - Required consent types — consent types are per-organisation configurable and referenced from the product row; a new product chooses which already-configured types it requires at case creation, or asks an org admin to define new types.
- Diagnosis code mappings — AI label-to-SNOMED-CT mappings live in a per-product-referenced mapping table, maintained in the admin site. A new AI model's label space is a data edit, not a code change.
- Image processing policies — EXIF retention rules, derivative specs, max sizes, virus-scan behaviour are per-product JSON edited via the admin site.
- Scope sets and rate limits — api_client scopes and rate limits are configured per client in the admin site, not enumerated in code.
- Webhook subscriptions — event types a product wants to consume are chosen at subscription time from the emitted event catalogue.
The shape of the clinical data itself is also designed for extension. Concretely:
skin_finding.finding_typeis an open-ended classifier with a built-inothervalue. New finding types can be added with a migration that appends to the enum or — if we choose a string-typed column at implementation time — with no migration at all. The spec does not mandate enum vs string; implementation should favour whichever is easiest to extend without downtime.- Typed extension tables (like
lesion_extension) are the pattern for product-specific structured fields. A new product needing product-specific structured data for its findings adds a new extension table (1:1 withskin_finding) rather than bloating the base or creating a parallel hierarchy. This is a schema change, but it's additive and follows the expand/contract discipline. - Clinical context on cases (
clinical_context_json) is a typed JSON blob whose schema is product-scoped; the admin site allows a product to register a JSON Schema for its clinical context, which the API validates against on write. New products bring their own schema without touching the core tables. - Medications, diagnoses, consents, images, histology all use the hybrid "structured fields + free-text fallback + coded vocabulary" pattern so a new product that uses a different vocabulary or a different workflow can still participate without schema pressure.
What is explicitly not extensible without a code change (and therefore what new products must conform to):
- The authentication and authorization model.
- The tenancy model (
organisation→product→api_client). - The encryption model (application-level field encryption with per-patient DEKs).
- The async status resource shape and the long-poll mechanism.
- The audit log shape and its object-locked S3 archive contract.
- The webhook payload contract (references only, no PHI).
- The RFC 7807 error shape and the correlation-id threading.
These are the load-bearing invariants of the platform and changing them is a platform evolution, not a product onboarding.
1.4 Naming discipline¶
No table, column, route, scope, event type, metric, log field, config key, module name, or IAM role may contain ov2, aida, or any other product-specific token. CI lints for this. Product-specific data lives in rows, not names. When examples are needed in documentation, they are labelled as examples of a specific product's current usage rather than as the canonical shape.
1.5 In scope for v1¶
Core clinical data model; REST API for clinical operations; admin site; multi-method authentication (OAuth2 client credentials and external JWT via JWKS) plus staff SSO for admin; per-tenant multi-product data isolation; application-level field encryption for PHI with per-patient data keys and crypto-shredding; image and histology ingestion pipelines via S3 + async workers; webhook outbound events with polling fallback and long-poll blocking mode; audit logging with object-locked S3 archive; versioned consent with configurable types; configurable retention and GDPR erasure; comprehensive auto-generated and hand-written documentation; Terraform-managed AWS infrastructure; multi-region deployments (UK and US) with separate per-region instances; medications and prescriptions modelled to FHIR-compatible shape without the translation layer itself.
1.6 Out of scope for v1¶
- FHIR / HL7 / DICOM translation layers — deferred to follow-on sub-projects. The data model is shaped to be compatible but does not perform translation.
- Cross-region patient sharing — regions are fully independent.
- Appointments, scheduling, billing, pharmacy dispensing, e-prescribing gateways.
- Clinical decision support, drug interaction checking.
- A full practitioner registry — actor identity is stored as inline snapshots on records, not as first-class entities.
- Lab result SFTP/file-drop ingestion paths — initial lab integration is REST-only.
1.7 Regulatory context¶
Both UK (UK GDPR, NHS DSPT alignment, 8+ year retention minimums) and US (HIPAA Security Rule, 6+ year retention minimums). Deployed as independent regional stacks to satisfy data residency. BAA-covered AWS services only.
2. System overview and deployment topology¶
2.1 Architectural shape¶
Monolithic NestJS codebase split into two ECS Fargate services at deploy time:
- clinical-api — the public clinical surface consumed by OV2, AIDA, and lab integrations. Exposes
/v1/...endpoints. - admin-api — the restricted admin surface consumed by the admin Next.js SPA and internal tooling. Exposes
/admin/v1/...endpoints. Reachable only via a corporate VPN / allow-listed IPs at the ALB level.
Both services share a single database, a single schema, and a single codebase, but have distinct task roles, security groups, ingress rules, autoscaling policies, and deploy cadences. Independent deployability means an admin-api regression cannot take the clinical path down.
2.2 Per-region deployment¶
Each region (UK and US initially) gets its own complete, independent stack. No operational data crosses regions. Cross-region encrypted Aurora snapshot replication exists for disaster recovery only.
2.3 Within-region stack¶
- VPC — public, private-app, and private-data subnets across 3 AZs; NAT gateways per AZ.
- Public ALB → clinical-api ECS service; TLS 1.2+; HSTS; AWS WAF rate-based rules.
- Restricted ALB → admin-api ECS service; allow-listed to corporate VPN and internal CIDRs.
- ECS Fargate cluster hosting clinical-api and admin-api, spread across 3 AZs.
- Aurora MySQL cluster — writer + 2 readers across AZs, behind RDS Proxy, storage encryption via dedicated KMS key, point-in-time recovery, automated backups, cross-region encrypted snapshot replication for DR.
- ElastiCache Redis cluster — multi-AZ, used for idempotency keys, rate limit counters, and the long-poll pub/sub channel.
- S3 buckets —
clinical-images,histology-reports,audit-log(object-locked, versioned),admin-static,docs,terraform-state(object-locked). All SSE-KMS with distinct per-bucket keys. All deny unencrypted PUTs via bucket policy. - KMS keys — separate CMKs for Aurora storage, application field encryption (per-patient DEKs wrap under this), image bucket, histology bucket, audit bucket, Terraform state.
- SQS queues + DLQs — image-ingestion, histology-ingestion, webhook-delivery.
- Lambda workers — image-ingestion, histology-ingestion, webhook-delivery. Triggered by S3 events (ingestion) or SQS messages (delivery).
- EventBridge — routes S3 ObjectCreated events to the appropriate ingestion queue.
- CloudFront + S3 — admin SPA static assets and docs site.
- Secrets Manager + SSM Parameter Store — runtime credentials and configuration.
- CloudWatch + CloudTrail + AWS Config — logs, metrics, audit trails for the infrastructure itself, and compliance rule monitoring.
2.4 High availability¶
- Aurora multi-AZ cluster with automatic failover (<30s typical), RDS Proxy for transparent connection management across failovers.
- ECS services on 3 AZs with
minimum-healthy-percent = 100,maximum-percent = 200, ALB connection draining, shallow liveness + deep readiness health checks. - Blue/green deploys via CodeDeploy for clinical-api (the critical path); rolling deploys for admin-api.
- Expand/contract schema migrations enforced by CI so no deploy ever requires downtime.
- Graceful degradation — webhook failures go to DLQ without blocking API responses; lab ingestion failures queue rather than reject; image ingestion failures quarantine rather than lose data.
- Cross-region Aurora snapshot replication for disaster recovery; 4h RTO, ≤5min RPO via PITR.
2.5 Technology stack summary¶
- Service language/framework: Node.js + NestJS (TypeScript).
- ORM / migrations: Prisma or TypeORM with explicit migration files (decision at implementation time; either supports the expand/contract discipline).
- Database: Aurora MySQL 8-compatible.
- Admin site: Next.js SPA (static export) served from CloudFront + S3, calling admin-api.
- Infrastructure: Terraform, organised as modules + per-environment stacks.
- CI/CD: CodePipeline + CodeBuild + CodeDeploy. Independent pipelines for clinical-api, admin-api, admin SPA, docs, and infrastructure.
3. Data model¶
All tables include id (UUID v7, time-ordered), organisation_id (tenant scope), created_at, updated_at, and a soft-delete deleted_at. PHI fields are encrypted at the application level; deterministic hashes are stored for fields that need equality lookup.
3.1 Tenancy and access control¶
organisation— top-level tenant.name,region,status,retention_policy_id.product— a registered client product (OV2, AIDA, …). Belongs to an organisation.code,display_name,image_processing_policy_json,required_consent_type_codes,diagnosis_code_mapping_id,actor_context_jwks_url,actor_context_issuer,actor_context_audience.api_client— credentials issued to a product.auth_method∈ {oauth2_client_credentials,external_jwt},auth_config_json,scopes[],allowed_ip_ranges[],status. Multiple api_clients per product allowed.webhook_subscription— belongs to an api_client.target_url,event_types[],signing_secret_kms_ciphertext,status,last_delivery_status.
3.2 Patient identity¶
patient— longitudinal anchor per organisation. Encrypted:given_name,family_name,dob,sex_at_birth,gender_identity,postal_code,email,phone. Plaintext lookup:email_hash,postal_code_hash,dob_hash.status∈ {active,provisional,merged},merged_into_patient_id,encrypted_dek.patient_identifier— strong identifiers (NHS number, MRN, CHI, insurance id, etc.).(scheme, value_ciphertext, value_hash). Unique index on(organisation_id, scheme, value_hash).patient_merge_candidate— fuzzy match review queue.provisional_patient_id,candidate_ranked_json,match_features_json,status∈ {pending_review,merged,dismissed,auto_merged_high_confidence}.patient_merge_history— merge audit + unmerge support.surviving_patient_id,absorbed_patient_id,absorbed_snapshot_ciphertext,actor_snapshot,merged_at,reason,retention_expires_at.
3.3 Cases and findings¶
case— an assessment event. Belongs to patient + product.external_reference(product's own case id, unique per org+product),opened_at,clinical_context_json,status∈ {open,awaiting_histology,completed}. Status never blocks histology addition.created_by_actorinline snapshot fields.skin_finding— unified abstraction for lesion / rash / patch / blemish / other. Belongs to a case.finding_type,body_site_code,body_site_free_text,body_map_x,body_map_y,body_map_orientation,parent_finding_id(lineage, same-patient only),clinical_notes_ciphertext,created_by_actor.lesion_extension— 1:1 optional extension forfinding_type = 'lesion'.diameter_mm_long_axis,diameter_mm_short_axis,elevation,pigmentation, OV2-specific structured fields.body_site— reference table seeded from a SNOMED CT body-site subset curated for dermatology.
3.4 Images¶
image— metadata only; bytes in S3.s3_bucket,s3_key,content_hash_sha256,mime_type,capture_type∈ {dermoscopic,macroscopic,other},captured_at,uploaded_by_actor,exif_retained_json,width_px,height_px,ingestion_status∈ {pending,processing,processed,quarantined,failed},ingestion_error_code.image_derivative— derived artifacts (thumbnail, web, EXIF-stripped master).image_id,derivative_name,s3_bucket,s3_key,content_hash_sha256,width_px,height_px,format.finding_image— many-to-many join.finding_id,image_id,bbox_x1,bbox_y1,bbox_x2,bbox_y2,bbox_coord_system∈ {pixel,normalized},bbox_image_width_px,bbox_image_height_px,bbox_source∈ {ai_detection,human_annotation},bbox_confidence,is_primary. Normalized coordinates are the source of truth; pixel coordinates are cached against the canonical image dimensions.
3.5 Diagnoses and histology¶
diagnosis— belongs to a finding.source∈ {ai,human_clinician,histopathology},free_text_ciphertext,code_system,code_value,code_display,confidence,diagnosed_at,actor_snapshot,notes_ciphertext.diagnosis_code_mapping— per-org mapping table(source_system, source_code) → (target_system, target_code, target_display). Used to resolve AI labels to SNOMED CT at ingestion time.histology_report— structured histology result. Belongs to a case; optionally to a finding.report_type,result_summary_ciphertext,received_at,reporting_lab_snapshot_json.report_file— attached histology file metadata.histology_report_id,s3_bucket,s3_key,mime_type,content_hash_sha256,size_bytes,uploaded_by_actor,virus_scan_status.
3.6 Consent¶
consent_type— per-org configurable.code,display_name,description,legal_basis,required_for_case_creation,active.consent_text_version— immutable versioned text.consent_type_id,version,effective_from,body,locale.consent_record— versioned per patient per type.patient_id,consent_type_id,consent_text_version_id,status∈ {granted,denied,withdrawn},captured_at,captured_via_case_id,actor_snapshot. Latest row for(patient, type)is the current effective state.
3.7 Medications¶
medication_code— reference vocabulary rows.(system, code, display)withingredient,form,strength. Seeded per region (dm+d for UK, RxNorm for US, ATC international).medication_statement— belongs to a case; optionally to a finding.statement_type∈ {prescribed,reported,administered},medication_code_system,medication_code_value,medication_code_display,medication_free_text,dosage_text,dose_amount,dose_unit,route,frequency,duration_days,as_needed,start_date,end_date,status∈ {active,completed,on_hold,stopped,cancelled,entered_in_error},reason_free_text,reason_diagnosis_id,prescriber_actor_snapshot,notes_ciphertext.medication_statement_history— append-only log of status and dosage changes.
Shape is FHIR-MedicationRequest / MedicationStatement compatible.
3.8 Audit and retention¶
audit_log— hot cache of recent audit entries.event_type,entity_type,entity_id,actor_snapshot,before_ciphertext,after_ciphertext,timestamp,correlation_id,s3_archive_key. Rolling in-DB retention (default 90 days); object-locked S3 is the durable system of record.retention_policy— per-org rules, evaluated nightly. Triggers soft-delete or crypto-shred on expiry. Supports legal holds that block both.
3.9 Deliberate omissions from v1¶
No scheduling, billing, pharmacy dispensing, practitioner registry (actor identity lives inline on records), FHIR/HL7/DICOM translation tables, or DICOM-specific metadata.
4. API surface¶
4.1 Conventions¶
- URL-prefixed versioning (
/v1/...). - JSON over HTTPS only; TLS 1.2+; HSTS.
- OpenAPI 3 auto-generated from NestJS decorators; published at
/v1/openapi.json. - Cursor-based pagination (
?cursor=,next_cursorin responses). - RFC 7807 Problem+JSON errors with
correlation_idandviolations[]for validation failures. PHI never appears in error detail strings. X-Correlation-Idheader on every request; generated if absent; echoed on every response, log line, audit row, webhook payload.Idempotency-Keyheader accepted on all non-idempotent writes; retention 24h in Redis.- Per-
api_clientrate limits enforced at WAF + in-app; configurable in the admin panel; separate limits for interactive vs ingestion endpoints. - URL-prefix versioning for breaking changes;
/v1deprecation surfaced viaSunset/Deprecationheaders plus admin panel notices plus webhook events; 12-month support window by default.
4.2 Authentication headers¶
Authorization— api_client credential. Either an OAuth2 client-credentials access token issued by our/v1/oauth/tokenendpoint, or an external JWT verified against a configured JWKS URL.X-Actor-Context— short-lived JWT (≤5 min) signed by the calling product, verified against theproduct.actor_context_jwks_url. Claims:external_user_id,display_name,role,professional_id,professional_id_type. Snapshotted onto records and audit entries. Optional for lab api_clients.
4.3 Resource routes (clinical-api, /v1)¶
Patients
POST /patients— create or upsert (strong-identifier match → fuzzy fallback → provisional). Response reportsmatched_existing/created/provisional_pending_review.GET /patients/{id};PATCH /patients/{id}.GET /patients/{id}/cases.POST /patients/search.
Cases
POST /cases;GET /cases/{id};PATCH /cases/{id}.POST /cases/{id}/findings.
Findings
GET /findings/{id};PATCH /findings/{id}.POST /findings/{id}/diagnoses.POST /findings/{id}/images/{image_id}— attach image with bbox + source.POST /findings/{id}/medications.POST /findings/{id}/lineage.
Images (two-phase upload)
POST /images:initiate— returns pre-signed S3 PUT URL, image id, and status URL.GET /images/{id}— metadata + time-limited pre-signed GET URLs.GET /images/{id}/status— async operation status (see §4.5).
Histology
POST /histology:initiate— structured result + pre-signed file upload URLs.GET /histology/{id};PATCH /histology/{id}.GET /histology/{id}/status.
Consents
GET /patients/{id}/consents;POST /patients/{id}/consents.GET /consents/types.
Medications
GET /medications/{id};PATCH /medications/{id};GET /medications/{id}/history.
Events
GET /events?since_cursor=— polling alternative to webhooks.
4.4 Admin routes (/admin/v1, restricted ALB)¶
Staff SSO only; no api_client credentials accepted.
/admin/v1/organisations,/products,/api-clients,/webhook-subscriptions/admin/v1/consent-types,/consent-text-versions/admin/v1/diagnosis-code-mappings/admin/v1/image-processing-policies/admin/v1/patient-merge-candidates,/patient-merge-history(unmerge within retention window)/admin/v1/retention-policies,/legal-holds/admin/v1/audit— hot DB cache + ranged S3 archive queries
4.5 Async status model¶
Three complementary mechanisms give clients deterministic visibility into async operations.
1. Synchronous by default — everything that can run inline does.
2. Status resource — async writes return 202 Accepted with a Location header pointing at /v1/{resource}/{id}/status. The status resource has a uniform shape:
{
"resource_type": "image",
"resource_id": "01H...",
"status": "processing",
"stage": "exif_processing",
"stages_completed": ["uploaded", "virus_scanning"],
"stages_remaining": ["exif_processing", "deriving", "complete"],
"progress_percent": 40,
"updated_at": "2026-04-15T12:34:56Z",
"next_poll_after_ms": 2000,
"terminal": false,
"error": null,
"correlation_id": "..."
}
Terminal states: processed, quarantined, failed, cancelled. Failure populates an RFC 7807 error object with a machine-readable code.
3. Long-poll blocking mode — any async or status endpoint accepts ?wait=true&timeout_ms=<1..30000>. The request holds open until the resource reaches a terminal state (returns immediately with terminal state) or the timeout elapses (returns current state with a fresh next_poll_after_ms). Backed by Redis pub/sub keyed on resource id; no DB polling. Max 30s to respect ALB idle timeouts.
Clients can mix blocking, polling, and webhook subscriptions freely. Webhooks are the push layer; polling is always available.
5. Authentication, authorization, tenancy, encryption¶
5.1 API client authentication¶
Pluggable strategy chain behind a single AuthGuard. Each api_client declares its method:
- OAuth2 client credentials — built-in
/v1/oauth/tokenendpoint issues 15-min JWT access tokens scoped to the api_client's scopes. Secrets stored as argon2id hashes with grace-period rotation (two secrets valid simultaneously for N hours). - External JWT (JWKS) — api_client declares a JWKS URL + expected
iss/aud. Service fetches and caches JWKS (1h refresh), verifies signature, maps subject to api_client. No secret stored locally.
Adding new methods (mTLS, SigV4) is a new strategy class with no schema change.
5.2 Actor context (end-user identity)¶
Separate from api_client auth. Every clinical request must carry an X-Actor-Context JWT signed by the calling product, verified against product.actor_context_jwks_url. Optional for lab api_clients.
5.3 Authorization¶
Fine-grained scopes: patients:{read,write}, cases:{read,write}, images:{read,write}, histology:{read,write}, consents:{read,write}, medications:{read,write}, events:read, plus cross-cutting cross_product_read and patient_merge_review.
Enforcement layers, top-down: auth guard → scope decorator → tenancy guard → row-level DB check.
5.4 Admin authentication¶
Staff SSO via OIDC against the corporate IdP. MFA enforced at the IdP; the app requires amr claim evidence for high-privilege operations (issuing api_clients, merging patients, retention changes). Internal roles in admin_user table per-org. All admin actions audited identically to clinical actions.
5.5 Tenancy enforcement¶
Every tenant-scoped table carries organisation_id as a not-null indexed column. A query interceptor injects organisation_id = :ctx_org on every query originating from clinical-api, using the authenticated client's org from request context. Bypass requires an explicit "cross-org" decorator permitted only in admin-api code paths and logged. Cross-product access is controlled separately by the cross_product_read scope with an identical product_id interceptor.
5.6 Encryption at rest — three layers¶
Layer 1 — Aurora storage encryption. Dedicated KMS CMK. Encrypts data, backups, snapshots, and cross-region replicas. KMS annual rotation.
Layer 2 — Application-level envelope encryption for PHI fields. Dedicated "field encryption" CMK. Per-patient data keys (DEKs) wrapped by the CMK, stored as patient.encrypted_dek. PHI fields encrypted with the DEK using AES-256-GCM with per-row random IVs. Deterministic encryption (AES-256-SIV with per-field salt) used only for fields requiring equality lookup (email_hash, postal_code_hash, patient_identifier.value_hash); salts differ per field to prevent cross-field correlation. DEKs cached in memory for request lifetime only; never persisted unwrapped. Crypto-shredding on GDPR erasure destroys the patient's wrapped DEK, rendering all encrypted PHI permanently unreadable including in backups and audit archives. Structural records survive for compliance.
Layer 3 — S3 SSE-KMS with separate CMKs per bucket type. Bucket policies deny unencrypted PUTs. Pre-signed URLs scoped per-object, ≤5 min TTL.
5.7 Encryption in transit¶
TLS 1.2+ everywhere: ALB listeners, ECS → RDS Proxy → Aurora, service → KMS/SQS/S3 (AWS SDK enforced), outbound webhook calls (reject targets without valid TLS).
5.8 Key and secret management¶
- Aurora storage CMK — KMS automatic annual rotation.
- Field-encryption CMK — manual yearly rotation; old versions retained for decryption; lazy re-wrap on access plus background forced re-wrap before retirement.
- S3 bucket CMKs — automatic annual rotation.
- Webhook signing secrets and OAuth2 client secrets — admin-panel-driven rotation with overlap windows.
- Secrets Manager for DB credentials and system secrets with automatic rotation where supported.
- ECS task roles scoped per service: clinical-api and admin-api have distinct IAM roles and cannot read each other's secrets or KMS keys.
6. Ingestion pipelines¶
6.1 Image ingestion¶
Pattern: client uploads directly to S3 via pre-signed URL; S3 ObjectCreated event drives processing; client observes status via the async status model in §4.5.
POST /v1/images:initiate— insertsimagerow (status=pending), returns pre-signed PUT URL and status URL.- Client PUTs bytes directly to S3.
- S3 ObjectCreated → EventBridge → image-ingestion SQS → image-ingestion Lambda.
- Lambda stages, each idempotent on content hash: read object head, virus scan (ClamAV layer, quarantine on detect), read EXIF and apply per-product retention policy (write retained fields, strip original in derivative), probe dimensions, generate derivatives per per-product policy, update
imagerow toprocessed, emitimage.processedwebhook. - Failure: virus detected →
quarantined; unsupported mime / corrupt →failed; transient KMS/S3/DB → SQS retry with exponential backoff, max 5 attempts, then DLQ + operator alarm.
No client-side finalize call. Processing is driven by the S3 event, so a client that crashes mid-upload never creates orphan rows that need reaping. A nightly sweep reaps pending rows >24h old without corresponding S3 objects.
EXIF policy — per-product JSON in product.image_processing_policy_json. Fields to retain, derivative specs, virus scan flag, max size. Edited via admin panel with structured form validation.
6.2 Histology ingestion¶
Same two-phase pattern, supporting two client shapes:
- Lab integration — lab api_client calls
POST /v1/histology:initiatewith structured result JSON plus a manifest of attachments. Response includeshistology_report_id, resolved case reference, pre-signed PUT URLs, optionalexpected_content_hashesthe worker verifies. - Manual upload by clinician — product backend makes the same call on the clinician's behalf (api_client + actor context both present).
Late binding to findings. Histology arriving months after a case is expected. If the source supplies a finding reference, we attach there; otherwise the report belongs to the case and can be reassigned to a finding later via PATCH /v1/histology/{id}. Case status never blocks histology attachment.
Lab identity and scope. Labs are dedicated api_clients with histology:write scope only, provisioned per organisation via the admin panel. Labs reference cases by the opaque external_reference supplied out-of-band; they cannot enumerate cases.
6.3 Webhook delivery¶
Publish events to webhook-delivery SQS; webhook-delivery Lambda fans out to matching subscriptions. For each subscription: build PHI-free payload (references only), HMAC-SHA256 sign with the subscription secret into X-Webhook-Signature, POST with 5s connect / 10s total timeout.
On 2xx: record success. On 4xx (except 429): record permanent failure, alert subscription owner via admin panel, no retry. On 5xx / 429 / network: exponential backoff retry (1m, 5m, 30m, 2h, 12h — total 6 attempts over ~15h), then DLQ + admin panel alert.
Payload contains no PHI. Only references (organisation, product, resource type, resource id, event id, timestamp, correlation id). Clients fetch detail via the authenticated API — which means the client must still have the right scope to see it. At-least-once delivery; clients dedupe on event_id.
Event types emitted in v1:
patient.created,patient.merged,patient.merge_candidate_flaggedcase.created,case.updated,case.histology.attachedfinding.created,finding.updated,finding.lineage_linkeddiagnosis.addedimage.processed,image.quarantined,image.failedmedication.recorded,medication.status_changedconsent.changed
New event types are additive and do not require a version bump.
7. Consent, audit, retention¶
7.1 Consent¶
Versioned, typed, per-patient. Consent types are configurable per organisation via the admin panel; each has a code, display name, description, GDPR legal basis, and required_for_case_creation flag. Multiple consents may be captured at case creation time. Consent text is versioned and immutable once published; patients consent against a specific text version. Current effective consent for a (patient, type) is the latest row in consent_record. Withdrawal is a new row with status=withdrawn.
Consent for care and consent for AI training are separable under GDPR — the schema enforces distinct consent_type rows, and the admin panel validates that they cannot be bundled in the UI. Consent withdrawal is distinct from GDPR erasure; withdrawing AI training consent excludes the images from future training runs but does not delete the images or invalidate already-trained models (those are governed separately).
7.2 Audit logging¶
Every write logs an event with actor snapshot, before/after diff (PHI-encrypted where relevant), entity, correlation id, and timestamp. Sensitive reads (patient fetch, image download, diagnosis view) are also logged. The in-DB audit_log table is a hot cache with rolling retention (default 90 days); every entry is also streamed asynchronously to an object-locked, versioned S3 bucket with a separate KMS key, which is the durable system of record for compliance. A compromised database cannot tamper with historical audit. Admin actions are audited identically to API actions.
7.3 Retention and erasure¶
Per-organisation configurable retention policy, evaluated nightly. Default rules align with UK NHS (8+ years) and US HIPAA (6+ years) minimums; specific deployments may extend. Legal holds block both soft-delete and crypto-shred on affected records.
GDPR erasure is implemented via crypto-shredding: destroy the patient's wrapped DEK in KMS. All field-encrypted PHI becomes permanently unreadable in the live database, in backups, and in audit archives. Structural records (case occurred, diagnosis of type X was made, image existed) survive for audit and ML integrity with only opaque ids; they can no longer reveal identity. S3 image and report objects belonging to the erased patient are deleted outright. Merge history is crypto-shredded alongside the patient's main data key.
8. Patient identity matching¶
Within an organisation, a patient seen in multiple products should be one record where possible.
- Strong-identifier match. If the request supplies a strong identifier (NHS number in UK, MRN/other in US) via
patient_identifier, match exactly on(scheme, value_hash). Match → use existing patient. - Fuzzy fallback when no strong identifier or no strong-id match. Score on name + DOB + postal code + any available identifiers. High-confidence unique match → use existing patient.
- Ambiguous or multiple candidates → create a provisional patient record immediately so ingestion never blocks, and enqueue the match candidates in
patient_merge_candidatefor admin review. - No match → create new patient, assume unique.
Admin merge flow. The admin panel surfaces the merge queue. An admin can merge a provisional patient into an existing one, cascading a re-point of all related records (cases, findings, images, consents, medications, histology) with full audit. Merges are reversible within a retention window via patient_merge_history, which holds an encrypted snapshot of the absorbed record sufficient to restore. Re-matching can be re-run on demand against the live data.
Post-hoc matching is supported — a provisional record can be reviewed and merged long after ingestion. Ingestion is never blocked by the matching pipeline.
9. Documentation¶
Docs-as-code; everything lives in the repo, generated from sources of truth where possible, published on every main merge, versioned alongside the API.
9.1 Content¶
- API reference — auto-generated from NestJS OpenAPI decorators; rendered with Scalar or ReDoc; OpenAPI spec also published as a downloadable artifact for SDK generation.
- Integration guides — getting started, image upload, histology results, patient identity matching, consent, webhooks, error handling. One complete worked example per product (OV2, AIDA).
- Admin site documentation — one page per admin capability with step-by-step runbooks and Playwright-generated screenshots kept fresh by CI.
- Data dictionary and ERD — generated from the schema on every build; entity relationship diagram rendered inline.
- Architecture documentation — this design doc, ADRs in
docs/adr/, system and network diagrams from Mermaid sources, runtime topology generated from Terraform state. - Operational runbooks — one per on-call scenario (Aurora failover, deployment rollback, KMS rotation, DLQ drain, ingestion backlog, webhook failure, quarantined image review, legal hold, DR restore, forgotten-admin recovery). Structured: symptoms / diagnosis / remediation / verification / follow-up.
- Compliance and security documentation — data flow diagrams for DPIA / HIPAA risk assessment, encryption model explainer, audit log shape, tenant isolation explainer, incident response outline.
- Developer documentation — local dev setup (docker-compose with MySQL, Minio, Localstack), testing strategy, migration authoring guide, contribution guide.
9.2 Delivery¶
- Stack: MkDocs Material (or Docusaurus if preferred). Plugins for OpenAPI ingestion and auto-generated schema artifacts.
- CI pipeline: on every
mainmerge — extract OpenAPI, regenerate data dictionary, regenerate admin screenshots, build the site, publish tos3://docs-<env>-<region>/served via CloudFront. - Visibility tiers:
- Internal (admin, runbooks, compliance, dev) — behind corporate VPN / SSO.
- Partner (API reference, integration guides, public changelog) — served publicly with per-partner credential gates at CloudFront for sensitive sections.
- Versioning: version switcher; old versions retained as long as the corresponding API version is supported.
9.3 Quality gates¶
- CI fails if OpenAPI examples don't match their declared types.
- CI fails if a new route lacks a
@ApiOperationdocstring. - Link checker, spellcheck, markdown lint on every build.
CHANGELOG.mdupdate enforced on every PR that changes external behaviour.
10. Testing and quality¶
10.1 Test pyramid¶
- Unit — fast, isolated. Business logic, validation, scope enforcement, tenancy guards, encryption round-trips (mocked KMS), consent evaluation, matching and scoring, retry/backoff math, webhook signing. Target >85% line coverage on business logic.
- Integration — against real MySQL 8 (Testcontainers), Minio, Localstack. No DB or storage mocking. Covers full request-to-DB-to-side-effect flows, schema migrations forward and backward, tenant isolation under concurrent access, idempotency-key behaviour, end-to-end encryption, crypto-shred effect (destroy DEK, confirm unrecoverability). This is the load-bearing test tier.
- Contract — OpenAPI is the contract. CI validates request/response schemas against the spec, runs Schemathesis/Dredd property-based fuzzing against the running service, and snapshot-tests the OpenAPI JSON so accidental breaking changes require explicit approval.
- End-to-end — full-stack docker-compose scenarios run on
mainbuilds: OV2 flow, AIDA flow, patient merge flow, crypto-shred lifecycle. - Admin SPA — Playwright against admin-api; same tests generate the docs screenshots.
- Load / stress — weekly k6 runs against staging; realistic mixes; baseline metrics tracked over time.
- Chaos / failure — quarterly game days against staging: kill Aurora writer, kill random ECS tasks mid-request, throttle KMS, fill a DLQ.
10.2 Security testing¶
CodeQL SAST on every PR; Dependabot + automated low-risk security patches; Trivy container scanning on push; gitleaks pre-commit + CI; weekly OWASP ZAP against staging; annual third-party penetration test covering auth, tenancy isolation, admin panel, and ingestion paths.
10.3 Schema migration testing¶
CI applies every new migration against a copy of the current production schema snapshot and verifies:
- Expand-phase migrations run without locking, without dropping columns still in use, compatible with the previous app version.
- The previous app version's integration tests pass against the intermediate schema.
- Contract-phase migrations only run after both app versions have been verified against the intermediate schema.
This is the single biggest source of outage risk and the most important gate.
10.4 CI/CD pipeline¶
Per-service:
- PR build — lint, typecheck, unit, integration, contract, SAST, secret scan, docs build.
- Main merge — all of the above plus E2E, container build, Trivy scan, ECR push, deploy to dev, smoke tests.
- Release to staging — gated manual approval; migration test against staging schema; expand-phase applied; app deployed; staging E2E pass.
- Release to production — gated manual approval plus 24h staging soak; blue/green (clinical-api) or rolling (admin-api) deploy; automatic rollback on health-check or SLO breach; contract-phase migration runs only after the new version is stable for a configured window.
Staging mirrors production topology and is always at or ahead of production version.
11. Infrastructure as code¶
11.1 Repository layout¶
infra/
modules/
network/
aurora/
ecs-service/
s3-bucket/
kms-keys/
sqs-queue/
lambda-worker/
cloudfront-site/
secrets/
monitoring/
stacks/
shared/
environments/
dev-uk/ dev-us/
staging-uk/ staging-us/
prod-uk/ prod-us/
Each environment holds its own Terraform state; no cross-environment references; no shared state.
11.2 State and secrets¶
- S3 state backend per region, versioning + object-lock, KMS-encrypted, DynamoDB lock table.
- Sensitive values generated inside AWS and referenced by ARN where possible (e.g. Aurora master password in Secrets Manager). Terraform knows ARNs, not plaintext.
- Dedicated KMS key for state file encryption with a tight policy (CI/CD roles + SRE).
11.3 Module principles¶
- Production-safe defaults (encryption on, public access off, deletion protection on, versioning on). Opting out requires explicit variables with justifying comments.
- No environment-specific logic inside modules.
- Clean output chaining — modules output ARNs/ids downstream modules consume; environment stacks read as short declarative composition.
- Module versions pinned per environment so upgrades are intentional (dev first, prod last).
11.4 In and out of Terraform¶
In Terraform: VPC, subnets, routes, NAT, security groups; ALBs + ACM + WAF; Aurora cluster + parameter groups + RDS Proxy; ECS cluster + tasks + services + roles; ECR repositories with image scanning; S3 buckets (clinical-images, histology-reports, audit-log, admin-static, docs, terraform-state, backups); KMS keys with scoped policies; SQS queues + DLQs; Lambda functions (code deployed via CI); ElastiCache Redis; EventBridge rules; Route53; CloudFront distributions for admin SPA and docs; CloudWatch log groups / metric filters / alarms / dashboards; Secrets Manager schemas (not values); IAM roles and policies; cross-region backup replication; AWS Config rules.
Out of Terraform: application code deployments (CodePipeline/CodeDeploy); database schema (app migration tool run as pre-deploy ECS task); secret values (Secrets Manager rotation or admin panel); data seeding (application-level seed scripts).
11.5 Bootstrap and DR¶
- Bootstrap module creates the state backend once per account (solves the chicken-and-egg).
- DR runbook:
terraform apply+ restore from cross-region snapshot. Tested annually against a disposable account. RTO 4h; RPO ≤5 min.
11.6 Infra CI¶
PR build: terraform fmt → validate → tflint → checkov → plan against each environment → post plan output on the PR. main merge auto-applies to dev. Staging and prod require manual approval. Nightly terraform plan against prod detects drift and alerts.
12. Open questions and future work¶
Deferred to follow-on projects:
- FHIR R4 translation layer (Patient, Encounter, Observation, Condition, Media, DiagnosticReport, Consent, MedicationRequest, MedicationStatement).
- HL7 v2 translation (mainly for lab integration with partners that can't hit REST).
- DICOM support — the current platforms do not produce DICOM; the data model will not store DICOM-specific metadata in v1.
- SFTP / file-drop histology ingestion path for labs that cannot integrate with a REST API.
- Event streaming via EventBridge for clients that want bus-based consumption rather than webhooks.
- Multi-factor authentication on api_client credentials (currently MFA applies to staff SSO only).
- A practitioner registry if clinician records grow beyond the inline actor snapshot model.
Decisions deferred to implementation time (not architecturally load-bearing):
- Prisma vs TypeORM (both support expand/contract migrations and work well with NestJS).
- MkDocs Material vs Docusaurus for the docs site.
- Scalar vs ReDoc for the OpenAPI reference renderer.
Things that need to be validated with implementation partners before build starts:
- The exact shape of the actor-context JWT claims — OV2 and AIDA must be able to issue these, so the claim set should be reviewed with their teams.
- The EXIF retention defaults per product — the initial policies should be agreed with clinical stakeholders before go-live.
- The seed set for
body_site(SNOMED CT subset) anddiagnosis_code_mappingstarting rows (AI label → SNOMED CT mappings for OV2). - The legal retention periods per region (the defaults should be reviewed by compliance before prod launch).
13. Summary¶
A single NestJS codebase deployed as two ECS services (clinical-api, admin-api) against an Aurora MySQL cluster, with S3-backed binary storage, Lambda-driven async ingestion, per-patient field-encrypted PHI with crypto-shred erasure, versioned REST APIs with uniform async status and webhook delivery, a Next.js admin SPA behind a restricted ALB, comprehensive docs-as-code, rigorous expand/contract schema migration gating, and Terraform-managed multi-region AWS infrastructure (UK and US initially) with per-region isolation and cross-region DR. Zero-downtime deployments are a first-class constraint enforced across both the infrastructure (Aurora multi-AZ, RDS Proxy, 3-AZ ECS, blue/green deploys) and the development discipline (expand/contract migrations, CI-gated).
This spec is scoped to v1. FHIR / HL7 / DICOM translation is deliberately deferred to follow-on projects that will layer onto the stable data model established here.