Deploying services runbook¶

How to deploy the platform's backend services (and SPAs) to a bootstrapped AWS environment with infra/scripts/deploy.sh. For reverting a bad deploy, see the Deployment rollback runbook.

Model¶

There is no automated CD — deploys are a manual, operator-run script. CI only tests/lints/builds; merging to main does not roll anything. deploy.sh:

Builds every (or each changed) service image, tags it with the current git short SHA, and pushes to ECR.
Builds + uploads the SPAs (admin-ui, simulator) to S3 and invalidates CloudFront.
Runs terraform apply so the ECS task definitions reference the new image tag(s); ECS then rolls the affected services.

The live /docs (Swagger) for each service is generated at boot, so new endpoints appear only after a redeploy.

Prerequisites¶

The environment has been bootstrapped (infra/scripts/bootstrap.sh).
terraform >= 1.6, docker, pnpm, jq, and the AWS CLI installed.
An AWS profile for the target account (aws configure --profile <name>).
A tfvars file for the environment under infra/terraform/aws/tfvars/ (gitignored per environment; only example.tfvars is tracked). For non-prod, set expose_api_docs = true there to serve Swagger.
The matching Terraform workspace selected: terraform -chdir=infra/terraform/aws workspace select <name>.

Deploy from a checkout of the code you want live (the image tag is git rev-parse --short HEAD).

Full deploy¶

Rebuilds, pushes, and rolls all 9 backend services plus the SPAs. Use it the first time, or whenever a shared change (a packages/* change, pnpm-lock.yaml, or the Dockerfile) means everything needs rebuilding.

./infra/scripts/deploy.sh --var-file tfvars/sandbox.tfvars

It prints an AWS account summary and asks for a typed y before pushing. The build installs the shared dependency layer once (reused across all images), skips the Chromium download, and produces ~390MB images.

Incremental deploy (`--changed`)¶

Deploys only the services that changed since they went live — much faster and far less disk than a full deploy.

./infra/scripts/deploy.sh --var-file tfvars/sandbox.tfvars --changed

For each service it reads the live image's git-SHA tag from ECS, then git diffs the repo since that SHA over:

the service's own directory (services/<name>/), and
shared packages (packages/common, auth-client, events, tsconfig, eslint-config), and
build inputs (pnpm-lock.yaml, pnpm-workspace.yaml, root package.json, infra/docker/service.Dockerfile, infra/docker/.npmrc).

Any change → that service is rebuilt/pushed/rolled (a shared-package or lockfile/Dockerfile change fans out to every service). It prints a per-service DEPLOY/skip plan first and exits early if nothing changed. Under the hood it sets a per-service image_tags entry and terraform apply -targets only those services' task-definition + service resources, so the rest are untouched.

Run a full deploy at least once first so every service has a known SHA tag live for --changed to diff against. --changed covers backend services only (it skips SPAs); deploy SPA changes with a full deploy. It cannot be combined with --services or --skip-tf.

Flags¶

Flag	Effect
`--var-file <path>`	tfvars for the closing `terraform apply` (required unless `--skip-tf`).
`--changed`	Incremental: rebuild/roll only services changed vs their live image.
`--services a,b`	Limit the build to a subset. Only safe with `--skip-tf` (image/SPA-only); for a partial roll use `--changed`.
`--skip-spas`	Skip the SPA build + upload.
`--skip-tf`	Push images/SPAs only, no `terraform apply` / ECS roll.
`--profile <name>`	Pick the AWS profile non-interactively (skips the menu).
`-y`, `--yes`	Skip the profile menu + account confirmation (pair with `--profile`).

Verify after a deploy¶

ECS services reach a steady state (new task set running, old drained).
GET https://<host>/<prefix>/health returns 200 (e.g. /v1/clinical-api/health).
If expose_api_docs = true: Swagger at https://<host>/v1/clinical-api/docs (and /v1/auth/docs, /api/docs, …) reflects the deployed code.

Troubleshooting¶

ERR_PNPM_ENOSPC: no space left on device — the Docker build host is out of disk. Reclaim safely (never touch volumes — they hold local dev DB/Minio/Redis data):

docker builder prune -af      # build cache
docker image prune -af        # unused images (old SHA-tagged builds)

A full deploy uses roughly 6–8GB; old SHA-tagged images + build cache accumulate across deploys and aren't auto-pruned, so run the above periodically.

Transient npm / corepack errors (e.g. ETIMEDOUT, corepack failing to fetch pnpm) — the build retries pnpm install and corepack through short blips; just re-run the deploy (it resumes from cache). If npm is persistently unreliable, consider a registry mirror/cache.

--services X rolled the wrong thing / failed — --services only limits the build; without --skip-tf the terraform apply still uses the global image_tag and would roll everything. Use --changed for partial deploys.

Notes¶

image_tag is global for full deploys; image_tags (a per-service map) is set by --changed. Both live in infra/terraform/aws/variables.tf.
The slack-agent deploys independently (its own image tag).
Per-environment tfvars are gitignored; only example.tfvars is tracked.