Skip to content

Deploying services runbook

How to deploy the platform's backend services (and SPAs) to a bootstrapped AWS environment with infra/scripts/deploy.sh. For reverting a bad deploy, see the Deployment rollback runbook.

Model

There is no automated CD — deploys are a manual, operator-run script. CI only tests/lints/builds; merging to main does not roll anything. deploy.sh:

  1. Builds every (or each changed) service image, tags it with the current git short SHA, and pushes to ECR.
  2. Builds + uploads the SPAs (admin-ui, simulator) to S3 and invalidates CloudFront.
  3. Runs terraform apply so the ECS task definitions reference the new image tag(s); ECS then rolls the affected services.

The live /docs (Swagger) for each service is generated at boot, so new endpoints appear only after a redeploy.

Prerequisites

  • The environment has been bootstrapped (infra/scripts/bootstrap.sh).
  • terraform >= 1.6, docker, pnpm, jq, and the AWS CLI installed.
  • An AWS profile for the target account (aws configure --profile <name>).
  • A tfvars file for the environment under infra/terraform/aws/tfvars/ (gitignored per environment; only example.tfvars is tracked). For non-prod, set expose_api_docs = true there to serve Swagger.
  • The matching Terraform workspace selected: terraform -chdir=infra/terraform/aws workspace select <name>.

Deploy from a checkout of the code you want live (the image tag is git rev-parse --short HEAD).

Full deploy

Rebuilds, pushes, and rolls all 9 backend services plus the SPAs. Use it the first time, or whenever a shared change (a packages/* change, pnpm-lock.yaml, or the Dockerfile) means everything needs rebuilding.

./infra/scripts/deploy.sh --var-file tfvars/sandbox.tfvars

It prints an AWS account summary and asks for a typed y before pushing. The build installs the shared dependency layer once (reused across all images), skips the Chromium download, and produces ~390MB images.

Incremental deploy (--changed)

Deploys only the services that changed since they went live — much faster and far less disk than a full deploy.

./infra/scripts/deploy.sh --var-file tfvars/sandbox.tfvars --changed

For each service it reads the live image's git-SHA tag from ECS, then git diffs the repo since that SHA over:

  • the service's own directory (services/<name>/), and
  • shared packages (packages/common, auth-client, events, tsconfig, eslint-config), and
  • build inputs (pnpm-lock.yaml, pnpm-workspace.yaml, root package.json, infra/docker/service.Dockerfile, infra/docker/.npmrc).

Any change → that service is rebuilt/pushed/rolled (a shared-package or lockfile/Dockerfile change fans out to every service). It prints a per-service DEPLOY/skip plan first and exits early if nothing changed. Under the hood it sets a per-service image_tags entry and terraform apply -targets only those services' task-definition + service resources, so the rest are untouched.

Run a full deploy at least once first so every service has a known SHA tag live for --changed to diff against. --changed covers backend services only (it skips SPAs); deploy SPA changes with a full deploy. It cannot be combined with --services or --skip-tf.

Flags

Flag Effect
--var-file <path> tfvars for the closing terraform apply (required unless --skip-tf).
--changed Incremental: rebuild/roll only services changed vs their live image.
--services a,b Limit the build to a subset. Only safe with --skip-tf (image/SPA-only); for a partial roll use --changed.
--skip-spas Skip the SPA build + upload.
--skip-tf Push images/SPAs only, no terraform apply / ECS roll.
--profile <name> Pick the AWS profile non-interactively (skips the menu).
-y, --yes Skip the profile menu + account confirmation (pair with --profile).

Verify after a deploy

  • ECS services reach a steady state (new task set running, old drained).
  • GET https://<host>/<prefix>/health returns 200 (e.g. /v1/clinical-api/health).
  • If expose_api_docs = true: Swagger at https://<host>/v1/clinical-api/docs (and /v1/auth/docs, /api/docs, …) reflects the deployed code.

Troubleshooting

ERR_PNPM_ENOSPC: no space left on device — the Docker build host is out of disk. Reclaim safely (never touch volumes — they hold local dev DB/Minio/Redis data):

docker builder prune -af      # build cache
docker image prune -af        # unused images (old SHA-tagged builds)

A full deploy uses roughly 6–8GB; old SHA-tagged images + build cache accumulate across deploys and aren't auto-pruned, so run the above periodically.

Transient npm / corepack errors (e.g. ETIMEDOUT, corepack failing to fetch pnpm) — the build retries pnpm install and corepack through short blips; just re-run the deploy (it resumes from cache). If npm is persistently unreliable, consider a registry mirror/cache.

--services X rolled the wrong thing / failed--services only limits the build; without --skip-tf the terraform apply still uses the global image_tag and would roll everything. Use --changed for partial deploys.

Notes

  • image_tag is global for full deploys; image_tags (a per-service map) is set by --changed. Both live in infra/terraform/aws/variables.tf.
  • The slack-agent deploys independently (its own image tag).
  • Per-environment tfvars are gitignored; only example.tfvars is tracked.