Deploying services runbook¶
How to deploy the platform's backend services (and SPAs) to a bootstrapped AWS
environment with infra/scripts/deploy.sh. For reverting a bad deploy, see the
Deployment rollback runbook.
Model¶
There is no automated CD — deploys are a manual, operator-run script. CI only
tests/lints/builds; merging to main does not roll anything. deploy.sh:
- Builds every (or each changed) service image, tags it with the current git short SHA, and pushes to ECR.
- Builds + uploads the SPAs (admin-ui, simulator) to S3 and invalidates CloudFront.
- Runs
terraform applyso the ECS task definitions reference the new image tag(s); ECS then rolls the affected services.
The live /docs (Swagger) for each service is generated at boot, so new
endpoints appear only after a redeploy.
Prerequisites¶
- The environment has been bootstrapped (
infra/scripts/bootstrap.sh). terraform >= 1.6,docker,pnpm,jq, and the AWS CLI installed.- An AWS profile for the target account (
aws configure --profile <name>). - A tfvars file for the environment under
infra/terraform/aws/tfvars/(gitignored per environment; onlyexample.tfvarsis tracked). For non-prod, setexpose_api_docs = truethere to serve Swagger. - The matching Terraform workspace selected:
terraform -chdir=infra/terraform/aws workspace select <name>.
Deploy from a checkout of the code you want live (the image tag is
git rev-parse --short HEAD).
Full deploy¶
Rebuilds, pushes, and rolls all 9 backend services plus the SPAs. Use it the
first time, or whenever a shared change (a packages/* change, pnpm-lock.yaml,
or the Dockerfile) means everything needs rebuilding.
./infra/scripts/deploy.sh --var-file tfvars/sandbox.tfvars
It prints an AWS account summary and asks for a typed y before pushing. The
build installs the shared dependency layer once (reused across all images),
skips the Chromium download, and produces ~390MB images.
Incremental deploy (--changed)¶
Deploys only the services that changed since they went live — much faster and far less disk than a full deploy.
./infra/scripts/deploy.sh --var-file tfvars/sandbox.tfvars --changed
For each service it reads the live image's git-SHA tag from ECS, then
git diffs the repo since that SHA over:
- the service's own directory (
services/<name>/), and - shared packages (
packages/common,auth-client,events,tsconfig,eslint-config), and - build inputs (
pnpm-lock.yaml,pnpm-workspace.yaml, rootpackage.json,infra/docker/service.Dockerfile,infra/docker/.npmrc).
Any change → that service is rebuilt/pushed/rolled (a shared-package or
lockfile/Dockerfile change fans out to every service). It prints a per-service
DEPLOY/skip plan first and exits early if nothing changed. Under the hood it
sets a per-service image_tags entry and terraform apply -targets only those
services' task-definition + service resources, so the rest are untouched.
Run a full deploy at least once first so every service has a known SHA tag
live for --changed to diff against. --changed covers backend services only
(it skips SPAs); deploy SPA changes with a full deploy. It cannot be combined
with --services or --skip-tf.
Flags¶
| Flag | Effect |
|---|---|
--var-file <path> |
tfvars for the closing terraform apply (required unless --skip-tf). |
--changed |
Incremental: rebuild/roll only services changed vs their live image. |
--services a,b |
Limit the build to a subset. Only safe with --skip-tf (image/SPA-only); for a partial roll use --changed. |
--skip-spas |
Skip the SPA build + upload. |
--skip-tf |
Push images/SPAs only, no terraform apply / ECS roll. |
--profile <name> |
Pick the AWS profile non-interactively (skips the menu). |
-y, --yes |
Skip the profile menu + account confirmation (pair with --profile). |
Verify after a deploy¶
- ECS services reach a steady state (new task set running, old drained).
GET https://<host>/<prefix>/healthreturns 200 (e.g./v1/clinical-api/health).- If
expose_api_docs = true: Swagger athttps://<host>/v1/clinical-api/docs(and/v1/auth/docs,/api/docs, …) reflects the deployed code.
Troubleshooting¶
ERR_PNPM_ENOSPC: no space left on device — the Docker build host is out of
disk. Reclaim safely (never touch volumes — they hold local dev DB/Minio/Redis
data):
docker builder prune -af # build cache
docker image prune -af # unused images (old SHA-tagged builds)
A full deploy uses roughly 6–8GB; old SHA-tagged images + build cache accumulate across deploys and aren't auto-pruned, so run the above periodically.
Transient npm / corepack errors (e.g. ETIMEDOUT, corepack failing to fetch
pnpm) — the build retries pnpm install and corepack through short blips; just
re-run the deploy (it resumes from cache). If npm is persistently unreliable,
consider a registry mirror/cache.
--services X rolled the wrong thing / failed — --services only limits the
build; without --skip-tf the terraform apply still uses the global
image_tag and would roll everything. Use --changed for partial deploys.
Notes¶
image_tagis global for full deploys;image_tags(a per-service map) is set by--changed. Both live ininfra/terraform/aws/variables.tf.- The slack-agent deploys independently (its own image tag).
- Per-environment tfvars are gitignored; only
example.tfvarsis tracked.