Deployment rollback runbook¶

When to use this runbook¶

A recent service deployment has introduced a regression — elevated error rates, a failing health check, or a P0/P1 incident — and the fastest safe path to recovery is to revert to the previous image.

Steps¶

Confirm the deployment is the cause. Compare the error-start timestamp with the deployment timestamp. Ensure you are not rolling back a healthy deployment in response to an unrelated infrastructure fault.
Identify the previous stable image tag. Check the CI/CD pipeline history for the service. Note the image tag or commit SHA of the last successful deployment.
Trigger the rollback. In the orchestration platform, redeploy the service using the previous stable image tag. This is typically a one-command operation in your deployment tooling (e.g. update the ECS task definition, trigger a Kubernetes rollout undo, or re-run the deployment pipeline with IMAGE_TAG=<previous> overridden).
Verify recovery. Once the rollback deployment completes, confirm:
GET /health/ready returns 200
Error rates return to baseline
Any active incidents are resolved or marked mitigated
Preserve the bad image. Do not delete the failing image tag. It may be needed for root-cause analysis.
Open a post-mortem ticket. Record what failed, what the rollback covered, and what changes are needed before re-deploying the reverted code.

Database migrations¶

If the failing deployment included a Prisma migration, a rollback of the application code alone may not be sufficient. See database-migration.md for migration rollback guidance.

If this doesn't work¶

If the rollback deployment itself fails, escalate to the infrastructure team.
If the rollback resolves the symptoms but the same issue reappears, the root cause may be a data or configuration change rather than the code.

Owners¶

See ../README.md for the service-to-owner map.