Deployment rollback runbook¶
When to use this runbook¶
A recent service deployment has introduced a regression — elevated error rates, a failing health check, or a P0/P1 incident — and the fastest safe path to recovery is to revert to the previous image.
Steps¶
-
Confirm the deployment is the cause. Compare the error-start timestamp with the deployment timestamp. Ensure you are not rolling back a healthy deployment in response to an unrelated infrastructure fault.
-
Identify the previous stable image tag. Check the CI/CD pipeline history for the service. Note the image tag or commit SHA of the last successful deployment.
-
Trigger the rollback. In the orchestration platform, redeploy the service using the previous stable image tag. This is typically a one-command operation in your deployment tooling (e.g. update the ECS task definition, trigger a Kubernetes rollout undo, or re-run the deployment pipeline with
IMAGE_TAG=<previous>overridden). -
Verify recovery. Once the rollback deployment completes, confirm:
GET /health/readyreturns 200- Error rates return to baseline
-
Any active incidents are resolved or marked mitigated
-
Preserve the bad image. Do not delete the failing image tag. It may be needed for root-cause analysis.
-
Open a post-mortem ticket. Record what failed, what the rollback covered, and what changes are needed before re-deploying the reverted code.
Database migrations¶
If the failing deployment included a Prisma migration, a rollback of the application code alone may not be sufficient. See database-migration.md for migration rollback guidance.
If this doesn't work¶
- If the rollback deployment itself fails, escalate to the infrastructure team.
- If the rollback resolves the symptoms but the same issue reappears, the root cause may be a data or configuration change rather than the code.
Owners¶
See ../README.md for the service-to-owner map.