Service-down triage runbook¶
When to use this runbook¶
A service is unhealthy, returning 5xx errors consistently, or an alert has fired indicating degraded health.
Steps¶
-
Identify the service. Check the alert or health-check dashboard. Note the service name and how long it has been degraded.
-
Check upstream dependencies. Confirm MySQL and Redis are reachable from within the service's network. A database or cache outage will cause all services to fail simultaneously.
-
Call the health endpoint.
GET /health/readyon the affected service returns 200 when Prisma and Redis connections are healthy. A non-200 response narrows the fault to infrastructure rather than application code. -
Check service logs. Look for unhandled exceptions, out-of-memory kills, database connection exhaustion, or Redis connection drops immediately before the degradation started.
-
Check recent deployments. If the failure began shortly after a deploy, that deploy is the most likely cause. Correlate the deployment timestamp with the first error timestamp.
-
Restart the service via the orchestration platform (ECS task replacement, Kubernetes rollout restart, etc.). If the service recovers, monitor for recurrence.
-
Roll back the most recent deploy if the failure correlates with it and a restart does not recover the service. See
deployment-rollback.md.
If this doesn't work¶
- Escalate to the on-call engineer (rotation defined in your incident management tool).
- If the MySQL or Redis cluster is the root cause, escalate to the infrastructure team.
- Open an incident ticket and link the relevant service logs and alert screenshots.
Owners¶
See ../README.md for the service-to-owner map.