Service-down triage runbook¶

When to use this runbook¶

A service is unhealthy, returning 5xx errors consistently, or an alert has fired indicating degraded health.

Identify the service. Check the alert or health-check dashboard. Note the service name and how long it has been degraded.
Check upstream dependencies. Confirm MySQL and Redis are reachable from within the service's network. A database or cache outage will cause all services to fail simultaneously.
Call the health endpoint. GET /health/ready on the affected service returns 200 when Prisma and Redis connections are healthy. A non-200 response narrows the fault to infrastructure rather than application code.
Check service logs. Look for unhandled exceptions, out-of-memory kills, database connection exhaustion, or Redis connection drops immediately before the degradation started.
Check recent deployments. If the failure began shortly after a deploy, that deploy is the most likely cause. Correlate the deployment timestamp with the first error timestamp.
Restart the service via the orchestration platform (ECS task replacement, Kubernetes rollout restart, etc.). If the service recovers, monitor for recurrence.
Roll back the most recent deploy if the failure correlates with it and a restart does not recover the service. See deployment-rollback.md.

Escalate to the on-call engineer (rotation defined in your incident management tool).
If the MySQL or Redis cluster is the root cause, escalate to the infrastructure team.
Open an incident ticket and link the relevant service logs and alert screenshots.

See ../README.md for the service-to-owner map.