[External] Outage on Graphite web dashboard and CLI

Postmortem

On three separate instances from Feb 10 to Feb 12, we suffered a major outage across all of our application server endpoints which power our Web dashboard and CLI.

Timeline

We experienced elevated levels of 504 (Gateway Timeout) responses at our application load balancer during the following windows (EST/UTC-5):

Feb 10: 2:06pm — 2:33pm (27 minutes)
Feb 11: 2:37pm — 3:30pm (53 minutes)
Feb 12: 12:59pm — 1:51pm (52 minutes)

Note: A 504 error from the load balancer indicates that a request which was sent to our application server did not return a response within 60 seconds.

Our application server Subwoofer is responsible for handling API requests from both the Graphite CLI and the Graphite Web dashboard.

Starting at 12:54pm on Feb 10, average response times as measured by our application load balancer began increasing slowly from the baseline. This degraded the performance of our application server endpoints but their core functionality was initially unaffected. This degradation continued until 2:06pm, when our load balancer began emitting 504 errors. At 2:21pm our on-call engineer raised the number of application server tasks from 30 → 40, after which the average response time decreased quickly, and completely restored service by 2:33pm.

Our application server continued running at the elevated number of tasks until the next day, Feb 11, at 2:36pm when our on-call engineer rolled back the autoscaling override, decreasing the number of tasks 40 → 30. At this point, average response times began climbing similar to the previous incident. At 2:52pm our on-call engineer reintroduced the override, bringing our application server to 40 tasks, and average response times were restored to baseline levels by 3:30pm.

At Feb 12 at 12:58pm, the number of application server tasks decreased from 40 → 30 again as a side effect of an Infrastructure-as-Code (IaC) change that was applied by another engineer. Similar to the previous days, average response time immediately climbed until 1:16pm, when our team increased the autoscaling limit again from 30 → 40. Unlike in previous days, this improved average response times but did not completely restore them to baseline levels. Two manual deploys of our application server, which occurred at 1:18pm and 1:30pm respectively, briefly improved average response times, as the number of deployed tasks was temporarily doubled. At 1:43pm our team further increased the autoscaling limit from 40 → 60, which ultimately restored service completely by 1:51pm.