Major Disruption of Service on US East 1 CSG Primary Hub

Incident Report for Xponent

Resolved

On the CSG Branded stack in US-East-1, the internal job processing queue was overwhelmed by jobs from multiple sources. Primarily a misconfigured queue listener that had been erroneously configured to allow 1.4M Messages a minute. At the time of configuration 14000 messages a minute was the desired rate. 9000 was the actual. This Queue listener started recieving over 30,000 messages per minute and the stack was currently capped at about a 20,000 messages per minute max for all users.

The incident was discovered by our normal monitoring as the queue depth grew at approximately 5:30 AM and the team was monitoring and trying to make sure they got the stack scaled in time against the queue, but it was overwhelmed before they could respond. The EC2 Instance running our internal queuing process then rebooted itself.

The server failed to restart the Queue daemon on restart and the team started building a second server. Before the second server was complete the reason for failure to start up was corrected and RabbitMQ was restarted and service was restored.

During the time of the outage our system was completely unresponsive. Things in queues and databases being fed in were all delayed for the period of the outage.

Any traffic coming to the api.csgjourney.com url for graph executions would have received a timeout as the graph execution powering those urls would not have been running.

Posted Apr 14, 2023 - 10:00 UTC