On Feb 29th, 2024, between 10:45 am EST and 11:30 am EST, Circle experienced intermittent instability for about 1 hour with roughly 41 minutes of downtime.
The issue stemmed from a surge in platform usage, triggering an overwhelming number of tasks to update user engagement statistics, ultimately leading to downtime.
We apologize for this incident and are taking the necessary steps to ensure this doesn’t happen again.
To immediately address the issue, we paused the default task queue, isolated the job that caused the issue to a separate queue, and paused that queue. After restarting both background and web services, the platform was restored.
Going forward, we’re implementing a dedicated pod with minimal concurrency to handle spikes in updating user engagement statistics in the future.