Partial Outage

Incident Report for Circle

Postmortem

Summary

On Feb 29th, 2024, between 10:45 am EST and 11:30 am EST, Circle experienced intermittent instability for about 1 hour with roughly 41 minutes of downtime.

The issue stemmed from a surge in platform usage, triggering an overwhelming number of tasks to update user engagement statistics, ultimately leading to downtime.

We apologize for this incident and are taking the necessary steps to ensure this doesn’t happen again.

Resolution and Recovery

To immediately address the issue, we paused the default task queue, isolated the job that caused the issue to a separate queue, and paused that queue. After restarting both background and web services, the platform was restored.

Going forward, we’re implementing a dedicated pod with minimal concurrency to handle spikes in updating user engagement statistics in the future.

Posted Feb 29, 2024 - 20:05 UTC

Resolved

This incident has been resolved.

Posted Feb 29, 2024 - 17:27 UTC

Update

Service is operational. We're monitoring for any further issues.

Posted Feb 29, 2024 - 16:47 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 29, 2024 - 16:29 UTC

Update

We are continuing to work on a fix for this issue.

Posted Feb 29, 2024 - 16:15 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 29, 2024 - 16:08 UTC

Monitoring

A fix has been implemented and we're monitoring the results.

Posted Feb 29, 2024 - 15:55 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 29, 2024 - 15:45 UTC

This incident affected: Communities (Sign-up & Login, Posts & Comments, Notifications, Direct Messaging & Chat, Paywalls & Member Billing, Live Streams & Rooms, Courses, Events, Workflows, Analytics), Developer API (REST API), and Apps (Circle iOS App, Circle Android App, Branded Apps).