We would like to sincerely apologize for the outage which occurred yesterday, February 28th, from 9:48AM PST to 3:30PM PST. We understand how much you depend on us to do your work in your ministry, and we take that responsibility very seriously.
A full summary of the day, including steps we plan on taking, is follows.
Thank you for letting us serve your church.
From 9:48AM PST to 3:30PM PST Requests to Service, Check-ins, Giving, People, Resources, Registrations, and Groups were met with slow response times and an inability to use many features on Services and People. The root cause was an inability to read files from, or write files to our hosting provider (Amazon Web Services).
9:48AM PST We began receiving alerts of increased errors across several applications.
9:55AM PST We identified the source of the errors as our inability to interact with the file storage system.
10:00AM PST We began disabling features that relied on file storage in an attempt to keep Services and People available.
12:00PM PST Other AWS services began to fail as a result of the sheer size of the outage. In an attempt to reduce load on our background servers we issued a restart to some of them. The restart failed, and we were unable to replace those servers the way we normally would.
12:10PM PST Faced with the loss of some very important servers, we made the decision to take the sites down in order to prevent damage to user data while we processed a large backup of queued tasks.
1:15PM PST We were able to fix one of the background servers and re-purpose some other servers to help process background tasks.
1:35PM PST We felt comfortable enough with the current server load to bring the sites back up with some features still disabled.
3:30PM PST The underlying issues were resolved by our hosting provider and we were able to restore all degraded features to full functionality.
We were unable to handle the volume of errors resulting from an inability to read or write files from AWS.
At 3:30 PST we were able to restore functionality only when the underlying issue was resolved.
After a complete review and analysis of our performance yesterday, the following action will be taken to improve our resilience to failures in third party products we rely on.
We are doing a complete review of all third party integrations, and will take steps to more gracefully deal with changes in availability of service providers.
We will modify our internal architecture to better handle slow response times from different parts of our system.
Planning Center is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you and your church.
Thank you for your business and continued support.