Slowness in All apps
Incident Report for Planning Center
Postmortem

We would like to sincerely apologize for the outage which occurred yesterday, February 28th, from 9:48AM PST to 3:30PM PST. We understand how much you depend on us to do your work in your ministry, and we take that responsibility very seriously.

A full summary of the day, including steps we plan on taking, is follows.

Thank you for letting us serve your church.

Issue Summary

From 9:48AM PST to 3:30PM PST Requests to Service, Check-ins, Giving, People, Resources, Registrations, and Groups were met with slow response times and an inability to use many features on Services and People. The root cause was an inability to read files from, or write files to our hosting provider (Amazon Web Services).

Timeline

9:48AM PST We began receiving alerts of increased errors across several applications.

9:55AM PST We identified the source of the errors as our inability to interact with the file storage system.

10:00AM PST We began disabling features that relied on file storage in an attempt to keep Services and People available.

12:00PM PST Other AWS services began to fail as a result of the sheer size of the outage. In an attempt to reduce load on our background servers we issued a restart to some of them. The restart failed, and we were unable to replace those servers the way we normally would.

12:10PM PST Faced with the loss of some very important servers, we made the decision to take the sites down in order to prevent damage to user data while we processed a large backup of queued tasks.

1:15PM PST We were able to fix one of the background servers and re-purpose some other servers to help process background tasks.

1:35PM PST We felt comfortable enough with the current server load to bring the sites back up with some features still disabled.

3:30PM PST The underlying issues were resolved by our hosting provider and we were able to restore all degraded features to full functionality.

Root Cause

We were unable to handle the volume of errors resulting from an inability to read or write files from AWS.

Resolution and Recovery

At 3:30 PST we were able to restore functionality only when the underlying issue was resolved.

Corrective and Preventative Measures

After a complete review and analysis of our performance yesterday, the following action will be taken to improve our resilience to failures in third party products we rely on.

  • We are doing a complete review of all third party integrations, and will take steps to more gracefully deal with changes in availability of service providers.

  • We will modify our internal architecture to better handle slow response times from different parts of our system.

Planning Center is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you and your church.

Thank you for your business and continued support.

Posted Mar 01, 2017 - 17:24 PST

Resolved
All services are restored to full functionality.
Posted Feb 28, 2017 - 15:41 PST
Update
Though we're back up, files uploads, chord charts, printing plans, & activity feeds are still down.
Posted Feb 28, 2017 - 13:55 PST
Update
Most of Planning Center is back up, though file uploads still won't work.
Posted Feb 28, 2017 - 13:35 PST
Update
Unfortunately we're having to put the site into maintenance mode. We'll get it back as soon as possible.
Posted Feb 28, 2017 - 12:10 PST
Update
Amazon recently posted:

Increased Error Rates
Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.
http://status.aws.amazon.com
Posted Feb 28, 2017 - 11:48 PST
Update
Our current issues are tied to Amazon AWS. You can follow their status at http://status.aws.amazon.com
Posted Feb 28, 2017 - 11:47 PST
Update
The Activity tab in PCO People has been disabled, which seems to speed other things up.
Posted Feb 28, 2017 - 10:27 PST
Update
We've disabled the Activity tab in Planning Center People. This should help speed up other things.
Posted Feb 28, 2017 - 10:25 PST
Update
Planning Center lyrics & chord PDFs are generated on Amazon S3. We've turned off generation of Lyric & Chord PDFs to try and speed up the rest of Services. However, the S3 outage will affect anything related to files. You won't be able to access any files until Amazon restores S3 service.
Posted Feb 28, 2017 - 10:13 PST
Monitoring
Many websites are experiencing difficulties at the moment. Amazon S3 helps power a huge number of sites, not just Planning Center. We're trying to turn off some features in Planning Center Services that rely heavily on S3 to try and reduce the requests we send them. This will hopefully allow other processes to speed up.
Posted Feb 28, 2017 - 10:06 PST
Identified
The current slowness seems to be caused by issues with Amazon's S3 file storage. We're trying to work around it.
Posted Feb 28, 2017 - 09:59 PST
Investigating
We're experiencing some slowness in all Planning Center apps and are investigating the cause.
Posted Feb 28, 2017 - 09:48 PST