Issue Summary

On Sunday, March 20, 2016 from approximately 8:20AM PDT to 9:10AM PDT, requests to Services and Check-ins were met with high response times that caused many requests to time out, interrupting service to users of Services, Services Live, and Check-ins. The root cause was a shift in traffic from our East Data Center (DC) to our West DC which overburdened the West DC web servers, despite recent upgrades, which rendered them unable to keep up with incoming requests.

Timeline

7:32AM PDT Alert triggered for MySQL replication errors on west MySQL server.

7:38AM PDT Alert was acknowledged and work began.

7:38AM PDT Alert triggered for MySQL replication errors on east MySQL server.

7:38AM PDT Alert acknowledged.

7:48AM PDT West to East replication restored.

7:54AM PDT West to East replication fails again.

7:58AM PDT Database support ticket opened.

7:58AM PDT Decision to transfer traffic East to West is made.

8:06AM PDT Support was notified of the issue.

8:20AM PDT Traffic Begins to shift East to West, increasing response times in the West DC.

8:33AM PDT Decision made to revert settings to send traffic back to both data centers.

8:51AM PDT Changes successfully reverted, and traffic begins to balance, bringing down response times.

9:05AM PDT Database support repairs the broken replication.

9:10AM PDT Response times completely recover.

9:41AM PDT Work commenced checking and repairing data integrity in MySQL.

3:17PM PDT MySQL data clean up is completed with no lost data.

Root Cause

At 7:58AM PDT, a decision was made to shift the traffic going to the East DC to the West DC. Even with the addition of an additional web server to our West DC it was unable to constantly handle the surge in incoming requests.

Resolution and Recovery

At 8:33AM PDT the error of this initial decision was recognized. Work commenced to re-balance traffic.

From this point to 8:44AM PDT we struggled to get our DNS provider to reset to our original configuration.

From 8:44AM PDT to 8:51AM PDT Work to revert changes halted as we considered the implications of that action

AT 8:51AM PDT changes were reverted and response times immediately improved.

9:10AM PDT Response times completely recover.

Corrective and Preventative Measures

After a complete review and analysis of the outage, the following actions are being taken to address the underlying causes of the issue, to help prevent recurrence, and to improve response times:

Remove the ability for traffic to be shifted to an individual DC when requests are above a threshold that it cannot handle.
Document the correct procedure to deal with replication errors so there is no confusion on the correct actions to take.
Improve our internal communication to reduce the lag between recognition of mistakes and corrective actions.
Continue to work aggressively on increasing our infrastructures capacity and resilience.

Planning Center Online is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and again, apologize for the impact to you and your church. We thank you for your business and continued support.

Posted Mar 21, 2016 - 13:40 PDT

Resolved

Everything is back to normal.

Posted Mar 20, 2016 - 09:34 PDT

Update

Response times are back towards normal, but you may still see some errors while our data centers catch up.

Posted Mar 20, 2016 - 09:09 PDT

Identified

The situation has been identified and addressed. Performance is improving and should be fully back to normal shortly.

Posted Mar 20, 2016 - 08:57 PDT

Investigating

We're experiencing some slower than normal response times for some customers. We're investigating now.

Posted Mar 20, 2016 - 08:45 PDT