On Sunday, March 20, 2016 from approximately 8:20AM PDT to 9:10AM PDT, requests to Services and Check-ins were met with high response times that caused many requests to time out, interrupting service to users of Services, Services Live, and Check-ins. The root cause was a shift in traffic from our East Data Center (DC) to our West DC which overburdened the West DC web servers, despite recent upgrades, which rendered them unable to keep up with incoming requests.
7:32AM PDT Alert triggered for MySQL replication errors on west MySQL server.
7:38AM PDT Alert was acknowledged and work began.
7:38AM PDT Alert triggered for MySQL replication errors on east MySQL server.
7:38AM PDT Alert acknowledged.
7:48AM PDT West to East replication restored.
7:54AM PDT West to East replication fails again.
7:58AM PDT Database support ticket opened.
7:58AM PDT Decision to transfer traffic East to West is made.
8:06AM PDT Support was notified of the issue.
8:20AM PDT Traffic Begins to shift East to West, increasing response times in the West DC.
8:33AM PDT Decision made to revert settings to send traffic back to both data centers.
8:51AM PDT Changes successfully reverted, and traffic begins to balance, bringing down response times.
9:05AM PDT Database support repairs the broken replication.
9:10AM PDT Response times completely recover.
9:41AM PDT Work commenced checking and repairing data integrity in MySQL.
3:17PM PDT MySQL data clean up is completed with no lost data.
At 7:58AM PDT, a decision was made to shift the traffic going to the East DC to the West DC. Even with the addition of an additional web server to our West DC it was unable to constantly handle the surge in incoming requests.
At 8:33AM PDT the error of this initial decision was recognized. Work commenced to re-balance traffic.
From this point to 8:44AM PDT we struggled to get our DNS provider to reset to our original configuration.
From 8:44AM PDT to 8:51AM PDT Work to revert changes halted as we considered the implications of that action
AT 8:51AM PDT changes were reverted and response times immediately improved.
9:10AM PDT Response times completely recover.
After a complete review and analysis of the outage, the following actions are being taken to address the underlying causes of the issue, to help prevent recurrence, and to improve response times:
Planning Center Online is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and again, apologize for the impact to you and your church. We thank you for your business and continued support.