Major Outage
Incident Report for Ovrture
Postmortem

Good afternoon Ovrture users,

Happy Friday to you all!

This document details the cause and events occurring immediately after Ovrture’s outage on March 14, 2019 as well as the steps we are taking to mitigate the impact of future outages like this one in the future.

On March 14, 2019, all Ovrture microsites experienced an outage which affected the ability for users to edit microsites and prospects/donors to access microsites. The outage occurred from 3:18 PM to 3:42 PM.

The underlying cause of the outage was that Chef on the new backup server changed the filesystem permissions of the EFS mount point to be too restrictive, which blocked the web servers from being able to access the microsite files. When Chef would run on one of the application servers, it would then fix the problem. Once we identified the root cause, the immediate fix was simple — we changed the Chef recipe on the backup server to use permissions that were compatible with what the application servers needed.

Things we will improve to make sure issues like this do not happen again in the future:
-In retrospect, it was a mistake for the backup server to mount the EFS filesystem in read/write mode. A read-only mode is available. This would have prevented any changes (foreseen or otherwise) from propagating. We will verify that the backup server also only has read-only access to S3 and the RDS database.
-Our Nagios monitoring needs to be beefed up. StatusPage picked up the failure, but our Nagios checks did not. We had not added the new health checks to our system. We should also add some additional internal health and system checks.
-We see the need for a more segmented rollout of Chef changes. We currently test against staging machines before deploying to production, but this experience shows that that process is not sufficient. We need to look at how we might segment Chef rollouts to force a lag between the propagation to staging, and the propagation to production. Coupled with better monitoring, this should help to prevent a bad configuration from making it into production.

We sincerely apologize for the impact of March, 14th’s service disruption on your applications. We take great pride in the reliability that Ovrture offers, but we also recognize that we can do more to improve it. You can be confident that we will continue to work diligently to improve the service and ensure the impact of outages like this have the least possible affect on our customers.

Onward,

Gideon and the Ovrture Team

Posted Mar 14, 2019 - 19:50 UTC

Resolved
This incident has been resolved.
Posted Mar 14, 2019 - 19:42 UTC
Update
We are continuing to investigate this issue.
Posted Mar 14, 2019 - 19:41 UTC
Investigating
We are currently investigating this issue.
Posted Mar 14, 2019 - 19:30 UTC