Major Outage

Incident Report for Ovrture

Postmortem

Good afternoon Ovrture users,

This document details the cause and events occurring immediately after Ovrture’s incident on January 28, 2020 as well as the steps we are taking to mitigate the impact of future outages like this one in the future.

On January 28, 2020, there was a major outage regarding microsites, login pages, and the database. This outage occurred from 6:00 AM. to 9:00 AM.

The Ovrture development team was the first to notice this issue through notification from the Ovrture status page. Ovrture investigated the issue and notified the Ovrture development operations team, who began looking into the outage immediately. It was discovered that the outages were related to a race condition in the reboot process between EFS, Tomcat, and Ubuntu 18's new security separation layers. These caused the reboot sequence to be unpredictable. When regular nightly updates were applied last night, the updates themselves were actually okay, it was the reboot that hurt. Subsequent reboots also had the same issue.

The underlying issues regarding this have been fixed and the race condition on reboot was resolved. That being said, reboots are now safe.

Things we will improve to make sure issues like this do not happen again in the future:

We will be improving the production environment's infrastructure to ensure this specific issue does not happen again.
We will reduce downtime by developing an "off-hours" notification protocol.
We will be auditing and possibly adjusting our Incident Response Plan (IRP) to provide better early warning, coordination, and resolution of issues.
To prevent slowdowns during debugging of future incidents, our Nagios site is now accessible from all of the Ovrture VPNs.

We are very sorry for any inconvenience this incident may have caused on January 28 and we will continue to work hard to make sure something like this doesn’t happen again.

Onward,

Gideon and the Ovrture Team

Posted Jan 28, 2020 - 22:07 UTC

Resolved

This incident has been resolved.

Posted Jan 28, 2020 - 14:14 UTC

Update

This incident has been resolved. You can now use the platform as usual. Thank you for your patience as this was resolved. We will continue to monitor the situation and will publish a postmortem in the next 24 hours.

Posted Jan 28, 2020 - 14:13 UTC

Update

We are continuing to investigate the issue. There are some live sites/reports that are operational at this point.

Posted Jan 28, 2020 - 13:19 UTC

Update

We are continuing to investigate the issue. Live sites/reports are operational.

Posted Jan 28, 2020 - 13:17 UTC

Update

We are continuing to investigate the issue. Live sites/reports are operational.

Posted Jan 28, 2020 - 13:15 UTC

Investigating

We are currently investigating the issue.

Posted Jan 28, 2020 - 12:50 UTC