Discourse Outage: Root Cause Analysis

Hi all,

Yesterday, at 15:30 UK time, our Discourse VM went offline. All subsequent attempts to rescue it failed, and we’ve now restored the site from backup. Sadly that backup is a week old, and the recent posts have been lost. This post details what happened, and what we’ll be doing to try to ensure it doesn’t happen again.

What happened?

  • 15:30 - VM becomes unresponsive
  • 15:35 - I try to log in on the web console, but can see there are errors
  • 15:40 - I hard-reboot the VM, but it doesn’t reboot. I open a support ticket
  • 15:56 - Support advises this is a router issue and site will return shortly
  • 16:00 - Since this is router issue, I decide to use the outage to perform a hardware upgrade on the VM
    • This involves “archiving” the VM, changing the hardware spec, and then un-archiving
  • 16:05 - The VM becomes stuck in it’s archiving process and reports it’s status as ‘Error’

The VM then stayed in this state until 09:45 this morning, at which time Scaleway discovered an underlying critical hypervisor error, and terminated the VM. Sadly, this destroyed the local SSD. Here’s the text of the ticket:

Backups?

Sadly, we don’t have any SSD snapshots on the Scaleway platform - until just a few days ago there was no hot-snapshot support, and I didn’t want to take the site down for cold-snapshots.

We were taking daily backups of Discourse itself, using Discourse’s own tools and then copying the files to another host (all automated). However due to this bug we started to see daily outages when the backups ran at 3am. This meant the site was out until 8am when someone could log in and kick it. To avoid this, we switched to manual backups, which I would do at least weekly.

The final issue was that the cronjob which copies the files to the other machine runs in the early morning. So while I did take a backup as usual yesterday morning, the timing was the worst possible - the cronjob had not yet copied the backup offsite.

Conclusions

At the heart of this is a really unfortunate set of coincidences - a hypervisor failure and a Docker bug preventing regular backups and the timing such that the latest backup hadn’t been transferred. Still, there are lessons to be learned…

  • Thanks to our sponsors at Scaleway, hot snapshots have been enabled on our account.
    • I will be setting up a regular API-based cronjob to take these snapshots
  • The backup job inside Discourse will be re-enabled, but set to run at 10am
    • This means I can be present if the Docker bug continues to cause issues
  • We will also be looking at using the mail archive to re-import the old posts, so hopefully we can recover most of the data.

The site appears to be functional again - I have tested inbound mail delivery and appear to be getting outbound mails. Please do report any issues to me, either by PM or in the #site-feedback category.

Once again, please accept my apologies for the situation. It’s incredibly upsetting to me to lose people’s data, and I really hope we don’t see such an unfortunate set of circumstances again.

Greg

6 Likes