Information on the June 16th outage.

Published by QueUp on

Hey everyone. After what felt like an eternity, we are back up and running once again. We are deeply frustrated with how long it took for us to restore the site and over the coming weeks we are going to be looking at how we can improve our response times to incidents like this.

What happened?

In the last few weeks prior to June 16th, we had recurring issues with our infrastructure that had a direct impact on all our systems, and more specifically, the ones that stored data. We weren’t and still aren’t sure what the root cause was, but we believe the failure was in the software we were using.

On June 16th we were alerted to multiple issues within our infrastructure where some crucial services had stopped running and also some of our data volumes became unhealthy. Under normal circumstances, should an issue arise with our data volumes, our storage controller would automatically start to rebuild them; however, due to other failures within our system, it was not possible for this process to complete.

Regrettably this issue extended to our backups, and it was not possible for us to recover any of our automatically generated backups.

We have successfully restored the site from a backup created on May 30th; however, this means anything that was created or changed between May 30th and June 16th will not exist, such as any changes made to playlists, any new rooms and accounts.

What happens next?

To offer some reassurance, we have changed how some of our back-end applications are run and hosted to a way that has worked for us previously and without any issues. We will continue to monitor our systems and ensure they operate as smoothly as possible.

We are truly sorry for the huge inconvenience caused by this downtime and we will do everything possible to learn from this incident and improve as we move forward.

Categories: QueUp