Words by David DarkeMay 19, 2018
Back in Jan 2017, the popular Git hosting service GitLab.com came into some issues and very quickly found something was wrong and a mistake had been made:
We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8— GitLab.com Status (@gitlabstatus) February 1, 2017
This sort of realisation has effects on how an intensively used platform runs, but can incur a great deal of embarrassment and even spawn mistrust with customers. I believe that many businesses would specify that they were having technical issues, but not actually describe the problem and how they fixed it.
Gitlab did something amazing, they documented everything publicly. This level of company transparency is something we really aspire to at Atomic Smash. We thought it would be valuable to talk about a recent infrastructure review that was just performed, it had an emphasis on coping with disasters.
A WordPress website that is in active development or maintenance has multiple moving parts, but these can be easily defined and a backup strategy can be formulated to make sure everything has been accounted for. These usually consist of:
Here is what we do to cover all the bases for our customers:
This is the code that actually runs the site, this will include the WordPress Core, Plugins and Themes that make the site unique. We use industry-standard methods of pulling in website dependencies by using PHP composer and version control to maintain and share code between developers (Git).
The core code behind WordPress and it’s open source dependencies are stored on Github but accessed through a series of API calls to a service called WordPress Packagist. The actual custom code behind the site is stored and backed up remotely on our self-hosted version of Gitlab. Whenever a developer makes ANY change to the site, it is recorded on a line by line basis, stored locally and synced to Gitlab.
These are images, documents or video that have been uploaded to the WordPress media library. These essentially sit on a directory structure on the live server (usually under ‘wp-content/uploads’).
Every day the complete media library is synced to an external Amazon S3 bucket. This acts as a daily backup for all media on the site in an external environment with a different hosting company.
The Database stores pages, post, and written content for the site. We use MYSQL to run our WordPress sites, so there is a fully configured MYSQL server in the same environment as the website.
Every 6 hours an export (backup) of the database is made. The single export is stored on the live server and is transferred to Amazon S3. Like the media, this acts as a daily backup for all content on the site. We usually store up to 7 days of 6-hourly database backups (28 backups in total). These durations are completely flexible, for example, we sometimes store hourly backups for 30 days for sites that change very frequently.
DigitalOcean (our hosting provider of choice) provides an “Environment” backup every week. This is an exact mirror of an entire server that is no more than 7 days old. Due to the time between backups, the more tailored backup solutions are preferable when it comes to restoring a site. There is still value in having a backup of everything in situ.
When measuring backup strength, it’s very common to only consider the live website and the environment it sits on. This is usually very easy to backup and restore. Yet there are many other issues that can occur away from the live environment that can impact a client’s site or business.
Here are some emergency scenarios, and how the backup strategy copes with them, and the consequences that might occur. These issues are also listed in order of likelihood, from possible, right through to practically impossible.
This one is easy, we would grab the latest DB backup from S3 and apply to the site. We would also then investigate why this might have happened.
We would grab an environmental backup from within Digital Ocean (this could be up to 7 days old). Then use the database and media backups that exist on S3 to update the site to it’s most recent state possible.
Source code can be retrieved from the external Git service on a new machine. There is no effect to live website.
Again, source code can be retrieved from the external Git service. Apart from the delay in getting new machines (and a new office to put them in), there is no effect on the live server.
We use Gitlab for our external Git repositories, this is on DigitalOcean and it has its own environmental backups. We can use this to restore the software, users, and issues tied to all our projects. This backup, however, can be up to a week old. Luckily any code changes exist on developer machines and on the live servers so can be pulled from there.
In this situation, the live website isn’t affected, yet our ability to modify it is. All source code will have to be retrieved from the live environment. We use Capistrano for our deployments which actually puts a copy of the Git repository on the live server. However, data such as recorded support tickets tied to projects will be lost. More about this below.
First, we would get the live site up and running from an environment backup and merge the latest content from S3. Then rebuild our development environments from the Git repository that exists on the live server.
This is the worst possible case scenario. In this final instance, we would solely rely on the environmental backups and would have to run the risk of content being up to 7 days old.
We would also have to rebuild our development environment from the live Git repository from within the environment backup.
From this infrastructure review, the one area that we haven’t really considered before is the extra data that is stored in our instance of GitLab. The most valuable to us would be the support tickets tied to projects. We would be able to backtrack through the automated emails that are sent out via GitLab when issues are open and responded to, yet that is obviously far from ideal. So we will add an extra external backup (on Amazon S3) of the GitLab instance to make sure we are covered.
More regular environmental backups of the server are always up to 7 days old, but there is a ‘snapshot’ function within Digital Ocean that can be performed on a more regular basis:
This functionality is also available via their API. So it is completely possible to create a small script to force a ‘Snapshot’ backup every day. We are going to implement such a script shortly.
We hope this sort of thought process for making sure our WordPress sites are safe is useful when you perform our own backup review.