Active Interview’s backup method – 7 Rules for Data Protection

Marco Arment, founder of Instapaper, recently wrote about the backup method he uses for that service.

First of all, props to Marco for writing that post.  It’s a selfless act to blog about your technical processes and methods, and especially when it opens you up to “armchair hacker” criticism.  Sharing of these kinds of different methods is really valuable to us DIY-ers / lean start-up guys and gals.  It’s in that context, and in the context of horror stories like Mag.no.lia (total data loss) and Github (testing vs production db wires crossed) and CodingHorror.com (images lost) that ActiveInterview just audited our data protection & backup methods.  These are smart people running awesome sites –  if it can happen to them, it’s worth stepping back for a few days to review what we’re doing to prevent a similar scenario from happening to us.

Our service at Active Interview does require an additional layer of backup that (I think) is different from Instapaper, since our application works around not just data stored in a DB, but also a large amount of files associated to candidate interviews (videos, video thumbnails, resumes) as well as our customers file data (company logos, etc).


Our constraints:

1. We outsource our storage and server infrastructure to Amazon (read our AWS Case Study).
2. We’re a lean start-up. Backup are managed by developers (and we like it that way).
3. We use mostly open-source / commodity technologies.


7 Rules for Data Protection (and how we achieve them)

1. All user generated file data should live in at-least two secure places at once.

Even though we use S3 for our user generated content data store, and S3 has data redundancy baked-in, we go a step further and backup data “off-site” from Amazon with another vendor at a non-Amazon data center.  We’re hedging against not just our vendor’s technical issues, but also operator error on our end.  Whenever we move data between servers, all transfers occur over SSH (password-less) with scp/Rsync.  Rsync has a variety of friendly option to help you from overwriting/deleting files on your target server, which is a nice precaution.  We regularly ping our backup server and use a Webmin system check to test that it’s there and healthy.

2. Database data should live in two secure places at once.

At the simplest level, this can be solved via Master/Slave replication.  This works great for if the master dies, but isn’t going to help if erroneous code deletes data on Master, as that DELETE will quickly replicate over to the Slave.  Binlogging as Marco describes can solve this problem in simple cases, as you can rewind/edit/replay the binlog.  Just in-case we can’t restore properly from a rewind/replay-ing of the binlog, we also do aggressive (hourly) snapshots of the database.  Note that you should Flush/Lock the database before you do a SQL dump, to ensure data consistency.  With MySQLdump you can use mysqldump with –flush-logs and –lock-all-tables.  It’s also important to recognize that mysqldump is only a realistic option when your database is fairly small.  Dumping data is fairly quick with large DBs, but it’s the import time that will kill you.  For backing up and restoring sizable databases, look into using a binary snapshot tool like InnoDB hotbackup.

As we take our hourly db sql dumps, we copy them over to a S3 bucket dedicated to backup files.  And then we Rsync the same db backup file to the offsite (non-Amazon) server.

3. In-case of disaster, be able to restore everything to a running state in less than 3 hours.

There are several ways we can do this, depending on the nature of the disaster.

  • If it’s a DB loss, we can flip over to the slave server. Or restore from hourly snapshot, which ensures we only lose at most one hour of data.
  • If it’s user genereated content loss from S3, we can Rsync from our off-site backup server.
  • If our EC2 instance goes down for some reason, we can bounce back with our original server drive attached to a new instance.  This is possible because we’re using an EBS root volume (you should probably be using this too, instead of EC2′s original “instance store”).  EC2 was initially designed such that your root data drive would not survive instance termination, accidental or otherwise.  With an EBS root drive, the data drive exist independent of the EC2 instance  and you able to detach/re-attach an EBS volume to any instance.
  • If for some reason the “live” production EBS drive cannot be re-attached to a new EC2 instance, not to worry.  We capture EBS volume snapshots of the entire drive twice a day with 14 days of archives, using this excellent, lightweight & chron-able script from Eric Hammond: http://alestic.com/2009/09/ec2-consistent-snapshot.  Note that for your EBS snapshot to be bullet-proof using that script, you should use an XFS filesystem, which allows locking.  Eric’s script can be used with or without the XFS file systems, though if you don’t use XFS, you run the risk of not having a consistent file system on EBS volume restore.
  • And finally, if all of the above fail to provide a path to restore the web-application server, we have literally every step needed to provision a new instance, from scratch, fully documented in a Python Fabric script.  This “fabfile” automates the process of provisioning a new server with all necessary packages, all user accounts, services and dependencies with one simple command: “fab -h <target>”.  Capistrano/ Bundler and RVM provide the final automation for easy deployment and automatic dependency management of our Rails application layer.  If you are on Rails 2.x, start using RVM/Bundler today!

4. Document & audit the recovery process at least once a month.

We document everything needed to recover any part of our data or infrastructure in a Google doc that is shared among the dev team.  The responsibility to audit the backup/restore process rotates each month across team members.  We run through the restore process every month… cost of firing up a new EC2 instance for that afternoon ~$3.  Cheaper if you use spot prices  :)

5. Database permission should be as granular as possible.

This mean the primary webapp does NOT use root access.  This is a pretty common development practice that frighteningly seems to creep into production setups.  Spend the extra 60 seconds setting up a non-root account w/ appropriate ACLs.  Since we use DB backup scripts, those scripts have their own users with read-only (SELECT only) rights.   Remote access from the office to the production DB is allowed, but only through password-less SSH, and again only with a SELECT only db account.  Basically if you are going to insert/delete/edit anything in the production DB, you’re going to have to go through root access from the shell.  We logwatch all server activity to monitor any suspicious behavior.

6. File level/S3 permissions should be granular as possible.

Pretty obvious, but you don’t want to have write or execute access to any part of the application or data directories on the web-server for system users that don’t need that access.  You’re playing with fire if you “chmod 777″.  Similarly, when files are copied over to S3 buckets, the appropriate API call is made to set the correct ACLs on that S3 object.

7.  Prevent production data leakage.

As a part of Marcos posted method, backups are also sent to his home computer.  I strongly disagree with that process.  As mentioned by many comments on HN, this is a direct vector for data loss.  Even if your service doesn’t appear to store anything confidential, IMO there is a reasonable and implicit expectation from consumers that their data is not residing on a home or laptop computer.  Since people re-use password all the time, all it takes is a lost/stolen table of passwords mapped to web-mail email addresses to create some real havoc, despite the application not seeming to store anything confidential.  Even many encrypted passwords are vunerable given enough time.

Unrelated to backups: It’s reasonable that developers may want production data on their development boxes to play with, which may be laptops they take home with them.  In this case, we’re thinking of writing a script that either scrubs or randomizes the data when it gets pulled down from production into a dev environment.

So there you have it –  7 simple rules for data protection.  Here is a corresponding diagram of our described backup architecture:
AI Backup Architecture

  1. Kevin Burke says:

    Hi guys,
    Thanks for this post, it was really good to see how you plan for the worst.

  1. [...] Active Interview's backup method – 7 Rules for Data Protection … [...]

Leave a Reply