Using Google Cloud Storage for Cassandra Disaster Recovery

Learn how to back up your Apache Cassandra data, upload the data to Google Cloud Storage, and then retrieve and restore the data. This tutorial provides and utilizes a backup/restore script to help streamline the end-to-end disaster recovery process. Although the tutorial focuses on backing up and restoring Cassandra data, the concepts can be applied broadly to many backup use cases.

Prerequisites

This tutorial assumes the following:

This tutorial was developed and tested on Google Compute Engine virtual machine instances running Debian Linux and Cassandra version 2.1; however, the information provided is applicable for any Linux environment supported by the Cloud SDK.

Obtaining the Cassandra backup/restore script

This tutorial utilizes a custom Cassandra backup/restore script. Download the script here and unzip it to a local directory. If you prefer, you can clone the GitHub project repository instead by running the following command:

git clone https://github.com/GoogleCloudPlatform/cassandra-cloud-backup.git

Understanding the Cassandra backup/restore script

This section describes the design of the Cassandra backup/restore script, describes its key features, and outlines some of its limitations.

Cassandra backup types

The Cassandra backup/restore script provides a common interface for managing Cassandra's two built-in methods of backing up data: snapshots and incremental backups. These two methods are complementary:

  • A snapshot is a complete backup of all on-disk data files stored in your Cassandra data directory. The Cassandra backup/restore script takes snapshots via Cassandra's nodetool utility, storing them locally as follows:

    <cassandra_data_dir>/<keyspace_dir>/<table_dir>/snapshots
    
  • An incremental backup backs up only the data that has been added or changed since the last snapshot or incremental backup. After you enable incremental backups for your Cassandra installation, Cassandra takes them automatically. Incremental backups are stored locally in the following directory:

    <cassandra_data_dir>/<keyspace_dir>/<table_dir>/backups
    

Both methods can be used without explicitly stopping or adversely affecting a running Cassandra instance. Both methods also execute very quickly, as Cassandra creates hard links to the active data files in both cases.

The nodetool utility

The Cassandra backup/restore script wraps and extends the snapshotting functionality of the nodetool utility, using nodetool snapshot to take snapshots and nodetool clearsnapshot to remove stale snapshots from a machine's local storage.

The gsutil utility

The Cassandra backup/restore script uses gsutil, included in the Cloud SDK, to upload and download files from Cloud Storage. Cloud Storage provides secure, redundant file storage, making it an ideal target for backups.

Preparing your environment

Before trying the backup/restore script, make sure that your Cloud Platform account and test environment are configured properly.

Verify that nodetool and gsutil are in your PATH variable

Verify that the paths for nodetool and gsutil have been added to your PATH environment variable. The backup/restore script is built around these utilities and will fail if they cannot be accessed.

Create a Cloud Storage bucket

To store your backups in Cloud Storage, you first need to create a Cloud Storage bucket by running the following command. Bucket names must be globally unique across Cloud Platform. To ensure that your bucket name is unique, consider using your project name as a namespace, as shown here:

gsutil mb gs://<project>-cassandra-backups

Set up a service account

Next, configure the Cloud SDK to authenticate with Cloud Platform using a service account. Service accounts are designed to authenticate on behalf of a service or application rather than a user, making them an appropriate choice for a production environment.

To set up a service account:

  1. Go to Creating a service account and follow the steps to create a new service account. Make sure to click the Furnish a new private key checkbox, which triggers the download of a private key file when you create the account. You'll need this key file to activate your service account.

  2. In your Cassandra test environment, run the following command to activate your service account. Replace <service_account> with the email address of your service account, and replace <full_path_to_key_file> with the absolute path of your private key:

    gcloud auth activate-service-account <service_account> --key-file <full_path_to_key_file>
    
  3. Log in as your service account. Replace <service_account> with the email address of your service account:

    gcloud auth login <service_account>
    

The script will now run gsutil as your service account.

Enable incremental backups in Cassandra

The Cassandra backup/restore script assumes that you have enabled incremental backups. If you haven't enabled them yet, you can do so as follows:

  1. Edit your cassandra.yaml file. This file is usually located in the /etc/cassandra directory.
  2. Change the value of incremental_backups to true.
  3. Restart Cassandra:

    sudo service cassandra restart
    

Managing backups with the Cassandra backup/restore script

When you use the Cassandra backup/restore script to copy your node's local snapshots and incremental backups to your Cloud Storage bucket, the script organizes them in the bucket by node name, backup type, and timestamp:

<bucket_root>
    /<node_1_name>
        /snpsht
            /2016-02-02_02:34:56
                [files]
                ...
            /2016-02-01_01:23:45
            ...
        /incr
            /2016-02-02_04:32:10
            /2016-02-02_03:45:01
            ...
    /<node_2_name>
    ...

The syntax of the Cassandra backup/restore script is as follows:

./cassandra-cloud-backup.sh [options] command

Take and upload snapshots

The Cassandra backup/restore script defaults to the backup command. You can use this command to create a new snapshot of all of your keyspaces. The snapshot and the file paths therein are written to a text file in /cassandra/backups; from there, they are uploaded to your Cloud Storage bucket.

The script also provides flags that let you perform a variety of post-snapshot tasks:

  • -z and -j allow you to compress the snapshot using gzip (tar.gz) or bzip2 (tar.bz), respectively.
  • -c and -C configure the script to remove old local snapshot files and old local incremental backup files, respectively. These flags are particularly useful after you've taken a snapshot: as soon as you have a fresh snapshot, your old snapshot and incremental backup files are obsolete, especially if they've already been backed up to Cloud Storage.
  • -b gs://<bucket> configures the script to upload your snapshot to Cloud Storage after it's been taken and processed.

In addition, the -v flag configures the script to provide verbose output as it runs.

For example, if you want to take a snapshot, compress it with bzip2, upload it to Cloud Storage, and clean out any old snapshots and incremental files on your local disk—all while monitoring progress by using the verbose-output flag (-v)—you can run the following command. Replace <bucket> with your Cloud Storage bucket name:

./cassandra-cloud-backup.sh -b gs://<bucket> -vCcj backup

Upload current incremental backups

You can tell the script to upload the latest incremental backup files by adding the -i flag. When you use this flag, the script will also create a timestamp file in the target backup directory—by default, /cassandra/backups—each time the script runs. The script uses this timestamp file to log which files it has already copied, and will skip these files the next time the script is run.

For example, to compress the newest incremental backup files with bzip2 and upload them to Cloud Storage—all while monitoring progress by using the verbose-output flag (-v)—you can run the following. Replace <bucket> with your Cloud Storage bucket name:

./cassandra-cloud-backup.sh -b gs://<bucket> -vji backup

Perform a dry run

You can perform a dry run of an operation or series of operations by using the -n, or no-op, flag. For example, to perform a dry run of the example snapshot command provided earlier, you can run the following command, replacing <bucket> with your Cloud Storage bucket name:

./cassandra-cloud-backup.sh -b gs://<bucket> -vnCcj backup

Though the no-op flag ensures that no operations are performed, it does create one file: a list of the snapshots and incremental backups that would have been uploaded to Cloud Storage. By default, this file is output to your temporary backup directory, /cassandra/backups. You can write it to a different directory by adding the -d <directory_path> flag.

Create a cron job

After you've worked out the backup operations that you want your script to perform, you should schedule two cron jobs: one to handle snapshotting and snapshot management, and one to handle incremental backup management.

To schedule your cron jobs:

  1. Open crontab:

    crontab -e
    
  2. Add your snapshot-related cron job.

    The following example runs at 1:30 AM (system time). It takes a snapshot, clears out old snapshots and incremental backups, compresses the snapshot with bzip2, and uploads it to a Cloud Storage bucket. It also writes the script's verbose output to a log file under /var/log/cassandra.

    Before adding this cron job, make sure to replace <bucket> with your bucket name:

    30 1 * * * /path_to_scripts/cassandra-cloud-backup.sh -b gs://<bucket> -vCcj -d /cassandra/backups > /var/log/cassandra/$(date +%Y%m%d%H%M%S)-fbackup.log 2>&1
    
  3. Add your incremental-backup-related cron job.

    The following example runs every hour. It copies any new incremental backup files, compresses them with bzip2, and uploads them a Cloud Storage bucket. It also writes the script's verbose output to a log file under /var/log/cassandra.

    Before adding this cron job, make sure to replace <bucket> with your bucket name:

    0 * * * * /path_to_scripts/cassandra-cloud-backup.sh -b gs://<bucket> -vji -d /cassandra/backups > /var/log/cassandra/$(date +%Y%m%d%H%M%S)-ibackup.log 2>&1
    

Restoring Cassandra data

This section briefly describes how to restore Cassandra data in a multi-node cluster.

Prepare your cluster

Before you can restore your nodes, you need to perform some initial tasks:

  1. Stop the Cassandra service on all of the nodes in your cluster.
  2. Clear out your Cassandra data, commit logs, and cache files on all of the nodes in the cluster.
  3. Look up your Cassandra seed nodes in cassandra.yaml. These nodes can be found under the key seeds, and are defined as a comma-delimited list of IPs.
  4. Log into, or establish an SSH connection with, the first seed node.

Set up your seed node

Next, prepare to restore your seed node:

  1. Copy your service account's private key to the node.
  2. Install and authenticate the Cloud SDK on the node.
  3. Set up gsutil to use your service account.
  4. Copy your recovery script to the node.

Now that you've set up your node, you can use the Cassandra backup/restore script to restore your data.

Restore your Cassandra data

To restore Cassandra data to your node with the Cassandra backup/restore script, you use the script's restore command:

./cassandra-cloud-backup.sh [options] restore

Common configuration options for the restore command include:

  • -b gs://<bucket> (required), which specifies which Cloud Storage bucket to retrieve the backup files from.
  • -d <directory_path>, which specifies the temporary local directory to which the script will download and decompress your backups. By default, this directory is set to /cassandra/backups.
  • -k, which tells the script to retain your retrieved backup files after you've restored them.
  • -f, which forces the script to skip any user prompts.
  • -v, which provides verbose output.

When you use the restore command, the script automatically detects if the file within the bucket is compressed. When the file is ready to be restored, the script safely flushes the node's logs, and then stops the Cassandra service.

The restore command only supports restoring a snapshot. If you want to restore from an incremental backup, you must first restore the last snapshot, then restore each subsequent incremental backup in chronological order.

You can view an inventory of available snapshots and incremental backups for the current machine by running the following command:

./cassandra-cloud-backups.sh -b gs://<bucket> inventory

Clean up your seed node

After you've restored your data, update any configuration files that aren't automatically restored by your script and restart the Cassandra service.

Start Cassandra on your other nodes

Start the Cassandra service on the rest of the nodes in your cluster. If your Cassandra cluster has additional seed nodes, you must start the service on these nodes before you start the service on non-seed nodes.

Setting up automatic file pruning in Cloud Storage

Older backups eventually outlive their usefulness and need to be removed. To help automate this process, Cloud Storage has a lifecycle management mechanism you can use to manage the lifecycle of your backup files.

To configure lifecycle management for the objects in your bucket:

  1. Create a new JSON file called lifecycle.json.
  2. Paste the following JSON into the file. This configuration will delete files in your Cloud Storage bucket after 30 days:

    {
     "lifecycle": {
       "rule": [{
         "action": { "type": "Delete" },
         "condition": { "age": 30 }
       }]
     }
    }
    
  3. Set the lifecycle configuration for your Cloud Storage bucket. Replace <bucket> with the name of your bucket:

    gsutil lifecycle set lifecycle.json gs://<bucket>
    

Next steps

Learn best practices for using gsutil in production

Scripting Production Transfers describes several ways to optimize your gsutil usage for large-scale production tasks.

Check out the nodetool documentation

The Cassandra backup/restore script featured in this tutorial uses a small subset of nodetool's functionality to streamline a general-purpose disaster recovery workflow. If your needs are more specific, consider reviewing the nodetool documentation and modifying the script to fit your needs.

Learn about disaster recovery on Google Cloud Platform

For general advice for developing a disaster recovery plan on Cloud Platform, see Designing a Disaster Recovery Plan with Google Cloud Platform. For a discussion of specific disaster recovery use cases, with example implementations on Google Cloud Platform, check out the Disaster Recovery Cookbook.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...