Migrating Data from HBase to Cloud Bigtable

This article describes considerations and processes for migrating the data from an Apache HBase cluster to a Cloud Bigtable cluster on Google Cloud (Google Cloud).

Before you begin this migration, you should consider performance implications, Bigtable schema design, implications for your approach to authentication and authorization, and the Bigtable feature set.

Performance implications

Under a typical workload, Bigtable delivers highly predictable performance. For more information, see Understanding Bigtable Performance.

Bigtable schema design

Designing a Bigtable schema is not like designing a schema for a relational database. Before designing your schema, review the concepts laid out in Designing Your Schema.

You must also keep the schema below the recommended limits for size. As a guideline, keep single rows below 100 MB and single values below 10 MB. There may be scenarios in which you'll need to store large values. Storing large values can impact performance, because extracting large values requires time as well as memory. Evaluate those scenarios on a case-by-case basis.

Authentication and authorization

Before you design access control for Bigtable, review the existing HBase authentication and authorization processes.

Bigtable uses Google Cloud's standard mechanisms for authentication and Identity and Access Management to provide access control, so you convert your existing authorization on HBase to IAM. You can map the existing Hadoop groups that provide access control mechanisms for HBase to different service accounts.

Bigtable allows you to control access at the project, instance, and table levels. For more information, see Access Control.

Migrating HBase to Bigtable

To migrate your data from HBase to Bigtable, you export the data as a series of Hadoop sequence files. This is a file format used by HBase consisting of binary key/value pairs.

To migrate the HBase table to Bigtable, follow these steps:

  1. Collect details from HBase.
  2. Export HBase tables to sequence files.
  3. Move the sequence files to Cloud Storage.
  4. Import the sequence files into Bigtable using Dataflow.
  5. Validate the move.

Plan the migration: collect details from HBase

To prepare for the migration, gather the following information from the existing HBase cluster, because you will need this information to build the destination table.

  • List of tables
  • Row counts
  • Cell counts
  • Column family details (including Time to Live, maximum number of versions)

An easy way to collect these details on a source table is to use the following script, which leaves the result on HDFS:

#!/usr/bin/env bash
# Table Name is the Source HBase Table Name
# Export Directory will be located on HDFS
hadoop fs -mkdir -p ${EXPORTDIR}
hbase shell << EOQ
describe ${TABLENAME}
EOQ | hadoop fs -put - ${EXPORTDIR}/${TABLENAME}-schema.json
hbase shell << EOQ
get_splits ${TABLENAME}
EOQ | hadoop fs -put - ${EXPORTDIR}/${TABLENAME}-splits.txt

Export the HBase table to Cloud Storage

When you know the basics of the HBase tables to be migrated, you need to export the table to sequence files and move them to Cloud Storage. Before taking actions to migrate any online data, run the following steps to make sure that your cluster meets the prerequisites for accessing Google Cloud:

  • Install the Cloud Storage connector

    If you want to migrate online data using distcp, you must install and configure the Cloud Storage connector. First, identify the HDFS file system that manages the data you want to migrate. Next, determine which client node in your Hadoop cluster has access to this file system. Finally, install the connector on the client node. For detailed installation steps, see Installing the Cloud Storage connector.

  • Install the Cloud SDK

    To migrate data with either distcp or gsutil, install the Cloud SDK on the Hadoop cluster client node where the migration will be initiated. For detailed installation steps, see the Cloud SDK documentation.

Export the HBase table to HDFS

Next, export the HBase table you want to migrate to somewhere in your Hadoop cluster. Let's assume your HBase table name is [MY_NEW_TABLE]. The target directory is under your user directory in HDFS. Export the HBase table as sequence files using the following commands:

hadoop fs -mkdir -p ${EXPORTDIR}
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
    -Dmapred.output.compress=true \
    -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
    -Dhbase.client.scanner.caching=100 \
    -Dmapred.map.tasks.speculative.execution=false \
    -Dmapred.reduce.tasks.speculative.execution=false \

Migrate the sequence files from HDFS to Cloud Storage

The next step is to move the sequence files to a Cloud Storage bucket. Depending upon the size of the data, the number of files, source of the data, and the available bandwidth, you can choose the appropriate option to move sequence files to Cloud Storage: Transfer Appliance, distcp, gsutil, or Storage Transfer Service.

Using Transfer Appliance

Use Transfer Appliance to migrate your data when:

  • You want to control the outgoing bandwidth on a schedule.
  • The size of the data is greater than 20 TB.

With this option, you do not need to buy or provision an extra network with Google. The end-to-end transfer time (appliance shipment time, rehydration, etc.) averages 100 Mbps.

To move data with Transfer Appliance, copy the sequence files to it and then ship the appliance back to Google. Google loads the data onto Google Cloud. For more information, see this Transfer Appliance documentation.

Using distcp

Use distcp to migrate your data when:

  • More than 100 Mbps bandwidth is available for the migration.
  • The Cloud Storage connector and the Cloud SDK can be installed on the source Hadoop environment.
  • Managing a new Hadoop job to perform the data migration is acceptable.
  • The size of the data is less than 20 TB.

To move data with distcp, use your Hadoop cluster client node configured with the Cloud Storage connector to submit a MapReduce job to copy the sequence files to Cloud Storage:


Using gsutil

Use gsutil to migrate your data when:

  • More than 100 Mbps bandwidth is available for the migration.
  • The Cloud SDK can be installed on the source Hadoop environment.
  • Managing a new Hadoop job to perform the data migration is not acceptable.
  • The size of the data is less than 10 TB.

To move data with gsutil, use your Hadoop cluster client node to initiate the data migration:

gsutil -m cp -r  ${EXPORTDIR} gs://[BUCKET]

Using Storage Transfer Service

Use Storage Transfer Service to migrate your data when:

  • Your data source is an Amazon S3 bucket, an HTTP/HTTPS location, or a Cloud Storage bucket.
  • The size of the data is less than 10 TB.

Storage Transfer Service has options that make data transfers and synchronization between data sources and data sinks easier. For example, you can:

  • Schedule one-time transfer operations or recurring transfer operations.
  • Delete existing objects in the destination bucket if they don't have a corresponding object in the source.
  • Delete source objects after transferring them.
  • Schedule periodic synchronization from data source to data sink with advanced filters based on file creation dates, file-name filters, and the times of day you prefer to import data.

For more information, see this Storage Transfer Service documentation.

Create the destination table

The next step is to create the destination table in Bigtable.

First, you use the gcloud command-line tool to install the Bigtable client tool cbt.

gcloud components update
gcloud components install cbt

Next, you create a table in Bigtable with the appropriate column families from your previous discovery effort.

Using the existing splits, presplit the destination table as you create it. This will improve the bulk load performance.

For example, if you find that the existing splits are:

'15861', '29374', '38173', '180922', '203294', '335846', '641111', '746477', '807307', '871053', '931689', '1729462', '1952670', '4356485', '4943705', '5968738', '6917370', '8993145', '10624362', '11309714', '12056747', '12772074', '14370672', '16583264', '18835454', '21194008', '22021148', '23702800', '25532516', '55555555'

Then set up a default project and a Bigtable instance for the cbt tool for your user account like this:

$ cat > ${HOME}/.cbtrc << EOF
project = [YOUR-GCP-PROJECT]

Create these splits in the destination table:

cbt -instance my-instance createtable my-new-table \

Create column families in the destination table to match the ones you discovered earlier. For example, if you discovered that there are two column families, cf1 and cf2, create the column family cf1 on Bigtable like this:

cbt createfamily my-new-table cf1

Create the column family cf2 like this:

cbt createfamily my-new-table cf2

After creating the column families, it is important to update each column family's garbage collection policy, including the maximum age and maximum number of versions for values in that column family. You must do this even if you used HBase's default settings for your HBase table, because Bigtable's native tools use a different default setting than HBase.

cbt setgcpolicy [TABLE] [FAMILY] ( maxage=[D]| maxversions=[N] )

Import the HBase data into Bigtable using Dataflow

For details of how to import data into Bigtable, see Importing and exporting data in the Bigtable documentation.

Keep the following tips in mind:

  • To improve the performance of data loading, be sure to set maxNumWorkers. This value helps to ensure that the import job has enough compute power to complete in a reasonable amount of time, but not so much that it would overwhelm the Bigtable cluster.

  • During the import, you should monitor the Bigtable cluster's CPU usage. The CPU usage section of the Bigtable monitoring document provides more information about this topic. If the CPU utilization across Bigtable cluster is too high, you might need to add additional nodes. It may take up to 20 minutes for the cluster to provide the performance benefit of additional nodes.

For more information about monitoring the Bigtable instance, see Monitoring a Bigtable Instance.

Verify the imported data inside Bigtable

To validate the imported data, you can use a couple of different checks:

  • Checking the row count match. The Dataflow job will report the total row count. This value needs to match the source HBase table row count.
  • Spot-checking with specific row queries. You can pick up a specific set of row keys from the source table and query them on the destination table to make sure they match:
cbt lookup [destination-table] [rowkey1]
cbt lookup [destination-table] [rowkey2]

What's next