Migrating Data from HBase to Cloud Bigtable

This article describes considerations and processes for migrating the data from an Apache HBase cluster to a Cloud Bigtable cluster on Google Cloud Platform (GCP).

Before you begin this migration, you should consider performance implications, Cloud Bigtable schema design, implications for your approach to authentication and authorization, and the Cloud Bigtable feature set.

Performance implications

Under a typical workload, Cloud Bigtable delivers highly predictable performance. When everything is running smoothly, you can expect to achieve the following performance for each node in your Cloud Bigtable cluster, depending on which type of storage your cluster uses.

Storage Type Reads   Writes Scans
SSD 10,000 rows per second @ 6 ms or 10,000 rows per second @ 6 ms 220 MB/s
HDD 500 rows per second @ 200 ms or 10,000 rows per second @ 50 ms 180 MB/s

The estimates shown in the list are based on rows that contain 1 KB of data. They also reflect a read-only or write-only workload. Performance for a mixed workload of reads and writes will vary.

These performance numbers are guidelines, not hard and fast rules. Per-node performance may vary based on your workload and the typical value size of a request or response. For more information, see Understanding Cloud Bigtable Performance.

Cloud Bigtable schema design

Designing a Cloud Bigtable schema is not like designing a schema for a relational database. Before designing your schema, review the concepts laid out in Designing Your Schema.

You must also keep the schema below the recommended limits for size. As a guideline, keep single rows below 100 MB and single values below 10 MB. There may be scenarios in which you'll need to store large values. Storing large values can impact performance, because extracting large values requires time as well as memory. Evaluate those scenarios on a case-by-case basis.

Authentication and authorization

Before you design access control for Cloud Bigtable, review the existing HBase authentication and authorization processes.

Cloud Bigtable uses GCP's standard mechanisms for authentication and Cloud Identity and Access Management to provide access control, so you convert your existing authorization on HBase to Cloud IAM. You can map the existing Hadoop groups that provide access control mechanisms for HBase to different service accounts.

While Cloud Bigtable allows you to control access at the instance level, it does not provide fine-grained control at the table level. One way to provide table-level granularity is to group tables that have similar access patterns under a Cloud Bigtable instance. But this approach could mean that you will need to use multiple Cloud Bigtable instances to migrate all your tables.

For more information, see Access Control.

Migrating HBase to Cloud Bigtable

To migrate your data from HBase to Cloud Bigtable, you export the data as a series of Hadoop sequence files. This is a file format used by HBase consisting of binary key/value pairs.

To migrate the HBase table to Cloud Bigtable, follow these steps:

  1. Collect details from HBase.
  2. Export HBase tables to sequence files.
  3. Move the sequence files to Cloud Storage.
  4. Import the sequence files into Cloud Bigtable using Cloud Dataflow.
  5. Validate the move.

Plan the migration: collect details from HBase

To prepare for the migration, gather the following information from the existing HBase cluster, because you will need this information to build the destination table.

  • List of tables
  • Row counts
  • Cell counts
  • Column family details (including Time to Live, maximum number of versions)

An easy way to collect these details on a source table is to use the following script, which leaves the result on HDFS:

#!/usr/bin/env bash
# Table Name is the Source HBase Table Name
# Export Directory will be located on HDFS
hadoop fs -mkdir -p ${EXPORTDIR}
hbase shell << EOQ
describe ${TABLENAME}
EOQ | hadoop fs -put - ${EXPORTDIR}/${TABLENAME}-schema.json
hbase shell << EOQ
get_splits ${TABLENAME}
EOQ | hadoop fs -put - ${EXPORTDIR}/${TABLENAME}-splits.txt

Export the HBase table to Cloud Storage

When you know the basics of the HBase tables to be migrated, you need to export the table to sequence files and move them to Cloud Storage. Before taking actions to migrate any online data, run the following steps to make sure that your cluster meets the prerequisites for accessing GCP:

  • Install the Cloud Storage connector

    If you want to migrate online data using distcp, you must install and configure the Cloud Storage connector. First, identify the HDFS file system that manages the data you want to migrate. Next, determine which client node in your Hadoop cluster has access to this file system. Finally, install the connector on the client node. For detailed installation steps, see Installing the Cloud Storage connector.

  • Install the Cloud SDK

    To migrate data with either distcp or gsutil, install the Cloud SDK on the Hadoop cluster client node where the migration will be initiated. For detailed installation steps, see the Cloud SDK documentation.

Export the HBase table to HDFS

Next, export the HBase table you want to migrate to somewhere in your Hadoop cluster. Let's assume your HBase table name is [MY_NEW_TABLE]. The target directory is under your user directory in HDFS. Export the HBase table as sequence files using the following commands:

hadoop fs -mkdir -p ${EXPORTDIR}
bin/hbase org.apache.hadoop.hbase.mapreduce.Export my-new-table \
    /user/hbase-${TABLENAME} \
    -export ${MAXVERSIONS}
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
    -Dmapred.output.compress=true \
    -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
    -DRAW_SCAN=true \
    -Dhbase.client.scanner.caching=100 \
    -Dmapred.map.tasks.speculative.execution=false \
    -Dmapred.reduce.tasks.speculative.execution=false \

Migrate the sequence files from HDFS to Cloud Storage

The next step is to move the sequence files to a Cloud Storage bucket. Depending upon the size of the data, the number of files, source of the data, and the available bandwidth, you can choose the appropriate option to move sequence files to Cloud Storage: Transfer Appliance, distcp, gsutil, or Storage Transfer Service.

Using Transfer Appliance

Use Transfer Appliance to migrate your data when:

  • You want to control the outgoing bandwidth on a schedule.
  • The size of the data is greater than 20 TB.

With this option, you do not need to buy or provision an extra network with Google. The end-to-end transfer time (appliance shipment time, rehydration, etc.) averages 100 Mbps.

To move data with Transfer Appliance, copy the sequence files to it and then ship the appliance back to Google. Google loads the data onto GCP. For more information, see this Transfer Appliance documentation.

Using distcp

Use distcp to migrate your data when:

  • More than 100 Mbps bandwidth is available for the migration.
  • The Cloud Storage connector and the Cloud SDK can be installed on the source Hadoop environment.
  • Managing a new Hadoop job to perform the data migration is acceptable.
  • The size of the data is less than 20 TB.

To move data with distcp, use your Hadoop cluster client node configured with the Cloud Storage connector to submit a MapReduce job to copy the sequence files to Cloud Storage:


Using gsutil

Use gsutil to migrate your data when:

  • More than 100 Mbps bandwidth is available for the migration.
  • The Cloud SDK can be installed on the source Hadoop environment.
  • Managing a new Hadoop job to perform the data migration is not acceptable.
  • The size of the data is less than 10 TB.

To move data with gsutil, use your Hadoop cluster client node to initiate the data migration:

gsutil -m cp -r  ${EXPORTDIR} gs://[BUCKET]

Using Storage Transfer Service

Use Storage Transfer Service to migrate your data when:

  • Your data source is an Amazon S3 bucket, an HTTP/HTTPS location, or a Cloud Storage bucket.
  • The size of the data is less than 10 TB.

Storage Transfer Service has options that make data transfers and synchronization between data sources and data sinks easier. For example, you can:

  • Schedule one-time transfer operations or recurring transfer operations.
  • Delete existing objects in the destination bucket if they don't have a corresponding object in the source.
  • Delete source objects after transferring them.
  • Schedule periodic synchronization from data source to data sink with advanced filters based on file creation dates, file-name filters, and the times of day you prefer to import data.

For more information, see this Storage Transfer Service documentation.

Create the destination table

The next step is to create the destination table in Cloud Bigtable.

First, you use the gcloud command-line tool to install the Cloud Bigtable client tool cbt.

gcloud components update
gcloud components install cbt

Next, you create a table in Cloud Bigtable with the appropriate column families from your previous discovery effort.

Using the existing splits, presplit the destination table as you create it. This will improve the bulk load performance.

For example, if you find that the existing splits are:

'15861', '29374', '38173', '180922', '203294', '335846', '641111', '746477', '807307', '871053', '931689', '1729462', '1952670', '4356485', '4943705', '5968738', '6917370', '8993145', '10624362', '11309714', '12056747', '12772074', '14370672', '16583264', '18835454', '21194008', '22021148', '23702800', '25532516', '55555555'

Then set up a default project and a Cloud Bigtable instance for the cbt tool for your user account like this:

$ cat > ${HOME}/.cbtrc << EOF
project = [YOUR-GCP-PROJECT]

Create these splits in the destination table:

cbt -instance my-instance createtable my-new-table \

Create column families in the destination table to match the ones you discovered earlier. For example, if you discovered that there are two column families, cf1 and cf2, create the column family cf1 on Cloud Bigtable like this:

cbt createfamily my-new-table cf1

Create the column family cf2 like this:

cbt createfamily my-new-table cf2

After creating the column families, it is important to update each column family's garbage collection policy, including the maximum age and maximum number of versions for values in that column family. You must do this even if you used HBase's default settings for your HBase table, because Cloud Bigtable's native tools use a different default setting than HBase.

cbt setgcpolicy [TABLE] [FAMILY] ( maxage=[D]| maxversions=[N] )

Import the HBase data into Cloud Bigtable using Cloud Dataflow

There are two ways to import data to Cloud Bigtable. See Importing Sequence Files in the Cloud Bigtable docs for details.

Keep the following tips in mind:

  • To improve the performance of data loading, be sure to set maxNumWorkers. This value helps to ensure that the import job has enough compute power to complete in a reasonable amount of time, but not so much that it would overwhelm the Cloud Bigtable cluster.

  • During the import, you should monitor the Cloud Bigtable cluster's CPU usage. The CPU usage section of the Cloud Bigtable monitoring document provides more information about this topic. If the CPU utilization across Cloud Bigtable cluster is too high, you might need to add additional nodes. It may take up to 20 minutes for the cluster to provide the performance benefit of additional nodes.

For more information about monitoring the Cloud Bigtable instance, see Monitoring a Cloud Bigtable Instance.

Verify the imported data inside Cloud Bigtable

To validate the imported data, you can use a couple of different checks:

  • Checking the row count match. The Cloud Dataflow job will report the total row count. This value needs to match the source HBase table row count.
  • Spot-checking with specific row queries. You can pick up a specific set of row keys from the source table and query them on the destination table to make sure they match:
cbt lookup [destination-table] [rowkey1]
cbt lookup [destination-table] [rowkey2]

What's next

Oliko tästä sivusta apua? Kerro mielipiteesi

Palautteen aihe:

Tämä sivu
Migrating Hadoop to GCP