Migrate from HBase hosted on Google Cloud

This page describes considerations and processes for migrating to Bigtable from an Apache HBase cluster that is hosted on a Google Cloud service, such as Dataproc or Compute Engine.

For guidance on offline migration from an external Apache HBase environment to Bigtable, see Migrating Data from HBase to Bigtable offline. For online migration, see Replicate from HBase from HBase to Bigtable.

Why migrate from HBase on Google Cloud to Bigtable

Reasons why you might choose this migration path include the following:

You can leave your client application where it is currently deployed, changing only the connection configuration.
Your data remains in the Google Cloud ecosystem.
You can continue using the HBase API if you want to. The Cloud Bigtable HBase client for Java is a fully supported extension of the Apache HBase library for Java.
You want the benefits of using a managed service to store your data.

Considerations

This section suggests a few things to review and think about before you begin your migration.

Bigtable schema design

In most cases, you can use the same schema design in Bigtable as you do in HBase. If you want to change your schema or if your use case is changing, review the concepts laid out in Designing your schema before you migrate your data.

Preparation and testing

Before you migrate your data, make sure that you understand the differences between HBase and Bigtable. You should spend some time learning how to configure your connection to connect your application to Bigtable. Additionally, you might want to perform system and functional testing prior to the migration to validate the application or service.

Migration steps

To migrate your data from HBase to Bigtable, you take an HBase snapshot and import the data directly from the HBase cluster into Bigtable. These steps are for a single HBase cluster and are described in detail in the next several sections.

Stop sending writes to HBase.
Create destination tables in Bigtable.
Take HBase snapshots and import them to Bigtable.
Validate the imported data.
Update the application to send reads and writes to Bigtable.

Before you begin

Install the Google Cloud CLI or use the Cloud Shell.
Create a Cloud Storage bucket to store your validation output data. Create the bucket in the same location that you plan to run your Dataproc job in.
Identify the Hadoop cluster that you are migrating from. You must run the jobs for your migration on a Dataproc 1.x cluster that has network connectivity to the HBase cluster's Namenode and Datanodes. Take note of the HBase cluster's ZooKeeper Quorum address and Namenode URI, which are required for the migration scripts.
Create a Dataproc cluster version 1.x on the same network as the source HBase cluster. You use this cluster to run the import and validation jobs.
Create a Bigtable instance to store your new tables. At least one cluster in the Bigtable instance must also be in the same region as the Dataproc cluster. Example: us-central1
Get the Schema Translation tool:
```
wget BIGTABLE_HBASE_TOOLS_URL
```
Replace BIGTABLE_HBASE_TOOLS_URL with the URL of the latest JAR with dependencies available in the tool's Maven repository. The file name is similar to https://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-hbase-1.x-tools/2.6.0/bigtable-hbase-1.x-tools-2.6.0-jar-with-dependencies.jar.

To find the URL or to manually download the JAR, do the following:
1. Go to the repository.
2. Click Browse to view the repository files.
3. Click the most recent version number.
4. Identify the JAR with dependencies file (usually at the top).
5. Either right-click and copy the URL, or click to download the file.
Get the MapReduce tool, which you use for the import and validation jobs:
```
wget BIGTABLE_MAPREDUCE_URL
```
Replace BIGTABLE_MAPREDUCE_URL with the URL of the latest shaded-byo JAR available in the tool's Maven repository. The file name is similar to https://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-hbase-1.x-mapreduce/2.6.0/bigtable-hbase-1.x-mapreduce-2.6.0-shaded-byo-hadoop.jar.

To find the URL or to manually download the JAR, do the following:
1. Go to the repository.
2. Click the most recent version number.
3. Click Downloads.
4. Mouse over shaded-byo-hadoop.jar.
5. Either right-click and copy the URL, or click to download the file.
Set the following environment variables:
```
#Google Cloud

export PROJECT_ID=PROJECT_ID
export REGION=REGION

##Cloud Bigtable

export BIGTABLE_INSTANCE_ID=BIGTABLE_INSTANCE_ID

##Dataproc

export DATAPROC_CLUSTER_ID=DATAPROC_CLUSTER_NAME

#Cloud Storage

export BUCKET_NAME="gs://BUCKET_NAME"
export STORAGE_DIRECTORY="$BUCKET_NAME/hbase-migration"

#HBase

export ZOOKEEPER_QUORUM=ZOOKEPER_QUORUM
export ZOOKEEPER_PORT=2181
export ZOOKEEPER_QUORUM_AND_PORT="$ZOOKEEPER_QUORUM:$ZOOKEEPER_PORT"
export MIGRATION_SOURCE_NAMENODE_URI=MIGRATION_SOURCE_NAMENODE_URI
export MIGRATION_SOURCE_TMP_DIRECTORY=${MIGRATION_SOURCE_NAMENODE_URI}/tmp
export MIGRATION_SOURCE_DIRECTORY=${MIGRATION_SOURCE_NAMENODE_URI}/hbase

#JAR files

export TRANSLATE_JAR=TRANSLATE_JAR
export MAPREDUCE_JAR=MAPREDUCE_JAR
```
Replace the placeholders with the values for your migration.

Google Cloud:
- PROJECT_ID: the Google Cloud project that your Bigtable instance is in
- REGION: the region that contains the Dataproc cluster that will run the import and validation jobs.
Bigtable:
- BIGTABLE_INSTANCE_ID: the identifier of the Bigtable instance that you are importing your data to
Dataproc:
- DATAPROC_CLUSTER_ID: the ID of the Dataproc cluster that will run the import and validation jobs
Cloud Storage:
- BUCKET_NAME: the name of the Cloud Storage bucket where you are storing your snapshots
HBase:
- ZOOKEEPER_QUORUM: the ZooKeeper host that the tool will connect to, in the format host1.myownpersonaldomain.com
- MIGRATION_SOURCE_NAMENODE_URI: the URI for your HBase cluster's Namenode, in the format hdfs://host1.myownpersonaldomain.com:8020
JAR files
- TRANSLATE_JAR: the name and version number of the bigtable hbase tools JAR file that you downloaded from Maven. The value should look something like bigtable-hbase-1.x-tools-2.6.0-jar-with-dependencies.jar.
- MAPREDUCE_JAR: the name and version number of the bigtable hbase mapreduce JAR file that you downloaded from Maven. The value should look something like bigtable-hbase-1.x-mapreduce-2.6.0-shaded-byo-hadoop.jar.
(Optional) To confirm that the variables were set correctly, run the printenv command to view all environment variables.

Stop sending writes to HBase

Before you take snapshots of your HBase tables, stop sending writes to your HBase cluster.

Create destination tables in Bigtable

The next step is to create a destination table in your Bigtable instance for each HBase table that you are migrating. Use an account that has bigtable.tables.create permission for the instance.

This guide uses the Bigtable Schema Translation tool, which automatically creates the table for you. However, if you don't want your Bigtable schema to exactly match the HBase schema, you can create a table using the cbt CLI or the Google Cloud console.

The Bigtable Schema Translation tool captures the schema of the HBase table, including the table name, column families, garbage collection policies, and splits. Then it creates a similar table in Bigtable.

For each table that you want to import, run the following to copy the schema from HBase to Bigtable.

java \
 -Dgoogle.bigtable.project.id=$PROJECT_ID \
 -Dgoogle.bigtable.instance.id=$BIGTABLE_INSTANCE_ID \
 -Dgoogle.bigtable.table.filter=TABLE_NAME \
 -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
 -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
 -jar $TRANSLATE_JAR

Replace TABLE_NAME with the name of the HBase table that you are importing. The Schema Translation tool uses this name for your new Bigtable table.

You can also optionally replace TABLE_NAME with a regular expression, such as ".*", that captures all the tables that you want to create, and then run the command only once.

Take HBase table snapshots and import them to Bigtable

Complete the following for each table that you plan to migrate to Bigtable.

Run the following command:
```
echo "snapshot 'HBASE_TABLE_NAME', 'HBASE_SNAPSHOT_NAME'" | hbase shell -n
```
Replace the following:
- HBASE_TABLE_NAME: the name of the HBase table that you are migrating to Bigtable.
- HBASE_SNAPSHOT_NAME: the unique name for the new snapshot

Import the snapshot by running the following command:

gcloud dataproc jobs submit hadoop \
    --cluster $DATAPROC_CLUSTER_ID \
    --region $REGION \
    --project $PROJECT_ID \
    --jar $MAPREDUCE_JAR \
    -- \
    import-snapshot \
    -Dgoogle.bigtable.project.id=$PROJECT_ID \
    -Dgoogle.bigtable.instance.id=$BIGTABLE_INSTANCE_ID \
    HBASE_SNAPSHOT_NAME \
    $MIGRATION_SOURCE_DIRECTORY \
    BIGTABLE_TABLE_NAME \
    $MIGRATION_SOURCE_TMP_DIRECTORY

Replace the following:

HBASE_SNAPSHOT_NAME: the name that you assigned to the snapshot of the table that you are importing
BIGTABLE_TABLE_NAME: the name of the Bigtable table that you are importing to

After you run the command, the tool restores the HBase snapshot on the source cluster and then imports it. It can take several minutes for the process of restoring the snapshot to finish, depending on the size of the snapshot.

The following additional options are available when you import the data:

Set client-based timeouts for the buffered mutator requests (default 600000ms). See the following example:
```
-Dgoogle.bigtable.rpc.use.timeouts=true
-Dgoogle.bigtable.mutate.rpc.timeout.ms=600000
```
Consider latency-based throttling, which can reduce the impact that the import batch job might have on other workloads. Throttling should be tested for your migration use case. See the following example:
```
-Dgoogle.bigtable.buffered.mutator.throttling.enable=true
-Dgoogle.bigtable.buffered.mutator.throttling.threshold.ms=100
```
Modify the number of map tasks that read a single HBase region (default 2 map tasks per region). See the following example:
```
-Dgoogle.bigtable.import.snapshot.splits.per.region=3
```

Set additional MapReduce configurations as properties. See the following example:

-Dmapreduce.map.maxattempts=4
-Dmapreduce.map.speculative=false
-Dhbase.snapshot.thread.pool.max=20

Keep the following tips in mind when you import:

To improve the performance of data loading, be sure to have enough Dataproc cluster workers to run map import tasks in parallel. By default, an n1-standard-8 Dataproc worker will run eight import tasks. Having enough workers ensures that the import job has enough compute power to complete in a reasonable amount of time, but not so much power that it overwhelms the Bigtable instance.
- If you are not also using the Bigtable instance for another workload, multiply the number of nodes in your Bigtable instance by 3, then divide by 8 (with n1-standard-8 dataproc worker). Use result as the number of Dataproc workers.
- If you are using the instance for another workload at the same time that you are importing your HBase data, reduce the value of Dataproc workers or increase the number of Bigtable nodes to meet the workloads' requirements.
During the import, you should monitor the Bigtable instance's CPU usage. If the CPU utilization across the Bigtable instance is too high, you might need to add additional nodes. Adding nodes improves CPU utilization immediately, but it can take up to 20 minutes after the nodes are added for the cluster to reach optimal performance.

For more information about monitoring the Bigtable instance, see Monitoring a Bigtable instance.

Validate the imported data in Bigtable

Next, validate the data migration by performing a hash comparison between the source and destination table to gain confidence with the integrity of the migrated data. First, run the hash-table job to generate hashes of row ranges on the source table. Then complete the validation by running the sync-table job to compute and match hashes from Bigtable with the source.

To create hashes to use for validation, run the following command for each table that you are migrating:

gcloud dataproc jobs submit hadoop \
  --project $PROJECT_ID \
  --cluster $DATAPROC_CLUSTER_ID \
  --region $REGION \
  --jar $MAPREDUCE_JAR \
  -- \
  hash-table \
  -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM_AND_PORT \
  HBASE_TABLE_NAME \
  $STORAGE_DIRECTORY/HBASE_TABLE_NAME/hash-output/

Replace HBASE_TABLE_NAME with the name of the HBase table that you created the snapshot for.

Run the following in the command shell:

gcloud dataproc jobs submit hadoop \
  --project $PROJECT_ID \
  --cluster $DATAPROC_CLUSTER_ID \
  --region $REGION \
 --jar $MAPREDUCE_JAR \
 -- \
 sync-table \
 --sourcezkcluster=$ZOOKEEPER_QUORUM_AND_PORT:/hbase \
 --targetbigtableproject=$PROJECT_ID \
 --targetbigtableinstance=$BIGTABLE_INSTANCE_ID \
 $STORAGE_DIRECTORY/HBASE_TABLE_NAME/hash-output/ \
 HBASE_TABLE_NAME \
 BIGTABLE_TABLE_NAME

Replace the following:

HBASE_TABLE_NAME: the name of the HBase table that you are importing from
BIGTABLE_TABLE_NAME: the name of the Bigtable table that you are importing to

You can optionally add --dryrun=false to the command if you want to enable synchronization between the source and target for diverging hash ranges.

When the sync-table job is complete, the counters for the job are displayed in the Google Cloud console where the job was executed. If the import job successfully imports all of the data, the value for HASHES_MATCHED has a value and the value for HASHES_NOT_MATCHED is 0.

If HASHES_NOT_MATCHED shows a value, you can re-run sync-table in debug mode to emit the diverging ranges and cell level details such as Source missing cell, Target missing cell, or Different values. To enable debug mode, configure --properties mapreduce.map.log.level=DEBUG. After the job is executed, use Cloud Logging and search for the expression jsonPayload.class="org.apache.hadoop.hbase.mapreduce.SyncTable" to review diverging cells.

You can try the import job again or use SyncTable to synchronize the source and target tables by setting dryrun=false. Review HBase SyncTable and additional configuration options before proceeding.

SyncTable results in Cloud Logging

Update the application to send reads and writes to Bigtable

After you've validated the data for each table in the cluster, you can configure your applications to route all their traffic to Bigtable, and then deprecate the HBase cluster.

When your migration is complete, you can delete the snapshots.

What's next

Replicate from HBase to Bigtable