Migrate from HBase on Google Cloud
This page describes considerations and processes for migrating to Bigtable from an Apache HBase cluster that is hosted on a Google Cloud service, such as Dataproc or Compute Engine.
For guidance on migrating from an external Apache HBase environment to Bigtable, see Migrating Data from HBase to Bigtable. To learn about about online migration, see Replicate from HBase from HBase to Bigtable.
Why migrate from HBase on Google Cloud to Bigtable
Reasons why you might choose this migration path include the following:
- You can leave your client application where it is currently deployed, changing only the connection configuration.
- Your data remains in the Google Cloud ecosystem.
- You can continue using the HBase API if you want to. The Cloud Bigtable HBase client for Java is a fully supported extension of the Apache HBase library for Java.
- You want the benefits of using a managed service to store your data.
Considerations
This section suggests a few things to review and think about before you begin your migration.
Bigtable schema design
In most cases, you can use the same schema design in Bigtable as you do in HBase. If you want to change your schema or if your use case is changing, review the concepts laid out in Designing your schema before you migrate your data.
Preparation and testing
Before you migrate your data, make sure that you understand the differences between HBase and Bigtable. You should spend some time learning how to configure your connection to connect your application to Bigtable. Additionally, you might want to perform system and functional testing prior to the migration to validate the application or service.
Migration steps
To migrate your data from HBase to Bigtable, you take an HBase snapshot and import the data directly from the HBase cluster into Bigtable. These steps are for a single HBase cluster and are described in detail in the next several sections.
- Stop sending writes to HBase.
- Create destination tables in Bigtable.
- Take HBase snapshots and import them to Bigtable.
- Validate the imported data.
- Update the application to send reads and writes to Bigtable.
Before you begin
Install the Google Cloud CLI or use the Cloud Shell.
Create a Cloud Storage bucket to store your validation output data. Create the bucket in the same location that you plan to run your Dataproc job in.
Identify the Hadoop cluster that you are migrating from. You must run the jobs for your migration on a Dataproc 1.x cluster that has network connectivity to the HBase cluster's Namenode and Datanodes. Take note of the HBase cluster's ZooKeeper Quorum address and Namenode URI, which are required for the migration scripts.
Create a Dataproc cluster version 1.x on the same network as the source HBase cluster. You use this cluster to run the import and validation jobs.
Create a Bigtable instance to store your new tables. At least one cluster in the Bigtable instance must also be in the same region as the Dataproc cluster. Example:
us-central1
Get the Schema Translation tool:
wget BIGTABLE_HBASE_TOOLS_URL
Replace
BIGTABLE_HBASE_TOOLS_URL
with the URL of the latestJAR with dependencies
available in the tool's Maven repository. The file name is similar tohttps://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-hbase-1.x-tools/2.6.0/bigtable-hbase-1.x-tools-2.6.0-jar-with-dependencies.jar
.To find the URL or to manually download the JAR, do the following:
- Go to the repository.
- Click Browse to view the repository files.
- Click the most recent version number.
- Identify the
JAR with dependencies file
(usually at the top). - Either right-click and copy the URL, or click to download the file.
Get the MapReduce tool, which you use for the import and validation jobs:
wget BIGTABLE_MAPREDUCE_URL
Replace
BIGTABLE_MAPREDUCE_URL
with the URL of the latestshaded-byo JAR
available in the tool's Maven repository. The file name is similar tohttps://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-hbase-1.x-mapreduce/2.6.0/bigtable-hbase-1.x-mapreduce-2.6.0-shaded-byo-hadoop.jar
.To find the URL or to manually download the JAR, do the following:
- Go to the repository.
- Click the most recent version number.
- Click Downloads.
- Mouse over shaded-byo-hadoop.jar.
- Either right-click and copy the URL, or click to download the file.
Set the following environment variables:
#Google Cloud export PROJECT_ID=PROJECT_ID export REGION=REGION ##Cloud Bigtable export BIGTABLE_INSTANCE_ID=BIGTABLE_INSTANCE_ID ##Dataproc export DATAPROC_CLUSTER_ID=DATAPROC_CLUSTER_NAME #Cloud Storage export BUCKET_NAME="gs://BUCKET_NAME" export STORAGE_DIRECTORY="$BUCKET_NAME/hbase-migration" #HBase export ZOOKEEPER_QUORUM=ZOOKEPER_QUORUM export ZOOKEEPER_PORT=2181 export ZOOKEEPER_QUORUM_AND_PORT="$ZOOKEEPER_QUORUM:$ZOOKEEPER_PORT" export MIGRATION_SOURCE_NAMENODE_URI=MIGRATION_SOURCE_NAMENODE_URI export MIGRATION_SOURCE_TMP_DIRECTORY=${MIGRATION_SOURCE_NAMENODE_URI}/tmp export MIGRATION_SOURCE_DIRECTORY=${MIGRATION_SOURCE_NAMENODE_URI}/hbase #JAR files export TRANSLATE_JAR=TRANSLATE_JAR export MAPREDUCE_JAR=MAPREDUCE_JAR
Replace the placeholders with the values for your migration.
Google Cloud:
PROJECT_ID
: the Google Cloud project that your Bigtable instance is inREGION
: the region that contains the Dataproc cluster that will run the import and validation jobs.
Bigtable:
BIGTABLE_INSTANCE_ID
: the identifier of the Bigtable instance that you are importing your data to
Dataproc:
DATAPROC_CLUSTER_ID
: the ID of the Dataproc cluster that will run the import and validation jobs
Cloud Storage:
BUCKET_NAME
: the name of the Cloud Storage bucket where you are storing your snapshots
HBase:
ZOOKEEPER_QUORUM
: the ZooKeeper host that the tool will connect to, in the formathost1.myownpersonaldomain.com
MIGRATION_SOURCE_NAMENODE_URI
: the URI for your HBase cluster's Namenode, in the formathdfs://host1.myownpersonaldomain.com:8020
JAR files
TRANSLATE_JAR
: the name and version number of thebigtable hbase tools
JAR file that you downloaded from Maven. The value should look something likebigtable-hbase-1.x-tools-2.6.0-jar-with-dependencies.jar
.MAPREDUCE_JAR
: the name and version number of thebigtable hbase mapreduce
JAR file that you downloaded from Maven. The value should look something likebigtable-hbase-1.x-mapreduce-2.6.0-shaded-byo-hadoop.jar
.
(Optional) To confirm that the variables were set correctly, run the
printenv
command to view all environment variables.
Stop sending writes to HBase
Before you take snapshots of your HBase tables, stop sending writes to your HBase cluster.
Create destination tables in Bigtable
The next step is to create a destination table in your Bigtable
instance for each HBase table that you are migrating. Use an account that has
bigtable.tables.create
permission for the instance.
This guide uses the Bigtable Schema Translation tool,
which automatically creates the table for you. However, if you don't want your
Bigtable schema to exactly match the HBase schema, you can
create a table using the
cbt
CLI
or the Google Cloud console.
The Bigtable Schema Translation tool captures the schema of the HBase table, including the table name, column families, garbage collection policies, and splits. Then it creates a similar table in Bigtable.
For each table that you want to import, run the following to copy the schema from HBase to Bigtable.
java \
-Dgoogle.bigtable.project.id=$PROJECT_ID \
-Dgoogle.bigtable.instance.id=$BIGTABLE_INSTANCE_ID \
-Dgoogle.bigtable.table.filter=TABLE_NAME \
-Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
-Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
-jar $TRANSLATE_JAR
Replace TABLE_NAME
with the name of the HBase table
that you are importing. The Schema Translation tool uses this name for
your new Bigtable table.
You can also optionally replace TABLE_NAME
with a
regular expression, such as ".*", that captures all the tables that you want to
create, and then run the command only once.
Take HBase table snapshots and import them to Bigtable
Complete the following for each table that you plan to migrate to Bigtable.
Run the following command:
echo "snapshot 'HBASE_TABLE_NAME', 'HBASE_SNAPSHOT_NAME'" | hbase shell -n
Replace the following:
HBASE_TABLE_NAME
: the name of the HBase table that you are migrating to Bigtable.HBASE_SNAPSHOT_NAME
: the unique name for the new snapshot
Import the snapshot by running the following command:
gcloud dataproc jobs submit hadoop \ --cluster $DATAPROC_CLUSTER_ID \ --region $REGION \ --project $PROJECT_ID \ --jar $MAPREDUCE_JAR \ -- \ import-snapshot \ -Dgoogle.bigtable.project.id=$PROJECT_ID \ -Dgoogle.bigtable.instance.id=$BIGTABLE_INSTANCE_ID \ HBASE_SNAPSHOT_NAME \ $MIGRATION_SOURCE_DIRECTORY \ BIGTABLE_TABLE_NAME \ $MIGRATION_SOURCE_TMP_DIRECTORY
Replace the following:
HBASE_SNAPSHOT_NAME
: the name that you assigned to the snapshot of the table that you are importingBIGTABLE_TABLE_NAME
: the name of the Bigtable table that you are importing to
After you run the command, the tool restores the HBase snapshot on the source cluster and then imports it. It can take several minutes for the process of restoring the snapshot to finish, depending on the size of the snapshot.
The following additional options are available when you import the data:
Set client-based timeouts for the buffered mutator requests (default 600000ms). See the following example:
-Dgoogle.bigtable.rpc.use.timeouts=true -Dgoogle.bigtable.mutate.rpc.timeout.ms=600000
Consider latency-based throttling, which can reduce the impact that the import batch job might have on other workloads. Throttling should be tested for your migration use case. See the following example:
-Dgoogle.bigtable.buffered.mutator.throttling.enable=true -Dgoogle.bigtable.buffered.mutator.throttling.threshold.ms=100
Modify the number of map tasks that read a single HBase region (default 2 map tasks per region). See the following example:
-Dgoogle.bigtable.import.snapshot.splits.per.region=3
Set additional MapReduce configurations as properties. See the following example:
-Dmapreduce.map.maxattempts=4 -Dmapreduce.map.speculative=false -Dhbase.snapshot.thread.pool.max=20
Keep the following tips in mind when you import:
- To improve the performance of data loading, be sure to have enough
Dataproc cluster workers to run map import tasks in parallel. By
default, an n1-standard-8 Dataproc worker will run eight import
tasks. Having enough workers ensures that the import job has enough compute
power to complete in a reasonable amount of time, but not so much power that it
overwhelms the Bigtable instance.
- If you are not also using the Bigtable instance for another workload, multiply the number of nodes in your Bigtable instance by 3, then divide by 8 (with n1-standard-8 dataproc worker). Use result as the number of Dataproc workers.
- If you are using the instance for another workload at the same time that you are importing your HBase data, reduce the value of Dataproc workers or increase the number of Bigtable nodes to meet the workloads' requirements.
- During the import, you should monitor the Bigtable instance's CPU usage. If the CPU utilization across the Bigtable instance is too high, you might need to add additional nodes. Adding nodes improves CPU utilization immediately, but it can take up to 20 minutes after the nodes are added for the cluster to reach optimal performance.
For more information about monitoring the Bigtable instance, see Monitoring a Bigtable instance.
Validate the imported data in Bigtable
Next, validate the data migration by performing a hash comparison between the
source and destination table to gain confidence with the integrity of the
migrated data. First, run the hash-table
job to generate hashes of row ranges
on the source table. Then complete the validation by running the sync-table
job to compute and match hashes from Bigtable with the source.
To create hashes to use for validation, run the following command for each table that you are migrating:
gcloud dataproc jobs submit hadoop \ --project $PROJECT_ID \ --cluster $DATAPROC_CLUSTER_ID \ --region $REGION \ --jar $MAPREDUCE_JAR \ -- \ hash-table \ -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM_AND_PORT \ HBASE_TABLE_NAME \ $STORAGE_DIRECTORY/HBASE_TABLE_NAME/hash-output/
Replace
HBASE_TABLE_NAME
with the name of the HBase table that you created the snapshot for.Run the following in the command shell:
gcloud dataproc jobs submit hadoop \ --project $PROJECT_ID \ --cluster $DATAPROC_CLUSTER_ID \ --region $REGION \ --jar $MAPREDUCE_JAR \ -- \ sync-table \ --sourcezkcluster=$ZOOKEEPER_QUORUM_AND_PORT:/hbase \ --targetbigtableproject=$PROJECT_ID \ --targetbigtableinstance=$BIGTABLE_INSTANCE_ID \ $STORAGE_DIRECTORY/HBASE_TABLE_NAME/hash-output/ \ HBASE_TABLE_NAME \ BIGTABLE_TABLE_NAME
Replace the following:
HBASE_TABLE_NAME
: the name of the HBase table that you are importing fromBIGTABLE_TABLE_NAME
: the name of the Bigtable table that you are importing to
You can optionally add --dryrun=false
to the command if you want to enable
synchronization between the source and target for diverging hash ranges.
When the sync-table
job is complete, the counters for the job are displayed in
the Google Cloud console where the job was executed. If the import job
successfully imports all of the data, the value for HASHES_MATCHED
has a
value and the value for HASHES_NOT_MATCHED
is 0.
If HASHES_NOT_MATCHED
shows a value, you can re-run sync-table
in debug
mode to emit the diverging ranges and cell level details such as
Source missing cell
, Target missing cell
, or Different values
. To enable
debug mode, configure --properties mapreduce.map.log.level=DEBUG
. After the
job is executed, use Cloud Logging and search for the expression
jsonPayload.class="org.apache.hadoop.hbase.mapreduce.SyncTable"
to review
diverging cells.
You can try the import job again or use SyncTable to synchronize the source and
target tables by setting dryrun=false
. Review HBase SyncTable and additional configuration options before proceeding.
Update the application to send reads and writes to Bigtable
After you've validated the data for each table in the cluster, you can configure your applications to route all their traffic to Bigtable, and then deprecate the HBase cluster.
When your migration is complete, you can delete the snapshots.