Exporting Data as Sequence Files

This page explains how to export a table from HBase or Cloud Bigtable as a series of Hadoop sequence files.

If you're migrating from HBase, you can export your table from HBase, then import the table into Cloud Bigtable.

If you're backing up or moving a Cloud Bigtable table, you can export your table from Cloud Bigtable, then import the table back into Cloud Bigtable.

Exporting a table from HBase

Identifying the table's column families

When you export a table, you should record a list of column families that the table uses. You will need this information when you import the table into Cloud Bigtable.

To get a list of column families in your table:

  1. Log into your HBase server.
  2. Start the HBase shell:

    hbase shell
    
  3. Use the describe command to get information about the table you plan to export:

    describe '[TABLE_NAME]'
    

    The describe command prints detailed information about the table's column families.

Exporting sequence files

The HBase server provides a utility that exports a table as a series of Hadoop sequence files. See the HBase documentation for instructions on using this utility.

To reduce transfer time, you can export compressed sequence files from HBase. The Cloud Bigtable importer supports both compressed and uncompressed sequence files.

Copying sequence files to Cloud Storage

Use the gsutil tool to copy the exported sequence files to a Cloud Storage bucket, replacing values in brackets with the appropriate values:

gsutil cp [SEQUENCE_FILES] gs://[BUCKET_PATH]

See the gsutil documentation for details about the gsutil cp command.

Exporting a table from Cloud Bigtable

Cloud Bigtable provides a utility that uses Cloud Dataflow to export a table as a series of Hadoop sequence files.

Identifying the table's column families

When you export a table, you should record a list of column families that the table uses. You will need this information when you import the table.

To get a list of column families in your table:

  1. Install the cbt tool:

    gcloud components update
    gcloud components install cbt
    
  2. Use the ls command to get a list of column families in the table you plan to export:

    cbt -instance [INSTANCE_ID] ls [TABLE_NAME]
    

Creating a Cloud Storage bucket

You can store your exported table in an existing Cloud Storage bucket or in a new bucket. To create a new bucket, use the gsutil tool, replacing [BUCKET_NAME] with the appropriate value:

gsutil mb gs://[BUCKET_NAME]

See the gsutil documentation for details about the gsutil mb command.

Exporting sequence files

To export the table as a series of sequence files:

  1. Download the import/export JAR file, which includes all of the required dependencies:

    curl -f -O http://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-dataflow-import/1.0.0-pre3/bigtable-dataflow-import-1.0.0-pre3-shaded.jar
    
  2. Run the following command to export the table, replacing values in brackets with the appropriate values. Make sure that [CLOUD_STORAGE_EXPORT_PATH] and [CLOUD_STORAGE_STAGING_PATH] are Cloud Storage paths that do not yet exist:

    java -jar bigtable-dataflow-import-1.0.0-pre3-shaded.jar export \
        --runner=BlockingDataflowPipelineRunner \
        --bigtableProjectId=[PROJECT_ID] \
        --bigtableInstanceId=[INSTANCE_ID] \
        --bigtableTableId=[TABLE_ID] \
        --destination=gs://[CLOUD_STORAGE_EXPORT_PATH] \
        --project=[PROJECT_ID] \
        --stagingLocation=gs://[CLOUD_STORAGE_STAGING_PATH] \
        --maxNumWorkers=[5x_NUMBER_OF_NODES] \
        --zone=[ZONE]
    

    For example, if your Cloud Bigtable instance has three nodes:

    java -jar bigtable-dataflow-import-1.0.0-pre3-shaded.jar export \
        --runner=BlockingDataflowPipelineRunner \
        --bigtableProjectId=my-project \
        --bigtableInstanceId=my-instance \
        --bigtableTableId=my-table \
        --destination=gs://my-export-bucket/my-table \
        --project=my-project \
        --stagingLocation=gs://my-export-bucket/jar-staging \
        --maxNumWorkers=15 \
        --zone=us-east1-b
    

    The export job saves your table to the Cloud Storage bucket as a set of Hadoop sequence files. You can use the Google Cloud Platform Console to monitor the export job while it runs.

    When the job is complete, it prints the message Job finished with status DONE to the console.

What's next

Learn how to import sequence files into Cloud Bigtable.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Cloud Bigtable Documentation