Importing Data from Sequence Files

This page explains how to import a series of Hadoop sequence files into Cloud Bigtable. You must create the Hadoop sequence files by exporting a table from HBase or Cloud Bigtable.

You can also use a Cloud Dataflow template to import sequence files from Cloud Storage.

If you need to import CSV data, see Import a CSV File into a Cloud Bigtable Table.

Before you begin

Before you import a table into Cloud Bigtable, you need to complete the following tasks:

  1. Export a table, either from HBase or from Cloud Bigtable.

    The import process is the same regardless of whether you exported your table from HBase or Cloud Bigtable.

  2. Check how much storage the original table uses, and make sure your Cloud Bigtable cluster has enough nodes for that amount of storage.

    For details about how many nodes you need, see Storage utilization per node.

Creating a new Cloud Bigtable table

To import your data, you must create a new, empty table with the same column families as the exported table.

To create the new table:

  1. Install the cbt tool:

    gcloud components update
    gcloud components install cbt
    
  2. Use the createtable command to create the table:

    cbt -instance [INSTANCE_ID] createtable [TABLE_NAME]
    
  3. Use the createfamily command as many times as necessary to create all of the column families:

    cbt -instance [INSTANCE_ID] createfamily [TABLE_NAME] [FAMILY_NAME]
    

    For example, if your table is called my-table, and you want to add the column families cf1 and cf2:

    cbt -instance my-instance createfamily my-new-table cf1
    cbt -instance my-instance createfamily my-new-table cf2
    

Importing the table

Cloud Bigtable provides a utility that uses Cloud Dataflow to import a table from a series of Hadoop sequence files.

To import the table:

  1. Download the import/export JAR file, which includes all of the required dependencies:

    curl -f -O http://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-beam-import/1.6.0/bigtable-beam-import-1.6.0-shaded.jar
    
  2. Run the following command to import the table, replacing values in brackets with the appropriate values. For [TEMP_PATH], use a Cloud Storage path that does not yet exist, or the same path you used when you exported the table:

    java -jar bigtable-beam-import-1.6.0-shaded.jar import \
        --runner=dataflow \
        --project=[PROJECT_ID] \
        --bigtableInstanceId=[INSTANCE_ID] \
        --bigtableTableId=[TABLE_ID] \
        --sourcePattern='gs://[BUCKET_NAME]/[EXPORT_PATH]/part-*' \
        --tempLocation=gs://[BUCKET_NAME]/[TEMP_PATH] \
        --maxNumWorkers=[3x_NUMBER_OF_NODES] \
        --zone=[DATAFLOW_JOB_ZONE]
    

    For example, if the clusters in your Cloud Bigtable instance have 3 nodes:

    java -jar bigtable-beam-import-1.6.0-shaded.jar import \
        --runner=dataflow \
        --project=my-project \
        --bigtableInstanceId=my-instance \
        --bigtableTableId=my-new-table \
        --sourcePattern='gs://my-export-bucket/my-table/part-*' \
        --tempLocation=gs://my-export-bucket/jar-temp \
        --maxNumWorkers=9 \
        --zone=us-east1-c
    

    The import job loads the Hadoop sequence files into your Cloud Bigtable table. You can use the Google Cloud Platform Console to monitor the import job while it runs.

    When the job is complete, it prints the message Job finished with status DONE to the console.

Checking the results of the import process

You can verify that the table was imported by using the cbt tool to count the number of rows in the table:

cbt -instance [INSTANCE_ID] count [TABLE_NAME]

The command prints the total number of rows in the table. Verify that the total number of rows is consistent with the number of rows in the exported table.

What's next

Learn how to export sequence files from HBase or Cloud Bigtable.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Bigtable Documentation