Importing Data from Sequence Files

This page explains how to import a series of Hadoop sequence files into Cloud Bigtable. You must create the Hadoop sequence files by exporting a table from HBase or Cloud Bigtable.

Before you start

Before you import a table into Cloud Bigtable, you need to export a table, either from HBase or from Cloud Bigtable.

The import process is the same regardless of whether you exported your table from HBase or Cloud Bigtable.

Creating a new Cloud Bigtable table

Before importing the table, you need to create a new, empty table with the same column families as the exported table.

To create the new table:

  1. Install the cbt tool:

    gcloud components update
    gcloud components install cbt
    
  2. Use the createtable command to create the table:

    cbt -instance [INSTANCE_ID] createtable [TABLE_NAME]
    
  3. Use the createfamily command as many times as necessary to create all of the column families:

    cbt -instance [INSTANCE_ID] createfamily [TABLE_NAME] [FAMILY_NAME]
    

    For example, if your table is called my-table, and you want to add the column families cf1 and cf2:

    cbt -instance my-instance createfamily my-table cf1
    cbt -instance my-instance createfamily my-table cf2
    

Importing the table

Cloud Bigtable provides a utility that uses Cloud Dataflow to import a table from a series of Hadoop sequence files.

To import the table:

  1. Download the import/export JAR file, which includes all of the required dependencies:

    curl -f -O http://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-dataflow-import/1.0.0-pre2/bigtable-dataflow-import-1.0.0-pre2-shaded.jar
    
  2. Run the following command to import the table, replacing values in brackets with the appropriate values. For [CLOUD_STORAGE_STAGING_PATH], use a Cloud Storage path that does not yet exist, or the same path you used when you exported the table:

    java -jar bigtable-dataflow-import-1.0.0-pre2-shaded.jar import \
        --runner=BlockingDataflowPipelineRunner \
        --bigtableProjectId=[PROJECT_ID] \
        --bigtableInstanceId=[INSTANCE_ID] \
        --bigtableTableId=[TABLE_ID] \
        --filePattern='gs://[CLOUD_STORAGE_EXPORT_PATH]/*' \
        --project=[PROJECT_ID] \
        --stagingLocation=gs://[CLOUD_STORAGE_STAGING_PATH] \
        --maxNumWorkers=[3x_NUMBER_OF_NODES] \
        --zone=[ZONE]
    

    For example, if your Cloud Bigtable instance has three nodes:

    java -jar bigtable-dataflow-import-1.0.0-pre2-shaded.jar import \
        --runner=BlockingDataflowPipelineRunner \
        --bigtableProjectId=my-project \
        --bigtableInstanceId=my-instance \
        --bigtableTableId=my-table \
        --filePattern='gs://my-export-bucket/my-table/*' \
        --project=my-project \
        --stagingLocation=gs://my-export-bucket/jar-staging \
        --maxNumWorkers=9 \
        --zone=us-east1-b
    

    The import job loads the Hadoop sequence files into your Cloud Bigtable table. You can use the Google Cloud Platform Console to monitor the import job while it runs.

    When the job is complete, it prints the message Job finished with status DONE to the console.

Checking the results of the import process

You can verify that the table was imported by using the cbt tool to count the number of rows in the table:

cbt -instance [INSTANCE_ID] count [TABLE_NAME]

The command prints the total number of rows in the table. Verify that the total number of rows is consistent with the number of rows in the exported table.

What's next

Learn how to export sequence files from HBase or Cloud Bigtable.

Send feedback about...

Cloud Bigtable Documentation