Edit on GitHub
Report issue
Page history

Import a CSV file into a Cloud Bigtable table

Author(s): @thebilly ,   Published: 2018-06-26
This tutorial will walk you through importing data into a Cloud Bigtable table. Using Cloud Dataflow, we will take a CSV file and map each row to a table row and use the headers as column qualifiers all placed under the same column family for simplicity.

Prerequisites

Install software and download sample code

Make sure you have the following software installed:

If you haven't used Maven before check out this 5 minute quickstart.

Set up your Google Cloud Platform project

  1. Create a project in the GCP Console.
  2. Enable billing for your project.
  3. Install the Google Cloud SDK if you do not already have it. Make sure you initialize the SDK.

Upload your CSV

You can use your own CSV file or the example provided.

Remove and store the headers

The method we are using to import data isn't able to automatically handle the headers. Before uploading your file, make a copy of the comma-separated list of headers and remove that row from the CSV if you don't want it imported into your table.

Upload the CSV file

Upload the headerless CSV file to a new or existing Cloud Storage bucket.

Prepare your Cloud Bigtable table for data import

Follow the steps in cbt quickstart to create a Cloud Bigtable instance and install the command line tool for Cloud Bigtable. You can use an existing instance if you want.

Use an existing table or create a table:

cbt createtable my-table

The Cloud Dataflow job inserts data into column family 'csv'. Create that column family in your table:

cbt createfamily my-table csv

You can verify this worked by running:

cbt ls my-table

Expect to see the following:

Family Name GC Policy
----------- ---------
csv   [default]

Run the Cloud Dataflow job

Cloud Dataflow is a fully-managed serverless service for transforming and enriching data in stream (real time) and batch (historical) modes. We are using it as an easy and quick way to process the CSV concurrently and easily perform writes at a large scale to our table. You also only pay for what you use, so it keeps costs down.

Clone the repo

Clone the following repository and change to the directory for this tutorial's code:

git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git
cd cloud-bigtable-examples/java/dataflow-connector-examples/

Start the Dataflow job

mvn package exec:exec -DCsvImport -Dbigtable.projectID=YOUR_PROJECT_ID -Dbigtable.instanceID=YOUR_INSTANCE_ID \
-DinputFile="YOUR_FILE" -Dbigtable.table="YOUR_TABLE_ID" -Dheaders="YOUR_HEADERS"

replacing YOUR_PROJECT_ID, YOUR_INSTANCE_ID, YOUR_FILE, YOUR_TABLE_ID, and YOUR_HEADERS with appropriate values.

Here is an example command:

mvn package exec:exec -DCsvImport -Dbigtable.projectID=YOUR_PROJECT_ID -Dbigtable.instanceID=YOUR_INSTANCE_ID \
-DinputFile="gs://YOUR_BUCKET/sample.csv" -Dbigtable.table="my-table" -Dheaders="rowkey,a,b"

Note: If you see an error saying "Unable to get application default credentials.", this means that you likely need to set up application credentials as outlined here. If you are setting up a custom service account, be sure to assign the necessary roles for this job. For testing purposes, you can use Bigtable Administrator, Dataflow Admin, and Storage Admin.

Monitor your job

Monitor the newly created job's status and see if there are any errors running it in the Dataflow console.

Verify your data was inserted

Run the following command to see the data for the first five rows (sorted lexicographically by row key) of your Cloud Bigtable table and verify that the output matches the data in the CSV file:

cbt read my-table count=5

Expect an output similar to the following:

1
  csv:a                                    @ 2018/07/09-13:42:39.364000
    "A5"
  csv:b                                    @ 2018/07/09-13:42:39.364000
    "B2"
----------------------------------------
10
  csv:a                                    @ 2018/07/09-13:42:38.022000
    "A3"
  csv:b                                    @ 2018/07/09-13:42:38.022000
    "B4"
----------------------------------------
2
  csv:a                                    @ 2018/07/09-13:42:39.365000
    "A4"
  csv:b                                    @ 2018/07/09-13:42:39.365000
    "B8"
----------------------------------------
3
  csv:a                                    @ 2018/07/09-13:42:39.366000
    "A8"
  csv:b                                    @ 2018/07/09-13:42:39.366000
    "B0"
----------------------------------------
4
  csv:a                                    @ 2018/07/09-13:42:39.367000
    "A4"
  csv:b                                    @ 2018/07/09-13:42:39.367000
    "B4"

Next steps

See more tagged Cloud Bigtable Dataflow Java

Submit a Tutorial

Share step-by-step guides

SUBMIT A TUTORIAL

Request a Tutorial

Ask for community help

SUBMIT A REQUEST

GCP Tutorials

Tutorials published by GCP

VIEW TUTORIALS

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies. Java is a registered trademark of Oracle and/or its affiliates.