title	description	author	tags	date_published
Import a CSV file into a Cloud Bigtable table with Dataflow	Learn how to import a CSV file into a Cloud Bigtable table.	billyjacobson	Cloud Bigtable, Dataflow, Java	2018-06-26

Billy Jacobson | Developer Programs Engineer | Google

Contributed by Google employees.

Important: You can now use the Cloud Bigtable CLI to import CSV files, using the cbt import command. For details, see Easy CSV importing into Cloud Bigtable. You can still follow the tutorial on this page to use Dataflow to import a CSV file if you need to have a customized CSV import.

This tutorial walks you through importing data into a Cloud Bigtable table. Using Dataflow, you take a CSV file and map each row to a table row and use the headers as column qualifiers all placed under the same column family for simplicity.

Prerequisites

Install software and download sample code

Make sure that you have the following software installed:

Git
Java SE 11
Apache Maven 3.6.x or later

These software packages are already installed for Cloud Shell.

If you haven't used Maven before, check out this 5 minute quickstart.

Set up your Google Cloud project

Create a project in the Cloud Console.
Enable billing for your project.
Install the Cloud SDK if you do not already have it. Make sure you initialize the SDK.

Upload your CSV

You can use your own CSV file or the example provided.

Remove and store the headers

The method used in this tutorial to import data isn't able to automatically handle the headers. Before uploading your file, make a copy of the comma-separated list of headers and remove the header row from the CSV if you don't want it imported into your table.

Upload the CSV file

Upload the headerless CSV file to a new or existing Cloud Storage bucket.

Prepare your Cloud Bigtable table for data import

Follow the steps in the cbt quickstart to create a Cloud Bigtable instance and install the command-line tool for Cloud Bigtable. You can use an existing instance if you want.

Create a table:
```
cbt createtable my-table
```
Create the csv column family in your table:
```
cbt createfamily my-table csv
```
The Dataflow job inserts data into the column family csv.

Verify that the creation worked:

cbt ls my-table

The output should be the following:

Family Name GC Policy
----------- ---------
csv   [default]

Run the Dataflow job

Dataflow is a fully-managed serverless service for transforming and enriching data in stream (real-time) and batch (historical) modes. This tutorial uses Dataflow as quick way to process the CSV concurrently and perform writes at a large scale to the table. You also only pay for what you use, so it keeps costs down.

Clone the repository

Clone the following repository and change to the directory for this tutorial's code:

git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git
cd cloud-bigtable-examples/java/dataflow-connector-examples/

Start the Dataflow job

mvn package exec:exec -DCsvImport -Dbigtable.projectID=YOUR_PROJECT_ID -Dbigtable.instanceID=YOUR_INSTANCE_ID \
-DinputFile="YOUR_FILE" -Dbigtable.table="YOUR_TABLE_ID" -Dheaders="YOUR_HEADERS"

Replace YOUR_PROJECT_ID, YOUR_INSTANCE_ID, YOUR_FILE, YOUR_TABLE_ID, and YOUR_HEADERS with appropriate values.

Here is an example command:

mvn package exec:exec -DCsvImport -Dbigtable.projectID=YOUR_PROJECT_ID -Dbigtable.instanceID=YOUR_INSTANCE_ID \
-DinputFile="gs://YOUR_BUCKET/sample.csv" -Dbigtable.table="my-table" -Dheaders="rowkey,a,b"

The first column is always used as the row key.

If you see an error saying "Unable to get application default credentials.", this means that you likely need to set up application credentials as outlined here. If you are setting up a custom service account, be sure to assign the necessary roles for this job. For testing purposes, you can use Bigtable Administrator, Dataflow Admin, and Storage Admin.

Monitor your job

Monitor the newly created job's status and see if there are any errors running it in the Dataflow console.

Verify your data was inserted

Run the following command to see the data for the first five rows (sorted lexicographically by row key) of your Cloud Bigtable table and verify that the output matches the data in the CSV file:

cbt read my-table count=5

Expect an output similar to the following:

1
  csv:a                                    @ 2018/07/09-13:42:39.364000
    "A5"
  csv:b                                    @ 2018/07/09-13:42:39.364000
    "B2"
----------------------------------------
10
  csv:a                                    @ 2018/07/09-13:42:38.022000
    "A3"
  csv:b                                    @ 2018/07/09-13:42:38.022000
    "B4"
----------------------------------------
2
  csv:a                                    @ 2018/07/09-13:42:39.365000
    "A4"
  csv:b                                    @ 2018/07/09-13:42:39.365000
    "B8"
----------------------------------------
3
  csv:a                                    @ 2018/07/09-13:42:39.366000
    "A8"
  csv:b                                    @ 2018/07/09-13:42:39.366000
    "B0"
----------------------------------------
4
  csv:a                                    @ 2018/07/09-13:42:39.367000
    "A4"
  csv:b                                    @ 2018/07/09-13:42:39.367000
    "B4"

Next steps

Explore the Cloud Bigtable documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cbt-import-csv.md

cbt-import-csv.md

Prerequisites

Install software and download sample code

Set up your Google Cloud project

Upload your CSV

Remove and store the headers

Upload the CSV file

Prepare your Cloud Bigtable table for data import

Run the Dataflow job

Clone the repository

Start the Dataflow job

Monitor your job

Verify your data was inserted

Next steps

Files

cbt-import-csv.md

Latest commit

History

cbt-import-csv.md

File metadata and controls

Prerequisites

Install software and download sample code

Set up your Google Cloud project

Upload your CSV

Remove and store the headers

Upload the CSV file

Prepare your Cloud Bigtable table for data import

Run the Dataflow job

Clone the repository

Start the Dataflow job

Monitor your job

Verify your data was inserted

Next steps