Bigtable to Cloud Storage Avro template

The Bigtable to Cloud Storage Avro template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Avro format. You can use the template to move data from Bigtable to Cloud Storage.

Pipeline requirements

The Bigtable table must exist.
The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Required parameters

bigtableProjectId: The ID of the Google Cloud project that contains the Bigtable instance that you want to read data from.
bigtableInstanceId: The ID of the Bigtable instance that contains the table.
bigtableTableId: The ID of the Bigtable table to export.
outputDirectory: The Cloud Storage path where data is written. For example, gs://mybucket/somefolder.
filenamePrefix: The prefix of the Avro filename. For example, output-. Defaults to: part.

Optional parameters

bigtableAppProfileId: The ID of the Bigtable application profile to use for the export. If you don't specify an app profile, Bigtable uses the instance's default app profile: https://cloud.google.com/bigtable/docs/app-profiles#default-app-profile.

Run the template

Console

Go to the Dataflow Create job from template page.

Go to Create job from template

In the Job name field, enter a unique job name.
Optional: For Regional endpoint, select a value from the drop-down menu. The default region is us-central1.
For a list of regions where you can run a Dataflow job, see Dataflow locations.
From the Dataflow template drop-down menu, select the Cloud Bigtable to Avro Files on Cloud Storage template .
In the provided parameter fields, enter your parameter values.
Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates-REGION_NAME/VERSION/Cloud_Bigtable_to_GCS_Avro \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
outputDirectory=OUTPUT_DIRECTORY,\
filenamePrefix=FILENAME_PREFIX

Replace the following:

JOB_NAME: a unique job name of your choice
VERSION: the version of the template that you want to use
You can use the following values:
- latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/
- the version name, like 2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
REGION_NAME: the region where you want to deploy your Dataflow job—for example, us-central1
BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
INSTANCE_ID: the ID of the Bigtable instance that contains the table
TABLE_ID: the ID of the Bigtable table to export
OUTPUT_DIRECTORY: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
FILENAME_PREFIX: the prefix of the Avro filename, for example, output-

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates-LOCATION/VERSION/Cloud_Bigtable_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}