The Bigtable to Cloud Storage Parquet template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Parquet format. You can use the template to move data from Bigtable to Cloud Storage.
Pipeline requirements
- The Bigtable table must exist.
- The output Cloud Storage bucket must exist before running the pipeline.
Template parameters
Required parameters
- bigtableProjectId : The ID of the Google Cloud project that contains the Cloud Bigtable instance that you want to read data from.
- bigtableInstanceId : The ID of the Cloud Bigtable instance that contains the table.
- bigtableTableId : The ID of the Cloud Bigtable table to export.
- outputDirectory : The path and filename prefix for writing output files. Must end with a slash. DateTime formatting is used to parse the directory path for date and time formatters. For example: gs://your-bucket/your-path.
- filenamePrefix : The prefix of the Parquet file name. For example, "table1-". Defaults to: part.
Optional parameters
- numShards : The maximum number of output shards produced when writing. A higher number of shards means higher throughput for writing to Cloud Storage, but potentially higher data aggregation cost across shards when processing output Cloud Storage files. The default value is decided by Dataflow.
- bigtableAppProfileId : The ID of the Bigtable application profile to use for the export. If you don't specify an app profile, Bigtable uses the instance's default app profile: https://cloud.google.com/bigtable/docs/app-profiles#default-app-profile.
Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Cloud Bigtable to Parquet Files on Cloud Storage template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates-REGION_NAME/VERSION/Cloud_Bigtable_to_GCS_Parquet \ --region REGION_NAME \ --parameters \ bigtableProjectId=BIGTABLE_PROJECT_ID,\ bigtableInstanceId=INSTANCE_ID,\ bigtableTableId=TABLE_ID,\ outputDirectory=OUTPUT_DIRECTORY,\ filenamePrefix=FILENAME_PREFIX,\ numShards=NUM_SHARDS
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
REGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Parquet filename, for example,output-
NUM_SHARDS
: the number of Parquet files to output, for example,1
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates-LOCATION/VERSION/Cloud_Bigtable_to_GCS_Parquet { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "BIGTABLE_PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "outputDirectory": "OUTPUT_DIRECTORY", "filenamePrefix": "FILENAME_PREFIX", "numShards": "NUM_SHARDS" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
LOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Parquet filename, for example,output-
NUM_SHARDS
: the number of Parquet files to output, for example,1
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.