Google-provided Dataflow batch templates

Stay organized with collections Save and categorize content based on your preferences.

Google provides a set of open source Dataflow templates.

These Dataflow templates can help you solve large data tasks, including data import, data export, data backup, data restore and bulk API operations—all without the use of a dedicated development environment. The templates are built on Apache Beam and use Dataflow to transform the data.

For general information about templates, see Dataflow templates. For a list of all Google-provided templates, see Get started with Google-provided templates.

This guide documents batch templates.

BigQuery to Cloud Storage TFRecords

The BigQuery to Cloud Storage TFRecords template is a pipeline that reads data from a BigQuery query and writes it to a Cloud Storage bucket in TFRecord format. You can specify the training, testing, and validation percentage splits. By default, the split is 1 or 100% for the training set and 0 or 0% for testing and validation sets. When setting the dataset split, the sum of training, testing, and validation needs to add up to 1 or 100% (for example, 0.6+0.2+0.2). Dataflow automatically determines the optimal number of shards for each output dataset.

Requirements for this pipeline:

  • The BigQuery dataset and table must exist.
  • The output Cloud Storage bucket must exist before pipeline execution. Training, testing, and validation subdirectories do not need to preexist and are autogenerated.

Template parameters

Parameter Description
readQuery A BigQuery SQL query that extracts data from the source. For example, select * from dataset1.sample_table.
outputDirectory The top-level Cloud Storage path prefix at which to write the training, testing, and validation TFRecord files. For example, gs://mybucket/output. Subdirectories for resulting training, testing, and validation TFRecord files are automatically generated from outputDirectory. For example, gs://mybucket/output/train
trainingPercentage (Optional) The percentage of query data allocated to training TFRecord files. The default value is 1, or 100%.
testingPercentage (Optional) The percentage of query data allocated to testing TFRecord files. The default value is 0, or 0%.
validationPercentage (Optional) The percentage of query data allocated to validation TFRecord files. The default value is 0, or 0%.
outputSuffix (Optional) The file suffix for the training, testing, and validation TFRecord files that are written. The default value is .tfrecord.

Running the BigQuery to Cloud Storage TFRecord files template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the BigQuery to TFRecords template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records \
    --region REGION_NAME \
    --parameters \
readQuery=READ_QUERY,\
outputDirectory=OUTPUT_DIRECTORY,\
trainingPercentage=TRAINING_PERCENTAGE,\
testingPercentage=TESTING_PERCENTAGE,\
validationPercentage=VALIDATION_PERCENTAGE,\
outputSuffix=OUTPUT_FILENAME_SUFFIX

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • READ_QUERY: the BigQuery query to run
  • OUTPUT_DIRECTORY: the Cloud Storage path prefix for output datasets
  • TRAINING_PERCENTAGE: the decimal percentage split for the training dataset
  • TESTING_PERCENTAGE: the decimal percentage split for the testing dataset
  • VALIDATION_PERCENTAGE: the decimal percentage split for the validation dataset
  • OUTPUT_FILENAME_SUFFIX: the preferred output TensorFlow Record file suffix

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records
{
   "jobName": "JOB_NAME",
   "parameters": {
       "readQuery":"READ_QUERY",
       "outputDirectory":"OUTPUT_DIRECTORY",
       "trainingPercentage":"TRAINING_PERCENTAGE",
       "testingPercentage":"TESTING_PERCENTAGE",
       "validationPercentage":"VALIDATION_PERCENTAGE",
       "outputSuffix":"OUTPUT_FILENAME_SUFFIX"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • READ_QUERY: the BigQuery query to run
  • OUTPUT_DIRECTORY: the Cloud Storage path prefix for output datasets
  • TRAINING_PERCENTAGE: the decimal percentage split for the training dataset
  • TESTING_PERCENTAGE: the decimal percentage split for the testing dataset
  • VALIDATION_PERCENTAGE: the decimal percentage split for the validation dataset
  • OUTPUT_FILENAME_SUFFIX: the preferred output TensorFlow Record file suffix

BigQuery export to Parquet (via Storage API)

The BigQuery export to Parquet template is a batch pipeline that reads data from a BigQuery table and writes it to a Cloud Storage bucket in Parquet format. This template utilizes the BigQuery Storage API to export the data.

Requirements for this pipeline:

  • The input BigQuery table must exist before running the pipeline.
  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Parameter Description
tableRef The BigQuery input table location. For example, <my-project>:<my-dataset>.<my-table>.
bucket The Cloud Storage folder in which to write the Parquet files. For example, gs://mybucket/exports.
numShards (Optional) The number of output file shards. The default value is 1.
fields (Optional) A comma-separated list of fields to select from the input BigQuery table.

Running the BigQuery to Cloud Storage Parquet template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the BigQuery export to Parquet (via Storage API) template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet \
    --region=REGION_NAME \
    --parameters \
tableRef=BIGQUERY_TABLE,\
bucket=OUTPUT_DIRECTORY,\
numShards=NUM_SHARDS,\
fields=FIELDS

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGQUERY_TABLE: your BigQuery table name
  • OUTPUT_DIRECTORY: your Cloud Storage folder for output files
  • NUM_SHARDS: the desired number of output file shards
  • FIELDS: the comma-separated list of fields to select from the input BigQuery table

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "tableRef": "BIGQUERY_TABLE",
          "bucket": "OUTPUT_DIRECTORY",
          "numShards": "NUM_SHARDS",
          "fields": "FIELDS"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet",
   }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGQUERY_TABLE: your BigQuery table name
  • OUTPUT_DIRECTORY: your Cloud Storage folder for output files
  • NUM_SHARDS: the desired number of output file shards
  • FIELDS: the comma-separated list of fields to select from the input BigQuery table

BigQuery to Elasticsearch

The BigQuery to Elasticsearch template is a batch pipeline that ingests data from a BigQuery table into Elasticsearch as documents. The template can either read the entire table or read specific records using a supplied query.

Requirements for this pipeline

  • The source BigQuery table must exist.
  • A Elasticsearch host on a Google Cloud instance or on Elastic Cloud with Elasticsearch version 7.0 or above and should be acccessible from the Dataflow worker machines.

Template parameters

Parameter Description
connectionUrl Elasticsearch URL in the format https://hostname:[port] or specify CloudID if using Elastic Cloud.
apiKey Base64 Encoded API key used for authentication.
index The Elasticsearch index toward which the requests will be issued, For example, my-index.
inputTableSpec (Optional) BigQuery table to read from to insert into Elasticsearch. Either table or query must be provided. For example, projectId:datasetId.tablename.
query (Optional) SQL query to pull data from BigQuery. Either table or query must be provided.
useLegacySql (Optional) Set to true to use legacy SQL (only applicable if supplying query). Default: false.
batchSize (Optional) Batch size in number of documents. Default: 1000.
batchSizeBytes (Optional) Batch size in number of bytes. Default: 5242880 (5mb).
maxRetryAttempts (Optional) Max retry attempts, must be > 0. Default: no retries.
maxRetryDuration (Optional) Max retry duration in milliseconds, must be > 0. Default: no retries.
propertyAsIndex (Optional) A property in the document being indexed whose value will specify _index metadata to be included with document in bulk request (takes precedence over an _index UDF). Default: none.
propertyAsId (Optional) A property in the document being indexed whose value will specify _id metadata to be included with document in bulk request (takes precedence over an _id UDF). Default: none.
javaScriptIndexFnGcsPath (Optional) The Cloud Storage path to the JavaScript UDF source for a function that will specify _index metadata to be included with document in bulk request. Default: none.
javaScriptIndexFnName (Optional) UDF JavaScript function name for function that will specify _index metadata to be included with document in bulk request. Default: none.
javaScriptIdFnGcsPath (Optional) The Cloud Storage path to the JavaScript UDF source for a function that will specify _id metadata to be included with document in bulk request. Default: none.
javaScriptIdFnName (Optional) UDF JavaScript function name for function that will specify _id metadata to be included with document in bulk request. Default: none.
javaScriptTypeFnGcsPath (Optional) The Cloud Storage path to the JavaScript UDF source for a function that will specify _type metadata to be included with document in bulk request. Default: none.
javaScriptTypeFnName (Optional) UDF JavaScript function name for function that will specify _type metadata to be included with document in bulk request. Default: none.
javaScriptIsDeleteFnGcsPath (Optional) The Cloud Storage path to JavaScript UDF source for function that will determine if document should be deleted rather than inserted or updated. The function should return string value "true" or "false". Default: none.
javaScriptIsDeleteFnName (Optional) UDF JavaScript function name for function that will determine if document should be deleted rather than inserted or updated. The function should return string value "true" or "false". Default: none.
usePartialUpdate (Optional) Whether to use partial updates (update rather than create or index, allowing partial docs) with Elasticsearch requests. Default: false.
bulkInsertMethod (Optional) Whether to use INDEX (index, allows upserts) or CREATE (create, errors on duplicate _id) with Elasticsearch bulk requests. Default: CREATE.

Running the BigQuery to Elasticsearch template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the BigQuery to Elasticsearch template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_Elasticsearch \
    --parameters \
inputTableSpec=INPUT_TABLE_SPEC,\
connectionUrl=CONNECTION_URL,\
apiKey=APIKEY,\
index=INDEX

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • INPUT_TABLE_SPEC: your BigQuery table name.
  • CONNECTION_URL: your Elasticsearch URL.
  • APIKEY: your base64 encoded API key for authentication.
  • INDEX: your Elasticsearch index.

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputTableSpec": "INPUT_TABLE_SPEC",
          "connectionUrl": "CONNECTION_URL",
          "apiKey": "APIKEY",
          "index": "INDEX"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_Elasticsearch",
   }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • INPUT_TABLE_SPEC: your BigQuery table name.
  • CONNECTION_URL: your Elasticsearch URL.
  • APIKEY: your base64 encoded API key for authentication.
  • INDEX: your Elasticsearch index.

BigQuery to MongoDB

The BigQuery to MongoDB template is a batch pipeline that reads rows from a BigQuery and writes them to MongoDB as documents. Currently each row is stored as a document.

Requirements for this pipeline

  • The source BigQuery table must exist.
  • The target MongoDB instance should be accessible from the Dataflow worker machines.

Template parameters

Parameter Description
mongoDbUri MongoDB connection URI in the format mongodb+srv://:@.
database Database in MongoDB to store the collection. For example: my-db.
collection Name of the collection in the MongoDB database. For example: my-collection.
inputTableSpec BigQuery table to read from. For example, bigquery-project:dataset.input_table.

Running the BigQuery to MongoDB template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the BigQuery to MongoDB template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

  gcloud beta dataflow flex-template run JOB_NAME \
      --project=PROJECT_ID \
      --region=REGION_NAME \
      --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_MongoDB \
      --parameters \
  inputTableSpec=INPUT_TABLE_SPEC,\
  mongoDbUri=MONGO_DB_URI,\
  database=DATABASE,\
  collection=COLLECTION
  

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • INPUT_TABLE_SPEC: your source BigQuery table name.
  • MONGO_DB_URI: your MongoDB URI.
  • DATABASE: your MongoDB database.
  • COLLECTION: your MongoDB collection.

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

  POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
  {
     "launch_parameter": {
        "jobName": "JOB_NAME",
        "parameters": {
            "inputTableSpec": "INPUT_TABLE_SPEC",
            "mongoDbUri": "MONGO_DB_URI",
            "database": "DATABASE",
            "collection": "COLLECTION"
        },
        "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_MongoDB",
     }
  }

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • INPUT_TABLE_SPEC: your source BigQuery table name.
  • MONGO_DB_URI: your MongoDB URI.
  • DATABASE: your MongoDB database.
  • COLLECTION: your MongoDB collection.

Bigtable to Cloud Storage Avro

The Bigtable to Cloud Storage Avro template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Avro format. You can use the template to move data from Bigtable to Cloud Storage.

Requirements for this pipeline:

  • The Bigtable table must exist.
  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Parameter Description
bigtableProjectId The ID of the Google Cloud project of the Bigtable instance that you want to read data from.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to export.
outputDirectory The Cloud Storage path where data is written. For example, gs://mybucket/somefolder.
filenamePrefix The prefix of the Avro filename. For example, output-.

Running the Bigtable to Cloud Storage Avro file template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Bigtable to Avro Files on Cloud Storage template .
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
outputDirectory=OUTPUT_DIRECTORY,\
filenamePrefix=FILENAME_PREFIX

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • OUTPUT_DIRECTORY: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the Avro filename, for example, output-

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • OUTPUT_DIRECTORY: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the Avro filename, for example, output-

Bigtable to Cloud Storage Parquet

The Bigtable to Cloud Storage Parquet template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Parquet format. You can use the template to move data from Bigtable to Cloud Storage.

Requirements for this pipeline:

  • The Bigtable table must exist.
  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Parameter Description
bigtableProjectId The ID of the Google Cloud project of the Bigtable instance that you want to read data from.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to export.
outputDirectory The Cloud Storage path where data is written. For example, gs://mybucket/somefolder.
filenamePrefix The prefix of the Parquet filename. For example, output-.
numShards The number of output file shards. For example 2.

Running the Bigtable to Cloud Storage Parquet file template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Bigtable to Parquet Files on Cloud Storage template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
outputDirectory=OUTPUT_DIRECTORY,\
filenamePrefix=FILENAME_PREFIX,\
numShards=NUM_SHARDS

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • OUTPUT_DIRECTORY: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the Parquet filename, for example, output-
  • NUM_SHARDS: the number of Parquet files to output, for example, 1

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
       "numShards": "NUM_SHARDS"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • OUTPUT_DIRECTORY: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the Parquet filename, for example, output-
  • NUM_SHARDS: the number of Parquet files to output, for example, 1

Bigtable to Cloud Storage SequenceFile

The Bigtable to Cloud Storage SequenceFile template is a pipeline that reads data from a Bigtable table and writes the data to a Cloud Storage bucket in SequenceFile format. You can use the template to copy data from Bigtable to Cloud Storage.

Requirements for this pipeline:

  • The Bigtable table must exist.
  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Parameter Description
bigtableProject The ID of the Google Cloud project of the Bigtable instance that you want to read data from.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to export.
bigtableAppProfileId The ID of the Bigtable application profile to be used for the export. If you do not specify an app profile, Bigtable uses the instance's default app profile.
destinationPath The Cloud Storage path where data is written. For example, gs://mybucket/somefolder.
filenamePrefix The prefix of the SequenceFile filename. For example, output-.

Running the Bigtable to Cloud Storage SequenceFile template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Bigtable to SequenceFile Files on Cloud Storage template .
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile \
    --region REGION_NAME \
    --parameters \
bigtableProject=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
bigtableAppProfileId=APPLICATION_PROFILE_ID,\
destinationPath=DESTINATION_PATH,\
filenamePrefix=FILENAME_PREFIX

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID: the ID of the Bigtable application profile to be used for the export
  • DESTINATION_PATH: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the SequenceFile filename, for example, output-

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProject": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "bigtableAppProfileId": "APPLICATION_PROFILE_ID",
       "destinationPath": "DESTINATION_PATH",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID: the ID of the Bigtable application profile to be used for the export
  • DESTINATION_PATH: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the SequenceFile filename, for example, output-

Datastore to Cloud Storage Text [Deprecated]

This template is deprecated and will be removed in Q1 2022. Please migrate to Firestore to Cloud Storage Text template.

The Datastore to Cloud Storage Text template is a batch pipeline that reads Datastore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.

Requirements for this pipeline:

Datastore must be set up in the project before running the pipeline.

Template parameters

Parameter Description
datastoreReadGqlQuery A GQL query that specifies which entities to grab. For example, SELECT * FROM MyKind.
datastoreReadProjectId The Google Cloud project ID of the Datastore instance that you want to read data from.
datastoreReadNamespace The namespace of the requested entities. To use the default namespace, leave this parameter blank.
javascriptTextTransformGcsPath (Optional) The Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use. For example, gs://my-bucket/my-udfs/my_file.js.
javascriptTextTransformFunctionName (Optional) The name of the JavaScript user-defined function (UDF) that you want to use. For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.
textWritePrefix The Cloud Storage path prefix to specify where the data is written. For example, gs://mybucket/somefolder/.

Running the Datastore to Cloud Storage Text template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Datastore to Text Files on Cloud Storage template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Datastore_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
datastoreReadGqlQuery="SELECT * FROM DATASTORE_KIND",\
datastoreReadProjectId=DATASTORE_PROJECT_ID,\
datastoreReadNamespace=DATASTORE_NAMESPACE,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
textWritePrefix=gs://BUCKET_NAME/output/

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • DATASTORE_PROJECT_ID: the Cloud project ID where the Datastore instance exists
  • DATASTORE_KIND: the type of your Datastore entities
  • DATASTORE_NAMESPACE: the namespace of your Datastore entities
  • JAVASCRIPT_FUNCTION: the name of the JavaScript user-defined function (UDF) that you want to use

    For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.

  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use—for example, gs://my-bucket/my-udfs/my_file.js

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Datastore_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "SELECT * FROM DATASTORE_KIND"
       "datastoreReadProjectId": "DATASTORE_PROJECT_ID",
       "datastoreReadNamespace": "DATASTORE_NAMESPACE",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • DATASTORE_PROJECT_ID: the Cloud project ID where the Datastore instance exists
  • DATASTORE_KIND: the type of your Datastore entities
  • DATASTORE_NAMESPACE: the namespace of your Datastore entities
  • JAVASCRIPT_FUNCTION: the name of the JavaScript user-defined function (UDF) that you want to use

    For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.

  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use—for example, gs://my-bucket/my-udfs/my_file.js

Firestore to Cloud Storage Text

The Firestore to Cloud Storage Text template is a batch pipeline that reads Firestore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.

Requirements for this pipeline:

Firestore must be set up in the project before running the pipeline.

Template parameters

Parameter Description
firestoreReadGqlQuery A GQL query that specifies which entities to grab. For example, SELECT * FROM MyKind.
firestoreReadProjectId The Google Cloud project ID of the Firestore instance that you want to read data from.
firestoreReadNamespace The namespace of the requested entities. To use the default namespace, leave this parameter blank.
javascriptTextTransformGcsPath (Optional) The Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use. For example, gs://my-bucket/my-udfs/my_file.js.
javascriptTextTransformFunctionName (Optional) The name of the JavaScript user-defined function (UDF) that you want to use. For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.
textWritePrefix The Cloud Storage path prefix to specify where the data is written. For example, gs://mybucket/somefolder/.

Running the Firestore to Cloud Storage Text template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Firestore to Text Files on Cloud Storage template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Firestore_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
firestoreReadGqlQuery="SELECT * FROM FIRESTORE_KIND",\
firestoreReadProjectId=FIRESTORE_PROJECT_ID,\
firestoreReadNamespace=FIRESTORE_NAMESPACE,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
textWritePrefix=gs://BUCKET_NAME/output/

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • FIRESTORE_PROJECT_ID: the Cloud project ID where the Firestore instance exists
  • FIRESTORE_KIND: the type of your Firestore entities
  • FIRESTORE_NAMESPACE: the namespace of your Firestore entities
  • JAVASCRIPT_FUNCTION: the name of the JavaScript user-defined function (UDF) that you want to use

    For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.

  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use—for example, gs://my-bucket/my-udfs/my_file.js

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Firestore_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "firestoreReadGqlQuery": "SELECT * FROM FIRESTORE_KIND"
       "firestoreReadProjectId": "FIRESTORE_PROJECT_ID",
       "firestoreReadNamespace": "FIRESTORE_NAMESPACE",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • FIRESTORE_PROJECT_ID: the Cloud project ID where the Firestore instance exists
  • FIRESTORE_KIND: the type of your Firestore entities
  • FIRESTORE_NAMESPACE: the namespace of your Firestore entities
  • JAVASCRIPT_FUNCTION: the name of the JavaScript user-defined function (UDF) that you want to use

    For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.

  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use—for example, gs://my-bucket/my-udfs/my_file.js

Cloud Spanner to Cloud Storage Avro

The Cloud Spanner to Avro Files on Cloud Storage template is a batch pipeline that exports a whole Cloud Spanner database to Cloud Storage in Avro format. Exporting a Cloud Spanner database creates a folder in the bucket you select. The folder contains:

  • A spanner-export.json file.
  • A TableName-manifest.json file for each table in the database you exported.
  • One or more TableName.avro-#####-of-##### files.

For example, exporting a database with two tables, Singers and Albums, creates the following file set:

  • Albums-manifest.json
  • Albums.avro-00000-of-00002
  • Albums.avro-00001-of-00002
  • Singers-manifest.json
  • Singers.avro-00000-of-00003
  • Singers.avro-00001-of-00003
  • Singers.avro-00002-of-00003
  • spanner-export.json

Requirements for this pipeline:

  • The Cloud Spanner database must exist.
  • The output Cloud Storage bucket must exist.
  • In addition to the IAM roles necessary to run Dataflow jobs, you must also have the appropriate IAM roles for reading your Cloud Spanner data and writing to your Cloud Storage bucket.

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database that you want to export.
databaseId The database ID of the Cloud Spanner database that you want to export.
outputDir The Cloud Storage path you want to export Avro files to. The export job creates a new directory under this path that contains the exported files.
snapshotTime (Optional) The timestamp that corresponds to the version of the Cloud Spanner database that you want to read. The timestamp must be specified as per RFC 3339 UTC "Zulu" format. For example, 1990-12-31T23:59:60Z. The timestamp must be in the past and Maximum timestamp staleness applies.
tableNames (Optional) A comma separated list of tables specifying the subset of the Cloud Spanner database to be exported. The list must include all the related tables (Parent tables, Foreign key referenced tables). If they are not explicitly listed, the 'shouldExportRelatedTables' flag must be set for a successful export.
shouldExportRelatedTables (Optional) The flag used in conjunction with 'tableNames' parameter to include all the related tables to be exported.
spannerProjectId (Optional) The Google Cloud Project ID of the Cloud Spanner database that you want to read data from.

Running the Cloud Spanner to Avro Files on Cloud Storage template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.

    For the job to show up in the Spanner Instances page of the Google Cloud console, the job name must match the following format:

    cloud-spanner-export-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME

    Replace the following:

    • SPANNER_INSTANCE_ID: your Spanner instance's ID
    • SPANNER_DATABASE_NAME: your Spanner database's name
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Spanner to Avro Files on Cloud Storage template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro \
    --region REGION_NAME \
    --staging-location GCS_STAGING_LOCATION \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
outputDir=GCS_DIRECTORY

Replace the following:

  • JOB_NAME: a unique job name of your choice

    For the job to show in the Cloud Spanner portion of the Google Cloud console, the job name must match the format cloud-spanner-export-INSTANCE_ID-DATABASE_ID.

  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • GCS_STAGING_LOCATION: the path for writing temporary files; for example, gs://mybucket/temp
  • INSTANCE_ID: your Cloud Spanner instance ID
  • DATABASE_ID: your Cloud Spanner database ID
  • GCS_DIRECTORY: the Cloud Storage path that the Avro files are exported to

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "outputDir": "gs://GCS_DIRECTORY"
   }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice

    For the job to show in the Cloud Spanner portion of the Google Cloud console, the job name must match the format cloud-spanner-export-INSTANCE_ID-DATABASE_ID.

  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • GCS_STAGING_LOCATION: the path for writing temporary files; for example, gs://mybucket/temp
  • INSTANCE_ID: your Cloud Spanner instance ID
  • DATABASE_ID: your Cloud Spanner database ID
  • GCS_DIRECTORY: the Cloud Storage path that the Avro files are exported to

Cloud Spanner to Cloud Storage Text

The Cloud Spanner to Cloud Storage Text template is a batch pipeline that reads in data from a Cloud Spanner table, and writes it to Cloud Storage as CSV text files.

Requirements for this pipeline:

  • The input Spanner table must exist before running the pipeline.

Template parameters

Parameter Description
spannerProjectId The Google Cloud Project ID of the Cloud Spanner database that you want to read data from.
spannerDatabaseId The database ID of the requested table.
spannerInstanceId The instance ID of the requested table.
spannerTable The table to read the data from.
textWritePrefix The directory where output text files are written. Add / at the end. For example, gs://mybucket/somefolder/.
spannerSnapshotTime (Optional) The timestamp that corresponds to the version of the Cloud Spanner database that you want to read. The timestamp must be specified as per RFC 3339 UTC "Zulu" format. For example, 1990-12-31T23:59:60Z. The timestamp must be in the past and Maximum timestamp staleness applies.

Running the Cloud Spanner to Cloud Storage Text template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Spanner to Text Files on Cloud Storage template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Spanner_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
spannerProjectId=SPANNER_PROJECT_ID,\
spannerDatabaseId=DATABASE_ID,\
spannerInstanceId=INSTANCE_ID,\
spannerTable=TABLE_ID,\
textWritePrefix=gs://BUCKET_NAME/output/

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • SPANNER_PROJECT_ID: the Cloud project ID of the Spanner database from which you want to read data
  • DATABASE_ID: the Spanner database ID
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • INSTANCE_ID: the Spanner instance ID
  • TABLE_ID: the Spanner table ID

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Spanner_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "spannerProjectId": "SPANNER_PROJECT_ID",
       "spannerDatabaseId": "DATABASE_ID",
       "spannerInstanceId": "INSTANCE_ID",
       "spannerTable": "TABLE_ID",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • SPANNER_PROJECT_ID: the Cloud project ID of the Spanner database from which you want to read data
  • DATABASE_ID: the Spanner database ID
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • INSTANCE_ID: the Spanner instance ID
  • TABLE_ID: the Spanner table ID

Cloud Storage Avro to Bigtable

The Cloud Storage Avro to Bigtable template is a pipeline that reads data from Avro files in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.

Requirements for this pipeline:

  • The Bigtable table must exist and have the same column families as exported in the Avro files.
  • The input Avro files must exist in a Cloud Storage bucket before running the pipeline.
  • Bigtable expects a specific schema from the input Avro files.

Template parameters

Parameter Description
bigtableProjectId The ID of the Google Cloud project of the Bigtable instance that you want to write data to.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to import.
inputFilePattern The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.

Running the Cloud Storage Avro file to Bigtable template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Avro Files on Cloud Storage to Cloud Bigtable template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
inputFilePattern=INPUT_FILE_PATTERN

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • INPUT_FILE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "inputFilePattern": "INPUT_FILE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • INPUT_FILE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

Cloud Storage Avro to Cloud Spanner

The Cloud Storage Avro files to Cloud Spanner template is a batch pipeline that reads Avro files exported from Cloud Spanner stored in Cloud Storage and imports them to a Cloud Spanner database.

Requirements for this pipeline:

  • The target Cloud Spanner database must exist and must be empty.
  • You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
  • The input Cloud Storage path must exist, and it must include a spanner-export.json file that contains a JSON description of files to import.

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database.
databaseId The database ID of the Cloud Spanner database.
inputDir The Cloud Storage path where the Avro files are imported from.

Running the Cloud Storage Avro to Cloud Spanner template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.

    For the job to show up in the Spanner Instances page of the Google Cloud console, the job name must match the following format:

    cloud-spanner-import-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME

    Replace the following:

    • SPANNER_INSTANCE_ID: your Spanner instance's ID
    • SPANNER_DATABASE_NAME: your Spanner database's name
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Avro Files on Cloud Storage to Cloud Spanner template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner \
    --region REGION_NAME \
    --staging-location GCS_STAGING_LOCATION \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
inputDir=GCS_DIRECTORY

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • INSTANCE_ID: the ID of the Spanner instance that contains the database
  • DATABASE_ID: the ID of the Spanner database to import to
  • GCS_DIRECTORY: the Cloud Storage path where the Avro files are imported from, for example, gs://mybucket/somefolder

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "inputDir": "gs://GCS_DIRECTORY"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • INSTANCE_ID: the ID of the Spanner instance that contains the database
  • DATABASE_ID: the ID of the Spanner database to import to
  • GCS_DIRECTORY: the Cloud Storage path where the Avro files are imported from, for example, gs://mybucket/somefolder

Cloud Storage Parquet to Bigtable

The Cloud Storage Parquet to Bigtable template is a pipeline that reads data from Parquet files in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.

Requirements for this pipeline:

  • The Bigtable table must exist and have the same column families as exported in the Parquet files.
  • The input Parquet files must exist in a Cloud Storage bucket before running the pipeline.
  • Bigtable expects a specific schema from the input Parquet files.

Template parameters

Parameter Description
bigtableProjectId The ID of the Google Cloud project of the Bigtable instance that you want to write data to.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to import.
inputFilePattern The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.

Running the Cloud Storage Parquet file to Bigtable template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Parquet Files on Cloud Storage to Cloud Bigtable template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
inputFilePattern=INPUT_FILE_PATTERN

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • INPUT_FILE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "inputFilePattern": "INPUT_FILE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • INPUT_FILE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

Cloud Storage SequenceFile to Bigtable

The Cloud Storage SequenceFile to Bigtable template is a pipeline that reads data from SequenceFiles in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.

Requirements for this pipeline:

  • The Bigtable table must exist.
  • The input SequenceFiles must exist in a Cloud Storage bucket before running the pipeline.
  • The input SequenceFiles must have been exported from Bigtable or HBase.

Template parameters

Parameter Description
bigtableProject The ID of the Google Cloud project of the Bigtable instance that you want to write data to.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to import.
bigtableAppProfileId The ID of the Bigtable application profile to be used for the import. If you do not specify an app profile, Bigtable uses the instance's default app profile.
sourcePattern The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.

Running the Cloud Storage SequenceFile to Bigtable template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the SequenceFile Files on Cloud Storage to Cloud Bigtable template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProject=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
bigtableAppProfileId=APPLICATION_PROFILE_ID,\
sourcePattern=SOURCE_PATTERN

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID: the ID of the Bigtable application profile to be used for the export
  • SOURCE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProject": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "bigtableAppProfileId": "APPLICATION_PROFILE_ID",
       "sourcePattern": "SOURCE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID: the ID of the Bigtable application profile to be used for the export
  • SOURCE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

Cloud Storage Text to BigQuery

The Cloud Storage Text to BigQuery pipeline is a batch pipeline that allows you to read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF) that you provide, and append the result to a BigQuery table.

Requirements for this pipeline:

  • Create a JSON file that describes your BigQuery schema.

    Ensure that there is a top-level JSON array titled BigQuery Schema and that its contents follow the pattern {"name": "COLUMN_NAME", "type": "DATA_TYPE"}.

    The Cloud Storage Text to BigQuery batch template doesn't support importing data into STRUCT (Record) fields in the target BigQuery table.

    The following JSON describes an example BigQuery schema:

    {
      "BigQuery Schema": [
        {
          "name": "location",
          "type": "STRING"
        },
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "age",
          "type": "STRING"
        },
        {
          "name": "color",
          "type": "STRING"
        },
        {
          "name": "coffee",
          "type": "STRING"
        }
      ]
    }
    
  • Create a JavaScript (.js) file with your UDF function that supplies the logic to transform the lines of text. Your function must return a JSON string.

    For example, this function splits each line of a CSV file and returns a JSON string after transforming the values.

    function transform(line) {
    var values = line.split(',');
    
    var obj = new Object();
    obj.location = values[0];
    obj.name = values[1];
    obj.age = values[2];
    obj.color = values[3];
    obj.coffee = values[4];
    var jsonString = JSON.stringify(obj);
    
    return jsonString;
    }

Template parameters

Parameter Description
javascriptTextTransformFunctionName The name of the JavaScript user-defined function (UDF) that you want to use. For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.
JSONPath The gs:// path to the JSON file that defines your BigQuery schema, stored in Cloud Storage. For example, gs://path/to/my/schema.json.
javascriptTextTransformGcsPath The Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use. For example, gs://my-bucket/my-udfs/my_file.js.
inputFilePattern The gs:// path to the text in Cloud Storage you'd like to process. For example, gs://path/to/my/text/data.txt.
outputTable The BigQuery table name you want to create to store your processed data in. If you reuse an existing BigQuery table, the data is appended to the destination table. For example, my-project-name:my-dataset.my-table.
bigQueryLoadingTemporaryDirectory The temporary directory for the BigQuery loading process. For example, gs://my-bucket/my-files/temp_dir.

Running the Cloud Storage Text to BigQuery template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to BigQuery (Batch) template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery \
    --region REGION_NAME \
    --parameters \
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • JAVASCRIPT_FUNCTION: the name of the JavaScript user-defined function (UDF) that you want to use

    For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.

  • PATH_TO_BIGQUERY_SCHEMA_JSON: the Cloud Storage path to the JSON file containing the schema definition
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use—for example, gs://my-bucket/my-udfs/my_file.js
  • PATH_TO_TEXT_DATA: your Cloud Storage path to your text dataset
  • BIGQUERY_TABLE: your BigQuery table name
  • PATH_TO_TEMP_DIR_ON_GCS: your Cloud Storage path to the temp directory

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "inputFilePattern":"PATH_TO_TEXT_DATA",
       "outputTable":"BIGQUERY_TABLE",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

    • latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/
    • the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • JAVASCRIPT_FUNCTION: the name of the JavaScript user-defined function (UDF) that you want to use

    For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.

  • PATH_TO_BIGQUERY_SCHEMA_JSON: the Cloud Storage path to the JSON file containing the schema definition
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use—for example, gs://my-bucket/my-udfs/my_file.js
  • PATH_TO_TEXT_DATA: your Cloud Storage path to your text dataset
  • BIGQUERY_TABLE: your BigQuery table name
  • PATH_TO_TEMP_DIR_ON_GCS: your Cloud Storage path to the temp directory

Cloud Storage Text to Datastore [Deprecated]

This template is deprecated and will be removed in Q1 2022. Please migrate to Cloud Storage Text to Firestore template.

The Cloud Storage Text to Datastore template is a batch pipeline that reads from text files stored in Cloud Storage and writes JSON encoded Entities to Datastore. Each line in the input text files must be in the specified JSON format.

Requirements for this pipeline:

  • Datastore must be enabled in the destination project.

Template parameters

Parameter Description
textReadPattern A Cloud Storage path pattern that specifies the location of your text data files. For example, gs://mybucket/somepath/*.json.
javascriptTextTransformGcsPath (Optional) The Cloud Storage URI of the .js file that defines the JavaScript user-defined function (UDF) you want to use. For example, gs://my-bucket/my-udfs/my_file.js.
javascriptTextTransformFunctionName (Optional) The name of the JavaScript user-defined function (UDF) that you want to use. For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples.
datastoreWriteProjectId The Google Cloud project id of where to write the Datastore entities
datastoreHintNumWorkers (Optional) Hint for the expected number of workers in the Datastore ramp-up throttling step. Default is 500.
errorWritePath The error log output file to use for write failures that occur during processing. For example, gs://bucket-name/errors.txt.

Running the Cloud Storage Text to Datastore template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Datastore template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:</