Google-provided batch templates

Google provides a set of open-source Cloud Dataflow templates. For general information about templates, see the Overview page. For a list of all Google-provided templates, see the Get started with Google-provided templates page.

This page documents batch templates:

Cloud Bigtable to Cloud Storage Avro

The Cloud Bigtable to Cloud Storage Avro template is a pipeline that reads data from a Cloud Bigtable table and writes it to a Cloud Storage bucket in Avro format. You can use the template to move data from Cloud Bigtable to Cloud Storage.

Requirements for this pipeline:

  • The Cloud Bigtable table must exist.
  • The output Cloud Storage bucket must exist prior to running the pipeline.

Template parameters

Parameter Description
bigtableProjectId The ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
bigtableInstanceId The ID of the Cloud Bigtable instance that contains the table.
bigtableTableId The ID of the Cloud Bigtable table to export.
outputDirectory Cloud Storage path where data should be written. For example, gs://mybucket/somefolder.
filenamePrefix The prefix of the Avro file name. For example, output-.

Running the Cloud Bigtable to Cloud Storage Avro file template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [OUTPUT_DIRECTORY] with Cloud Storage path where data should be written. For example, gs://mybucket/somefolder.
  • Replace [FILENAME_PREFIX] with the prefix of the Avro file name. For example, output-.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location gs://dataflow-templates/latest/ \
    --parameters bigtableProjectId=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],outputDirectory=[OUTPUT_DIRECTORY],filenamePrefix=[FILENAME_PREFIX]

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/

To Run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [OUTPUT_DIRECTORY] with Cloud Storage path where data should be written. For example, gs://mybucket/somefolder.
  • Replace [FILENAME_PREFIX] with the prefix of the Avro file name. For example, output-.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "bigtableProjectId": "[PROJECT_ID]",
       "bigtableInstanceId": "[INSTANCE_ID]",
       "bigtableTableId": "[TABLE_ID]",
       "outputDirectory": "[OUTPUT_DIRECTORY]",
       "filenamePrefix": "[FILENAME_PREFIX]",
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Bigtable to Cloud Storage SequenceFile

The Cloud Bigtable to Cloud Storage SequenceFile template is a pipeline that reads data from a Cloud Bigtable table and writes the data to a Cloud Storage bucket in SequenceFile format. You can use the template to copy data from Cloud Bigtable to Cloud Storage.

Requirements for this pipeline:

  • The Cloud Bigtable table must exist.
  • The output Cloud Storage bucket must exist prior to running the pipeline.

Template parameters

Parameter Description
bigtableProject The ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
bigtableInstanceId The ID of the Cloud Bigtable instance that contains the table.
bigtableTableId The ID of the Cloud Bigtable table to export.
bigtableAppProfileId The ID of the Cloud Bigtable application profile to be used for the export. If you do not specify an app profile, Cloud Bigtable uses the instance's default app profile.
destinationPath Cloud Storage path where data should be written. For example, gs://mybucket/somefolder.
filenamePrefix The prefix of the SequenceFile file name. For example, output-.

Running the Cloud Bigtable to Cloud Storage SequenceFile template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [APPLICATION_PROFILE_ID] with the ID of the Cloud Bigtable application profile to be used for the export.
  • Replace [DESTINATION_PATH] with Cloud Storage path where data should be written. For example, gs://mybucket/somefolder.
  • Replace [FILENAME_PREFIX] with the prefix of the SequenceFile file name. For example, output-.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location gs://dataflow-templates/latest/ \
    --parameters bigtableProject=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],bigtableAppProfileId=[APPLICATION_PROFILE_ID],destinationPath=[DESTINATION_PATH],filenamePrefix=[FILENAME_PREFIX]

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [APPLICATION_PROFILE_ID] with the ID of the Cloud Bigtable application profile to be used for the export.
  • Replace [DESTINATION_PATH] with Cloud Storage path where data should be written. For example, gs://mybucket/somefolder.
  • Replace [FILENAME_PREFIX] with the prefix of the SequenceFile file name. For example, output-.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "bigtableProject": "[PROJECT_ID]",
       "bigtableInstanceId": "[INSTANCE_ID]",
       "bigtableTableId": "[TABLE_ID]",
       "bigtableAppProfileId": "[APPLICATION_PROFILE_ID]",
       "destinationPath": "[DESTINATION_PATH]",
       "filenamePrefix": "[FILENAME_PREFIX]",
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Datastore to Cloud Storage Text

The Cloud Datastore to Cloud Storage Text template is a batch pipeline that reads Cloud Datastore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.

Requirements for this pipeline:

Cloud Datastore must be set up in the project prior to running the pipeline.

Template parameters

Parameter Description
datastoreReadGqlQuery A GQL query that specifies which entities to grab. For example, SELECT * FROM MyKind.
datastoreReadProjectId The GCP project ID of the Cloud Datastore instance that you want to read data from.
datastoreReadNamespace The namespace of the requested entities. To use the default namespace, leave this parameter blank.
javascriptTextTransformGcsPath A Cloud Storage path that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName Name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. If you don't want to provide a function, leave this parameter blank.
textWritePrefix The Cloud Storage path prefix to specify where the data should be written. For example, gs://mybucket/somefolder/.

Running the Cloud Datastore to Cloud Storage Text template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Datastore to Cloud Storage Text template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Datastore_to_GCS_Text

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace YOUR_DATASTORE_KIND with your the type of your Datastore entities.
  • Replace YOUR_DATASTORE_NAMESPACE with the namespace of your Datastore entities.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your JavaScript code.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Datastore_to_GCS_Text \
    --parameters \
datastoreReadGqlQuery="SELECT * FROM YOUR_DATASTORE_KIND",\
datastoreReadProjectId=YOUR_PROJECT_ID,\
datastoreReadNamespace=YOUR_DATASTORE_NAMESPACE,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
textWritePrefix=gs://YOUR_BUCKET_NAME/output/

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Datastore_to_GCS_Text

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace YOUR_DATASTORE_KIND with your the type of your Datastore entities.
  • Replace YOUR_DATASTORE_NAMESPACE with the namespace of your Datastore entities.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your JavaScript code.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Datastore_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "SELECT * FROM YOUR_DATASTORE_KIND"
       "datastoreReadProjectId": "YOUR_PROJECT_ID",
       "datastoreReadNamespace": "YOUR_DATASTORE_NAMESPACE",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION",
       "textWritePrefix": "gs://YOUR_BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Spanner to Cloud Storage Avro

The Cloud Spanner to Cloud Storage template is a batch pipeline that exports a whole Cloud Spanner database to Cloud Storage in Avro format. Exporting a Cloud Spanner database creates a folder in the bucket you select. The folder contains:

  • A spanner-export.json file.
  • A TableName-manifest.json file for each table in the database you exported.
  • One or more TableName.avro-#####-of-##### files.

For example, exporting a database with two tables, Singers and Albums, creates the following file set:

  • Albums-manifest.json
  • Albums.avro-00000-of-00002
  • Albums.avro-00001-of-00002
  • Singers-manifest.json
  • Singers.avro-00000-of-00003
  • Singers.avro-00001-of-00003
  • Singers.avro-00002-of-00003
  • spanner-export.json

Requirements for this pipeline:

  • The Cloud Spanner database must exist.
  • The output Cloud Storage bucket must exist.
  • In addition to the Cloud IAM roles necessary to run Cloud Dataflow jobs, you must also have the appropriate Cloud IAM roles for reading your Cloud Spanner data and writing to your Cloud Storage bucket.

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database that you want to export.
databaseId The database ID of the Cloud Spanner database that you want to export.
outputDir The Cloud Storage path you want to export Avro files to. The export job creates a new directory under this path that contains the exported files.

Running the template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Spanner to Cloud Storage Avro template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field.
    • Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [DATAFLOW_REGION] with the region where you want the Cloud Dataflow job to run (such as us-central1).
  • Replace [YOUR_INSTANCE_ID] with your Cloud Spanner instance ID.
  • Replace [YOUR_DATABASE_ID] with your Cloud Spanner database ID.
  • Replace [YOUR_GCS_DIRECTORY] with the Cloud Storage path that the Avro files should be exported to.
  • Replace [JOB_NAME] with a job name of your choice.
    • The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location='gs://dataflow-templates/[VERSION]/Cloud_Spanner_to_GCS_Avro' \
    --region=[DATAFLOW_REGION] \
    --parameters='instanceId=[YOUR_INSTANCE_ID],databaseId=[YOUR_DATABASE_ID],outputDir=[YOUR_GCS_DIRECTORY]

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [DATAFLOW_REGION] with the region where you want the Cloud Dataflow job to run (such as us-central1).
  • Replace [YOUR_INSTANCE_ID] with your Cloud Spanner instance ID.
  • Replace [YOUR_DATABASE_ID] with your Cloud Spanner database ID.
  • Replace [YOUR_GCS_DIRECTORY] with the Cloud Storage path that the Avro files should be exported to.
  • Replace [JOB_NAME] with a job name of your choice.
    • The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/locations/[DATAFLOW_REGION]/templates:launch?gcsPath=gs://dataflow-templates/[VERSION]/Cloud_Spanner_to_GCS_Avro
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "instanceId": "[YOUR_INSTANCE_ID]",
       "databaseId": "[YOUR_DATABASE_ID]",
       "outputDir": "gs://[YOUR_GCS_DIRECTORY]"
   }
}

Cloud Spanner to Cloud Storage Text

The Cloud Spanner to Cloud Storage Text template is a batch pipeline that reads in data from a Cloud Spanner table, optionally transforms the data via a JavaScript User Defined Function (UDF) that you provide and writes it to Cloud Storage as CSV text files.

Requirements for this pipeline:

  • The input Cloud Spanner table must exist prior to running the pipeline.

Template parameters

Parameter Description
spannerProjectId The GCP Project Id of the Cloud Spanner database that you want to read data from.
spannerDatabaseId Database of requested table.
spannerInstanceId Instance of requested table.
spannerTable Table to export.
textWritePrefix Output Directory where output text files will be written. Please add / at the end. For eg: gs://mybucket/somefolder/.
javascriptTextTransformGcsPath [Optional] A Cloud Storage path that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName [Optional] Name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. If you don't want to provide a function, leave this parameter blank.

Running the Cloud Spanner to Cloud Storage Text template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Spanner to Cloud Storage Text template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Spanner_to_GCS_Text

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_DATABASE_ID with the Cloud Spanner database id.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace YOUR_INSTANCE_ID with the Cloud Spanner instance id.
  • Replace YOUR_TABLE_ID with the Cloud Spanner table id.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your JavaScript code.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Spanner_to_GCS_Text \
    --parameters \
spannerProjectId=YOUR_PROJECT_ID,\
spannerDatabaseId=YOUR_DATABASE_ID,\
spannerInstanceId=YOUR_INSTANCE_ID,\
spannerTable=YOUR_TABLE_ID,\
textWritePrefix=gs://YOUR_BUCKET_NAME/output/,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Spanner_to_GCS_Text

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_DATABASE_ID with the Cloud Spanner database id.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace YOUR_INSTANCE_ID with the Cloud Spanner instance id.
  • Replace YOUR_TABLE_ID with the Cloud Spanner table id.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your JavaScript code.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Spanner_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "spannerProjectId": "YOUR_PROJECT_ID",
       "spannerDatabaseId": "YOUR_DATABASE_ID",
       "spannerInstanceId": "YOUR_INSTANCE_ID",
       "spannerTable": "YOUR_TABLE_ID",
       "textWritePrefix": "gs://YOUR_BUCKET_NAME/output/",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage Avro to Cloud Bigtable

The Cloud Storage Avro to Cloud Bigtable template is a pipeline that reads data from Avro files in a Cloud Storage bucket and writes the data to a Cloud Bigtable table. You can use the template to copy data from Cloud Storage to Cloud Bigtable.

Requirements for this pipeline:

  • The Cloud Bigtable table must exist and have the same column families as exported in the Avro files.
  • The input Avro files must exist in a Cloud Storage bucket prior to running the pipeline.
  • Cloud Bigtable expects a specific schema from the input Avro files.

Template parameters

Parameter Description
bigtableProjectId The ID of the GCP project of the Cloud Bigtable instance that you want to write data to.
bigtableInstanceId The ID of the Cloud Bigtable instance that contains the table.
bigtableTableId The ID of the Cloud Bigtable table to import.
inputFilePattern Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.

Running the Cloud Storage Avro file to Cloud Bigtable template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Spanner to Cloud Storage Text template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Spanner_to_GCS_Text

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [INPUT_FILE_PATTERN] with Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location gs://dataflow-templates/latest/Spanner_to_GCS_Text \
    --parameters bigtableProjectId=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],inputFilePattern=[INPUT_FILE_PATTERN]

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Spanner_to_GCS_Text

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [INPUT_FILE_PATTERN] with Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Spanner_to_GCS_Text
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "bigtableProjectId": "[PROJECT_ID]",
       "bigtableInstanceId": "[INSTANCE_ID]",
       "bigtableTableId": "[TABLE_ID]",
       "inputFilePattern": "[INPUT_FILE_PATTERN]",
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage Avro to Cloud Spanner

The Cloud Storage Avro files to Cloud Spanner template is a batch pipeline that reads Avro files from Cloud Storage and imports them to a Cloud Spanner database.

Requirements for this pipeline:

  • The target Cloud Spanner database must exist and must be empty.
  • You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
  • The input Cloud Storage path must exist, and it must include a spanner-export.json file that contains a JSON description of files to import.

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database.
databaseId The database ID of the Cloud Spanner database.
inputDir The Cloud Storage path where the Avro files should be imported from.

Running the template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Avro to Cloud Spanner template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field.
    • Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-import-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [INPUT_FILE_PATTERN] with Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location='gs://dataflow-templates/[VERSION]/GCS_Avro_to_Cloud_Spanner' \
    --region=[DATAFLOW_REGION] \
    --parameters='instanceId=[YOUR_INSTANCE_ID],databaseId=[YOUR_DATABASE_ID],inputDir=[YOUR_GCS_DIRECTORY]'

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [INPUT_FILE_PATTERN] with Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/locations/[DATAFLOW_REGION]/templates:launch?gcsPath=gs://dataflow-templates/[VERSION]/GCS_Avro_to_Cloud_Spanner
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "instanceId": "[YOUR_INSTANCE_ID]",
       "databaseId": "[YOUR_DATABASE_ID]",
       "inputDir": "gs://[YOUR_GCS_DIRECTORY]"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

Cloud Storage SequenceFile to Cloud Bigtable

The Cloud Storage SequenceFile to Cloud Bigtable template is a pipeline that reads data from SequenceFiles in a Cloud Storage bucket and writes the data to a Cloud Bigtable table. You can use the template to copy data from Cloud Storage to Cloud Bigtable.

Requirements for this pipeline:

  • The Cloud Bigtable table must exist.
  • The input SequenceFiles must exist in a Cloud Storage bucket prior to running the pipeline.
  • The input SequenceFiles must have been exported from Cloud Bigtable or HBase.

Template parameters

Parameter Description
bigtableProject The ID of the GCP project of the Cloud Bigtable instance that you want to write data to.
bigtableInstanceId The ID of the Cloud Bigtable instance that contains the table.
bigtableTableId The ID of the Cloud Bigtable table to import.
bigtableAppProfileId The ID of the Cloud Bigtable application profile to be used for the import. If you do not specify an app profile, Cloud Bigtable uses the instance's default app profile.
sourcePattern Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.

Running the Cloud Storage SequenceFile to Cloud Bigtable template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Avro to Cloud Spanner template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [APPLICATION_PROFILE_ID] with the ID of the Cloud Bigtable application profile to be used for the export.
  • Replace [SOURCE_PATTERN] with Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Spanner \
    --parameters bigtableProject=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],bigtableAppProfileId=[APPLICATION_PROFILE_ID],sourcePattern=[SOURCE_PATTERN]

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [APPLICATION_PROFILE_ID] with the ID of the Cloud Bigtable application profile to be used for the export.
  • Replace [SOURCE_PATTERN] with Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Spanner
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "bigtableProject": "[PROJECT_ID]",
       "bigtableInstanceId": "[INSTANCE_ID]",
       "bigtableTableId": "[TABLE_ID]",
       "bigtableAppProfileId": "[APPLICATION_PROFILE_ID]",
       "sourcePattern": "[SOURCE_PATTERN]",
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage Text to BigQuery

The Cloud Storage Text to BigQuery pipeline is a batch pipeline that allows you to read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF) that you provide, and output the result to BigQuery.

IMPORTANT: If you reuse an existing BigQuery table, the table will be overwritten.

Requirements for this pipeline:

  • Create a JSON file that describes your BigQuery schema.

    Ensure that there is a top level JSON array titled BigQuery Schema and that its contents follow the pattern {"name": "COLUMN_NAME", "type": "DATA_TYPE"}. For example:

    {
      "BigQuery Schema": [
        {
          "name": "location",
          "type": "STRING"
        },
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "age",
          "type": "STRING"
        },
        {
          "name": "color",
          "type": "STRING"
        },
        {
          "name": "coffee",
          "type": "STRING"
        }
      ]
    }
    
  • Create a JavaScript (.js) file with your UDF function that supplies the logic to transform the lines of text. Note that your function must return a JSON string.

    For example, this function splits each line of a CSV file and returns a JSON string after transforming the values.

    function transform(line) {
    var values = line.split(',');
    
    var obj = new Object();
    obj.location = values[0];
    obj.name = values[1];
    obj.age = values[2];
    obj.color = values[3];
    obj.coffee = values[4];
    var jsonString = JSON.stringify(obj);
    
    return jsonString;
    }
    

Template parameters

Parameter Description
javascriptTextTransformFunctionName The name of the function you want to call from your .js file.
JSONPath The gs:// path to the JSON file that defines your BigQuery schema, stored in Cloud Storage. For example, gs://path/to/my/schema.json.
javascriptTextTransformGcsPath The gs:// path to the JavaScript file that defines your UDF. For example, gs://path/to/my/javascript_function.js.
inputFilePattern The gs:// path to the text in Cloud Storage you'd like to process. For example, gs://path/to/my/text/data.txt.
outputTable The BigQuery table name you want to create to store your processed data in. If you reuse an existing BigQuery table, the table will be overwritten. For example, my-project-name:my-dataset.my-table.
bigQueryLoadingTemporaryDirectory Temporary directory for BigQuery loading process. For example, gs://my-bucket/my-files/temp_dir.

Running the Cloud Storage Text to BigQuery template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Text to BigQuery template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_JAVASCRIPT_FUNCTION with the name of your UDF.
  • Replace PATH_TO_BIGQUERY_SCHEMA_JSON with the Cloud Storage path to the JSON file containing the schema definition.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your JavaScript code.
  • Replace PATH_TO_YOUR_TEXT_DATA with your Cloud Storage path to your text dataset.
  • Replace BIGQUERY_TABLE with your BigQuery table name.
  • Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --parameters \
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_JAVASCRIPT_FUNCTION with the name of your UDF.
  • Replace PATH_TO_BIGQUERY_SCHEMA_JSON with the Cloud Storage path to the JSON file containing the schema definition.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your JavaScript code.
  • Replace PATH_TO_YOUR_TEXT_DATA with your Cloud Storage path to your text dataset.
  • Replace BIGQUERY_TABLE with your BigQuery table name.
  • Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION",
       "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "inputFilePattern":"PATH_TO_YOUR_TEXT_DATA",
       "outputTable":"BIGQUERY_TABLE",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage Text to Cloud Datastore

The Cloud Storage Text to Cloud Datastore template is a batch pipeline which reads from text files stored in Cloud Storage and writes JSON encoded Entities to Cloud Datastore. Each line in the input text files should be in JSON format specified in https://cloud.google.com/datastore/docs/reference/rest/v1/Entity .

Requirements for this pipeline:

  • Datastore must be enabled in the destination project.

Template parameters

Parameter Description
textReadPattern A Cloud Storage file path pattern that specifies the location of your text data files. For example, gs://mybucket/somepath/*.json.
javascriptTextTransformGcsPath A Cloud Storage path pattern that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName Name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. If you don't want to provide a function, leave this parameter blank.
datastoreWriteProjectId The GCP project id of where to write the Cloud Datastore entities
errorWritePath The error log output file to use for write failures that occur during processing. For example, gs://bucket-name/errors.txt.

Running the Cloud Storage Text to Datastore template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Text to Datastore template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Datastore

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace PATH_TO_INPUT_TEXT_FILES with the input files pattern on Cloud Storage.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your JavaScript code.
  • Replace ERROR_FILE_WRITE_PATH with your desired path to error file on Cloud Storage.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Datastore \
    --parameters \
textReadPattern=PATH_TO_INPUT_TEXT_FILES,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
datastoreWriteProjectId=YOUR_PROJECT_ID,\
errorWritePath=ERROR_FILE_WRITE_PATH

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Datastore

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace PATH_TO_INPUT_TEXT_FILES with the input files pattern on Cloud Storage.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your JavaScript code.
  • Replace ERROR_FILE_WRITE_PATH with your desired path to error file on Cloud Storage.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Datastore
{
   "jobName": "JOB_NAME",
   "parameters": {
       "textReadPattern": "PATH_TO_INPUT_TEXT_FILES",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION",
       "datastoreWriteProjectId": "YOUR_PROJECT_ID",
       "errorWritePath": "ERROR_FILE_WRITE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage Text to Cloud Pub/Sub (Batch)

This template creates a batch pipeline that reads records from text files stored in Cloud Storage and publishes them to a Cloud Pub/Sub topic. The template can be used to publish records in a newline-delimited file containing JSON records or CSV file to a Cloud Pub/Sub topic for real-time processing. You can use this template to replay data to Cloud Pub/Sub.

Note that this template does not set any timestamp on the individual records, so the event time will be equal to the publishing time during execution. If your pipeline is reliant on an accurate event time for processing, you should not use this pipeline.

Requirements for this pipeline:

  • The files to read need to be in newline-delimited JSON or CSV format. Records spanning multiple lines in the source files may cause issues downstream as each line within the files will be published as a message to Cloud Pub/Sub.
  • The Cloud Pub/Sub topic must exist prior to running the pipeline.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/files/*.json.
outputTopic The Cloud Pub/Sub input topic to write to. The name should be in the format of projects/<project-id>/topics/<topic-name>.

Running the Cloud Storage Text to Cloud Pub/Sub (Batch) template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Text to Cloud Pub/Sub (Batch) template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_TOPIC_NAME with your Cloud Pub/Sub topic name.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub \
    --parameters \
inputFilePattern=gs://YOUR_BUCKET_NAME/files/*.json,\
outputTopic=projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_TOPIC_NAME with your Cloud Pub/Sub topic name.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://YOUR_BUCKET_NAME/files/*.json",
       "outputTopic": "projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage Text to Cloud Spanner

The Cloud Storage Text to Cloud Spanner template is a batch pipeline that reads CSV text files from Cloud Storage and imports them to a Cloud Spanner database.

Requirements for this pipeline:

  • The target Cloud Spanner database and table must exist.
  • You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
  • The input Cloud Storage path containing the CSV files must exist.
  • You must create an import manifest file containing a JSON description of the CSV files, and you must store that manifest in Cloud Storage.
  • If the target Cloud Spanner database already has a schema, any columns specified in the manifest file must have the same data types as their corresponding columns in the target database's schema.
  • The manifest file, encoded in ASCII or UTF-8, must match the following format:

  • Text files to be imported must be in CSV format, with ASCII or UTF-8 encoding. We recommend not using byte order mark (BOM) in UTF-8 encoded files.
  • Data must match one of the following types:
    • INT64
    • FLOAT64
    • BOOL
    • STRING
    • DATE
    • TIMESTAMP

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database.
databaseId The database ID of the Cloud Spanner database.
importManifest The path in Cloud Storage to the import manifest file.
columnDelimiter The column delimiter that the source file uses. The default value is ,.
fieldQualifier The character that should surround any value in the source file that contains the columnDelimiter. The default value is ".
trailingDelimiter Specifies whether the lines in the source files have trailing delimiters (that is, if the columnDelimiter character appears at the end of each line, after the last column value). The default value is true.
escape The escape character the source file uses. By default, this parameter is not set and the template does not use the escape character.
nullString The string that represents a NULL value. By default, this parameter is not set and the template does not use the null string.
dateFormat The format used to parse date columns. By default, the pipeline tries to parse the date columns as yyyy-M-d[' 00:00:00'], for example, as 2019-01-31 or 2019-1-1 00:00:00. If your date format is different, specify the format using the java.time.format.DateTimeFormatter patterns.
timestampFormat The format used to parse timestamp columns. If the timestamp is a long integer, then it is parsed as Unix epoch. Otherwise, it is parsed as a string using the java.time.format.DateTimeFormatter.ISO_INSTANT format. For other cases, specify your own pattern string, for example, using MMM dd yyyy HH:mm:ss.SSSVV for timestamps in the form of "Jan 21 1998 01:02:03.456+08:00".

If you need to use customized date or timestamp formats, make sure they're valid java.time.format.DateTimeFormatter patterns. The following table shows additional examples of customized formats for date and timestamp columns:

Type Input value Format Remark
DATE 2011-3-31 By default, the template can parse this format. You don't need to specify the dateFormat parameter.
DATE 2011-3-31 00:00:00 By default, the template can parse this format. You don't need to specify the format. If you like, you can use yyyy-M-d' 00:00:00'.
DATE 01 Apr, 18 dd MMM, yy
DATE Wednesday, April 3, 2019 AD EEEE, LLLL d, yyyy G
TIMESTAMP 2019-01-02T11:22:33Z
2019-01-02T11:22:33.123Z
2019-01-02T11:22:33.12356789Z
The default format ISO_INSTANT can parse this type of timestamp. You don't need to provide the timestampFormat parameter.
TIMESTAMP 1568402363 By default, the template can parse this type of timestamp and treat it as the Unix epoch time.
TIMESTAMP Tue, 3 Jun 2008 11:05:30 GMT EEE, d MMM yyyy HH:mm:ss VV
TIMESTAMP 2018/12/31 110530.123PST yyyy/MM/dd HHmmss.SSSz
TIMESTAMP 2019-01-02T11:22:33Z or 2019-01-02T11:22:33.123Z yyyy-MM-dd'T'HH:mm:ss[.SSS]VV If the input column is a mix 2019-01-02T11:22:33Z and 2019-01-02T11:22:33.123Z, the default format can parse this type of timestamp. You don't need to provide your own format parameter. However, you can use yyyy-MM-dd'T'HH:mm:ss[.SSS]VV to handle both cases. Note that you cannot use yyyy-MM-dd'T'HH:mm:ss[.SSS]'Z', because the postfix 'Z' must be parsed as a time-zone ID, not a character literal. Internally, the timestamp column is converted to a java.time.Instant. Therefore, it must be specified in UTC or have time zone information associated with it. Local datetime, such as 2019-01-02 11:22:33, cannot be parsed as a valid java.time.Instant.

Running the template

Console

Run in the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Text to Cloud Spanner template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

gcloud

Run with the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [DATAFLOW_REGION] with the region where you want the Cloud Dataflow job to run (such as us-central1).
  • Replace [YOUR_INSTANCE_ID] with your Cloud Spanner instance ID.
  • Replace [YOUR_DATABASE_ID] with your Cloud Spanner database ID.
  • Replace [GCS_PATH_TO_IMPORT_MANIFEST] with the Cloud Storage path to your import manifest file.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location='gs://dataflow-templates/[VERSION]/GCS_Text_to_Cloud_Spanner' \
    --region=[DATAFLOW_REGION] \
    --parameters='instanceId=[YOUR_INSTANCE_ID],databaseId=[YOUR_DATABASE_ID],importManifest=[GCS_PATH_TO_IMPORT_MANIFEST]'

API

Run with the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [DATAFLOW_REGION] with the region where you want the Cloud Dataflow job to run (such as us-central1).
  • Replace [YOUR_INSTANCE_ID] with your Cloud Spanner instance ID.
  • Replace [YOUR_DATABASE_ID] with your Cloud Spanner database ID.
  • Replace [GCS_PATH_TO_IMPORT_MANIFEST] with the Cloud Storage path to your import manifest file.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/locations/[DATAFLOW_REGION]/templates:launch?gcsPath=gs://dataflow-templates/[VERSION]/GCS_Text_to_Cloud_Spanner

{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "instanceId": "[YOUR_INSTANCE_ID]",
       "databaseId": "[YOUR_DATABASE_ID]",
       "importManifest": "[GCS_PATH_TO_IMPORT_MANIFEST]"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

Java Database Connectivity (JDBC) to BigQuery

The JDBC to BigQuery template is a batch pipeline that copies data from a relational database table into an existing BigQuery table. This pipeline uses JDBC to connect to the relational database. You can use this template to copy data from any relational database with available JDBC drivers into BigQuery. For an extra layer of protection, you can also pass in a Cloud KMS key along with a Base64-encoded username, password, and connection string parameters encrypted with the Cloud KMS key. See the Cloud KMS API encryption endpoint for additional details on encrypting your username, password, and connection string parameters.

Requirements for this pipeline:

  • The JDBC drivers for the relational database must be available.
  • The BigQuery table must exist prior to pipeline execution.
  • The BigQuery table must have a compatible schema.
  • The relational database must be accessible from the subnet where Cloud Dataflow runs.

Template parameters

Parameter Description
driverJars Comma separated list of driver jars. For example, gs://<my-bucket>/driver_jar1.jar,gs://<my-bucket>/driver_jar2.jar.
driverClassName The JDBC driver class name. For example, com.mysql.jdbc.Driver.
connectionURL The JDBC connection URL string. For example, jdbc:mysql://some-host:3306/sampledb. Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key.
query Query to be run on the source to extract the data. For example, select * from sampledb.sample_table.
outputTable The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table>.
bigQueryLoadingTemporaryDirectory Temporary directory for BigQuery loading process. For example, gs://<my-bucket>/my-files/temp_dir.
connectionProperties [Optional] Properties string to use for the JDBC connection. For example, unicode=true&characterEncoding=UTF-8.
username [Optional] Username to be used for the JDBC connection. Can be passed in as a Base64encoded string encrypted with a Cloud KMS key.
password [Optional] Password to be used for the JDBC connection. Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key.
KMSEncryptionKey [Optional] Cloud KMS Encryption Key to decrypt the username, password, and connection string. If Cloud KMS key is passed in, the username, password and connection string must all be passed in encrypted.

Running the JDBC to BigQuery template

CONSOLE

Run from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click Create job from template.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the JDBC to BigQuery template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Jdbc_to_BigQuery

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace DRIVER_PATHS with the comma separated Cloud Storage path(s) of the JDBC driver(s).
  • Replace DRIVER_CLASS_NAME with the drive class name.
  • Replace JDBC_CONNECTION_URL with the JDBC connection URL.
  • Replace SOURCE_SQL_QUERY with the SQL query to be run on the source database.
  • Replace YOUR_DATASET with your BigQuery dataset, and replace YOUR_TABLE_NAME with your BigQuery table name.
  • Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
  • Replace CONNECTION_PROPERTIES with the JDBC connection properties if required.
  • Replace CONNECTION_USERNAME with the JDBC connection username.
  • Replace CONNECTION_PASSWORD with the JDBC connection password.
  • Replace KMS_ENCRYPTION_KEY with the Cloud KMS Encryption Key.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Jdbc_to_BigQuery \
    --parameters \
driverJars=DRIVER_PATHS,\
driverClassName=DRIVER_CLASS_NAME,\
connectionURL=JDBC_CONNECTION_URL,\
query=SOURCE_SQL_QUERY,\
outputTable=YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME,
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS,\
connectionProperties=CONNECTION_PROPERTIES,\
username=CONNECTION_USERNAME,\
password=CONNECTION_PASSWORD,\
KMSEncryptionKey=KMS_ENCRYPTION_KEY

API

Run from the REST API

When running this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Jdbc_to_BigQuery

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace DRIVER_PATHS with the comma separated Cloud Storage path(s) of the JDBC driver(s).
  • Replace DRIVER_CLASS_NAME with the drive class name.
  • Replace JDBC_CONNECTION_URL with the JDBC connection URL.
  • Replace SOURCE_SQL_QUERY with the SQL query to be run on the source database.
  • Replace YOUR_DATASET with your BigQuery dataset, and replace YOUR_TABLE_NAME with your BigQuery table name.
  • Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
  • Replace CONNECTION_PROPERTIES with the JDBC connection properties if required.
  • Replace CONNECTION_USERNAME with the JDBC connection username.
  • Replace CONNECTION_PASSWORD with the JDBC connection password.
  • Replace KMS_ENCRYPTION_KEY with the Cloud KMS Encryption Key.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Jdbc_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "driverJars": "DRIVER_PATHS",
       "driverClassName": "DRIVER_CLASS_NAME",
       "connectionURL": "JDBC_CONNECTION_URL",
       "query": "SOURCE_SQL_QUERY",
       "outputTable": "YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS",
       "connectionProperties": "CONNECTION_PROPERTIES",
       "username": "CONNECTION_USERNAME",
       "password": "CONNECTION_PASSWORD",
       "KMSEncryptionKey":"KMS_ENCRYPTION_KEY"
   },
   "environment": { "zone": "us-central1-f" }
}
Оцените, насколько информация на этой странице была вам полезна:

Оставить отзыв о...

Текущей странице
Cloud Dataflow
Нужна помощь? Обратитесь в службу поддержки.