Google-Provided Templates

Google provides a set of open-source Cloud Dataflow templates. The following templates are available:

WordCount

The WordCount template is a batch pipeline that reads text from Cloud Storage, tokenizes the text lines into individual words, and performs a frequency count on each of the words. For more information about WordCount, see WordCount Example Pipeline.

Template parameters

Parameter Description
inputFile The Cloud Storage input file path.
output The Cloud Storage output file path and prefix.

Executing the WordCount template

Console

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the WordCount template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/wordcount/template_file

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/wordcount/template_file \
    --parameters \
inputFile=gs://dataflow-samples/shakespeare/kinglear.txt,\
outputFile=gs://YOUR_BUCKET_NAME/output/my_output

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/wordcount/template_file

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/wordcount/template_file
{
    "jobName": "JOB_NAME",
    "parameters": {
       "inputFile" : "gs://dataflow-samples/shakespeare/kinglear.txt",
       "output": "gs://YOUR_BUCKET_NAME/output/my_output"
    },
    "environment": { "zone": "us-central1-f" }
}

Cloud Bigtable to Cloud Storage SequenceFile

The Cloud Bigtable to Cloud Storage SequenceFile template is a pipeline that reads data from a Cloud Bigtable table and writes them to a Cloud Storage bucket in SequenceFile format. You can use the template as a quick solution to move data from Cloud Bigtable to Cloud Storage.

Requirements for this pipeline:

  • The Cloud Bigtable table must exist.
  • The output Cloud Storage bucket must exist prior to pipeline execution.

Template parameters

Parameter Description
bigtableProject The ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
bigtableInstanceId The ID of the Cloud Bigtable instance that contains the table.
bigtableTableId The ID of the Cloud Bigtable table to export.
bigtableAppProfileId The ID of the Cloud Bigtable application profile to be used for the export. If you do not specify an app profile, Cloud Bigtable uses the instance's default app profile.
destinationPath Cloud Storage path where data should be written. For example, gs://mybucket/somefolder.
filenamePrefix The prefix of the SequenceFile file name. For example, output-.

Executing the Cloud Bigtable to Cloud Storage SequenceFile template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Bigtable to Cloud Storage SequenceFile template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [APPLICATION_PROFILE_ID] with the ID of the Cloud Bigtable application profile to be used for the export.
  • Replace [DESTINATION_PATH] with Cloud Storage path where data should be written. For example, gs://mybucket/somefolder.
  • Replace [FILENAME_PREFIX] with the prefix of the SequenceFile file name. For example, output-.
  • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_SequenceFile \
    --parameters bigtableProject=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],bigtableAppProfileId=[APPLICATION_PROFILE_ID],destinationPath=[DESTINATION_PATH],filenamePrefix=[FILENAME_PREFIX]

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [APPLICATION_PROFILE_ID] with the ID of the Cloud Bigtable application profile to be used for the export.
  • Replace [DESTINATION_PATH] with Cloud Storage path where data should be written. For example, gs://mybucket/somefolder.
  • Replace [FILENAME_PREFIX] with the prefix of the SequenceFile file name. For example, output-.
  • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_SequenceFile
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "bigtableProject": "[PROJECT_ID]",
       "bigtableInstanceId": "[INSTANCE_ID]",
       "bigtableTableId": "[TABLE_ID]",
       "bigtableAppProfileId": "[APPLICATION_PROFILE_ID]",
       "destinationPath": "[DESTINATION_PATH]",
       "filenamePrefix": "[FILENAME_PREFIX]",
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Pub/Sub to BigQuery

The Cloud Pub/Sub to BigQuery template is a streaming pipeline that reads JSON-formatted messages from a Cloud Pub/Sub topic and writes them to a BigQuery table. You can use the template as a quick solution to move Cloud Pub/Sub data to BigQuery. The template reads JSON-formatted messages from Cloud Pub/Sub and converts them to BigQuery elements.

Requirements for this pipeline:

  • The Cloud Pub/Sub messages must be in JSON format, described here. For example, messages formatted as {"k1":"v1", "k2":"v2"} may be inserted into a BigQuery table with two columns, named k1 and k2, with string data type.
  • The output directory must exist prior to pipeline execution.

Template parameters

Parameter Description
inputTopic The Cloud Pub/Sub input topic to read from, in the format of projects/<project>/topics/<topic>.
outputTableSpec The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table>

Executing the Cloud Pub/Sub to BigQuery template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Pub/Sub to BigQuery template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/PubSub_to_BigQuery

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_TOPIC_NAME with your Cloud Pub/Sub topic name.
  • Replace YOUR_DATASET with your BigQuery dataset, and replace YOUR_TABLE_NAME with your BigQuery table name.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/PubSub_to_BigQuery \
    --parameters \
inputTopic=projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME,\
outputTableSpec=YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/PubSub_to_BigQuery

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_TOPIC_NAME with your Cloud Pub/Sub topic name.
  • Replace YOUR_DATASET with your BigQuery dataset, and replace YOUR_TABLE_NAME with your BigQuery table name.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/PubSub_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputTopic": "projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME",
       "outputTableSpec": "YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage Text to Cloud Pub/Sub

The Cloud Storage Text to Cloud Pub/Sub template is a batch pipeline that reads records from text files stored in Cloud Storage and publishes them to a Cloud Pub/Sub topic. The template can be used to publish records in a newline-delimited file containing JSON records or CSV file to a Cloud Pub/Sub topic for real-time processing. You can use this template to replay data to Cloud Pub/Sub.

Note that this template does not set any timestamp on the individual records, so the event time will be equal to the publishing time during execution. If your pipeline is reliant on an accurate event time for processing, you should not use this pipeline.

Requirements for this pipeline:

  • The files to read need to be in newline-delimited JSON or CSV format. Records spanning multiple lines in the source files may cause issues downstream as each line within the files will be published as a message to Cloud Pub/Sub.
  • The Cloud Pub/Sub topic must exist prior to execution.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/files/*.json.
outputTopic The Cloud Pub/Sub input topic to write to. The name should be in the format of projects/<project-id>/topics/<topic-name>.

Executing the Cloud Storage Text to Cloud Pub/Sub template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Text to Cloud Pub/Sub template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_TOPIC_NAME with your Cloud Pub/Sub topic name.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub \
    --parameters \
inputFilePattern=gs://YOUR_BUCKET_NAME/files/*.json,\
outputTopic=projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_TOPIC_NAME with your Cloud Pub/Sub topic name.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://YOUR_BUCKET_NAME/files/*.json",
       "outputTopic": "projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Pub/Sub to Cloud Storage Text

The Cloud Pub/Sub to Cloud Storage Text template is a streaming pipeline that reads records from Cloud Pub/Sub and saves them as a series of Cloud Storage files in text format. The template can be used as a quick way to save data in Cloud Pub/Sub for future use. By default, the template generates a new file every 5 minutes.

Requirements for this pipeline:

  • The Cloud Pub/Sub topic must exist prior to execution.
  • The messages published to the topic must be in text format.
  • The messages published to the topic must not contain any newlines. Note that each Cloud Pub/Sub message is saved as a single line in the output file.

Template parameters

Parameter Description
inputTopic The Cloud Pub/Sub topic to read the input from. The topic name should be in the format projects/<project-id>/topics/<topic-name>.
outputDirectory The path and filename prefix for writing output files. For example, gs://bucket-name/path/. This value must end in a slash.
outputFilenamePrefix The prefix to place on each windowed file. For example, output-
outputFilenameSuffix The suffix to place on each windowed file, typically a file extension such as .txt or .csv.
shardTemplate The shard template defines the dynamic portion of each windowed file. By default, the pipeline uses a single shard for output to the file system within each window. This means that all data will land into a single file per window. The shardTemplate defaults to W-P-SS-of-NN where W is the window date range, P is the pane info, S is the shard number, and N is the number of shards. In case of a single file, the SS-of-NN portion of the shardTemplate will be 00-of-01.

Executing the Cloud Pub/Sub to Cloud Storage Text template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Pub/Sub to Cloud Storage Text template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_PubSub_to_GCS_Text

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_TOPIC_NAME with your Cloud Pub/Sub topic name.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Cloud_PubSub_to_GCS_Text \
    --parameters \
inputTopic=projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME,\
outputDirectory=gs://YOUR_BUCKET_NAME/output/,\
outputFilenamePrefix=output-,\
outputFilenameSuffix=.txt

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_PubSub_to_GCS_Text

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_TOPIC_NAME with your Cloud Pub/Sub topic name.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_PubSub_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputTopic": "projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME"
       "outputDirectory": "gs://YOUR_BUCKET_NAME/output/",
       "outputFilenamePrefix": "output-",
       "outputFilenameSuffix": ".txt",
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Datastore to Cloud Storage Text

The Cloud Datastore to Cloud Storage Text template is a batch pipeline that reads Cloud Datastore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.

Requirements for this pipeline:

Cloud Datastore must be set up in the project prior to execution.

Template parameters

Parameter Description
datastoreReadGqlQuery A GQL query that specifies which entities to grab. For example, SELECT * FROM MyKind.
datastoreReadProjectId The GCP project ID of the Cloud Datastore instance that you want to read data from.
datastoreReadNamespace The namespace of the requested entities. To use the default namespace, leave this parameter blank.
javascriptTextTransformGcsPath A Cloud Storage path that contains all your Javascript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName Name of the Javascript function to be called. For example, if your Javascript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. If you don't want to provide a function, leave this parameter blank.
textWritePrefix The Cloud Storage path prefix to specify where the data should be written. For example, gs://mybucket/somefolder/.

Executing the Cloud Datastore to Cloud Storage Text template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Datastore to Cloud Storage Text template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Datastore_to_GCS_Text

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace YOUR_DATASTORE_KIND with your the type of your Datastore entities.
  • Replace YOUR_DATASTORE_NAMESPACE with the namespace of your Datastore entities.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your Javascript function name.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your Javascript code.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Datastore_to_GCS_Text \
    --parameters \
datastoreReadGqlQuery="SELECT * FROM YOUR_DATASTORE_KIND",\
datastoreReadProjectId=YOUR_PROJECT_ID,\
datastoreReadNamespace=YOUR_DATASTORE_NAMESPACE,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
textWritePrefix=gs://YOUR_BUCKET_NAME/output/

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Datastore_to_GCS_Text

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace YOUR_DATASTORE_KIND with your the type of your Datastore entities.
  • Replace YOUR_DATASTORE_NAMESPACE with the namespace of your Datastore entities.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your Javascript function name.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your Javascript code.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Datastore_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "SELECT * FROM YOUR_DATASTORE_KIND"
       "datastoreReadProjectId": "YOUR_PROJECT_ID",
       "datastoreReadNamespace": "YOUR_DATASTORE_NAMESPACE",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION",
       "textWritePrefix": "gs://YOUR_BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage Text to BigQuery

The Cloud Storage Text to BigQuery pipeline is a batch pipeline that allows you to read text files stored in Cloud Storage, transform them using a Javascript User Defined Function (UDF) that you provide, and output the result to BigQuery.

IMPORTANT: If you reuse an existing BigQuery table, the table will be overwritten.

Requirements for this pipeline:

  • Create a JSON file that describes your BigQuery schema.

    Ensure that there is a top level JSON array titled “BigQuery Schema” and that its contents follow the pattern {“name”: ‘COLUMN NAME”, “type”:”DATA TYPE”}. For example:

    {
      "BigQuery Schema": [
        {
          "name": "location",
          "type": "STRING"
        },
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "age",
          "type": "STRING"
        },
        {
          "name": "color",
          "type": "STRING"
        },
        {
          "name": "coffee",
          "type": "STRING"
        }
      ]
    }
    
  • Create a Javascript (.js) file with your UDF function that supplies the logic to transform the lines of text. Note that your function must return a JSON string.

    For example, this function splits each line of a CSV file and returns a JSON string after transforming the values.

    function transform(line) {
    var values = line.split(',');
    
    var obj = new Object();
    obj.location = values[0];
    obj.name = values[1];
    obj.age = values[2];
    obj.color = values[3];
    obj.coffee = values[4];
    var jsonString = JSON.stringify(obj);
    
    return jsonString;
    }
    

Template parameters

Parameter Description
javascriptTextTransformFunctionName The name of the function you want to call from your .js file.
JSONPath The gs:// path to the JSON file that defines your BigQuery Schema, stored in Cloud Storage. For example, gs://path/to/my/schema.json.
javascriptTextTransformGcsPath The gs:// path to the Javascript file that defines your UDF. For example, gs://path/to/my/javascript_function.js.
inputFilePattern The gs:// path to the text in Cloud Storage you’d like to process. For example, gs://path/to/my/text/data.txt.
outputTable The BigQuery table name you want to create to store your processed data in. If you reuse an existing BigQuery table, the table will be overwritten. For example, my-project-name:my-dataset.my-table.
bigQueryLoadingTemporaryDirectory Temporary directory for BigQuery loading process. For example, gs://my-bucket/my-files/temp_dir.

Executing the Cloud Storage Text to BigQuery template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Text to BigQuery template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_JAVASCRIPT_FUNCTION with the name of your UDF.
  • Replace PATH_TO_BIGQUERY_SCHEMA_JSON with the Cloud Storage path to the JSON file containing the schema definition.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your Javascript code.
  • Replace PATH_TO_YOUR_TEXT_DATA with your Cloud Storage path to your text dataset.
  • Replace BIGQUERY_TABLE with your BigQuery table name.
  • Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --parameters \
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_JAVASCRIPT_FUNCTION with the name of your UDF.
  • Replace PATH_TO_BIGQUERY_SCHEMA_JSON with the Cloud Storage path to the JSON file containing the schema definition.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your Javascript code.
  • Replace PATH_TO_YOUR_TEXT_DATA with your Cloud Storage path to your text dataset.
  • Replace BIGQUERY_TABLE with your BigQuery table name.
  • Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION",
       "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "inputFilePattern":"PATH_TO_YOUR_TEXT_DATA",
       "outputTable":"BIGQUERY_TABLE",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage Text to Cloud Datastore

The Cloud Storage Text to Cloud Datastore template is a batch pipeline which reads from text files stored in Cloud Storage and writes JSON encoded Entities to Cloud Datastore. Each line in the input text files should be in JSON format specified in https://cloud.google.com/datastore/docs/reference/rest/v1/Entity .

Requirements for this pipeline:

  • Datastore must be enabled in the destination project.

Template parameters

Parameter Description
textReadPattern A Cloud Storage file path pattern that specifies the location of your text data files. For example, gs://mybucket/somepath/*.json.
javascriptTextTransformGcsPath A Cloud Storage path pattern that contains all your Javascript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName Name of the Javascript function to be called. For example, if your Javascript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. If you don't want to provide a function, leave this parameter blank.
datastoreWriteProjectId The GCP project id of where to write the Cloud Datastore entities
errorWritePath The error log output file to use for write failures that occur during processing. For example, gs://bucket-name/errors.txt.

Executing the Cloud Storage Text to Datastore template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Text to Datastore template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Datastore

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace PATH_TO_INPUT_TEXT_FILES with the input files pattern on Cloud Storage.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your Javascript function name.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your Javascript code.
  • Replace ERROR_FILE_WRITE_PATH with your desired path to error file on Cloud Storage.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Datastore \
    --parameters \
textReadPattern=PATH_TO_INPUT_TEXT_FILES,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
datastoreWriteProjectId=YOUR_PROJECT_ID,\
errorWritePath=ERROR_FILE_WRITE_PATH

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Datastore

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace PATH_TO_INPUT_TEXT_FILES with the input files pattern on Cloud Storage.
  • Replace YOUR_JAVASCRIPT_FUNCTION with your Javascript function name.
  • Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the .js file containing your Javascript code.
  • Replace ERROR_FILE_WRITE_PATH with your desired path to error file on Cloud Storage.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Datastore
{
   "jobName": "JOB_NAME",
   "parameters": {
       "textReadPattern": "PATH_TO_INPUT_TEXT_FILES",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION",
       "datastoreWriteProjectId": "YOUR_PROJECT_ID",
       "errorWritePath": "ERROR_FILE_WRITE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Storage SequenceFile to Cloud Bigtable

The Cloud Storage SequenceFile to Cloud Bigtable template is a pipeline that reads data from SequenceFile in a Cloud Storage bucket and writes them to a Cloud Bigtable table. You can use the template as a quick solution to move data from Cloud Storage to Cloud Bigtable.

Requirements for this pipeline:

  • The Cloud Bigtable table must exist.
  • The input SequenceFile must exist in a Cloud Storage bucket prior to pipeline execution.
  • The input SequenceFile must have been exported from Cloud Bigtable or HBase.

Template parameters

Parameter Description
bigtableProject The ID of the GCP project of the Cloud Bigtable instance that you want to write data to.
bigtableInstanceId The ID of the Cloud Bigtable instance that contains the table.
bigtableTableId The ID of the Cloud Bigtable table to import.
bigtableAppProfileId The ID of the Cloud Bigtable application profile to be used for the import. If you do not specify an app profile, Cloud Bigtable uses the instance's default app profile.
sourcePattern Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.

Executing the Cloud Storage SequenceFile to Cloud Bigtable template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage SequenceFile to Cloud Bigtable template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [APPLICATION_PROFILE_ID] with the ID of the Cloud Bigtable application profile to be used for the export.
  • Replace [SOURCE_PATTERN] with Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.
  • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location gs://dataflow-templates/latest/GCS_SequenceFile_to_Cloud_Bigtable \
    --parameters bigtableProject=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],bigtableAppProfileId=[APPLICATION_PROFILE_ID],sourcePattern=[SOURCE_PATTERN]

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace [PROJECT_ID] with the ID of the GCP project of the Cloud Bigtable instance that you want to read data from.
  • Replace [INSTANCE_ID] with the ID of the Cloud Bigtable instance that contains the table.
  • Replace [TABLE_ID] with the ID of the Cloud Bigtable table to export.
  • Replace [APPLICATION_PROFILE_ID] with the ID of the Cloud Bigtable application profile to be used for the export.
  • Replace [SOURCE_PATTERN] with Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.
  • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_SequenceFile_to_Cloud_Bigtable
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "bigtableProject": "[PROJECT_ID]",
       "bigtableInstanceId": "[INSTANCE_ID]",
       "bigtableTableId": "[TABLE_ID]",
       "bigtableAppProfileId": "[APPLICATION_PROFILE_ID]",
       "sourcePattern": "[SOURCE_PATTERN]",
   },
   "environment": { "zone": "us-central1-f" }
}

Bulk Compress Cloud Storage Files

The Bulk Compress Cloud Storage Files template is a batch pipeline that compresses files on Cloud Storage to a specified location. This template can be useful when you need to compress large batches of files as part of a periodic archival process. The supported compression modes are: BZIP2, DEFLATE, GZIP, ZIP. Files output to the destination location will follow a naming schema of original filename appended with the compression mode extension. The extensions appended will be one of: .bzip2, .deflate, .gz, .zip.

Any errors which occur during the compression process will be output to the failure file in CSV format of filename, error message. If no failures occur during execution, the error file will still be created but will contain no error records.

Requirements for this pipeline:

  • The compression must be in one of the following formats: BZIP2, DEFLATE, GZIP, ZIP.
  • The output directory must exist prior to pipeline execution.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/uncompressed/*.txt.
outputDirectory The output location to write to. For example, gs://bucket-name/compressed/.
outputFailureFile The error log output file to use for write failures that occur during the compression process. For example, gs://bucket-name/compressed/failed.csv. If there are no failures, the file is still created but will be empty. The file contents are in CSV format (Filename, Error) and consist of one line for each file that fails compression.
compression The compression algorithm used to compress the matched files. Must be one of: BZIP2, DEFLATE, GZIP, ZIP

Executing the Bulk Compress Cloud Storage Files template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Bulk Compress Cloud Storage Files template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace COMPRESSION with your choice of compression algorithm.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Bulk_Compress_GCS_Files \
    --parameters \
inputFilePattern=gs://YOUR_BUCKET_NAME/uncompressed/*.txt,\
outputDirectory=gs://YOUR_BUCKET_NAME/compressed,\
outputFailureFile=gs://YOUR_BUCKET_NAME/failed/failure.csv,\
compression=COMPRESSION

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Compress_GCS_Files

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace COMPRESSION with your choice of compression algorithm.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Bulk_Compress_GCS_Files
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://YOUR_BUCKET_NAME/uncompressed/*.txt",
       "outputDirectory": "gs://YOUR_BUCKET_NAME/compressed",
       "outputFailureFile": "gs://YOUR_BUCKET_NAME/failed/failure.csv",
       "compression": "COMPRESSION"
   },
   "environment": { "zone": "us-central1-f" }
}

Bulk Decompress Cloud Storage Files

The Bulk Decompress Cloud Storage Files template is a batch pipeline that decompresses files on Cloud Storage to a specified location. This functionality is useful when you want to use compressed data to minimize network bandwidth costs during a migration, but would like to maximize analytical processing speed by operating on uncompressed data after migration. The pipeline automatically handles multiple compression modes during a single execution and determines the decompression mode to use based on the file extension (.bzip2, .deflate, .gz, .zip).

Requirements for this pipeline:

  • The files to decompress must be in one of the following formats: Bzip2, Deflate, Gzip, Zip.
  • The output directory must exist prior to pipeline execution.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/compressed/*.gz.
outputDirectory The output location to write to. For example, gs://bucket-name/decompressed.
outputFailureFile The error log output file to use for write failures that occur during the decompression process. For example, gs://bucket-name/decompressed/failed.csv. If there are no failures, the file is still created but will be empty. The file contents are in CSV format (Filename, Error) and consist of one line for each file that fails decompression.

Executing the Bulk Decompress Cloud Storage Files template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Bulk Decompress Cloud Storage Files template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace OUTPUT_FAILURE_FILE_PATH with your choice of path to the file containing failure information.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files \
    --parameters \
inputFilePattern=gs://YOUR_BUCKET_NAME/compressed/*.gz,\
outputDirectory=gs://YOUR_BUCKET_NAME/decompressed,\
outputFailureFile=OUTPUT_FAILURE_FILE_PATH

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace OUTPUT_FAILURE_FILE_PATH with your choice of path to the file containing failure information.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Bulk_Decompress_GCS_Files
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://YOUR_BUCKET_NAME/compressed/*.gz",
       "outputDirectory": "gs://YOUR_BUCKET_NAME/decompressed",
       "outputFailureFile": "OUTPUT_FAILURE_FILE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

Cloud Datastore Bulk Delete

The Cloud Datastore Bulk Delete template is a pipeline which reads in Entities from Datastore with a given GQL query and then deletes all matching Entities in the selected target project. The pipleline can optionally pass the JSON encoded Datastore Entities to your Javascript UDF, which you can use to filter out Entities by returning null values.

Requirements for this pipeline:

  • Cloud Datastore must be set up in the project prior to execution.
  • If reading and deleting from separate Datastore instances, the Dataflow Controller Service Account must have permission to read from one instance and delete from the other.

Template parameters

Parameter Description
datastoreReadGqlQuery GQL Query which specifies which entities to match for deletion. e.g: "SELECT * FROM MyKind".
datastoreReadProjectId GCP Project Id of the Cloud Datastore instance from which you want to read entities (using your GQL Query) that are used for matching.
datastoreDeleteProjectId GCP Project Id of the Cloud Datastore instance from which to delete matching entities. This can be the same as datastoreReadProjectId if you want to read and delete within the same Cloud Datastore instance.
datastoreReadNamespace [Optional] Namespace of requested Entities. Set as "" for default namespace.
javascriptTextTransformGcsPath [Optional] A Cloud Storage path which contains all your Javascript code. e.g: "gs://mybucket/mytransforms/*.js". If you don't want to use a UDF leave this field blank.
javascriptTextTransformFunctionName [Optional] Name of the Function to be called. If this function returns a value of undefined or null for a given Datastore Entity, then that Entity will not be deleted. If you have the javascript code of: "function myTransform(inJson) { ...dostuff...}" then your function name is "myTransform". If you don't want to use a UDF leave this field blank.

Executing the Bulk Decompress Cloud Storage Files template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Datastore Bulk Delete template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field. Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files

You must replace the following values in this example:

  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace GQL_QUERY with the query you'll use to match entities for deletion.
  • Replace DATASTORE_READ_AND_DELETE_PROJECT_ID with your Datastore instance project id. This example will both read and delete from the same Datastore instance.
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Datastore_to_Datastore_Delete \
    --parameters \
datastoreReadGqlQuery="GQL_QUERY",\
datastoreReadProjectId=DATASTORE_READ_AND_DELETE_PROJECT_ID,\
datastoreDeleteProjectId=DATASTORE_READ_AND_DELETE_PROJECT_ID

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files

To execute this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

You must replace the following values in this example:

  • Replace JOB_NAME with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
  • Replace GQL_QUERY with the query you'll use to match entities for deletion.
  • Replace DATASTORE_READ_AND_DELETE_PROJECT_ID with your Datastore instance project id. This example will both read and delete from the same Datastore instance.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Datastore_to_Datastore_Delete
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "GQL_QUERY",
       "datastoreReadProjectId": "READ_PROJECT_ID",
       "datastoreDeleteProjectId": "DELETE_PROJECT_ID"
   },
   "environment": { "zone": "us-central1-f" }
   }
}

Cloud Spanner to Cloud Storage Avro

The Cloud Spanner to Cloud Storage template is a batch pipeline that exports a whole Cloud Spanner database to Cloud Storage in Avro format. Exporting a Cloud Spanner database creates a folder in the bucket you select. The folder contains:

  • A spanner-export.json file.
  • A TableName-manifest.json file for each table in the database you exported.
  • One or more TableName.avro-#####-of-##### files.

For example, exporting a database with two tables, Singers and Albums, creates the following file set:

  • Albums-manifest.json
  • Albums.avro-00000-of-00002
  • Albums.avro-00001-of-00002
  • Singers-manifest.json
  • Singers.avro-00000-of-00003
  • Singers.avro-00001-of-00003
  • Singers.avro-00002-of-00003
  • spanner-export.json

Requirements for this pipeline:

  • The Cloud Spanner database must exist.
  • The output Cloud Storage bucket must exist.
  • In addition to the Cloud IAM roles necessary to run Cloud Dataflow jobs, you must also have the appropriate Cloud IAM roles for reading your Cloud Spanner data and writing to your Cloud Storage bucket.

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database that you want to export.
databaseId The database ID of the Cloud Spanner database that you want to export.
outputDir The Cloud Storage path you want to export Avro files to. The export job creates a new directory under this path that contains the exported files.

Executing the template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Spanner to Cloud Storage Avro template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field.
    • Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [DATAFLOW_REGION] with the region where you want the Cloud Dataflow job to run (such as us-central1).
  • Replace [YOUR_INSTANCE_ID] with your Cloud Spanner instance ID.
  • Replace [YOUR_DATABASE_ID] with your Cloud Spanner database ID.
  • Replace [YOUR_GCS_DIRECTORY] with the Cloud Storage path that the Avro files should be exported to.
  • Replace [JOB_NAME] with a job name of your choice.
    • The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location='gs://dataflow-templates/[VERSION]/Cloud_Spanner_to_GCS_Avro' \
    --region=[DATAFLOW_REGION] \
    --parameters='instanceId=[YOUR_INSTANCE_ID],databaseId=[YOUR_DATABASE_ID],outputDir=[YOUR_GCS_DIRECTORY]

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [DATAFLOW_REGION] with the region where you want the Cloud Dataflow job to run (such as us-central1).
  • Replace [YOUR_INSTANCE_ID] with your Cloud Spanner instance ID.
  • Replace [YOUR_DATABASE_ID] with your Cloud Spanner database ID.
  • Replace [YOUR_GCS_DIRECTORY] with the Cloud Storage path that the Avro files should be exported to.
  • Replace [JOB_NAME] with a job name of your choice.
    • The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/locations/[DATAFLOW_REGION]/templates:launch?gcsPath=gs://dataflow-templates/[VERSION]/Cloud_Spanner_to_GCS_Avro
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "instanceId": "[YOUR_INSTANCE_ID]",
       "databaseId": "[YOUR_DATABASE_ID]",
       "outputDir": "gs://[YOUR_GCS_DIRECTORY]"
   }
}

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not available.

Cloud Storage Avro to Cloud Spanner

The Cloud Storage Avro files to Cloud Spanner template is a batch pipeline that reads Avro files from Cloud Storage and imports them to a Cloud Spanner database.

Requirements for this pipeline:

  • The target Cloud Spanner database must exist and must be empty.
  • You need to have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
  • The input Cloud Storage path must exist, and it must include a spanner-export.json file that contains a JSON description of files to import.
  • You must have exported the Avro and JSON files from a Cloud Spanner database.

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database.
databaseId The database ID of the Cloud Spanner database.
inputDir The Cloud Storage path where the Avro files should be imported from.

Executing the template

CONSOLE

Execute from the Google Cloud Platform Console
  1. Go to the Cloud Dataflow page in the GCP Console.
  2. Go to the Cloud Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Cloud Platform Console Create Job From Template Button
  5. Select the Cloud Storage Avro to Cloud Spanner template from the Cloud Dataflow template drop-down menu.
  6. Enter a job name in the Job Name field.
    • Your job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-import-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
  7. Enter your parameter values in the provided parameter fields.
  8. Click Run Job.

GCLOUD

Execute from the gcloud command-line tool

Note: To use the gcloud command-line tool to execute templates, you must have Cloud SDK version 138.0.0 or higher.

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [DATAFLOW_REGION] with the region where you want the Cloud Dataflow job to run (such as us-central1).
  • Replace [YOUR_INSTANCE_ID] with your Cloud Spanner instance ID.
  • Replace [YOUR_DATABASE_ID] with your Cloud Spanner database ID.
  • Replace [YOUR_GCS_DIRECTORY] with the Cloud Storage path that the Avro files should be imported from.
  • Replace [JOB_NAME] with a job name of your choice.
    • The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-import-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location='gs://dataflow-templates/[VERSION]/GCS_Avro_to_Cloud_Spanner' \
    --region=[DATAFLOW_REGION] \
    --parameters='instanceId=[YOUR_INSTANCE_ID],databaseId=[YOUR_DATABASE_ID],inputDir=[YOUR_GCS_DIRECTORY]

API

Execute from the REST API

When executing this template, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

  • Replace [YOUR_PROJECT_ID] with your project ID.
  • Replace [DATAFLOW_REGION] with the region where you want the Cloud Dataflow job to run (such as us-central1).
  • Replace [YOUR_INSTANCE_ID] with your Cloud Spanner instance ID.
  • Replace [YOUR_DATABASE_ID] with your Cloud Spanner database ID.
  • Replace [YOUR_GCS_DIRECTORY] with the Cloud Storage path that the Avro files should be imported from.
  • Replace [JOB_NAME] with a job name of your choice.
    • The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • The job name must match the format cloud-spanner-import-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID] to show up in the Cloud Spanner portion of the GCP Console.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/locations/[DATAFLOW_REGION]/templates:launch?gcsPath=gs://dataflow-templates/[VERSION]/GCS_Avro_to_Cloud_Spanner
{
   "jobName": "[JOB_NAME]",
   "parameters": {
       "instanceId": "[YOUR_INSTANCE_ID]",
       "databaseId": "[YOUR_DATABASE_ID]",
       "inputDir": "gs://[YOUR_GCS_DIRECTORY]"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

Template source code

Java: SDK 2.x

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

Python

Python source code is not available.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow Documentation