Google-Provided Templates

Google provides a set of Google Cloud Dataflow templates. The following templates are available:

WordCount

The WordCount template is a batch pipeline that reads text from Google Cloud Storage, tokenizes the text lines into individual words, and performs a frequency count on each of the words. For more information about WordCount, see WordCount Example Pipeline.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/wordcount/template_file

Template parameters

Parameter Description
inputFile The Cloud Storage input file path.
output The Cloud Storage output file path and prefix.

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/wordcount/template_file
    {
        "jobName": "[JOB_NAME]",
        "parameters": {
           "inputFile" : "gs://dataflow-samples/shakespeare/kinglear.txt",
           "output": "gs://[YOUR_BUCKET_NAME]/output/my_output"
        },
        "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
        }
    }
    

Cloud Pub/Sub to BigQuery

The Cloud Pub/Sub to BigQuery template is a streaming pipeline that reads JSON-formatted messages from a Cloud Pub/Sub topic and writes them to a BigQuery table. You can use the template as a quick solution to move Cloud Pub/Sub data to BigQuery. The template reads JSON-formatted messages from Cloud Pub/Sub and converts them to BigQuery elements.

Requirements for this pipeline:

  • The Cloud Pub/Sub messages must be in JSON format.
  • The output directory must exist prior to pipeline execution.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/PubSub_to_BigQuery

Template parameters

Parameter Description
inputTopic The Cloud Pub/Sub input topic to read from, in the format of projects/<project>/topics/<topic>.
outputTableSpec The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table>

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_TOPIC_NAME] with your Cloud Pub/Sub topic name.
    • Replace [YOUR_DATASET] with your BigQuery dataset, and replace [YOUR_TABLE_NAME] with your BigQuery table name.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/PubSub_to_BigQuery/template_file
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "topic": "projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME]",
           "table": "[YOUR_PROJECT_ID]:[YOUR_DATASET].[YOUR_TABLE_NAME]"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Cloud Storage Text to Cloud Pub/Sub

The Cloud Storage Text to Cloud Pub/Sub template is a batch pipeline that reads records from text files stored in Cloud Storage and publishes them to a Cloud Pub/Sub topic. The template can be used to publish records in a newline-delimited file containing JSON records or CSV file to a Cloud Pub/Sub topic for real-time processing. You can use this template to replay data to Cloud Pub/Sub.

Note that this template does not set any timestamp on the individual records, so the event time will be equal to the publishing time during execution. If your pipeline is reliant on an accurate event time for processing, you should not use this pipeline.

Requirements for this pipeline:

  • The files to read need to be in newline-delimited JSON or CSV format. Records spanning multiple lines in the source files may cause issues downstream as each line within the files will be published as a message to Cloud Pub/Sub.
  • The Cloud Pub/Sub topic must exist prior to execution.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/files/*.json.
outputTopic The Cloud Pub/Sub input topic to write to. The name should be in the format of projects/<project-id>/topics/<topic-name>.

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_TOPIC_NAME] with your Cloud Pub/Sub topic name.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/GCS_Text_to_Cloud_PubSub/template_file
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "inputFilePattern": "gs://[YOUR_BUCKET_NAME]/files/*.json",
           "outputTopic": "projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME]"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Cloud Pub/Sub to Cloud Storage Text

The Cloud Pub/Sub to Cloud Storage Text template is a streaming pipeline that reads records from Cloud Pub/Sub and saves them as a series of Cloud Storage files in text format. The template can be used as a quick way to save data in Cloud Pub/Sub for future use. By default, the template generates a new file every 5 minutes.

Requirements for this pipeline:

  • The Cloud Pub/Sub topic must exist prior to execution.
  • The messages published to the topic must be in text format.
  • The messages published to the topic must not contain any newlines. Note that each Cloud Pub/Sub message is saved as a single line in the output file.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/Cloud_PubSub_to_GCS_Text

Template parameters

Parameter Description
inputTopic The Cloud Pub/Sub topic to read the input from. The topic name should be in the format projects/<project-id>/topics/<topic-name>.
outputDirectory The path and filename prefix for writing output files. For example, gs://bucket-name/path/. This value must end in a slash.
outputFilenamePrefix The prefix to place on each windowed file. For example, output-
outputFileNameSuffix The suffix to place on each windowed file, typically a file extension such as .txt or .csv.
shardTemplate The shard template defines the dynamic portion of each windowed file. By default, the pipeline uses a single shard for output to the file system within each window. This means that all data will land into a single file per window. The shardTemplate defaults to W-P-SS-of-NN where W is the window date range, P is the pane info, S is the shard number, and N is the number of shards. In case of a single file, the SS-of-NN portion of the shardTemplate will be 00-of-01.

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_TOPIC_NAME] with your Cloud Pub/Sub topic name.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/Cloud_PubSub_to_GCS_Text/template_file
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "inputTopic": "projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME]"
           "outputDirectory": "gs://[YOUR_BUCKET_NAME]/output/",
           "outputFilenamePrefix": "output-",
           "outputFilenameSuffix": ".txt",
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Cloud Datastore to Cloud Storage Text

The Cloud Datastore to Cloud Storage Text template is a batch pipeline that reads Cloud Datastore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.

Requirements for this pipeline:

Cloud Datastore must be set up in the project prior to execution.

Cloud Storage path to template

If you're executing this template using the REST API, you'll need the Cloud Storage path to the template:

gs://dataflow-templates/latest/Datastore_to_GCS_Text

Template parameters

Parameter Description
datastoreReadGqlQuery A GQL query that specifies which entities to grab. For example, SELECT * FROM MyKind.
datastoreReadProjectId The Cloud Platform project ID of the Cloud Datastore instance that you want to read data from.
datastoreReadNamespace The namespace of the requested entities. To use the default namespace, leave this parameter blank.
javascriptTextTransformGcsPath A Cloud Storage path that contains all your Javascript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName Name of the Javascript function to be called. For example, if your Javascript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. If you don't want to provide a function, leave this parameter blank.
textWritePrefix The Cloud Storage path prefix to specify where the data should be written. For example, gs://mybucket/somefolder/.

Executing the template

Note: Executing Google-provided templates with the gcloud command-line tool is not currently supported.

  • Execute from the Google Cloud Platform Console
  • Execute from the REST API

    Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. You must replace the following values in this example:

    • Replace [YOUR_PROJECT_ID] with your project ID.
    • Replace [JOB_NAME] with a job name of your choice. The job name must match the regular expression [a-z]([-a-z0-9]{0,38}[a-z0-9])? to be valid.
    • Replace [YOUR_BUCKET_NAME] with the name of your Cloud Storage bucket.
    • Replace [YOUR_DATASTORE_KIND] with your the type of your Datastore entities.
    • Replace [YOUR_DATASTORE_NAMESPACE] with the namespace of your Datastore entities.
    • Replace [YOUR_JAVASCRIPT_FUNCTION] with your Javascript function name.
    • Replace [PATH_TO_JAVASCRIPT_UDF_FILE] with the Cloud Storage path to the .js file containing your Javascript code.
    POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/Datastore_to_GCS_Text/template_file
    {
       "jobName": "[JOB_NAME]",
       "parameters": {
           "datastoreReadGqlQuery": "SELECT * FROM [YOUR_DATASTORE_KIND]"
           "datastoreReadProjectId": "[YOUR_PROJECT_ID]",
           "datastoreReadNamespace": "[YOUR_DATASTORE_NAMESPACE]",
           "javascriptTextTransformGcsPath": "[PATH_TO_JAVASCRIPT_UDF_FILE]",
           "javascriptTextTransformFunctionName": "[YOUR_JAVASCRIPT_FUNCTION]",
           "textWritePrefix": "gs://[YOUR_BUCKET_NAME]/output/"
       },
       "environment": {
           "tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
           "zone": "us-central1-f"
       }
    }
    

Send feedback about...

Cloud Dataflow Documentation