Get started with Google-provided templates

Google provides a set of open source Dataflow templates. For general information about templates, see the Overview page. To get started, use the WordCount template. See other Google-provided templates:

Streaming templates - Templates for processing data continuously:

Batch templates - Templates for processing data in bulk:

Utility templates:

WordCount

The WordCount template is a batch pipeline that reads text from Cloud Storage, tokenizes the text lines into individual words, and performs a frequency count on each of the words. For more information about WordCount, see WordCount Example Pipeline.

Template parameters

Parameter Description
inputFile The Cloud Storage input file's path.
output The Cloud Storage output file's path and prefix.

Running the WordCount template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Word Count template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Word_Count \
    --region REGION_NAME \
    --parameters \
    inputFile=gs://dataflow-samples/shakespeare/kinglear.txt,\
    output=gs://BUCKET_NAME/output/my_output

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • REGION_NAME: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BUCKET_NAME: the name of your Cloud Storage bucket

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/latest/Word_Count
{
    "jobName": "JOB_NAME",
    "parameters": {
       "inputFile" : "gs://dataflow-samples/shakespeare/kinglear.txt",
       "output": "gs://BUCKET_NAME/output/my_output"
    },
    "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: the Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • LOCATION: the regional endpoint where you want to deploy your Dataflow job—for example, us-central1
  • BUCKET_NAME: the name of your Cloud Storage bucket