Google provides a set of
open source
Dataflow templates. For general information about templates, see the
Overview page. To get started, use the
WordCount
template. See other
Google-provided templates:
Streaming templates
Templates for processing data continuously:
- Pub/Sub Subscription to BigQuery
- Pub/Sub Topic to BigQuery
- Pub/Sub Avro to BigQuery
- Pub/Sub Proto to BigQuery
- Pub/Sub to Pub/Sub
- Pub/Sub to Splunk
- Pub/Sub to Cloud Storage Avro
- Pub/Sub to Cloud Storage Text
- Pub/Sub to MongoDB
- Pub/Sub to Elasticsearch
- Cloud Storage Text to BigQuery (Stream)
- Cloud Spanner change streams to BigQuery
- Cloud Spanner change streams to Cloud Storage
- Cloud Storage Text to Pub/Sub (Stream)
- Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP)
- Change Data Capture to BigQuery (Stream)
- Apache Kafka to BigQuery
- Datastream to BigQuery
- Datastream to PostgreSQL
- Datastream to Cloud Spanner
- Pub/Sub to Java Database Connectivity (JDBC)
Batch templates
Templates for processing data in bulk:
- BigQuery to Cloud Storage TFRecords
- BigQuery export to Parquet (via Storage API)
- BigQuery to Elasticsearch
- Bigtable to Cloud Storage Avro
- Bigtable to Cloud Storage Parquet
- Bigtable to Cloud Storage SequenceFiles
- Datastore to Cloud Storage Text [Deprecated]
- Firestore (Datastore mode) to Cloud Storage Text
- Cloud Spanner to Cloud Storage Avro
- Cloud Spanner to Cloud Storage Text
- Cloud Storage Avro to Bigtable
- Cloud Storage Avro to Cloud Spanner
- Cloud Storage Parquet to Bigtable
- Cloud Storage SequenceFiles to Bigtable
- Cloud Storage Text to BigQuery
- Cloud Storage Text to Datastore [Deprecated]
- Cloud Storage Text to Firestore (Datastore mode)
- Cloud Storage Text to Pub/Sub (Batch)
- Cloud Storage Text to Cloud Spanner
- Cloud Storage to Elasticsearch
- Java Database Connectivity (JDBC) to BigQuery
- Java Database Connectivity (JDBC) to Pub/Sub
- Apache Cassandra to Bigtable
Utility templates
WordCount
The WordCount template is a batch pipeline that reads text from Cloud Storage, tokenizes the text lines into individual words, and performs a frequency count on each of the words. For more information about WordCount, see WordCount Example Pipeline.
Template parameters
Parameter | Description |
---|---|
inputFile |
The Cloud Storage input file's path. |
output |
The Cloud Storage output file's path and prefix. |
Running the WordCount template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Word Count template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Word_Count \ --region REGION_NAME \ --parameters \ inputFile=gs://dataflow-samples/shakespeare/kinglear.txt,\ output=gs://BUCKET_NAME/output/my_output
Replace the following:
JOB_NAME
: a unique job name of your choiceREGION_NAME
: the regional endpoint where you want to deploy your Dataflow job—for example,us-central1
BUCKET_NAME
: the name of your Cloud Storage bucket
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/latest/Word_Count { "jobName": "JOB_NAME", "parameters": { "inputFile" : "gs://dataflow-samples/shakespeare/kinglear.txt", "output": "gs://BUCKET_NAME/output/my_output" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Dataflow jobJOB_NAME
: a unique job name of your choiceLOCATION
: the regional endpoint where you want to deploy your Dataflow job—for example,us-central1
BUCKET_NAME
: the name of your Cloud Storage bucket
Extend templates
You can build your own pipelines by extending the
open source
Dataflow templates. For example, for a template that uses a fixed window duration, data
that arrives outside of the window might be discarded. To avoid this behavior, use the template
code as a base, and modify the code to invoke the
.withAllowedLateness
operation.