This document provides an overview of pipeline deployment and highlights some of the operations you can perform on a deployed pipeline.
Run your pipeline
After you've created and tested your Apache Beam pipeline, run your pipeline. You can run your pipeline locally, which lets you test and debug your Apache Beam pipeline, or on Dataflow, a data processing system available for running Apache Beam pipelines.
Run locally
Run your pipeline locally:
Java
The following example code, taken from the quickstart, shows how to run the WordCount pipeline locally. To learn more, see how to run your Java pipeline locally.
In your terminal, run the following command:
mvn compile exec:java \ -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--output=counts"
Python
The following example code, taken from the quickstart, shows how to run the WordCount pipeline locally. To learn more, see how to run your Python pipeline locally.
In your terminal, run the following command:
python -m apache_beam.examples.wordcount \ --output outputs
Go
The following example code, taken from the quickstart, shows how to run the WordCount pipeline locally. To learn more, see how to run your Go pipeline locally.
In your terminal, run the following command:
go run wordcount.go --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output outputs
Learn how to run your pipeline locally, on your machine, using the direct runner.
Run on Dataflow
Run your pipeline on Dataflow:
Java
The following example code, taken from the quickstart, shows how to run the WordCount pipeline on Dataflow. To learn more, see how to run your Java pipeline on Dataflow.
In your terminal, run the following command (from your word-count-beam
directory):
mvn -Pdataflow-runner compile exec:java \ -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--project=PROJECT_ID \ --gcpTempLocation=gs://BUCKET_NAME/temp/ \ --output=gs://BUCKET_NAME/output \ --runner=DataflowRunner \ --region=REGION"
Replace the following:
PROJECT_ID
: your Google Cloud project IDBUCKET_NAME
: the name of your Cloud Storage bucketREGION
: a Dataflow regional endpoint, likeus-central1
Python
The following example code, taken from the quickstart, shows how to run the WordCount pipeline on Dataflow. To learn more, see how to run your Python pipeline on Dataflow.
In your terminal, run the following command:
python -m apache_beam.examples.wordcount \ --region DATAFLOW_REGION \ --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output gs://STORAGE_BUCKET/results/outputs \ --runner DataflowRunner \ --project PROJECT_ID \ --temp_location gs://STORAGE_BUCKET/tmp/
Replace the following:
DATAFLOW_REGION
: the regional endpoint where you want to deploy the Dataflow job—for example,europe-west1
The
--region
flag overrides the default region that is set in the metadata server, your local client, or environment variables.STORAGE_BUCKET
: the Cloud Storage name that you copied earlierPROJECT_ID
: the Google Cloud project ID that you copied earlier
Go
The following example code, taken from the quickstart, shows how to run the WordCount pipeline on Dataflow. To learn more, see how to run your Go pipeline on Dataflow.
In your terminal, run the following command:
posix-terminal go run wordcount.go --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output gs://STORAGE_BUCKET/results/outputs \ --runner dataflow \ --project PROJECT_ID \ --region DATAFLOW_REGION \ --staging_location gs://STORAGE_BUCKET/binaries/
Replace the following:
STORAGE_BUCKET
: The Cloud Storage bucket name.PROJECT_ID
: the Google Cloud project ID.DATAFLOW_REGION
: The regional endpoint where you want to deploy the Dataflow job. For example,europe-west1
. For a list of available locations, see Dataflow locations. Note that the--region
flag overrides the default region that is set in the metadata server, your local client, or environment variables.
Learn how to run your pipeline on the Dataflow service, using the Dataflow runner.
When you run your pipeline on Dataflow, Dataflow turns your Apache Beam pipeline code into a Dataflow job. Dataflow fully manages Google Cloud services for you, such as Compute Engine and Cloud Storage to run your Dataflow job, and automatically spins up and tears down necessary resources. You can learn more about how Dataflow turns your Apache Beam code into a Dataflow job in Pipeline lifecycle.
Set pipeline options
You can control some aspects of how Dataflow runs your job by setting pipeline options in your Apache Beam pipeline code. For example, you can use pipeline options to set whether your pipeline runs on worker virtual machines, on the Dataflow service backend, or locally.
Monitor your job
Dataflow provides visibility into your jobs through tools like the Dataflow monitoring interface and the Dataflow command-line interface.
Access worker VMs
You can view the VM instances for a given pipeline by using the Google Cloud console. From there, you can use SSH to access each instance. However, after your job either completes or fails, the Dataflow service automatically shuts down and cleans up the VM instances.
Job optimizations
In addition to managing Google Cloud resources, Dataflow automatically performs and optimizes many aspects of distributed parallel processing for you.
Parallelization and distribution
Dataflow automatically partitions your data and distributes your worker code to Compute Engine instances for parallel processing. For more information, see parallelization and distribution.
Fusion and combine optimizations
Dataflow uses your pipeline code to create
an execution graph that represents your pipeline's PCollection
s and transforms,
and optimizes the graph for the most efficient performance and resource usage.
Dataflow also automatically optimizes potentially costly operations, such as data
aggregations. For more information, see Fusion optimization
and Combine optimization.
Automatic tuning features
The Dataflow service includes several features that provide on-the-fly adjustment of resource allocation and data partitioning. These features help Dataflow execute your job as quickly and efficiently as possible. These features include the following:
Streaming Engine
By default, the Dataflow pipeline runner executes the steps of your streaming pipeline entirely on worker virtual machines, consuming worker CPU, memory, and Persistent Disk storage. Dataflow's Streaming Engine moves pipeline execution out of the worker VMs and into the Dataflow service backend. For more information, see Streaming Engine.
Dataflow Flexible Resource Scheduling
Dataflow FlexRS reduces batch processing costs by using advanced scheduling techniques, the Dataflow Shuffle service, and a combination of preemptible virtual machine (VM) instances and regular VMs. By running preemptible VMs and regular VMs in parallel, Dataflow improves the user experience if Compute Engine stops preemptible VM instances during a system event. FlexRS helps to ensure that the pipeline continues to make progress and that you do not lose previous work when Compute Engine preempts your preemptible VMs. For more information about FlexRS, see Using Flexible Resource Scheduling in Dataflow.
Dataflow Shielded VM
Starting on June 1, 2022, the Dataflow service uses Shielded VM for all workers. To learn more about Shielded VM capabilities, see Shielded VM.