Dataflow is a fully managed service for running stream and batch data processing pipelines. It provides automated provisioning and management of compute resources and delivers consistent, reliable, exactly-once processing of your data. Dataflow supports many data processing use cases, including stream analytics, real-time AI, sensor and log data processing, and other workflows involving data transformation.
Create a pipeline or use a template
To create a data pipeline for Dataflow, you use Apache Beam, an open source unified model for defining both batch and streaming data processing pipelines. The Apache Beam programming model simplifies the mechanics of large-scale parallel data processing. Using one of the Apache Beam SDKs, you build a program that defines your pipeline. Then you run the pipeline on Dataflow.
The Apache Beam model provides useful abstractions that hide from you the low-level details of distributed processing, such as coordinating individual workers and sharding datasets. Dataflow fully manages these low-level details. This lets you concentrate on the logical composition of your data processing job, rather than the physical orchestration of parallel processing. You can focus on what your job needs to do instead of exactly how that job is executed.
If you don't want to author your own pipeline, you can use one of the Google-provided Dataflow templates. These templates define common data transformations, such as reading streaming data from Pub/Sub and writing it to Cloud Storage, or reading streaming data from Apache Kafka and writing it to BigQuery. You can also create your own custom Dataflow templates to share your pipelines across a team or organization.
Dataflow templates allow you to package a Dataflow pipeline for deployment. Anyone with the correct permissions can then use the template to deploy a pipeline. Templates separate pipeline design from deployment. For example, a developer can create a template, and a data scientist can deploy the template later, using parameters to customize the job at runtime.
Dataflow features
Dataflow provides many features to help you run secure, reliable, cost-effective data pipelines at scale. This section outlines some of the things you can do with Dataflow.
Scale out with Horizontal Autoscaling
With Horizontal Autoscaling enabled, Dataflow automatically chooses the appropriate number of worker instances required to run your job. Dataflow might also dynamically re-allocate more workers or fewer workers during runtime to account for the characteristics of your job.
Scale up with Vertical Autoscaling
Vertical Autoscaling enables Dataflow to dynamically scale up or scale down the memory available to workers to fit the requirements of the job. It's designed to make jobs resilient to out-of-memory errors and to maximize pipeline efficiency. Dataflow monitors your pipeline, detects situations where the workers lack or exceed available memory, and then replaces them with new workers with more or less memory.
Run serverless pipelines
You can run serverless pipeline using Dataflow Prime. Dataflow Prime is a serverless data processing platform based on Dataflow. Dataflow Prime uses a compute and state-separated architecture and includes features designed to improve efficiency and increase productivity. Pipelines using Dataflow Prime benefit from automated and optimized resource management, reduced operational costs, and improved diagnostics capabilities.
Optimize resources by pipeline stage
Dataflow Prime Right Fitting creates stage-specific pools of resources that are optimized for each pipeline stage to reduce resource wastage.
Monitor jobs
Use the monitoring interface to see and interact with Dataflow jobs. The monitoring interface shows a list of your Dataflow jobs, a graphical representation of each pipeline, details about each job's status, links to info about the Google Cloud services running your pipeline, any errors or warnings that occur during a job, and additional diagnostics and metrics.
Visualize job performance
Part of the Execution details tab in the Dataflow console, Job Visualizer, lets you see performance metrics for a Dataflow job and optimize the job's performance by finding inefficient code, including parallelization bottlenecks. You can also see the list of steps associated with each stage of the pipeline.
Separate streaming resources from storage
Streaming Engine separates compute from state storage for streaming pipelines. It moves parts of pipeline execution out of worker VMs and into Dataflow, significantly improving autoscaling and data latency.
Shuffle data efficiently
Dataflow Shuffle moves the shuffle operation, used for grouping and joining data, out of worker VMs and into Dataflow for batch pipelines. Batch pipelines scale seamlessly, without any tuning required, into hundreds of terabytes.
Reduce batch processing costs
Flexible Resource Scheduling (FlexRS) reduces batch processing costs by using advanced scheduling techniques, the Dataflow Shuffle service, and a combination of preemptible VM instances and regular VMs.
Run pipelines from notebooks
Iteratively build pipelines from the ground up with Vertex AI notebooks and run the jobs on Dataflow. You can author Apache Beam pipelines step-by-step by inspecting pipeline graphs in a read-eval-print-loop (REPL) workflow. Using notebooks, you can write pipelines in an intuitive environment with the latest data science and machine learning frameworks.
Get smart recommendations
Optimize pipelines based on recommendations informed by machine learning. The recommendations can help you improve job performance, reduce cost, and troubleshoot errors.
Protect pipelines with customer-managed encryption keys
A customer-managed encryption key (CMEK) enables encryption of data at rest with a key that you can control through Cloud KMS. You can create a batch or streaming pipeline that is protected with a CMEK, and you can access CMEK-protected data in sources and sinks.
Specify networks and subnetworks
Dataflow's integration with VPC Service Controls provides additional security for your data processing environment by improving your ability to mitigate the risk of data exfiltration. You can specify a network or a subnetwork or both options when you run Dataflow jobs.
Configure private IPs
Turning off public IPs allows you to better secure your data processing infrastructure. By not using public IP addresses for your Dataflow workers, you also lower the number of public IP addresses you consume against your Google Cloud project quota.
Get started
To get started with Dataflow, try one of the quickstarts:
- Create a Dataflow pipeline using Java
- Create a Dataflow pipeline using Python
- Create a Dataflow pipeline using Go
- Create a streaming pipeline using a Dataflow template