High-performance, scalable VMs
Fast, unified stream and batch data processing

Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing. With its serverless approach to resource provisioning and management, you have access to virtually limitless capacity to solve your biggest data processing challenges, while paying only for what you use.

  • Automated provisioning and management of processing resources
  • Horizontal autoscaling of worker resources to maximize resource utilization
  • Unified streaming and batch programming model
  • OSS community-driven innovation with Apache Beam SDK
  • Reliable and consistent exactly-once processing

Fast streaming data analytics

Dataflow enables fast, simplified streaming data pipeline development with lower data latency.

Simplify operations and management

Allow teams to focus on programming instead of managing server clusters as Dataflow’s serverless approach removes operational overhead from data engineering workloads.

Reduce total cost of ownership

Resource autoscaling paired with cost-optimized batch processing capabilities means Dataflow offers virtually limitless capacity to manage your seasonal and spiky workloads without overspending.

Key features

Automated resource management and dynamic work rebalancing

Dataflow automates provisioning and management of processing resources to minimize latency and maximize utilization so that you do not need to spin up instances or reserve them by hand. Work partitioning is also automated and optimized to dynamically rebalance lagging work. No need to chase down “hot keys” or preprocess your input data.

Horizontal autoscaling

Horizontal autoscaling of worker resources for optimum throughput results in better overall price-to-performance.

Flexible resource scheduling pricing for batch processing

For processing with flexibility in job scheduling time, such as overnight jobs, flexible resource scheduling (FlexRS) offers a lower price for batch processing. These flexible jobs are placed into a queue with a guarantee that they will be retrieved for execution within a six-hour window.

View all features

Customer stories

Highlights

  • Synthesized 30+ years of unstructured news data to assess qualitative business impact of key events

  • Defined complex network efforts to uncover hidden relationships and insights

  • Prototype Knowledge Graph delivered with ease in 10 weeks

Partner

See more customers

What’s new

Documentation

Tutorial
Dataflow quickstart using Python

Set up your Google Cloud project and Python development environment, get the Apache Beam SDK, and run and modify the WordCount example on the Dataflow service.

Tutorial
Using Dataflow SQL

Create a SQL query and deploy a Dataflow job to run your SQL query from the Dataflow SQL UI.

Tutorial
Installing the Apache Beam SDK

Install the Apache Beam SDK so that you can run your pipelines on the Dataflow service.

Tutorial
Machine learning with Apache Beam and TensorFlow

Preprocess, train, and make predictions on a molecular energy machine learning model, using Apache Beam, Dataflow, and TensorFlow.

Common use cases

Stream analytics

Stream analytics from Google Cloud makes data more organized, useful, and accessible from the instant it’s generated. Built on the autoscaling infrastructure of Dataflow along with Pub/Sub and BigQuery, our streaming solution provisions the resources you need to ingest, process, and analyze fluctuating volumes of real-time data for real-time business insights. This abstracted provisioning reduces complexity and makes stream analytics accessible to both data analysts and data engineers.

Architecture, showing Stream analyticsTriggerAnalyzeActivateData StudioThird-party BlCreation FlowConfigure source to push event message to Pub/Sub Topic Create Pub/Sub Topic and Subscription Deploy streaming or batch Dataflow job using templates, CLI, or notebooksCreate dataset, tables, and models to receive streamBuild real-time dashboards and call external APIs IngestEnrichAnalyzeActivateEdgeMobileWebData StoreIoTPub/SubBigQueryAl PlatformBigtable Cloud FunctionsDataflow StreamingApache Beam (SDK)Dataflow BatchBackfill/ReprocessArchitecture
Sensor and log data processing

Unlock business insights from your global device network with an intelligent IoT platform.

Real-time AI

Dataflow brings streaming events to Google Cloud’s AI Platform and TensorFlow Extended (TFX) to enable predictive analytics, fraud detection, real-time personalization, and other Advanced Analytics use cases. TFX uses Dataflow and Apache Beam as the distributed data processing engine to enable several aspects of the ML life cycle, all supported with CI/CD for ML through Kubeflow pipelines.

All features

Autoscaling Autoscaling lets the Dataflow service automatically choose the appropriate number of worker instances required to run your job. The Dataflow service may also dynamically reallocate more workers or fewer workers during runtime to account for the characteristics of your job.
Streaming Engine Streaming Engine separates compute from state storage and moves parts of pipeline execution out of the worker VMs and into the Dataflow service back end, significantly improving autoscaling and data latency.
Dataflow Shuffle Service-based Dataflow Shuffle moves the shuffle operation, used for grouping and joining data, out of the worker VMs and into the Dataflow service back end for batch pipelines. Batch pipelines scale seamlessly, without any tuning required, into hundreds of terabytes.
Dataflow SQL Dataflow SQL lets you use your SQL skills to develop streaming Dataflow pipelines right from the BigQuery web UI. You can join streaming data from Pub/Sub with files in Cloud Storage or tables in BigQuery, write results into BigQuery, and build real-time dashboards using Google Sheets or other BI tools.
Flexible Resource Scheduling (FlexRS) Dataflow FlexRS reduces batch processing costs by using advanced scheduling techniques, the Dataflow Shuffle service, and a combination of preemptible virtual machine (VM) instances and regular VMs.
Dataflow templates Dataflow templates allow you to easily share your pipelines with team members and across your organization, or take advantage of many Google-provided templates to implement simple but useful data processing tasks.
Inline monitoring Dataflow inline monitoring lets you interact with your jobs and directly access job metrics. You can also set up alerts for conditions for stale data and high system latency.
Customer-managed encryption keys You can create a batch or streaming pipeline that is protected with a customer-managed encryption key (CMEK) or access CMEK-protected data in sources and sinks.
Dataflow VPC Service Controls Dataflow’s integration with VPC Service Controls provides additional security for your data processing environment by improving your ability to mitigate the risk of data exfiltration.
Private IPs Turning off public IPs allows you to better secure your data processing infrastructure. By not using public IP addresses for your Dataflow workers, you also lower the number of public IP addresses you consume against your Google Cloud project quota.

Pricing

Dataflow jobs are billed in per-second increments, based on the actual use of Dataflow batch or streaming workers. Jobs that consume additional Google Cloud resources, such as Cloud Storage or Pub/Sub, are each billed per that service‚s pricing.

View pricing details

Partners

Google Cloud partners and third-party developers have developed integrations with Dataflow to quickly and easily enable powerful data processing tasks of any size.