Data Transformation with Cloud Dataflow
- Automated Resource Management
- Cloud Dataflow automates provisioning and management of processing resources to minimize latency and maximize utilization; no more spinning up instances by hand or reserving them.
- Dynamic Work Rebalancing
- Automated and optimized work partitioning dynamically rebalances lagging work. No need to chase down “hot keys” or pre-process your input data.
- Reliable & Consistent Exactly-once Processing
- Provides built-in support for fault-tolerant execution that is consistent and correct regardless of data size, cluster size, processing pattern or pipeline complexity.
- Horizontal Auto-scaling
- Horizontal auto-scaling of worker resources for optimum throughput results in better overall price-to-performance.
- Unified Programming Model
- Apache Beam SDK offers equally rich MapReduce-like operations, powerful data windowing, and fine-grained correctness control for streaming and batch data alike.
- Community-driven Innovation
- Developers wishing to extend the Cloud Dataflow programming model can fork and/or contribute to Apache Beam.
Cloud Dataflow vs. Cloud Dataproc: Which should you use?
|WORKLOADS||CLOUD DATAPROC||CLOUD DATAFLOW|
|Stream processing (ETL)||check|
|Batch processing (ETL)||check||check|
|Iterative processing and notebooks||check|
|Machine learning with Spark ML||check|
|Preprocessing for machine learning||check (with Cloud ML Engine)|
“Running our pipelines on Cloud Dataflow lets us focus on programming without having to worry about deploying and maintaining instances running our code (a hallmark of GCP overall).”- Jibran Saithi Lead Architect, Qubit
|Cloud Dataflow Worker Type||vCPU
|Storage - Standard Persistent Disk
|Storage - SSD Persistent Disk
|Shuffle Data Processed3
3 Service-based Cloud Dataflow Shuffle is currently available in beta for batch pipelines in the us-central1 (Iowa) and europe-west1 (Belgium) regions only. It will become available in other regions in the future.
4 See Cloud Dataflow Pricing for more information about Shuffle Data Processed.