Cloud Dataproc

A faster, easier, more cost-effective way to run Apache Spark and Apache Hadoop

Try It Free

Cloud-native Apache Hadoop & Apache Spark

Cloud Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days take seconds or minutes instead, and you pay only for the resources you use (with per-second billing). Cloud Dataproc also easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics, and machine learning.

Managed Hadoop and Spark

Fast & Scalable Data Processing

Create Cloud Dataproc clusters quickly and resize them at any time—from three to hundreds of nodes—so you don't have to worry about your data pipelines outgrowing your clusters. You have more time to focus on insights, with less time lost to infrastructure—each cluster action takes less than 90 seconds on average, y.

Fast and Scalable Data Processing

Affordable Pricing

Adopting Google Cloud Platform pricing principles, Cloud Dataproc has a low cost and an easy to understand price structure, based on actual use, measured by the second. Also, Cloud Dataproc clusters can include lower-cost preemptible instances, giving you powerful clusters at an even lower total cost.

Affordable Pricing

Open source Ecosystem

You can use Spark and Hadoop tools, libraries, and documentation with Cloud Dataproc. Cloud Dataproc provides frequent updates to native versions of Spark, Hadoop, Pig, and Hive, so you can get started without the need to learn new tools or APIs, and move existing projects or ETL pipelines without redevelopment.

Open source Ecosystem

Cloud Dataproc Features

Cloud Dataproc is a managed Apache Spark and Apache Hadoop service that is fast, easy to use, and low cost.

Automated Cluster Management
Managed deployment, logging, and monitoring let you focus on your data, not on your cluster. Cloud Dataproc clusters are stable, scalable, and speedy.
Resizable Clusters
Create and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options.
Integrated
Built-in integration with Cloud Storage, BigQuery, Bigtable, Stackdriver Logging, and Stackdriver Monitoring, giving you a complete and robust data platform.
Versioning
Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools.
Highly available
Run clusters with multiple master nodes and set jobs to restart on failure to ensure your clusters and jobs are highly available.
Developer Tools
Multiple ways to manage a cluster, including an easy-to-use Web UI, the Cloud SDK, RESTful APIs, and SSH access.
Initialization Actions
Run initialization actions to install or customize the settings and libraries you need when your cluster is created.
Automatic or Manual Configuration
Cloud Dataproc automatically configures hardware and software on clusters for you while also allowing for manual control.
Flexible Virtual Machines
Clusters can use custom machine types and preemptible virtual machines so they are the perfect size for your needs.

Cloud Dataflow versus Cloud Dataproc: Which should you use?

Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. How do you decide which product is a better fit for your environment?
Dataproc vs Dataflow

Cloud Dataproc

Cloud Dataproc is good for environments dependent on specific Apache big data components:

  • Tools/packages
  • Pipelines
  • Skill sets of existing resources

Cloud Dataflow

Cloud Dataflow is typically the preferred option for green field environments:

  • Less operational overhead
  • Unified approach to development of batch or streaming pipelines
  • Uses Apache Beam
  • Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes

Recommended Workloads

WORKLOADS CLOUD DATAPROC CLOUD DATAFLOW
Stream processing (ETL)
Batch processing (ETL)
Iterative processing and notebooks
Machine learning with Spark ML
Preprocessing for machine learning (with AI Platform)

Cloud Dataproc Pricing

Cloud Dataproc incurs a small incremental fee per virtual CPU in the Compute Engine instances used in your cluster1.

Iowa (us-central1) Oregon (us-west1) Northern Virginia (us-east4) South Carolina (us-east1) Montréal (northamerica-northeast1) São Paulo (southamerica-east1) Belgium (europe-west1) London (europe-west2) Netherlands (europe-west4) Zürich (europe-west6) Frankfurt (europe-west3) Sydney (australia-southeast1) Mumbai (asia-south1) Hong Kong (asia-east2) Taiwan (asia-east1) Tokyo (asia-northeast1) Osaka (asia-northeast2)
Machine Type Price
Standard Machines
1-64 Virtual CPUs
High Memory Machines
2-64 Virtual CPUs
High CPU Machines
2-64 Virtual CPUs
Custom Machines
Based on vCPU and memory usage
If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

1 Cloud Dataproc incurs a small incremental fee per virtual CPU in the Compute Engine instances used in your cluster while the cluster is operational. Other resources used by Cloud Dataproc, including Compute Engine network, BigQuery, and Cloud Bigtable, are billed as they are consumed. For detailed pricing information, view the pricing guide.

Send feedback about...

Cloud Dataproc