Dataproc

A faster, easier, more cost-effective way to run Apache Spark and Apache Hadoop

Try It Free

Cloud-native Apache Hadoop & Apache Spark

Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days now complete in seconds or minutes instead, and you pay only for the resources you use (with per-second billing). Dataproc also easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics, and machine learning.

Managed Hadoop and Spark

Fast & Scalable Data Processing

Create Dataproc clusters quickly and resize them at any time—from three to hundreds of nodes—so you don't have to worry about your data pipelines outgrowing your clusters. You have more time to focus on insights, with less time lost to infrastructure—each cluster action takes less than 90 seconds on average.

Fast and Scalable Data Processing

Affordable Pricing

Adopting Google Cloud Platform pricing principles, Dataproc has a low cost and an easy to understand price structure, based on actual use, measured by the second. Also, Dataproc clusters can include lower-cost preemptible instances, committed use discounts, and sustained use discounts, giving you powerful clusters at an even lower total cost.

Affordable Pricing

Open source Ecosystem

You can use Spark and Hadoop tools, libraries, and documentation with Dataproc. Dataproc provides frequent updates to native versions of Spark, Hadoop, Pig, and Hive, so you can get started without the need to learn new tools or APIs, and move existing projects or ETL pipelines without redevelopment.

Open source Ecosystem

Dataproc Features

Dataproc is a managed Apache Spark and Apache Hadoop service that is fast, easy to use, and low cost.

Automated Cluster Management
Managed deployment, logging, and monitoring let you focus on your data, not on your cluster. Dataproc clusters are stable, scalable, and speedy.
Resizable Clusters
Create and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options.
Autoscaling Clusters
Dataproc Autoscaling provides a mechanism for automating cluster resource management, and enables automatic addition and subtraction of cluster workers (nodes).
Cloud Integrated
Built-in integration with Cloud Storage, BigQuery, Cloud Bigtable,Stackdriver Logging,Stackdriver Monitoring ,and AI Hub ,giving you a complete and robust data platform.
Versioning
Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools.
Highly available
Run clusters in high availability mode with multiple master nodes, and set jobs to restart on failure to ensure your clusters and jobs are highly available.
Enterprise Security
When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos by adding a Security Configuration. Also,GCP and Dataproc offer additional security features that help protect your data. Some of the most commonly used GCP-specific security features used with Dataproc include default at-rest encryption, OS Login, VPC Service Controls, and Customer Managed Encryption Keys (CMEK)
Cluster Scheduled Deletion
To help avoid incurring charges for an inactive cluster, you can use Cloud Dataproc's scheduled deletion ,which provides options to delete a cluster after a specified cluster idle period, at a specified future time, or after a specfied time period.
Automatic or Manual Configuration
Dataproc automatically configures hardware and software, but also gives you manual control.
Developer Tools
Multiple ways to manage a cluster, including an easy-to-use web UI, the Cloud SDK, RESTful APIs, and SSH access.
Initialization Actions
Run initialization actions to install or customize the settings and libraries you need when your cluster is created.
Optional Components
Use optional components to install and configure additional components on the cluster. Optional components are integrated with Dataproc components, and offer fully configured environments for Zeppelin, Druid, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem.
Custom Images
Dataproc clusters can be provisioned with a custom image that includes your pre-installed Linux operating system packages.
Flexible Virtual Machines
Clusters can use custom machine types and preemptible virtual machines to make them the perfect size for your needs.
Component Gateway and Notebook Access
Dataproc Component Gateway enables secure, one-click access to Dataproc default and optional component web interfaces running on the cluster.
Workflow Templates
Dataproc workflow templates provide a flexible and easy-to-use mechanism for managing and executing workflows. A Workflow Template is a reusable workflow configuration that defines a graph of jobs with information on where to run those jobs.

Dataproc Pricing

Dataproc incurs a small incremental fee per virtual CPU in the Compute Engine instances used in your cluster1.

Featured Blogs

Read the latest blogs to better understand open source data processing in the cloud

Highlights from Next ’19

Watch how customers use Dataproc to lower cost and make data driven decisions in their organization

Cloud Dataproc's Newest Features
video_youtube
How Customers Are Migrating Hadoop to Google Cloud Platform
video_youtube
Democratizing Dataproc
video_youtube
Google Cloud

Get started

Learn and build

New to GCP? Get started with any GCP product for free with a $300 credit.

Need more help?

Our experts will help you build the right solution or find the right partner for your needs.

Products listed on this page are in alpha, beta, or early access. For more information on our product launch stages, see here.

Cloud AI products comply with the SLA policies listed here. They may offer different latency or availability guarantees from other Google Cloud services.