Dataproc

Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. Use Dataproc for data lake modernization, ETL, and secure data science, at planet scale, fully integrated with Google Cloud, at a fraction of the cost.

  • check_circle_filled_black_24dp (1)

    Open: Build fully managed open source clusters on demand, integrated with data lake metastore

  • check_circle_filled_black_24dp (1)

    Flexible: Mix-and-match OSS and GCP-native services with tight integrations across the platform

  • check_circle_filled_black_24dp (1)

    Governed: Enable scalable, secure, cost-controlled open source data science with Dataproc Hub

  • check_circle_filled_black_24dp (1)

    Secure: Configure advanced security such as Kerberos, Apache Ranger and Personal Authentication

  • check_circle_filled_black_24dp (1)

    Cost-effective: Realize 57% lower TCO compared to on-prem data lakes with per-second pricing

Benefits

Modernize your open source data processing

Whether you need VMs or Kubernetes, extra memory for Presto, or even GPUs, Dataproc can help accelerate your data and analytics processing by spinning up purpose-built environments on-demand.

Cost effective and secure OSS for data science

With autoscaling, idle cluster deletion, per-second pricing, integrated security, and more, Dataproc Hub can cost effectively scale open source data science while minimizing management costs and security risks.

Advanced security, compliance, and governance

Manage and enforce user authorization and authentication using existing Kerberos and Apache Ranger polices or Personal Cluster Authentication. Define permissions without having to setup a network node.

Key features

Key features

Fully managed and automated big data open source software

Fully managed deployment, logging, and monitoring let you focus on your data and analytics, not on your infrastructure. Reduce TCO of Apache Spark management by up to 57%. Data scientists and engineers interact with Dataproc using familiar tools such as Jupyter and Zeppelin notebooks that can be templated in Dataproc Hub. The Dataproc Jobs API makes it easy to incorporate big data processing into custom applications, while Dataproc Metastore eliminates the need to run your own Hive metastore or catalog service.

Containerize Apache Spark jobs with Kubernetes

Build your Apache Spark jobs using Dataproc on Kubernetes so you can use Dataproc with Google Kubernetes Engine (GKE) to provide job portability and isolation.

Enterprise security integrated with Google Cloud

When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos by adding a Security Configuration. Additionally, some of the most commonly used Google Cloud-specific security features used with Dataproc include default at-rest encryption, OS Login, VPC Service Controls, and customer-managed encryption keys (CMEK).

The best of open source with the best of Google Cloud

Dataproc lets you take the open source tools, algorithms, and programming languages that you use today but makes it easy to apply them on cloud-scale datasets. At the same time, Dataproc has out-of-the-box integration with the rest of the Google Cloud analytics, database, and AI ecosystem. Data scientists and engineers can quickly access data and build data application connecting Dataproc to BigQuery, AI Platform, Cloud Spanner, Pub/Sub, or Data Fusion. 

View all features

What's new

What's new

Sign up for Google Cloud newsletters to receive product updates, event information, special offers, and more.

Documentation

Documentation

APIs & Libraries
Dataproc initialization actions

Add other OSS projects to your Dataproc clusters with pre-built initialization actions.

APIs & Libraries
Open source connectors

Libraries and tools for Apache Hadoop interoperability.

Use cases

Use cases

Use case
Move your Hadoop and Spark clusters to the cloud

Enterprises are migrating their existing on-premises Apache Hadoop and Spark clusters over to Dataproc to manage costs and unlock the power of elastic scale. With Dataproc, enterprises get a fully managed, purpose-built cluster that can autoscale to support any data or analytics processing job. 

Use case
Data science on Dataproc

Create your ideal data science environment by spinning up a purpose-built Dataproc cluster. Integrate open source software like Apache Spark, NVIDIA RAPIDS, and Jupyter notebooks with Google Cloud AI services and GPUs to help accelerate your machine learning and AI development. 

All features

All features

Resizable clusters Create and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options.
Autoscaling clusters Dataproc autoscaling provides a mechanism for automating cluster resource management and enables automatic addition and subtraction of cluster workers (nodes).
Cloud integrated Built-in integration with Cloud Storage, BigQuery, Cloud Bigtable, Cloud Logging, Cloud Monitoring, and AI Hub, giving you a more complete and robust data platform.
Versioning Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools.
Highly available Run clusters in high availability mode with multiple master nodes and set jobs to restart on failure to help ensure your clusters and jobs are highly available.
Cluster scheduled deletion To help avoid incurring charges for an inactive cluster, you can use Dataproc's scheduled deletion, which provides options to delete a cluster after a specified cluster idle period, at a specified future time, or after a specified time period.
Automatic or manual configuration Dataproc automatically configures hardware and software but also gives you manual control.
Developer tools Multiple ways to manage a cluster, including an easy-to-use web UI, the Cloud SDK, RESTful APIs, and SSH access.
Initialization actions Run initialization actions to install or customize the settings and libraries you need when your cluster is created.
Optional components Use optional components to install and configure additional components on the cluster. Optional components are integrated with Dataproc components and offer fully configured environments for Zeppelin, Druid, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem.
Custom images Dataproc clusters can be provisioned with a custom image that includes your pre-installed Linux operating system packages.
Flexible virtual machines Clusters can use custom machine types and preemptible virtual machines to make them the perfect size for your needs.
Component Gateway and notebook access Dataproc Component Gateway enables secure, one-click access to Dataproc default and optional component web interfaces running on the cluster.
Workflow templates Dataproc workflow templates provide a flexible and easy-to-use mechanism for managing and executing workflows. A workflow template is a reusable workflow configuration that defines a graph of jobs with information on where to run those jobs.

Pricing

Pricing

Dataproc pricing is based on the number of vCPU and the duration of time that they run. While pricing shows hourly rate, we charge down to the second, so you only pay for what you use.

Ex: 6 clusters (1 master + 5 workers) of 4 CPUs each ran for 2 hours would cost $.48.  Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48

Please see pricing page for details.

Partners

Partners

Dataproc integrates with key partners to complement your existing investments and skill sets.