Dataproc

Dataproc makes open source data and analytics processing fast, easy, and more secure in the cloud.

  • Spin up an autoscaling cluster in 90 seconds on custom machines
  • Build fully managed Apache Spark, Apache Hadoop, Presto, and other open source software (OSS) clusters
  • Containerize OSS jobs with Kubernetes
  • Only pay for the resources you use and lower the total cost of ownership of OSS
  • Encryption and unified security built into every cluster
  • Accelerate data science experimentation with purpose-built clusters that can include Jupyter notebooks, Apache Spark, and machines with GPUs

Build custom OSS clusters on custom machines faster

Whether you need extra memory for Presto or GPUs for Apache Spark machine learning, Dataproc can help accelerate your data and analytics processing by spinning up a purpose-built cluster in 90 seconds.

Easy and affordable cluster management

With autoscaling, idle cluster deletion, per-second pricing, workflows, cluster resizing, and more, Dataproc can help reduce the total cost of ownership of OSS on day one so you can focus your time and resources elsewhere.

Security built in by default

Encryption by default helps ensure no piece of data is unprotected. With JobsAPI and Component Gateway, you can define user permissions for clusters with Cloud IAM, without having to set up networking or gateway nodes.

Key features

Automated cluster management

Managed deployment, logging, and monitoring let you focus on your data, not on your cluster. Dataproc clusters are stable, scalable, and speedy.

Containerize OSS jobs

When you build your OSS jobs (e.g., Apache Spark) on Dataproc, you can quickly containerize them with Kubernetes and deploy them anywhere a GKE cluster lives.

Enterprise security

When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos by adding a Security Configuration. With Jobs API and Component Gateway, you can define specific user permissions to clusters based on Cloud IAM without having to set up complicated networking or gateway nodes. Also, Google Cloud and Dataproc offer additional security features that help protect your data. Some of the most commonly used Google Cloud-specific security features used with Dataproc include default at-rest encryption, OS Login, VPC Service Controls, and Customer Managed Encryption Keys (CMEK).

View all features

Customer stories

Story features

  • Migrated on-premises Apache Hadoop to Google Cloud

  • 226 models running in production

  • Two months to roll out Google Cloud to first country

Industry

  • Telecommunications

See more customers

Documentation

Best Practice
Dataproc how-to guides

Learn how to build and manage a Dataproc cluster.

APIs & Libraries
Dataproc initialization actions

Add other OSS projects to your Dataproc clusters with pre-built initialization actions.

APIs & Libraries
Open source connectors

Libraries and tools for Apache Hadoop interoperability.

Use cases

use case
Move your Hadoop and Spark clusters to the cloud

Enterprises are migrating their existing on-premises Apache Hadoop and Spark clusters over to Dataproc to manage costs and unlock the power of elastic scale. With Dataproc, enterprises get a fully managed, purpose-built cluster that can autoscale to support any data or analytics processing job.

use case
Data science on Dataproc

Create your ideal data science environment by spinning up a purpose-built Dataproc cluster. Integrate open source software like Apache Spark, NVIDIA RAPIDS, and Juypter notebooks with Google Cloud AI services and GPUs to help accelerate your machine learning and AI development.

All features

Resizable clusters Create and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options.
Autoscaling clusters Dataproc autoscaling provides a mechanism for automating cluster resource management and enables automatic addition and subtraction of cluster workers (nodes).
Cloud integrated Built-in integration with Cloud Storage, BigQuery, Cloud Bigtable, Cloud Logging, Cloud Monitoring, and AI Hub, giving you a more complete and robust data platform.
Versioning Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools.
Highly available Run clusters in high availability mode with multiple master nodes and set jobs to restart on failure to help ensure your clusters and jobs are highly available.
Cluster scheduled deletion To help avoid incurring charges for an inactive cluster, you can use Dataproc's scheduled deletion, which provides options to delete a cluster after a specified cluster idle period, at a specified future time, or after a specified time period.
Automatic or manual configuration Dataproc automatically configures hardware and software but also gives you manual control.
Developer tools Multiple ways to manage a cluster, including an easy-to-use web UI, the Cloud SDK, RESTful APIs, and SSH access.
Initialization actions Run initialization actions to install or customize the settings and libraries you need when your cluster is created.
Optional components Use optional components to install and configure additional components on the cluster. Optional components are integrated with Dataproc components and offer fully configured environments for Zeppelin, Druid, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem.
Custom images Dataproc clusters can be provisioned with a custom image that includes your pre-installed Linux operating system packages.
Flexible virtual machines Clusters can use custom machine types and preemptible virtual machines to make them the perfect size for your needs.
Component Gateway and notebook access Dataproc Component Gateway enables secure, one-click access to Dataproc default and optional component web interfaces running on the cluster.
Workflow templates Dataproc workflow templates provide a flexible and easy-to-use mechanism for managing and executing workflows. A workflow template is a reusable workflow configuration that defines a graph of jobs with information on where to run those jobs.

Pricing

Dataproc pricing is based on the size of the Dataproc cluster and the duration of time it runs. While pricing shows hourly rate, we charge down to the second so you only pay for what you use. Please see pricing page for details.

View pricing details

Partners

Dataproc integrates with key partners to complement your existing investments and skill sets.

Take the next step

Get $300 in credits to learn and build on Google Cloud for up to 12 months.

Need help getting started?
Work with a trusted partner
Continue browsing