Learn how your organization can prepare for the new data economy with the analytics lakehouse. Register here.
Jump to
Dataproc

Dataproc

Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. Use Dataproc for data lake modernization, ETL, and secure data science, at scale, integrated with Google Cloud, at a fraction of the cost.

  • Open: Run open source data analytics at scale, with enterprise grade security

  • Flexible: Use serverless, or manage clusters on Google Compute and Kubernetes 

  • Intelligent: Enable data users through integrations with Vertex AI, BigQuery, and Dataplex 

  • Secure: Configure advanced security such as Kerberos, Apache Ranger and Personal Authentication

  • Cost-effective: Realize 54% lower TCO compared to on-prem data lakes with per-second pricing

Benefits

Modernize your open source data processing

Whether you need VMs or Kubernetes, extra memory for Presto, or even GPUs, Dataproc can help accelerate your data and analytics processing through on-demand purpose-built or serverless environments.

Intelligent and seamless OSS for data science

Enable data scientists and data analysts to seamlessly perform data science jobs through native integrations with BigQuery, Dataplex, and Vertex AI.

Advanced security, compliance, and governance

Enforce fine-grained row & column-level access controls with Dataproc, BigLake & Dataplex. Manage & enforce user authorization and authentication using existing KerberosApache Ranger policies. 

Key features

Key features

Fully managed and automated big data open source software

Serverless deployment, logging, and monitoring let you focus on your data and analytics, not on your infrastructure. Reduce TCO of Apache Spark management by up to 54%. Enable data scientists and engineers to build and train models 5X faster, compared to traditional notebooks, through integration with Vertex AI Workbench. The Dataproc Jobs API makes it easy to incorporate big data processing into custom applications, while Dataproc Metastore eliminates the need to run your own Hive metastore or catalog service.

Containerize Apache Spark jobs with Kubernetes

Build your Apache Spark jobs using Dataproc on Kubernetes so you can use Dataproc with Google Kubernetes Engine (GKE) to provide job portability and isolation.

Enterprise security integrated with Google Cloud

When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos by adding a Security Configuration. Additionally, some of the most commonly used Google Cloud-specific security features used with Dataproc include default at-rest encryption, OS Login, VPC Service Controls, and customer-managed encryption keys (CMEK).

The best of open source with the best of Google Cloud

Dataproc lets you take the open source tools, algorithms, and programming languages that you use today, but makes it easy to apply them on cloud-scale datasets. At the same time, Dataproc has out-of-the-box integration with the rest of the Google Cloud analytics, database, and AI ecosystem. Data scientists and engineers can quickly access data and build data applications connecting Dataproc to BigQuery, Vertex AI, Cloud Spanner, Pub/Sub, or Data Fusion

View all features