Dataproc overview
Accessing clusters
-
Cluster web interfaces
Available web interfaces for Dataproc's open source components, and how to connect to them.
-
Component gateway
Use the component gateway to connect to cluster components.
-
Workforce identity federation
Allow workforce access to the Dataproc Component Gateway.
-
Network configuration
Configure your cluster's network.
-
Connect using SSH to a cluster
Use SSH to connect to a cluster node.
Components
-
Overview
Overview of cluster components.
-
Anaconda optional component
Install the Anaconda component on your cluster.
-
Docker optional component
Install the Docker component on your cluster.
-
Flink optional component
Install the Flink component on your cluster.
-
HBase optional componentBeta
Install the HBase component on your cluster.
-
Hive WebHCat optional component
Install the Hive WebHCat component on your cluster.
-
Hudi optional component
Install the Hudi component on your cluster.
-
Jupyter optional component
Install the Jupyter component on your cluster.
-
Presto optional component
Install the Presto component on your cluster.
-
Ranger optional component
Install the Ranger component on your cluster.
-
Using Ranger with Kerberos
Use the Ranger component with Kerberos on your cluster.
-
Back up and restore a Ranger schema
Follow the stepas to back up and restore a Ranger schema.
-
Solr optional component
Install the Solr component on your cluster.
-
Trino optional component
Install the Trino component on your cluster.
-
Zeppelin optional component
Install the Zeppelin component on your cluster.
-
Zookeeper optional component
Install the Zookeeper component on your cluster.
Compute options
-
Supported machine types
Dataproc lets you specify custom machine types for special workloads.
-
GPU Clusters
Use Graphics Processing Units (GPUs) with your Dataproc clusters.
-
Local Solid State Drives
Attach local SSDs to Dataproc clusters.
-
Minimum CPU platform
Specify a minimum CPU platform for your Dataproc cluster.
-
Persistent Solid State Drive (PD-SSD) boot disks
Create clusters with persistent SSD boot disks.
-
Secondary workers - preemptible and non-preemptible VMs
Understand and use preemptible and non-preemptible secondary workers in your Dataproc cluster.
Configuring and running jobs
-
Life of a job
Understand Dataproc job throttling.
-
Troubleshoot job delays
Understand and avoid common causes of job delays.
-
Persistent History Server
Learn about the Dataproc Persistent History Server.
-
Restartable jobs
Create jobs that restart on failure. Great for long-running and streaming jobs.
-
Run a Spark job on Dataproc on GKE
Create a Dataproc on GKE virtual cluster, then run a Spark job on the virtual cluster.
-
Customize Spark job runtime environment with Docker on YARN
Use a Docker image to customize your Spark job environment.
-
Run Spark jobs with DataprocFileOutputCommitter
Run Spark jobs with Dataproc's enhanced, configurable version of the open source
FileOutputCommitter
.
Configuring clusters
-
Autoscaling clusters
Learn how to use autoscaling to automatically resize clusters to meet the demands of user workloads.
-
Auto Zone placement
Let Dataproc select a zone for your cluster.
-
Cluster caching
Use cluster caching to improve performance.
-
Cluster metadata
Learn about Dataproc's cluster metadata and how to set your own custom metadata.
-
Cluster properties
Use configuration properties for Dataproc open source components.
-
Cluster rotation
Rotate clusters that are part of a cluster pool.
-
Enhanced Flexibility Mode
Keep jobs running by changing where intermediate data is saved.
-
Flexible VMs
Specify VM types that you can use in your cluster if your requested VMs are unavailable.
-
High availability mode
Increase the resilience of HDFS and YARN to service unavailability
-
Initialization actions
Specify actions to run on all or some cluster nodes on setup.
-
Network configuration
Configure your cluster's network.
-
Scaling Clusters
Increase or decrease the number of worker nodes in a cluster, even while jobs are running.
-
Scheduled Deletion
Delete your cluster after a specified period or at a specified time.
-
Security Configuration
Enable cluster security features.
-
Confidential compute
Create a cluster with Confidential VMs.
-
Customer managed encryption keys (CMEK)
Manage encrypted keys for Dataproc cluster and job data.
-
Ranger Cloud Storage plugin
Use the Ranger Cloud Storage plugin with Dataproc).
-
Dataproc service accounts
Understand Dataproc service accounts.
-
Single node clusters
Create lightweight sandbox clusters with only one node.
-
Sole tenant node clusters
Create clusters on sole tenant nodes.
-
Staging and temp buckets
Learn about the Dataproc staging and temp buckets.
Connectors
-
BigQuery connector
Use the BigQuery for Apache Hadoop on your Dataproc clusters.
-
BigQuery connector code samples
View the BigQuery code samples.
-
Bigtable with Dataproc
Use Bigtable Apache HBase-compatible API with your Dataproc clusters.
-
Cloud Storage connector
Use the Cloud Storage connector.
-
Hive BigQuery connector
Learn about the Hive BigQuery connector.
-
Pub/Sub Lite with Dataproc
Use Pub/Sub Lite with Dataproc).
Identity and Access Management (IAM)
-
Dataproc permissions and IAM roles
Set up IAM roles to allow users and groups to access your project's Dataproc resources.
-
Dataproc principals and roles
Understand Dataproc principals and the roles required to create, manage, and run tasks on a cluster.
-
Dataproc Granular IAM
Set up granular cluster-specific permissions.
-
Dataproc Personal Cluster Authentication
Set up personal cluster authemtication.
-
Dataproc service account based multi-tenancy
Set up multi-tenant clusters.
-
Manage Dataproc resources using custom constraints
Set up custom constraints to manage Dataproc resources.
Dataproc Regional endpoints
Versioning
-
Overview
Software versions used on Dataproc clusters and how to select them.
-
2.1.x release versions
Dataproc image version 2.1.
-
2.0.x release versions
Dataproc image version 2.0.
-
1.5.x release versions
Dataproc image version 1.5.
-
1.4.x release versions
Dataproc image version 1.4.
-
Dataproc cluster image version lists
Lists of versions currently supported in Dataproc clusters.
Workflow Templates
-
Overview
Learn about workflow templates.
-
Monitoring and debugging workflows
How to monitor and debug workflows.
-
Parameterization
Learn how to parameterize your workflow templates.
-
Use YAML files
Learn how to use YAML files in your workflow.
-
Use cluster selectors
Learn how to use cluster selectors in your workflow.
-
Use inline workflows
Learn how to create and run inline workflows.
-
Use workflows
Learn how to set up and run workflows.
-
Workflow scheduling solutions
Run workflows with Cloud Scheduler, Cloud Functions, and Cloud Composer.