Dataproc overview
Accessing clusters
Components
-
Overview
Overview of cluster components.
-
Anaconda optional component
Install the Anaconda component on your cluster.
-
Docker optional component
Install the Docker component on your cluster.
-
Druid optional componentAlpha
Install the Druid component on your cluster.
-
Flink optional component
Install the Flink component on your cluster.
-
HBase optional componentBeta
Install the HBase component on your cluster.
-
Hive WebHCat optional component
Install the Hive WebHCat component on your cluster.
-
Jupyter optional component
Install the Jupyter component on your cluster.
-
Presto optional component
Install the Presto component on your cluster.
-
Ranger optional component
Install the Ranger component on your cluster.
-
Using Ranger with Kerberos
Use the Ranger component with Kerberos on your cluster.
-
Solr optional component
Install the Solr component on your cluster.
-
Zeppelin optional component
Install the Zeppelin component on your cluster.
-
Zookeeper optional component
Install the Zookeeper component on your cluster.
Compute options
-
Supported machine types
Dataproc lets you specify custom machine types for special workloads.
-
GPU Clusters
Use Graphics Processing Units (GPUs) with your Dataproc clusters.
-
Local Solid State Drives
Attach local SSDs to Dataproc clusters.
-
Minimum CPU platform
Specify a minimum CPU platform for your Dataproc cluster.
-
Persistent Solid State Drive (PD-SSD) boot disks
Create clusters with persistent SSD boot disks.
-
Secondary workers - preemptible and non-preemptible VMs
Understand and use preemptible and non-preemptible secondary workers in your Dataproc cluster.
Configuring and running jobs
-
Life of a job
Understand Dataproc job throttling.
-
Persistent History ServerAlpha
Learn about the Dataproc Persistent History Server.
-
Restartable jobs
Create jobs that restart on failure. Great for long-running and streaming jobs.
-
Dataproc jobs on GKEBeta
Running Dataproc job on Google Kubernetes Engine
Configuring clusters
-
Autoscaling clusters
Learn how to use autoscaling to automatically resize clusters to meet the demands of user workloads.
-
Auto Zone placement
Let Dataproc select a zone for your cluster.
-
Cluster metadata
Learn about Dataproc's cluster metadata and how to set your own custom metadata.
-
Cluster properties
Configuration properties for Dataproc's open source components and how to access them.
-
Customer managed encryption keys (CMEK)
Manage encrypted keys for Dataproc cluster and job data.
-
Enhanced Flexibility ModeBeta
Keep jobs running by saving intermediate data to a distributed file system.
-
High availability mode
Increase the resilience of HDFS and YARN to service unavailability
-
Initialization actions
Specify actions to run on all or some cluster nodes on setup.
-
Network configuration
Configure your cluster's network.
-
Scaling Clusters
Increase or decrease the number of worker nodes in a cluster, even while jobs are running.
-
Scheduled Deletion
Delete your cluster after a specified period or at a specified time.
-
Security Configuration
Enable cluster security features.
-
Service accounts
Understand Dataproc service accounts.
-
Single node clusters
Create lightweight sandbox clusters with only one node.
-
Sole tenant node clusters
Create clusters on sole tenant nodes.
-
Staging and temp buckets
Learn about the Dataproc staging and temp buckets.
Connectors
-
BigQuery connector
Use the BigQuery for Apache Hadoop on your Dataproc clusters.
-
BigQuery connector code samples
View the BigQuery code samples.
-
Cloud Bigtable with Dataproc
Use Cloud Bigtable Apache HBase-compatible API with your Dataproc clusters.
-
Cloud Storage connector
Use the Cloud Storage connector for Hadoop on your Dataproc clusters.
-
Installing the Cloud Storage connector
Using the Cloud Storage connector on other (non-Dataproc) clusters.
Identity and Access Management (IAM)
-
Dataproc permissions and IAM roles
Set up IAM roles to allow users and groups to access your project's Dataproc resources.
-
Dataproc principals and roles
Understand Dataproc principals and the roles required to create, manage, and run tasks on a cluster.
-
Dataproc Granular IAM
Set up granular cluster-specific permissions.
-
Dataproc Personal Cluster Authentication
Set up personal cluster authemtication.
-
DataprocService account based multi-tenancyBeta
Set up multi-tenant clusters.
Dataproc Regional endpoints
Versioning
-
Overview
Software versions used on Dataproc clusters and how to select them.
-
Overview
Software versions used on Dataproc clusters and how to select them.
-
2.0.x release versions
Dataproc image version 2.0.
-
1.5.x release versions
Dataproc image version 1.5.
-
1.4.x release versions
Dataproc image version 1.4.
-
1.3.x release versions
Dataproc image version 1.3.
-
Image version list
List of versions currently supported in Dataproc clusters.
Workflow Templates
-
Overview
Learn about workflow templates.
-
Monitoring and debugging workflows
How to monitor and debug workflows.
-
Parameterization
Learn how to parameterize your workflow templates.
-
Using YAML files
Learn how to use YAML files in your workflow.
-
Using cluster selectors
Learn how to use cluster selectors in your workflow.
-
Using inline workflows
Learn how to create and run inline workflows.
-
Using workflows
Learn how to set up and run workflows.
-
Workflow scheduling solutions
Run workflows with Cloud Scheduler, Cloud Functions, and Cloud Composer.