Dataproc best practices for production

This document discusses Dataproc best practices that can help you run reliable, efficient, and insightful data processing jobs on Dataproc clusters in production environments.

Specify cluster image versions

Dataproc uses image versions to bundle operating system, big data components, and Google Cloud connectors into a package that is deployed on a cluster. If you don't specify an image version when creating a cluster, Dataproc defaults to the most recent stable image version.

For production environments, associate your cluster with a specific major.minor Dataproc image version, as shown in the following gcloud CLI command.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=region \
    --image-version=2.0

Dataproc resolves the major.minor version to the latest sub-minor version version (2.0 is resolved to 2.0.x). Note: if you need to rely on a specific sub-minor version for your cluster, you can specify it: for example, --image-version=2.0.x. See How versioning works for more information.

Dataproc preview image versions

New minor versions of Dataproc images are available in a preview version prior to release in the standard minor image version track. Use a preview image to test and validate your jobs against a new minor image version prior to adopting the standard minor image version in production. See Dataproc versioning for more information.

Use custom images when necessary

If you have dependencies to add to the cluster, such as native Python libraries, or security hardening or virus protection software, create a custom image from the latest image in your target minor image version track. This practice allows you to meet dependency requirements when you create clusters using your custom image. When you rebuild your custom image to update dependency requirements, use the latest available sub-minor image version within the minor image track.

Submit jobs to the Dataproc service

Submit jobs to the Dataproc service with a jobs.submit call using the gcloud CLI or the Google Cloud console. Set job and cluster permissions by granting Dataproc roles. Use custom roles to separate cluster access from job submit permissions.

Benefits of submitting jobs to the Dataproc service:

No complicated networking settings required - the API is widely reachable
Easy to manage IAM permissions and roles
Track job status easily - no Dataproc job metadata to complicate results.

In production, run jobs that only depend on cluster-level dependencies at a fixed minor image version, (for example, --image-version=2.0). Bundle dependencies with jobs when the jobs are submitted. Submitting an uber jar to Spark or MapReduce is a common way to do this.

Example: If a job jar depends on args4j and spark-sql, with args4j specific to the job and spark-sql a cluster-level dependency, bundle args4j in the job's uber jar.

Control initialization action locations

Initialization actions allow you to automatically run scripts or install components when you create a Dataproc cluster (see the dataproc-initialization-actions GitHub repository for common Dataproc initialization actions). When using cluster initialization actions in a production environment, copy initialization scripts to Cloud Storage rather than sourcing them from a public repository. This practice avoids running initialization scripts that are subject to modification by others.

Monitor Dataproc release notes

Dataproc regularly releases new sub-minor image versions. View or subscribe to Dataproc release notes to be aware of the latest Dataproc image version releases and other announcements, changes, and fixes.

View the staging bucket to investigate failures

Look at your cluster's staging bucket to investigate cluster and job error messages. Typically, the staging bucket Cloud Storage location is shown in error messages, as shown in the bold text in the following sample error message:

ERROR:
(gcloud.dataproc.clusters.create) Operation ... failed:
...
- Initialization action failed. Failed action ... see output in:
gs://dataproc-<BUCKETID>-us-central1/google-cloud-dataproc-metainfo/CLUSTERID/<CLUSTER_ID>\dataproc-initialization-script-0_output

Use the gcloud CLI to view staging bucket contents:

gcloud storage cat gs://STAGING_BUCKET

Sample output:

+ readonly RANGER_VERSION=1.2.0
... Ranger admin password not set. Please use metadata flag - default-password

Get support

Google Cloud supports your production OSS workloads and helps you meet your business SLAs through tiers of support. Also, Google Cloud Consulting Services can provide guidance on best practices for your team's production deployments.

For more information

Read the Google Cloud blog Dataproc best practices guide.
View Democratizing Dataproc on YouTube.