This document discusses Dataproc best practices that can help you run reliable, efficient, and insightful data processing jobs on Dataproc clusters in production environments.
Specify cluster image versions
Dataproc uses image versions to bundle operating system, big data components, and Google Cloud connectors into a package that is deployed on a cluster. If you don't specify an image version when creating a cluster, Dataproc defaults to the most recent stable image version.
For production environments, associate your cluster with a specific
major.minor
Dataproc image version, as
shown in the following gcloud CLI command.
gcloud dataproc clusters create CLUSTER_NAME \ --region=region \ --image-version=2.0
Dataproc resolves the major.minor
version to the latest sub-minor version version
(2.0
is resolved to 2.0.x
). Note: if you need to rely on a specific sub-minor version for your cluster,
you can specify it: for example, --image-version=2.0.x
. See
How versioning works for
more information.
Dataproc preview image versions
New minor versions of Dataproc
images are available in a preview
version prior to release
in the standard minor image version track. Use a preview image
to test and validate your jobs against a new minor image version
prior to adopting the standard minor image version in production.
See Dataproc versioning
for more information.
Use custom images when necessary
If you have dependencies to add to the cluster, such as native Python libraries, or security hardening or virus protection software, create a custom image from the latest image in your target minor image version track. This practice allows you to meet dependency requirements when you create clusters using your custom image. When you rebuild your custom image to update dependency requirements, use the latest available sub-minor image version within the minor image track.
Submit jobs to the Dataproc service
Submit jobs to the Dataproc service with a jobs.submit call using the gcloud CLI or the Google Cloud console. Set job and cluster permissions by granting Dataproc roles. Use custom roles to separate cluster access from job submit permissions.
Benefits of submitting jobs to the Dataproc service:
- No complicated networking settings required - the API is widely reachable
- Easy to manage IAM permissions and roles
- Track job status easily - no Dataproc job metadata to complicate results.
In production, run jobs that only depend on cluster-level
dependencies at a fixed minor image version, (for example, --image-version=2.0
). Bundle
dependencies with jobs when the jobs are submitted. Submitting
an uber jar to
Spark or MapReduce is a common way to do this.
- Example: If a job jar depends on
args4j
andspark-sql
, withargs4j
specific to the job andspark-sql
a cluster-level dependency, bundleargs4j
in the job's uber jar.
Control initialization action locations
Initialization actions allow you to automatically run scripts or install components when you create a Dataproc cluster (see the dataproc-initialization-actions GitHub repository for common Dataproc initialization actions). When using cluster initialization actions in a production environment, copy initialization scripts to Cloud Storage rather than sourcing them from a public repository. This practice avoids running initialization scripts that are subject to modification by others.
Monitor Dataproc release notes
Dataproc regularly releases new sub-minor image versions. View or subscribe to Dataproc release notes to be aware of the latest Dataproc image version releases and other announcements, changes, and fixes.
View the staging bucket to investigate failures
Look at your cluster's staging bucket to investigate cluster and job error messages. Typically, the staging bucket Cloud Storage location is shown in error messages, as shown in the bold text in the following sample error message:
ERROR: (gcloud.dataproc.clusters.create) Operation ... failed: ... - Initialization action failed. Failed action ... see output in: gs://dataproc-<BUCKETID>-us-central1/google-cloud-dataproc-metainfo/CLUSTERID/<CLUSTER_ID>\dataproc-initialization-script-0_output
Use the gcloud CLI to view staging bucket contents:
Sample output:gcloud storage cat gs://STAGING_BUCKET
+ readonly RANGER_VERSION=1.2.0 ... Ranger admin password not set. Please use metadata flag - default-password
Get support
Google Cloud supports your production OSS workloads and helps you meet your business SLAs through tiers of support. Also, Google Cloud Consulting Services can provide guidance on best practices for your team's production deployments.
For more information
Read the Google Cloud blog Dataproc best practices guide.
View Democratizing Dataproc on YouTube.