Use the gcpdiag
tool
gcpdiag
is an open source tool. It is not an officially supported Google Cloud product.
You can use the gcpdiag
tool to help you identify and fix Google Cloud
project issues. For more information, see the
gcpdiag project on GitHub.
The gcpdiag
tool helps you discover the following Dataproc
cluster creation issues by performing the following checks:
- Stockout errors: Evaluates Logs Explorer logs to discover stockouts in regions and zones.
- Insufficient quota: Checks quota availability in the Dataproc cluster project.
- Incomplete network configuration: Performs network connectivity tests,
including checks for necessary firewall rules and external and internal IP
configuration. If the cluster has been deleted, the
gcpdiag
tool cannot perform a network connectivity check. - Incorrect cross-project configuration: Checks for cross-project service accounts and reviews additional roles and organization policies enforcement.
- Missing shared Virtual Private Cloud network IAM roles: If the Dataproc cluster uses a Shared VPC network, checks for the addition of required service account roles.
- Initialization action failures: Evaluates Logs Explorer logs to discover initialization action script failures and timeouts.
For a list of gcpdiag
cluster-creation steps, see
Potential steps.
Run the gcpdiag
command
You can run the gcpdiag
command from Cloud Shell in the
Google Cloud console or within a Docker container.
Google Cloud console
- Complete and then copy the following command.
- Open the Google Cloud console and activate Cloud Shell. Open Cloud console
- Paste the copied command.
- Run the
gcpdiag
command, which downloads thegcpdiag
docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.
gcpdiag runbook dataproc/cluster-creation \
--parameter project_id=PROJECT_ID \
--parameter cluster_name=CLUSTER_NAME \
--parameter OPTIONAL_FLAGS
Docker
You can
run gcpdiag
using a wrapper that starts gcpdiag
in a
Docker container. Docker or
Podman must be installed.
- Copy and run the following command on your local workstation.
curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
- Execute the
gcpdiag
command../gcpdiag runbook dataproc/cluster-creation \ --parameter project_id=PROJECT_ID \ --parameter cluster_name=CLUSTER_NAME \ --parameter OPTIONAL_FLAGS
View available parameters for this runbook.
Replace the following:
- PROJECT_ID: The ID of the project containing the resource
- CLUSTER_NAME: The name of the target Dataproc cluster in your project
- OPTIONAL_PARAMETERS: Add one or more of the
following optional parameters. These parameters are required if
the cluster has been deleted.
cluster_uuid
: The UUID of the target Dataproc cluster in your projectservice_account
: The Dataproc cluster VM service accountsubnetwork
: The Dataproc cluster subnetwork full URI pathinternal_ip_only
: True or Falsecross_project
: The cross-project ID if the Dataproc cluster uses a VM service account in another project
Useful flags:
--universe-domain
: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource--parameter
or-p
: Runbook parameters
For a list and description of all gcpdiag
tool flags, see the
gcpdiag
usage instructions.
Understand and fix cluster creation errors
This section lists Dataproc error messages, and their common causes and solutions.
Operation timed out: Only 0 out of 2 minimum required datanodes/node managers running.
Cause: The controller node is unable to create the cluster because it cannot communicate with worker nodes.
Solution:
- Check firewall rule warnings.
- Make sure the correct firewall rules are in place. For more information, see Overview of the default Dataproc firewall rules.
- Perform a connectivity test in the Google Cloud console to determine what is blocking communication between the controller and worker nodes.
Required
compute.subnetworks.use
permission forprojects/{projectId}/regions/{region}/subnetworks/{subnetwork}
Cause: This error can occur when you attempt to setup a Dataproc cluster using a VPC network in another project and the Dataproc Service Agent service account does not have the necessary permissions on the Shared VPC project that is hosting the network.
Solution: Follow the steps listed in Create a cluster that uses a VPC network in another project.
The zone
projects/zones/{zone}
does not have enough resources available to fulfill the request(resource type:compute)
Cause: The zone being used to create the cluster does not have sufficient resources.
Solution:
- Create the cluster in a different zone.
- Use the Dataproc Auto Zone placement feature.
Quota Exceeded errors
Insufficient CPUS/CPUS_ALL_REGIONS quota
Insufficient 'DISKS_TOTAL_GB' quota
Insufficient 'IN_USE_ADDRESSES' quotaCause: Your CPU, disk, or IP address request exceeds your available quota.
Solution: Request additional quota from the Google Cloud console.
Initialization action failed
Cause: The initialization action provided during cluster creation failed to install.
Solution:
- See initialization actions considerations and guidelines.
- Examine the output logs. The error message should provide a link to the logs in Cloud Storage.
Failed to initialize node
CLUSTER-NAME-m
. ... See output in:<gs://PATH_TO_STARTUP_SCRIPT_OUTPUT>
Cause: Dataproc cluster controller node failed to be initialized.
Solution:
- Review the startup script output logs listed in the error message
(
gs://PATH_TO_STARTUP_SCRIPT_OUTPUT
) and verify the cause of the failed node initialization. - Causes can include Dataproc cluster network configuration issues and failed installation of Python package dependencies.
- If the issue is not resolved after you review the startup-script logs, fix any user side issues, then retry with exponential backoff, contact Google Cloud support.
- Review the startup script output logs listed in the error message
(
Cluster creation failed: IP address space exhausted
Cause: IP address space needed to provision the requested cluster nodes is unavailable.
Solution:
- Create a cluster on a different subnetwork or network.
- Reduce usage on the network to free IP address space.
- Wait until sufficient IP space becomes available on the network.
Initialization script error message: The repository REPO_NAME no longer has a Release file
Cause: The Debian oldstable backports repository was purged.
Solution:
Add the following code before the code that runs
apt-get
in your initialization script.oldstable=$(curl -s https://deb.debian.org/debian/dists/oldstable/Release | awk '/^Codename/ {print $2}'); stable=$(curl -s https://deb.debian.org/debian/dists/stable/Release | awk '/^Codename/ {print $2}'); matched_files="$(grep -rsil '\-backports' /etc/apt/sources.list*)" if [[ -n "$matched_files" ]]; then for filename in "$matched_files"; do grep -e "$oldstable-backports" -e "$stable-backports" "$filename" || \ sed -i -e 's/^.*-backports.*$//' "$filename" done fi
Timeout waiting for instance
DATAPROC_CLUSTER_VM_NAME
to report in or Network is unreachable:dataproccontrol-REGION.googleapis.com
Cause: These error messages indicate that the networking setup of your Dataproc cluster is incomplete: you may be missing the route to the default internet gateway or firewall rules.
Solution:
To troubleshoot this issue, you can create the following Connectivity Tests:
- Create a Connectivity Test between two Dataproc cluster VMs. The outcome of this test will help you understand whether the ingress or egress allow firewall rules of your network apply to the cluster VMs correctly.
- Create a Connectivity Test between a Dataproc cluster VM and a current Dataproc control API IP address. To get a current Dataproc control API IP address, use the following command:
dig dataproccontrol-REGION.googleapis.com A
Use any of the IPv4 addresses in the answer section of the output.
The outcome of the Connectivity Test will help you understand whether the route to the default internet gateway and the egress allow firewall are properly configured.
Based on the outcomes of the Connectivity Tests:
- Add a route to the internet to your cluster VPC network:
0.0.0.0/0
for IPv4 and::/0
for IPv6 with--next-hop-gateway=default-internet-gateway
. - Add firewall rules for access control.
Error due to update
Cause: The cluster accepted a job submitted to the Dataproc service, but was unable to scale up or down manually or through autoscaling. This error can also be caused by a non-standard cluster configuration.
Solution:
Cluster reset: Open a support ticket, include a diagnostic tar file, and ask for the cluster to be rest to a RUNNING state.
New cluster: Recreate the cluster with the same configuration. This solution can be faster than a support-provided reset.
Cluster troubleshooting tips
This section provides additional guidance on troubleshooting common issues that can prevent the creation of Dataproc clusters.
When a Dataproc cluster fails to provision, it often produces
a generic error message or reports a PENDING
or PROVISIONING
status
before failing. The key to diagnosing and solving cluster failure issues
is to examine cluster logs and assess common failure points.
Common symptoms and error messages
The following are common symptoms and error messages associated with cluster creation failures:
- Cluster status remains
PENDING
orPROVISIONING
for an extended period. - Cluster transitions to
ERROR
state. - Generic API errors during cluster creation, such as
Operation timed out
. Logged or API response error messages, such as:
RESOURCE_EXHAUSTED
: related to CPU, disk, or IP address quotasInstance failed to start
Permission denied
Unable to connect to service_name.googleapis.com
orCould not reach required Google APIs
Connection refused
ornetwork unreachable
- Errors related to initialization actions failing, such as script execution errors and file not found.
Review cluster logs
An important initial step when diagnosing cluster creation failures is reviewing the detailed cluster logs available in Cloud Logging.
- Go to Logs Explorer: Open the Logs Explorer in the Google Cloud console.
- Filter for Dataproc clusters:
- In the Resource drop-down, select
Cloud Dataproc Cluster
. - Enter your
cluster_name
andproject_id
. You can also filter bylocation
(region).
- In the Resource drop-down, select
- Examine Log Entries:
- Look for
ERROR
orWARNING
level messages that occur close to the time of the cluster creation failure. - Pay attention to logs from
master-startup
,worker-startup
, andagent
components for insights into VM-level or Dataproc agent issues. - For insight into VM boot-time issues,
filter logs by
resource.type="gce_instance"
, and look for messages from the instance names associated with your cluster nodes, such asCLUSTER_NAME-m
orCLUSTER_NAME-w-0
. Serial console logs can reveal network configuration issues, disk problems, and script failures that occur early in the VM lifecycle.
- Look for
Common cluster failure causes and troubleshooting tips
This section outlines common reasons why Dataproc cluster creation might fail and provides troubleshooting tips to help troubleshoot cluster failures.
Insufficient IAM permissions
The VM service account that your Dataproc cluster uses must have appropriate IAM roles to provision Compute Engine instances, access Cloud Storage buckets, write logs, and interact with other Google Cloud services.
- Required Worker role: Verify that the VM service account has the
Dataproc Worker
role (
roles/dataproc.worker
). This role has the minimum permissions required for Dataproc to manage cluster resources. - Data access permissions: If your jobs read from or write to
Cloud Storage or BigQuery, the service account needs
related roles, such as
Storage Object Viewer
,Storage Object Creator
, orStorage Object Admin
for Cloud Storage, orBigQuery Data Viewer
orBigQuery Editor
for BigQuery. - Logging permissions: The service account must have a role with permissions
needed to write logs to Cloud Logging, such as the
Logging Writer
role.
Troubleshooting tips:
Identify service account: Determine the VM service account that your cluster is configured to use. If not specified, the default is the Compute Engine default service account.
Verify IAM roles: Go to the IAM & Admin > IAM page in the Google Cloud console, find the cluster VM service account, and then verify that it has the necessary roles needed for cluster operations. Grant any missing roles.
Resource quotas exceeded
Dataproc clusters consume resources from Compute Engine and other Google Cloud services. Exceeding project or regional quotas can cause cluster creation failures.
- Common Dataproc quotas to check:
CPUs
(regional)DISKS_TOTAL_GB
(regional)IN_USE_ADDRESSES
(regional for internal IPs, global for external IPs)- Dataproc API quotas, such as
ClusterOperationRequestsPerMinutePerProjectPerRegion
.
Troubleshooting tips:
- Review quotas: Go to the IAM & Admin > IAM page in the Google Cloud console. Filter by "Service" for "Compute Engine API" and "Dataproc API."
- Check usage vs. limit: Identify any quotas that are at or near their limits.
- If necessary, request a quota increase.
Network configuration issues
Network configuration issues, such as incorrect VPC network, subnet, firewall, or DNS configuration, are a common cause of cluster creation failures. Cluster instances must be able to communicate with each other and with Google APIs.
- VPC network and subnet:
- Verify that the cluster VPC network and subnet exist and are configured correctly.
- Verify that the subnet has a sufficient range of available IP addresses.
- Private Google Access (PGA): If cluster VMs have internal IP addresses and need to reach Google APIs for Cloud Storage, Cloud Logging, and other operations, verify that Private Google Access is enabled on the subnet. By default, Dataproc clusters created with 2.2+ image versions provision VMs with internal-only IP addresses with Private Google Access enabled on the cluster regional subnet.
- Private Service Connect
(PSC): If you are using
Private Service Connect to access Google
APIs, verify that necessary
Private Service Connect endpoints
are correctly configured for the Google APIs that Dataproc
depends on, such as
dataproc.googleapis.com
,storage.googleapis.com
,compute.googleapis.com
, andlogging.googleapis.com
. DNS entries for the APIs must resolve to private IP addresses. Note that using Private Service Connect does not eliminate the need to use VPC peering to communicate with other customer-managed VPC networks. . - VPC Peering: If your cluster communicates with resources in other VPC networks, such as shared VPC host projects or other customer VPCs, verify that VPC peering is correctly configured and routes are propagating.
Firewall rules:
- Default rules: Verify that default firewall rules, such as
allow-internal
orallow-ssh
, are not overly restrictive. Custom rules: If custom firewall rules are in place, verify that they allow needed communication paths:
- Internal communication within the cluster (between -m and -w nodes).
Outbound traffic from cluster VMs to Google APIs, using either public IPs or an internet gateway, Private Google Access, or Private Service Connect endpoints.
Traffic to any external data sources or services that your jobs depend on.
- Default rules: Verify that default firewall rules, such as
DNS resolution: Confirm that cluster instances can correctly resolve DNS names for Google APIs and any internal or external services.
Troubleshooting tips:
- Review network configuration: Inspect the VPC network and subnet settings where the cluster is being deployed.
- Check firewall rules: Review firewall rules in the VPC network or shared VPC host project.
- Test connectivity: Launch a temporary Compute Engine
VM in the cluster subnet and run the following steps:
ping
orcurl
to external Google API domains, such asstorage.googleapis.com
.nslookup
to verify DNS resolution to expected IP addresses (Private Google Access or Private Service Connect).- Run Google Cloud connectivity tests to diagnose paths from a test VM to relevant endpoints.
Initialization action failures
Dataproc initialization actions are scripts that run on cluster VMs during cluster creation. Errors in these scripts can prevent cluster startup.
Troubleshooting tips:
- Examine logs for initialization action errors: Look for log entries related to
init-actions
orstartup-script
for the cluster instances in Cloud Logging. - Check script paths and permissions: Verify that initialization action
scripts are correctly located in Cloud Storage and that the cluster VM service
account has the
Storage Object Viewer
role needed to read Cloud Storage scripts. - Debug script logic: Test script logic on a separate Compute Engine VM that mimics the cluster environment to identify errors. Add verbose logging to the script.
Regional resource availability (stockouts)
Occasionally, a machine type or resource in a region or zone
experiences temporary unavailability (stockout). Typically, this results
in RESOURCE_EXHAUSTED
errors unrelated to project quota issues.
Troubleshooting tips:
- Try a different zone or region: Attempt to create the cluster in a different zone within the same region, or in a different region.
- Use Auto Zone placement: Use the Dataproc Auto Zone placement feature to automatically select a zone with capacity.
- Adjust machine type: If using a custom or specialized machine type, try a standard machine type to see if that resolves the issue.
Next steps
If you continue to experience cluster failure issues:
- Contact Cloud Customer Care.
Describe the cluster failure issue and troubleshooting steps taken.
Also provide the following information:
- Cluster diagnostic data
- Output from the following command:
gcloud dataproc clusters describe CLUSTER_NAME \ -region=REGION
- Exported logs for the failed cluster.