Troubleshoot cluster creation issues

Use the `gcpdiag` tool

gcpdiag is an open source tool. It is not an officially supported Google Cloud product. You can use the gcpdiag tool to help you identify and fix Google Cloud project issues. For more information, see the gcpdiag project on GitHub.

The gcpdiag tool helps you discover the following Dataproc cluster creation issues by performing the following checks:

Stockout errors: Evaluates Logs Explorer logs to discover stockouts in regions and zones.
Insufficient quota: Checks quota availability in the Dataproc cluster project.
Incomplete network configuration: Performs network connectivity tests, including checks for necessary firewall rules and external and internal IP configuration. If the cluster has been deleted, the gcpdiag tool cannot perform a network connectivity check.
Incorrect cross-project configuration: Checks for cross-project service accounts and reviews additional roles and organization policies enforcement.
Missing shared Virtual Private Cloud network IAM roles: If the Dataproc cluster uses a Shared VPC network, checks for the addition of required service account roles.
Initialization action failures: Evaluates Logs Explorer logs to discover initialization action script failures and timeouts.

For a list of gcpdiag cluster-creation steps, see Potential steps.

Run the `gcpdiag` command

You can run the gcpdiag command from Cloud Shell in the Google Cloud console or within a Docker container.

Google Cloud console

Complete and then copy the following command.

gcpdiag runbook dataproc/cluster-creation \
    --parameter project_id=PROJECT_ID \
    --parameter cluster_name=CLUSTER_NAME \
    --parameter OPTIONAL_FLAGS

Open the Google Cloud console and activate Cloud Shell.

Open Cloud console

Paste the copied command.
Run the gcpdiag command, which downloads the gcpdiag docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.

Docker

You can run gcpdiag using a wrapper that starts gcpdiag in a Docker container. Docker or Podman must be installed.

Copy and run the following command on your local workstation.

curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag

Execute the gcpdiag command.

./gcpdiag runbook dataproc/cluster-creation \
    --parameter project_id=PROJECT_ID \
    --parameter cluster_name=CLUSTER_NAME \
    --parameter OPTIONAL_FLAGS

View available parameters for this runbook.

Replace the following:

PROJECT_ID: The ID of the project containing the resource
CLUSTER_NAME: The name of the target Dataproc cluster in your project
OPTIONAL_PARAMETERS: Add one or more of the following optional parameters. These parameters are required if the cluster has been deleted.
- cluster_uuid: The UUID of the target Dataproc cluster in your project
- service_account: The Dataproc cluster VM service account
- subnetwork: The Dataproc cluster subnetwork full URI path
- internal_ip_only: True or False
- cross_project: The cross-project ID if the Dataproc cluster uses a VM service account in another project

Useful flags:

--universe-domain: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource
--parameter or -p: Runbook parameters

For a list and description of all gcpdiag tool flags, see the gcpdiag usage instructions.

Understand and fix cluster creation errors

This section lists Dataproc error messages, and their common causes and solutions.

Operation timed out: Only 0 out of 2 minimum required datanodes/node managers running.

Cause: The controller node is unable to create the cluster because it cannot communicate with worker nodes.

Solution:
- Check firewall rule warnings.
- Make sure the correct firewall rules are in place. For more information, see Overview of the default Dataproc firewall rules.
- Perform a connectivity test in the Google Cloud console to determine what is blocking communication between the controller and worker nodes.
Required compute.subnetworks.use permission for projects/{projectId}/regions/{region}/subnetworks/{subnetwork}

Cause: This error can occur when you attempt to setup a Dataproc cluster using a VPC network in another project and the Dataproc Service Agent service account does not have the necessary permissions on the Shared VPC project that is hosting the network.

Solution: Follow the steps listed in Create a cluster that uses a VPC network in another project.
The zone projects/zones/{zone} does not have enough resources available to fulfill the request (resource type:compute)

Cause: The zone being used to create the cluster does not have sufficient resources.

Solution:
- Create the cluster in a different zone.
- Use the Dataproc Auto Zone placement feature.
Quota Exceeded errors

Insufficient CPUS/CPUS_ALL_REGIONS quota
Insufficient 'DISKS_TOTAL_GB' quota
Insufficient 'IN_USE_ADDRESSES' quota

Cause: Your CPU, disk, or IP address request exceeds your available quota.

Solution: Request additional quota from the Google Cloud console.
Initialization action failed

Cause: The initialization action provided during cluster creation failed to install.

Solution:
- See initialization actions considerations and guidelines.
- Examine the output logs. The error message should provide a link to the logs in Cloud Storage.
Failed to initialize node CLUSTER-NAME-m. ... See output in: <gs://PATH_TO_STARTUP_SCRIPT_OUTPUT>

Cause: Dataproc cluster controller node failed to be initialized.

Solution:
- Review the startup script output logs listed in the error message (gs://PATH_TO_STARTUP_SCRIPT_OUTPUT) and verify the cause of the failed node initialization.
- Causes can include Dataproc cluster network configuration issues and failed installation of Python package dependencies.
- If the issue is not resolved after you review the startup-script logs, fix any user side issues, then retry with exponential backoff, contact Google Cloud support.
Cluster creation failed: IP address space exhausted

Cause: IP address space needed to provision the requested cluster nodes is unavailable.

Solution:
- Create a cluster on a different subnetwork or network.
- Reduce usage on the network to free IP address space.
- Wait until sufficient IP space becomes available on the network.

Initialization script error message: The repository REPO_NAME no longer has a Release file

Cause: The Debian oldstable backports repository was purged.

Solution:

Add the following code before the code that runs apt-get in your initialization script.

oldstable=$(curl -s https://deb.debian.org/debian/dists/oldstable/Release | awk '/^Codename/ {print $2}');
stable=$(curl -s https://deb.debian.org/debian/dists/stable/Release | awk '/^Codename/ {print $2}');

matched_files="$(grep -rsil '\-backports' /etc/apt/sources.list*)"
if [[ -n "$matched_files" ]]; then
  for filename in "$matched_files"; do
    grep -e "$oldstable-backports" -e "$stable-backports" "$filename" || \
      sed -i -e 's/^.*-backports.*$//' "$filename"
  done
fi

Timeout waiting for instance DATAPROC_CLUSTER_VM_NAME to report in or Network is unreachable: dataproccontrol-REGION.googleapis.com

Cause: These error messages indicate that the networking setup of your Dataproc cluster is incomplete: you may be missing the route to the default internet gateway or firewall rules.

Solution:

To troubleshoot this issue, you can create the following Connectivity Tests:
- Create a Connectivity Test between two Dataproc cluster VMs. The outcome of this test will help you understand whether the ingress or egress allow firewall rules of your network apply to the cluster VMs correctly.
- Create a Connectivity Test between a Dataproc cluster VM and a current Dataproc control API IP address. To get a current Dataproc control API IP address, use the following command:
```
dig dataproccontrol-REGION.googleapis.com A
```
Use any of the IPv4 addresses in the answer section of the output.

The outcome of the Connectivity Test will help you understand whether the route to the default internet gateway and the egress allow firewall are properly configured.

Based on the outcomes of the Connectivity Tests:
- Add a route to the internet to your cluster VPC network: 0.0.0.0/0 for IPv4 and ::/0 for IPv6 with --next-hop-gateway=default-internet-gateway.
- Add firewall rules for access control.
Error due to update

Cause: The cluster accepted a job submitted to the Dataproc service, but was unable to scale up or down manually or through autoscaling. This error can also be caused by a non-standard cluster configuration.

Solution:
- Cluster reset: Open a support ticket, include a diagnostic tar file, and ask for the cluster to be rest to a RUNNING state.
- New cluster: Recreate the cluster with the same configuration. This solution can be faster than a support-provided reset.

Cluster troubleshooting tips

This section provides additional guidance on troubleshooting common issues that can prevent the creation of Dataproc clusters.

When a Dataproc cluster fails to provision, it often produces a generic error message or reports a PENDING or PROVISIONING status before failing. The key to diagnosing and solving cluster failure issues is to examine cluster logs and assess common failure points.

Common symptoms and error messages

The following are common symptoms and error messages associated with cluster creation failures:

Cluster status remains PENDING or PROVISIONING for an extended period.
Cluster transitions to ERROR state.
Generic API errors during cluster creation, such as Operation timed out.
Logged or API response error messages, such as:
- RESOURCE_EXHAUSTED: related to CPU, disk, or IP address quotas
- Instance failed to start
- Permission denied
- Unable to connect to service_name.googleapis.com or Could not reach required Google APIs
- Connection refused or network unreachable
- Errors related to initialization actions failing, such as script execution errors and file not found.

Review cluster logs

An important initial step when diagnosing cluster creation failures is reviewing the detailed cluster logs available in Cloud Logging.

Go to Logs Explorer: Open the Logs Explorer in the Google Cloud console.
Filter for Dataproc clusters:
- In the Resource drop-down, select Cloud Dataproc Cluster.
- Enter your cluster_name and project_id. You can also filter by location (region).
Examine Log Entries:
- Look for ERROR or WARNING level messages that occur close to the time of the cluster creation failure.
- Pay attention to logs from master-startup, worker-startup, and agent components for insights into VM-level or Dataproc agent issues.
- For insight into VM boot-time issues, filter logs by resource.type="gce_instance", and look for messages from the instance names associated with your cluster nodes, such as CLUSTER_NAME-m or CLUSTER_NAME-w-0. Serial console logs can reveal network configuration issues, disk problems, and script failures that occur early in the VM lifecycle.

Common cluster failure causes and troubleshooting tips

This section outlines common reasons why Dataproc cluster creation might fail and provides troubleshooting tips to help troubleshoot cluster failures.

Insufficient IAM permissions

The VM service account that your Dataproc cluster uses must have appropriate IAM roles to provision Compute Engine instances, access Cloud Storage buckets, write logs, and interact with other Google Cloud services.

Required Worker role: Verify that the VM service account has the Dataproc Worker role (roles/dataproc.worker). This role has the minimum permissions required for Dataproc to manage cluster resources.
Data access permissions: If your jobs read from or write to Cloud Storage or BigQuery, the service account needs related roles, such as Storage Object Viewer, Storage Object Creator, or Storage Object Admin for Cloud Storage, or BigQuery Data Viewer or BigQuery Editor for BigQuery.
Logging permissions: The service account must have a role with permissions needed to write logs to Cloud Logging, such as the Logging Writer role.

Troubleshooting tips:

Identify service account: Determine the VM service account that your cluster is configured to use. If not specified, the default is the Compute Engine default service account.
Verify IAM roles: Go to the IAM & Admin > IAM page in the Google Cloud console, find the cluster VM service account, and then verify that it has the necessary roles needed for cluster operations. Grant any missing roles.

Resource quotas exceeded

Dataproc clusters consume resources from Compute Engine and other Google Cloud services. Exceeding project or regional quotas can cause cluster creation failures.

Common Dataproc quotas to check:
- CPUs (regional)
- DISKS_TOTAL_GB (regional)
- IN_USE_ADDRESSES (regional for internal IPs, global for external IPs)
- Dataproc API quotas, such as ClusterOperationRequestsPerMinutePerProjectPerRegion
  To compare Dataproc quotas with Serverless for Apache Spark quotas, see Serverless for Apache Spark quotas
  .

Troubleshooting tips:

Review quotas: Go to the IAM & Admin > IAM page in the Google Cloud console. Filter by "Service" for "Compute Engine API" and "Dataproc API."
Check usage vs. limit: Identify any quotas that are at or near their limits.
If necessary, request a quota increase.

Network configuration issues

Network configuration issues, such as incorrect VPC network, subnet, firewall, or DNS configuration, are a common cause of cluster creation failures. Cluster instances must be able to communicate with each other and with Google APIs.

VPC network and subnet:
- Verify that the cluster VPC network and subnet exist and are configured correctly.
- Verify that the subnet has a sufficient range of available IP addresses.
Private Google Access (PGA): If cluster VMs have internal IP addresses and need to reach Google APIs for Cloud Storage, Cloud Logging, and other operations, verify that Private Google Access is enabled on the subnet. By default, Dataproc clusters created with 2.2+ image versions provision VMs with internal-only IP addresses with Private Google Access enabled on the cluster regional subnet.
Private Service Connect (PSC): If you are using Private Service Connect to access Google APIs, verify that necessary Private Service Connect endpoints are correctly configured for the Google APIs that Dataproc depends on, such as dataproc.googleapis.com, storage.googleapis.com, compute.googleapis.com, and logging.googleapis.com. DNS entries for the APIs must resolve to private IP addresses. Note that using Private Service Connect does not eliminate the need to use VPC peering to communicate with other customer-managed VPC networks.
For detailed Private Service Connect network troubleshooting, see Dataproc cluster networking with Private Service Connect
.
VPC Peering: If your cluster communicates with resources in other VPC networks, such as shared VPC host projects or other customer VPCs, verify that VPC peering is correctly configured and routes are propagating.
Firewall rules:
- Default rules: Verify that default firewall rules, such as allow-internal or allow-ssh, are not overly restrictive.
- Custom rules: If custom firewall rules are in place, verify that they allow needed communication paths:
  - Internal communication within the cluster (between -m and -w nodes).
  - Outbound traffic from cluster VMs to Google APIs, using either public IPs or an internet gateway, Private Google Access, or Private Service Connect endpoints.
  - Traffic to any external data sources or services that your jobs depend on.
DNS resolution: Confirm that cluster instances can correctly resolve DNS names for Google APIs and any internal or external services.

Troubleshooting tips:

Review network configuration: Inspect the VPC network and subnet settings where the cluster is being deployed.
Check firewall rules: Review firewall rules in the VPC network or shared VPC host project.
Test connectivity: Launch a temporary Compute Engine VM in the cluster subnet and run the following steps:
- ping or curl to external Google API domains, such as storage.googleapis.com.
- nslookup to verify DNS resolution to expected IP addresses (Private Google Access or Private Service Connect).
- Run Google Cloud connectivity tests to diagnose paths from a test VM to relevant endpoints.

Initialization action failures

Dataproc initialization actions are scripts that run on cluster VMs during cluster creation. Errors in these scripts can prevent cluster startup.

Troubleshooting tips:

Examine logs for initialization action errors: Look for log entries related to init-actions or startup-script for the cluster instances in Cloud Logging.
Check script paths and permissions: Verify that initialization action scripts are correctly located in Cloud Storage and that the cluster VM service account has the Storage Object Viewer role needed to read Cloud Storage scripts.
Debug script logic: Test script logic on a separate Compute Engine VM that mimics the cluster environment to identify errors. Add verbose logging to the script.

Regional resource availability (stockouts)

Occasionally, a machine type or resource in a region or zone experiences temporary unavailability (stockout). Typically, this results in RESOURCE_EXHAUSTED errors unrelated to project quota issues.

Troubleshooting tips:

Try a different zone or region: Attempt to create the cluster in a different zone within the same region, or in a different region.
Use Auto Zone placement: Use the Dataproc Auto Zone placement feature to automatically select a zone with capacity.
Adjust machine type: If using a custom or specialized machine type, try a standard machine type to see if that resolves the issue.

Next steps

If you continue to experience cluster failure issues:

Contact Cloud Customer Care. Describe the cluster failure issue and troubleshooting steps taken. Also provide the following information:
- Cluster diagnostic data
- Output from the following command:
```
gcloud dataproc clusters describe CLUSTER_NAME \
    -region=REGION
```
- Exported logs for the failed cluster.

Troubleshoot cluster creation issues Stay organized with collections Save and categorize content based on your preferences.

Use the gcpdiag tool

Run the gcpdiag command

Google Cloud console

Docker

Understand and fix cluster creation errors

Cluster troubleshooting tips

Common symptoms and error messages

Review cluster logs

Common cluster failure causes and troubleshooting tips

Insufficient IAM permissions

Resource quotas exceeded

Network configuration issues

Initialization action failures

Regional resource availability (stockouts)

Next steps

Read related information

Troubleshoot cluster creation issues

Use the `gcpdiag` tool

Run the `gcpdiag` command