Troubleshoot cluster creation errors

This page provides information to help you diagnose and fix Dataproc cluster creation errors.

Use the gcpdiag tool

gcpdiag is an open-source tool that can help identify and fix problems with Google Cloud projects. gcpdiag is not an officially supported Google Cloud product. For more information, see the gcpdiag project on GitHub.

The gcpdiag tool helps you discover the following Dataproc cluster creation issues by performing the following checks:

  • Stockout errors: Evaluates Logs Explorer logs to discover stockouts in regions and zones.
  • Insufficient quota: Checks quota availability in the Dataproc cluster project.
  • Incomplete network configuration: Performs network connectivity tests, including checks for necessary firewall rules and external and internal IP configuration. If the cluster has been deleted, the gcpdiag tool cannot perform a network connectivity check.
  • Incorrect cross-project configuration: Checks for cross-project service accounts and reviews additional roles and organization policies enforcement.
  • Missing shared VPC IAM roles: If the Dataproc cluster uses a Shared VPC network, checks for the addition of required service account roles.
  • Initialization action failures: Evaluates Logs Explorer logs to discover initialization action script failures and timeouts.

For a list of gcpdiag cluster-creation steps, see Potential steps.

Cloud Shell

  1. Copy and run the following command in Cloud Shell:
    gcpdiag runbook dataproc/cluster-creation \
      --project=PROJECT_ID \
      --parameter cluster_name=CLUSTER_NAME \
      --parameter OPTIONAL_FLAGS

Google Cloud console

  1. Copy the following command:
  2. gcpdiag runbook dataproc/cluster-creation \
      --project=PROJECT_ID \
      --parameter cluster_name=CLUSTER_NAME \
      --parameter OPTIONAL_FLAGS
  3. Open the Google Cloud console and activate Cloud Shell:
  4. Open Cloud console
  5. Paste the copied command.
  6. Run the gcpdiag command. This will download the gcpdiag docker image and start running the relevant checks for this command. Follow instructions on how to fix failed checks.

Docker

You can run gcpdiag using a wrapper that starts gcpdiag in a Docker container. This works on any machine with Docker or Podman installed.

  1. Copy and run the following command in your local workstation:
    curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
  2. Execute the gcpdiag command:
    ./gcpdiag runbook dataproc/cluster-creation \
      --project=PROJECT_ID \
      --parameter cluster_name=CLUSTER_NAME \
      --parameter OPTIONAL_FLAGS

View all available parameters for this runbook.

Replace the following:

  • PROJECT_ID: The ID of the project containing the resource.
    • CLUSTER_NAME: The name of the target Dataproc cluster within your project.
    • OPTIONAL FLAGS: You may provide additional parameters if the cluster was deleted from the UI.

      • cluster_uuid: The UUID of the target Dataproc cluster within your project.
      • service_account: Dataproc cluster service account used to create the resource.
      • subnetwork: Dataproc cluster Subnetwork URI (full path)
      • internal_ip_only: True or False
      • cross_project: Cross Project ID, where service account is located if it is not in the same project as the Dataproc cluster.

Useful flags:

  • --project: to define the PROJECT_ID.
  • --universe-domain: if applicable, used to define the Trusted Partner Sovereign Cloud domain hosting the resource.
  • --parameter or -p: to define parameters for a runbook.

For more information on available flags, see the Usage instructions for gcpdiag.

Fix cluster creation error messages

This section lists Dataproc error messages, and their common causes and solutions.

  • Operation timed out: Only 0 out of 2 minimum required datanodes/node managers running.

    Cause: The master node is unable to create the cluster because it cannot communicate with worker nodes.

    Solution:

  • Required compute.subnetworks.use permission for projects/{projectId}/regions/{region}/subnetworks/{subnetwork}

    Cause: This error can occur when you attempt to setup a Dataproc cluster using a VPC network in another project and the Dataproc Service Agent service account does not have the necessary permissions on the Shared VPC project that is hosting the network.

    Solution: Follow the steps listed in Create a cluster that uses a VPC network in another project.

  • The zone projects/zones/{zone} does not have enough resources available to fulfill the request (resource type:compute)

    Cause: The zone being used to create the cluster does not have sufficient resources.

    Solution:

  • Quota Exceeded errors

    Insufficient CPUS/CPUS_ALL_REGIONS quota
    Insufficient 'DISKS_TOTAL_GB' quota
    Insufficient 'IN_USE_ADDRESSES' quota

    Cause: Your CPU, disk, or IP address request exceeds your available quota.

    Solution: Request additional quota from the Google Cloud console.

  • Initialization action failed

    Cause: The initialization action provided during cluster creation failed to install.

    Solution:

  • Failed to initialize node {cluster-name}: {component}

    Cause: A Dataproc component failed to initialize.

    Solution: Refer to:

  • Cluster creation failed: IP address space exhausted

    Cause: IP address space needed to provision the requested cluster nodes is unavailable.

    Solution:

    • Create a cluster on a different subnetwork or network.
    • Reduce usage on the network to free IP address space.
    • Wait until sufficient IP space becomes available on the network.
  • Initialization script error message: The repository REPO_NAME no longer has a Release file

    Cause: The Debian oldstable backports repository was purged.

    Solution:

    Add the following code before the code that runs apt-get in your initialization script.

    oldstable=$(curl -s https://deb.debian.org/debian/dists/oldstable/Release | awk '/^Codename/ {print $2}');
    stable=$(curl -s https://deb.debian.org/debian/dists/stable/Release | awk '/^Codename/ {print $2}');
    
    matched_files="$(grep -rsil '\-backports' /etc/apt/sources.list*)"
    if [[ -n "$matched_files" ]]; then
      for filename in "$matched_files"; do
        grep -e "$oldstable-backports" -e "$stable-backports" "$filename" || \
          sed -i -e 's/^.*-backports.*$//' "$filename"
      done
    fi
    
  • Network is unreachable: dataproccontrol-REGION.googleapis.com/...*

    Cause: The Dataproc cluster VPC network might not have a required route to the internet. The Dataproc agent running on cluster VMs needs a route to the internet to access the Dataproc control API to get jobs and report status. The system-generated default route to the internet might have been deleted.

    Solution:

    Add a route to the internet to your cluster VPC network: 0.0.0.0/0 for IPv4 and ::/0 for IPv6 with --next-hop-gateway=default-internet-gateway. Add firewall rules for access control.

Additional references