Troubleshoot cluster creation issues

Use the gcpdiag tool

gcpdiag is an open source tool. It is not an officially supported Google Cloud product. You can use the gcpdiag tool to help you identify and fix Google Cloud project issues. For more information, see the gcpdiag project on GitHub.

The gcpdiag tool helps you discover the following Dataproc cluster creation issues by performing the following checks:

  • Stockout errors: Evaluates Logs Explorer logs to discover stockouts in regions and zones.
  • Insufficient quota: Checks quota availability in the Dataproc cluster project.
  • Incomplete network configuration: Performs network connectivity tests, including checks for necessary firewall rules and external and internal IP configuration. If the cluster has been deleted, the gcpdiag tool cannot perform a network connectivity check.
  • Incorrect cross-project configuration: Checks for cross-project service accounts and reviews additional roles and organization policies enforcement.
  • Missing shared VPC IAM roles: If the Dataproc cluster uses a Shared VPC network, checks for the addition of required service account roles.
  • Initialization action failures: Evaluates Logs Explorer logs to discover initialization action script failures and timeouts.

For a list of gcpdiag cluster-creation steps, see Potential steps.

Run the gcpdiag command

You can run the gcpdiag command from Cloud Shell in the Google Cloud console or within a Docker container.

Google Cloud console

  1. Complete and then copy the following command.
  2. GOOGLE_AUTH_TOKEN=GOOGLE_AUTH_TOKEN \
      gcpdiag runbook dataproc/cluster-creation \
        --parameter project_id=PROJECT_ID \
        --parameter cluster_name=CLUSTER_NAME \
        --parameter OPTIONAL_FLAGS \
        --auto --reason=REASON
  3. Open the Google Cloud console and activate Cloud Shell.
  4. Open Cloud console
  5. Paste the copied command.
  6. Run the gcpdiag command, which downloads the gcpdiag docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.

Docker

You can run gcpdiag using a wrapper that starts gcpdiag in a Docker container. Docker or Podman must be installed.

  1. Copy and run the following command on your local workstation.
    curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
  2. Execute the gcpdiag command.
    ./gcpdiag runbook dataproc/cluster-creation \
        --parameter project_id=PROJECT_ID \
        --parameter cluster_name=CLUSTER_NAME \
        --parameter OPTIONAL_FLAGS

View available parameters for this runbook.

Replace the following:

    • PROJECT_ID: The ID of the project containing the resource
    • CLUSTER_NAME: The name of the target Dataproc cluster in your project
    • OPTIONAL_PARAMETERS: Add one or more of the following optional parameters. These parameters are required if the cluster has been deleted.
      • cluster_uuid: The UUID of the target Dataproc cluster in your project
      • service_account: The Dataproc cluster VM service account
      • subnetwork: The Dataproc cluster subnetwork full URI path
      • internal_ip_only: True or False
      • cross_project: The cross-project ID if the Dataproc cluster uses a VM service account in another project

Useful flags:

For a list and description of all gcpdiag tool flags, see the gcpdiag usage instructions.

Understand and fix cluster creation errors

This section lists Dataproc error messages, and their common causes and solutions.

  • Operation timed out: Only 0 out of 2 minimum required datanodes/node managers running.

    Cause: The master node is unable to create the cluster because it cannot communicate with worker nodes.

    Solution:

  • Required compute.subnetworks.use permission for projects/{projectId}/regions/{region}/subnetworks/{subnetwork}

    Cause: This error can occur when you attempt to setup a Dataproc cluster using a VPC network in another project and the Dataproc Service Agent service account does not have the necessary permissions on the Shared VPC project that is hosting the network.

    Solution: Follow the steps listed in Create a cluster that uses a VPC network in another project.

  • The zone projects/zones/{zone} does not have enough resources available to fulfill the request (resource type:compute)

    Cause: The zone being used to create the cluster does not have sufficient resources.

    Solution:

  • Quota Exceeded errors

    Insufficient CPUS/CPUS_ALL_REGIONS quota
    Insufficient 'DISKS_TOTAL_GB' quota
    Insufficient 'IN_USE_ADDRESSES' quota

    Cause: Your CPU, disk, or IP address request exceeds your available quota.

    Solution: Request additional quota from the Google Cloud console.

  • Initialization action failed

    Cause: The initialization action provided during cluster creation failed to install.

    Solution:

  • Failed to initialize node {cluster-name}: {component}

    Cause: A Dataproc component failed to initialize.

    Solution: Refer to:

  • Cluster creation failed: IP address space exhausted

    Cause: IP address space needed to provision the requested cluster nodes is unavailable.

    Solution:

    • Create a cluster on a different subnetwork or network.
    • Reduce usage on the network to free IP address space.
    • Wait until sufficient IP space becomes available on the network.
  • Initialization script error message: The repository REPO_NAME no longer has a Release file

    Cause: The Debian oldstable backports repository was purged.

    Solution:

    Add the following code before the code that runs apt-get in your initialization script.

    oldstable=$(curl -s https://deb.debian.org/debian/dists/oldstable/Release | awk '/^Codename/ {print $2}');
    stable=$(curl -s https://deb.debian.org/debian/dists/stable/Release | awk '/^Codename/ {print $2}');
    
    matched_files="$(grep -rsil '\-backports' /etc/apt/sources.list*)"
    if [[ -n "$matched_files" ]]; then
      for filename in "$matched_files"; do
        grep -e "$oldstable-backports" -e "$stable-backports" "$filename" || \
          sed -i -e 's/^.*-backports.*$//' "$filename"
      done
    fi
    
  • Network is unreachable: dataproccontrol-REGION.googleapis.com/...*

    Cause: The Dataproc cluster VPC network might not have a required route to the internet. The Dataproc agent running on cluster VMs needs a route to the internet to access the Dataproc control API to get jobs and report status. The system-generated default route to the internet might have been deleted.

    Solution:

    Add a route to the internet to your cluster VPC network: 0.0.0.0/0 for IPv4 and ::/0 for IPv6 with --next-hop-gateway=default-internet-gateway. Add firewall rules for access control.