Use the gcpdiag
tool
gcpdiag
is an open source tool. It is not an officially supported Google Cloud product.
You can use the gcpdiag
tool to help you identify and fix Google Cloud
project issues. For more information, see the
gcpdiag project on GitHub.
The gcpdiag
tool helps you discover the following Dataproc
cluster creation issues by performing the following checks:
- Stockout errors: Evaluates Logs Explorer logs to discover stockouts in regions and zones.
- Insufficient quota: Checks quota availability in the Dataproc cluster project.
- Incomplete network configuration: Performs network connectivity tests,
including checks for necessary firewall rules and external and internal IP
configuration. If the cluster has been deleted, the
gcpdiag
tool cannot perform a network connectivity check. - Incorrect cross-project configuration: Checks for cross-project service accounts and reviews additional roles and organization policies enforcement.
- Missing shared VPC IAM roles: If the Dataproc cluster uses a Shared VPC network, checks for the addition of required service account roles.
- Initialization action failures: Evaluates Logs Explorer logs to discover initialization action script failures and timeouts.
For a list of gcpdiag
cluster-creation steps, see
Potential steps.
Run the gcpdiag
command
You can run the gcpdiag
command from Cloud Shell in the
Google Cloud console or within a Docker container.
Google Cloud console
- Complete and then copy the following command.
- Open the Google Cloud console and activate Cloud Shell. Open Cloud console
- Paste the copied command.
- Run the
gcpdiag
command, which downloads thegcpdiag
docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.
GOOGLE_AUTH_TOKEN=GOOGLE_AUTH_TOKEN \
gcpdiag runbook dataproc/cluster-creation \
--parameter project_id=PROJECT_ID \
--parameter cluster_name=CLUSTER_NAME \
--parameter OPTIONAL_FLAGS \
--auto --reason=REASON
Docker
You can
run gcpdiag
using a wrapper that starts gcpdiag
in a
Docker container. Docker or
Podman must be installed.
- Copy and run the following command on your local workstation.
curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
- Execute the
gcpdiag
command../gcpdiag runbook dataproc/cluster-creation \ --parameter project_id=PROJECT_ID \ --parameter cluster_name=CLUSTER_NAME \ --parameter OPTIONAL_FLAGS
View available parameters for this runbook.
Replace the following:
- PROJECT_ID: The ID of the project containing the resource
- CLUSTER_NAME: The name of the target Dataproc cluster in your project
- OPTIONAL_PARAMETERS: Add one or more of the
following optional parameters. These parameters are required if
the cluster has been deleted.
cluster_uuid
: The UUID of the target Dataproc cluster in your projectservice_account
: The Dataproc cluster VM service accountsubnetwork
: The Dataproc cluster subnetwork full URI pathinternal_ip_only
: True or Falsecross_project
: The cross-project ID if the Dataproc cluster uses a VM service account in another project
Useful flags:
--universe-domain
: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource--parameter
or-p
: Runbook parameters
For a list and description of all gcpdiag
tool flags, see the
gcpdiag
usage instructions.
Understand and fix cluster creation errors
This section lists Dataproc error messages, and their common causes and solutions.
Operation timed out: Only 0 out of 2 minimum required datanodes/node managers running.
Cause: The master node is unable to create the cluster because it cannot communicate with worker nodes.
Solution:
- Check firewall rule warnings.
- Make sure the correct firewall rules are in place. For more information, see Overview of the default Dataproc firewall rules.
- Perform a connectivity test in the Google Cloud console to determine what is blocking communication between the master and worker nodes.
Required
compute.subnetworks.use
permission forprojects/{projectId}/regions/{region}/subnetworks/{subnetwork}
Cause: This error can occur when you attempt to setup a Dataproc cluster using a VPC network in another project and the Dataproc Service Agent service account does not have the necessary permissions on the Shared VPC project that is hosting the network.
Solution: Follow the steps listed in Create a cluster that uses a VPC network in another project.
The zone
projects/zones/{zone}
does not have enough resources available to fulfill the request(resource type:compute)
Cause: The zone being used to create the cluster does not have sufficient resources.
Solution:
- Create the cluster in a different zone.
- Use the Dataproc Auto Zone placement feature.
Quota Exceeded errors
Insufficient CPUS/CPUS_ALL_REGIONS quota
Insufficient 'DISKS_TOTAL_GB' quota
Insufficient 'IN_USE_ADDRESSES' quotaCause: Your CPU, disk, or IP address request exceeds your available quota.
Solution: Request additional quota from the Google Cloud console.
Initialization action failed
Cause: The initialization action provided during cluster creation failed to install.
Solution:
- See initialization actions considerations and guidelines.
- Examine the output logs. The error message should provide a link to the logs in Cloud Storage.
Failed to initialize node {cluster-name}: {component}
Cause: A Dataproc component failed to initialize.
Solution: Refer to:
Cluster creation failed: IP address space exhausted
Cause: IP address space needed to provision the requested cluster nodes is unavailable.
Solution:
- Create a cluster on a different subnetwork or network.
- Reduce usage on the network to free IP address space.
- Wait until sufficient IP space becomes available on the network.
Initialization script error message: The repository REPO_NAME no longer has a Release file
Cause: The Debian oldstable backports repository was purged.
Solution:
Add the following code before the code that runs
apt-get
in your initialization script.oldstable=$(curl -s https://deb.debian.org/debian/dists/oldstable/Release | awk '/^Codename/ {print $2}'); stable=$(curl -s https://deb.debian.org/debian/dists/stable/Release | awk '/^Codename/ {print $2}'); matched_files="$(grep -rsil '\-backports' /etc/apt/sources.list*)" if [[ -n "$matched_files" ]]; then for filename in "$matched_files"; do grep -e "$oldstable-backports" -e "$stable-backports" "$filename" || \ sed -i -e 's/^.*-backports.*$//' "$filename" done fi
Network is unreachable: dataproccontrol-REGION.googleapis.com/...*
Cause: The Dataproc cluster VPC network might not have a required route to the internet. The Dataproc agent running on cluster VMs needs a route to the internet to access the Dataproc control API to get jobs and report status. The system-generated default route to the internet might have been deleted.
Solution:
Add a route to the internet to your cluster VPC network:
0.0.0.0/0
for IPv4 and::/0
for IPv6 with--next-hop-gateway=default-internet-gateway
. Add firewall rules for access control.