Dataproc Cluster in error state after VM deletion

Problem

A Dataproc cluster has reached an error state and a few VMs are in Not found state rather than Terminated. This can be seen under the VM Instances tab of the chosen Dataproc cluster.

Stop Cluster command used and failed with error while checking the stackdriver logs with filters:

gcloud dataproc clusters start <cluster-name> --region=<region>

Error:

ERROR: (gcloud.dataproc.clusters.start) FAILED_PRECONDITION: Cluster '<cluster-name>' must be stopped before it can be started, current cluster state is 'ERROR'
Under the VM instances section of the dataproc cluster the master VM grap-sb2-w-0 is not in terminated state like others, rather they are not found.

Environment

  • Dataproc Cluster

Solution

  1. Use the recommended way of deleting VMs using the Dataproc API update described in Manage a cluster  to scale down the cluster.

Cause

In the Compute Engine logs entries can be seen for the not found VMs, and it can be verified that they were deleted manually by the user.

<timestamp> ICompute Engine stop <region>:<hostname-worker-vm> <user>
<timestamp> ICompute Engine delete <region>:<hostname-worker-vm> <user>

This can lead to the error state of the cluster. The cluster start button will be greyed out in the UI and you can get error while using the gcloud command to restart the cluster. The stop operation dispatches an operation to stop worker VMs and Dataproc is not aware that these VMs were deleted.