Life of a Dataproc job

This page delineates the sequence of steps involved with the submission, execution, and completion of a Dataproc job. It also discusses job throttling and debugging.

Dataproc jobs flow

  1. User submits job to Dataproc.
  2. Job waits to be acquired by the dataproc agent.
    • If the job is acquired, JobStatus.State is marked as RUNNING.
    • If the job is not acquired due to agent failure, Compute Engine network failure, or other cause, the job is marked ERROR.
  3. Once a job is acquired by the agent, the agent verifies that there are sufficient resources available on the Dataproc cluster's master node to start the driver.
  4. If sufficient resources are available, the dataproc agent starts the job driver process.
    • At this stage, typically there are one or more applications running in Apache Hadoop YARN. However, Yarn applications may not start until the driver finishes scanning Cloud Storage directories or performing other start-up job tasks.
  5. The dataproc agent periodically sends updates to Dataproc on job progress, cluster metrics, and Yarn applications associated with the job (see Job Monitoring and Debugging).
  6. Yarn application(s) complete.
    • Job continues to be reported as RUNNING while driver performs any job completion tasks, such as materializing collections.
    • An unhandled or uncaught failure in the Main thread can leave the driver in a zombie state (marked as RUNNING without information as to the cause of the failure).
  7. Driver exits. dataproc agent reports completion to Dataproc.
    • Dataproc reports job as DONE.

Job concurrency

You can configure the maximum number of concurrent Dataproc jobs with the dataproc:dataproc.scheduler.max-concurrent-jobs cluster property when you create a cluster. If this property value is not set, the upper limit on concurrent jobs is calculated as max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5). masterMemoryMb is determined by the master VM's machine type. masterMemoryMbPerJob is 1024 by default, but is configurable at cluster creation with the dataproc:dataproc.scheduler.driver-size-mb cluster property.

Job monitoring and debugging

Use the Google Cloud CLI, Dataproc REST API, and Google Cloud console to analyze and debug Dataproc jobs.

gcloud command

To examine a running job's status:

gcloud dataproc jobs describe job-id \
    --region=region

To view job driver output, see View job output.

REST API

Call jobs.get to examine a job's JobStatus.State, JobStatus.Substate, JobStatus.details, and YarnApplication fields.

Console

To view job driver output, see View job output.

To view the dataproc agent log in Logging, select Dataproc Cluster→Cluster Name→Cluster UUID from the Logs Explorer cluster selector.

Then use the logs selector to select google.dataproc.agent logs.

View Job logs in Logging

If a job fails, you can access job logs in Logging.

Determining who submitted a job

Looking up the details of a job will show who submitted that job in the submittedBy field. For example, this job output shows user@domain submitted the example job to a cluster.

...
placement:
  clusterName: cluster-name
  clusterUuid: cluster-uuid
reference:
  jobId: job-uuid
  projectId: project
status:
  state: DONE
  stateStartTime: '2018-11-01T00:53:37.599Z'
statusHistory:
- state: PENDING
  stateStartTime: '2018-11-01T00:33:41.387Z'
- state: SETUP_DONE
  stateStartTime: '2018-11-01T00:33:41.765Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2018-11-01T00:33:42.146Z'
submittedBy: user@domain

Error Messages

  • Task was not acquired: Indicates that the Dataproc agent on the master node was unable to acquire the task from the control plane. This often happens due to out-of-memory (OOM) or network issues. If the job ran successfully previously and you have not changed network configuration settings, OOM is the most likely cause, often the result of the submission of many concurrently running jobs or jobs whose drivers consume significant memory (for example, jobs that load large datasets into memory).

  • No agent on master node(s) found to be active: Indicates that the Dataproc agent on the master node is not active and cannot accept new jobs. This often happens due to out-of-memory (OOM) or network issues, or if the master node VM is unhealthy. If the job ran successfully previously and you have not changed network configuration settings, OOM is the most likely cause, which often results from the submission of many concurrently running jobs or jobs whose drivers consume significant memory (jobs that load large datasets into memory). This problem can be resolved by restarting (stopping, then starting) the Dataproc cluster or retrying the job submission later. Note: Stopping a cluster causes all running jobs to fail.

  • Task not found: Indicates that the cluster was deleted while a job was running. To help avoid this error, before submitting a job, confirm cluster status is RUNNING by viewing cluster status on the Clusters page in the Google Cloud console, running the gcloud dataproc clusters describe command, or by checking ClusterStatus.State with a clusters.get API request.