Life of a Dataproc job

This page delineates the sequence of steps involved with the submission, execution, and completion of a Dataproc job. It also discusses job throttling and debugging.

Dataproc jobs flow

User submits job to Dataproc.
- JobStatus.State is marked as PENDING.
Job waits to be acquired by the dataproc agent.
- If the job is acquired, JobStatus.State is marked as RUNNING.
- If the job is not acquired due to agent failure, Compute Engine network failure, or other cause, the job is marked ERROR.
Once a job is acquired by the agent, the agent verifies that there are sufficient resources available on the Dataproc cluster's master node to start the driver.
- If sufficient resources are not available, the job is delayed (throttled). JobStatus.Substate shows the job as QUEUED, and Job.JobStatus.details provides information on the cause of the delay.
If sufficient resources are available, the dataproc agent starts the job driver process.
- At this stage, typically there are one or more applications running in Apache Hadoop YARN. However, Yarn applications may not start until the driver finishes scanning Cloud Storage directories or performing other start-up job tasks.
The dataproc agent periodically sends updates to Dataproc on job progress, cluster metrics, and Yarn applications associated with the job (see Job monitoring and debugging).
Yarn application(s) complete.
- Job continues to be reported as RUNNING while driver performs any job completion tasks, such as materializing collections.
- An unhandled or uncaught failure in the Main thread can leave the driver in a zombie state (marked as RUNNING without information as to the cause of the failure).
Driver exits. dataproc agent reports completion to Dataproc.
- Dataproc reports job as DONE.

Job concurrency

You can configure the maximum number of concurrent Dataproc jobs with the dataproc:dataproc.scheduler.max-concurrent-jobs cluster property when you create a cluster. If this property value is not set, the upper limit on concurrent jobs is calculated as max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5). masterMemoryMb is determined by the master VM's machine type. masterMemoryMbPerJob is 1024 by default, but is configurable at cluster creation with the dataproc:dataproc.scheduler.driver-size-mb cluster property.

Whats next

See Troubleshoot jobs