Restartable jobs

By default, Dataproc jobs will not automatically restart on failure. By using an optional setting, you can set jobs to restart on failure. When you set a job to restart, you specify the maximum number of retries per hour (the upper limit is 10 retries per hour).

Restarting jobs mitigates common types of job failure, including out-of-memory issues and unexpected Compute Engine virtual machine reboots. Restartable jobs are especially useful for long-running and streaming jobs. For example, you can restart Spark streaming jobs running on Dataproc clusters to ensure that the streaming jobs are resilient. Restartable jobs can be created through the Cloud SDK gcloud command-line tool or the Dataproc REST API.

Restartable job semantics

Any type of Dataproc job can be marked as restartable. This applies to jobs submitted through the Google Cloud Console, Cloud SDK gcloud command-line tool, or the Dataproc REST API.

The following semantics apply to reporting the success or failure of jobs:

  • A job is reported successful if the driver terminates with code 0.
  • A job is reported failed if:
    • The driver terminates with a non-zero code more than 4 times in 10 minutes.
    • The driver terminates with non-zero code, and has exceeded the max_failures_per_hour setting.
  • A job will be restarted if the driver exits with a non-zero code, is not thrashing, and is within the max_failures_per_hour setting.

Known limitations

The following limitations apply to restartable jobs on Dataproc.

  • You must design your jobs to gracefully handle restarting. For example, if your job writes to a directory, your job will need to account for the fact that the directory may already exist when your job is restarted.
  • Apache Spark streaming jobs that checkpoint can be restarted after failure, but these jobs will not report Yarn status.

Creating and using restartable jobs


You can specify the maximum number of times a job can be restarted per hour when submitting the job using the gcloud command-line tool (the maximum value is 10 retries per hour).

gcloud dataproc jobs submit job type \
    --region=region \
    --max-failures-per-hour=number \
    ... other args


You can set jobs to restart through the Dataproc REST API. Use the jobs.submit API to set the number of times a job can be restarted on failure. You can set the maximum number of failures for a job by specifying a JobScheduling object and setting the max_failures_per_hour field (the maximum value is 10 retries per hour).


POST /v1/projects/project-id/regions/us-central1/jobs:submit/
"projectId": "project-id",
"job": {
"placement": {
  "clusterName": "example-cluster"
"reference": {
  "jobId": "cea7ae0b...."
"sparkJob": {
  "args": [
  "mainClass": "org.apache.spark.examples.SparkPi",
  "jarFileUris": [
"scheduling": {
  "maxFailuresPerHour": 5


You can submit restartable jobs by specifying the max restarts per hour on the Dataproc Submit a job page (the maximum value is 10 times per hour).