Restartable jobs

By default, Dataproc jobs will not automatically restart on failure. By using optional settings, you can set jobs to restart on failure. When you set a job to restart, you specify the maximum number of retries per hour (max value is 10 retries per hour) and/or the maximum number of total retries (max value is 240 total retries).

Restarting jobs mitigates common types of job failure, including out-of-memory issues and unexpected Compute Engine virtual machine reboots. Restartable jobs are particularly useful for long-running and streaming jobs. For example, you can restart Spark streaming jobs running on Dataproc clusters to ensure that the streaming jobs are resilient.

Restartable job semantics

The following semantics apply to reporting the success or failure of jobs:

  • A job is reported successful if the driver terminates with code 0.
  • A job is reported failed if:
    • The driver terminates with a non-zero code more than 4 times in 10 minutes.
    • The driver terminates with a non-zero code, and has exceeded the max_failures_per_hour or themax_failures_total setting.
  • A job will be restarted if the driver exits with a non-zero code, is not thrashing, and is within the max_failures_per_hour and max_failures_total settings.

Job design considerations

  • Design your jobs to gracefully handle restarting. For example, if your job writes to a directory, your job accommodate the possibility that the directory will exist when the job is restarted.
  • Apache Spark streaming jobs that checkpoint can be restarted after failure, but they will not report Yarn status.

Creating and using restartable jobs

You can specify the maximum number of times a job can be restarted per hour and the maximum number of total retries when submitting the job through the gcloud CLI gcloud command-line tool, the Dataproc REST API, or the Google Cloud console.

Example: If you want to allow your job to retry up to 10 times, but no more than 5 times in one hour, set max-failures-total to 10 and max-failures-per-hour to 5.

gcloud

Specify the maximum number of times a job can be restarted per hour (the max value is 10 retries per hour) and/or the maximum number of total retries (max value is 240 total retries) using the --max-failures-per-hour and --max-failures-total flags, respectively.

gcloud dataproc jobs submit job type \
    --region=region \
    --max-failures-per-hour=number \
    --max-failures-total=number \
    ... other args

REST API

Specify the maximum number of times a job can be restarted per hour (max value is 10 retries per hour) and/or the maximum number of total retries (max value is 240 total retries) by setting the Job.JobScheduling maxFailuresPerHour and/or maxFailuresTotalfields, respectively .

Example

POST /v1/projects/project-id/regions/us-central1/jobs:submit/
{
"projectId": "project-id",
"job": {
"placement": {
  "clusterName": "example-cluster"
},
"reference": {
  "jobId": "cea7ae0b...."
},
"sparkJob": {
  "args": [
    "1000"
  ],
  "mainClass": "org.apache.spark.examples.SparkPi",
  "jarFileUris": [
    "file:///usr/lib/spark/examples/jars/spark-examples.jar"
  ]
},
"scheduling": {
  "maxFailuresPerHour": 5
  "maxFailuresTotal": 10
}
}
}

Console

You can submit restartable jobs by specifying the max restarts per hour on the Dataproc Submit a job page (the maximum value is 10 times per hour). Currently, the max restarts total setting is not available on the Google Cloud console.