By default, Dataproc jobs will not automatically restart on failure. By using optional settings, you can set jobs to restart on failure. When you set a job to restart, you specify the maximum number of retries per hour (max value is 10 retries per hour) and/or the maximum number of total retries (max value is 240 total retries).
Restarting jobs mitigates common types of job failure, including out-of-memory issues and unexpected Compute Engine virtual machine reboots. Restartable jobs are particularly useful for long-running and streaming jobs. For example, you can restart Spark streaming jobs running on Dataproc clusters to ensure that the streaming jobs are resilient.
Restartable job semantics
The following semantics apply to reporting the success or failure of jobs:
- A job is reported successful if the driver terminates with code
0
. - A job is reported failed if:
- The driver terminates with a non-zero code more than 4 times in 10 minutes.
- The driver terminates with a non-zero code, and has exceeded the
max_failures_per_hour
or themax_failures_total
setting.
- A job will be restarted if the driver exits with a non-zero code, is not
thrashing, and is within the
max_failures_per_hour
andmax_failures_total
settings.
Job design considerations
- Design your jobs to gracefully handle restarting. For example, if your job writes to a directory, your job accommodate the possibility that the directory will exist when the job is restarted.
- Apache Spark streaming jobs that checkpoint can be restarted after failure, but they will not report Yarn status.
Creating and using restartable jobs
You can specify the maximum number of times a job can be restarted per hour and the maximum number of total retries when submitting the job through the gcloud CLI gcloud command-line tool, the Dataproc REST API, or the Google Cloud console.
Example: If you want to allow your job to retry up to 10 times, but no more than
5 times in one hour, set max-failures-total
to 10 and max-failures-per-hour
to 5.
gcloud
Specify the maximum number of times a job can be restarted per hour
(the max value is 10 retries per hour) and/or the maximum number of total
retries (max value is 240 total retries) using the
--max-failures-per-hour
and --max-failures-total
flags, respectively.
gcloud dataproc jobs submit job type \ --region=region \ --max-failures-per-hour=number \ --max-failures-total=number \ ... other args
REST API
Specify the maximum number of times a job can be restarted per hour
(max value is 10 retries per hour) and/or the maximum number of total
retries (max value is 240 total retries) by setting the
Job.JobScheduling
maxFailuresPerHour
and/or maxFailuresTotal
fields, respectively .
Example
POST /v1/projects/project-id/regions/us-central1/jobs:submit/ { "projectId": "project-id", "job": { "placement": { "clusterName": "example-cluster" }, "reference": { "jobId": "cea7ae0b...." }, "sparkJob": { "args": [ "1000" ], "mainClass": "org.apache.spark.examples.SparkPi", "jarFileUris": [ "file:///usr/lib/spark/examples/jars/spark-examples.jar" ] }, "scheduling": { "maxFailuresPerHour": 5 "maxFailuresTotal": 10 } } }
Console
You can submit restartable jobs by specifying the max restarts per hour on the Dataproc Submit a job page (the maximum value is 10 times per hour). Currently, the max restarts total setting is not available on the Google Cloud console.