By default, Cloud Dataproc jobs will not automatically restart on failure. By using an optional setting, you can set jobs to restart on failure. When you set a job to restart, you specify the maximum number of retries per hour (the upper limit is 10 retries per hour).
Restarting jobs mitigates common types of job failure, including out-of-memory issues and unexpected Google Compute Engine virtual machine reboots. Restartable jobs are especially useful for long-running and streaming jobs. For example, you can restart Spark streaming jobs running on Google Cloud Dataproc clusters to ensure that the streaming jobs are resilient. Restartable jobs can be created through the Google Cloud SDK gcloud command-line tool or the Cloud Dataproc REST API.
Restartable job semantics
Any type of Cloud Dataproc job can be marked as restartable. This applies to jobs submitted through the Google Cloud Console, Google Cloud SDK gcloud command-line tool, or the Cloud Dataproc REST API.
The following semantics apply to reporting the success or failure of jobs:
- A job is reported successful if the driver terminates with code
0
. - A job is reported failed if:
- The driver terminates with a non-zero code more than 4 times in 10 minutes.
- The driver terminates with non-zero code, and has exceeded the
max_failures_per_hour
setting.
- A job will be restarted if the driver exits with a non-zero code, is not thrashing, and is within the
max_failures_per_hour
setting.
Known limitations
The following limitations apply to restartable jobs on Cloud Dataproc.
- You must design your jobs to gracefully handle restarting. For example, if your job writes to a directory, your job will need to account for the fact that the directory may already exist when your job is restarted.
- Apache Spark streaming jobs that checkpoint can be restarted after failure, but these jobs will not report Yarn status.
Creating and using restartable jobs
gcloud
You can specify the maximum number of times a job can be restarted per hour
when submitting the job using the gcloud
command-line tool (the maximum
value is 10 retries per hour).
gcloud dataproc jobs submit <job type> args --max-failures-per-hour number
REST API
You can set jobs to restart through the
Cloud Dataproc REST API. Use the
jobs.submit
API to set the number of times a job can be restarted on failure.
You can set the maximum number of failures for a job by specifying a
JobScheduling
object and setting the max_failures_per_hour
field (the maximum
value is 10 retries per hour).
Example
POST /v1/projects/project-id/regions/us-central1/jobs:submit/ { "projectId": "project-id", "job": { "placement": { "clusterName": "example-cluster" }, "reference": { "jobId": "cea7ae0b...." }, "sparkJob": { "args": [ "1000" ], "mainClass": "org.apache.spark.examples.SparkPi", "jarFileUris": [ "file:///usr/lib/spark/examples/jars/spark-examples.jar" ] }, "scheduling": { "maxFailuresPerHour": 5 } } }
Console
You can submit restartable jobs by specifying the max restarts per hour on the Cloud Dataproc Submit a job page (the maximum value is 10 times per hour).