Restartable jobs

By default, Cloud Dataproc jobs will not automatically restart on failure. By using an optional setting, you can set jobs to restart on failure. When you set a job to restart, you specify the maximum number of retries per hour (the upper limit is 10 retries per hour).

Restarting jobs mitigates common types of job failure, including out-of-memory issues and unexpected Google Compute Engine virtual machine reboots. Restartable jobs are especially useful for long-running and streaming jobs. For example, you can restart Spark streaming jobs running on Google Cloud Dataproc clusters to ensure that the streaming jobs are resilient. Restartable jobs can be created through the Google Cloud SDK gcloud command-line tool or the Cloud Dataproc REST API.

Restartable job semantics

Any type of Cloud Dataproc job can be marked as restartable. This applies to jobs submitted through the Google Cloud Platform Console, Google Cloud SDK gcloud command-line tool, or the Cloud Dataproc REST API.

The following semantics apply to reporting the success or failure of jobs:

  • A job is reported successful if the driver terminates with code 0.
  • A job is reported failed if:
    • The driver terminates with a non-zero code more than 4 times in 10 minutes.
    • The driver terminates with non-zero code, and has exceeded the max_failures_per_hour setting.
  • A job will be restarted if the driver exits with a non-zero code, is not thrashing, and is within the max_failures_per_hour setting.

Known limitations

The following limitations apply to restartable jobs on Cloud Dataproc.

  • You must design your jobs to gracefully handle restarting. For example, if your job writes to a directory, your job will need to account for the fact that the directory may already exist when your job is restarted.
  • Apache Spark streaming jobs that checkpoint can be restarted after failure, but these jobs will not report Yarn status.

Creating and using restartable jobs

gcloud

You can specify the maximum number of times a job can be restarted per hour when submitting the job using the gcloud command-line tool (the maximum value is 10 retries per hour).

gcloud dataproc jobs submit <job type> args --max-failures-per-hour number

REST API

You can set jobs to restart through the Cloud Dataproc REST API. Use the jobs.submit API to set the number of times a job can be restarted on failure. You can set the maximum number of failures for a job by specifying a JobScheduling object and setting the max_failures_per_hour field (the maximum value is 10 retries per hour).

Example

POST /v1/projects/project-id/regions/global/jobs:submit/
{
  "projectId": "project-id",
  "job": {
    "placement": {
      "clusterName": "example-cluster"
    },
    "reference": {
      "jobId": "cea7ae0b...."
    },
    "sparkJob": {
      "args": [
        "1000"
      ],
      "mainClass": "org.apache.spark.examples.SparkPi",
      "jarFileUris": [
        "file:///usr/lib/spark/examples/jars/spark-examples.jar"
      ]
    },
    "scheduling": {
      "maxFailuresPerHour": 5
    }
  }
}

Console

You can submit restartable jobs by specifying the max restarts per hour on the Cloud Dataproc Submit a job page (the maximum value is 10 times per hour).

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation