This page describes how to automatically retry tasks after all or some failures.
A Batch job fails when at least one of its tasks fails, which can happen for various reasons. By default, each task in a job only runs once; if a task fails, it is not retried. However, some issues that cause a task to fail can be easily resolved just by retrying the task. In these cases, configuring the job to automatically retry tasks can substantially help reduce troubleshooting friction and the overall run time of your jobs.
Automatic retries are well-suited to loosely coupled (independent) tasks and can help with a variety of issues. For example, automatic task retries can resolve time-sensitive issues like following:
- preemption of Spot VMs
- VM maintenance events and host errors
- transient networking errors
You can configure automatic task retries for each task when you create a job. Specifically, for each task, you can use one of the following configuration options:
- By default, each task is not retried when it fails.
- Retry tasks for all failures: You can configure the maximum times to automatically retry failed tasks. You can specify between 0 (default) and 10 retries.
- Retry tasks for some failures: You can configure different task actions—either automatic retry or fail without retry—for specific failures. The opposite action is taken for all unspecified failures. Specific failures can each be identified by an exit code that is defined by your application or Batch.
Before you begin
- If you haven't used Batch before, review Get started with Batch and enable Batch by completing the prerequisites for projects and users.
-
To get the permissions that you need to create a job, ask your administrator to grant you the following IAM roles:
-
Batch Job Editor (
roles/batch.jobsEditor
) on the project -
Service Account User (
roles/iam.serviceAccountUser
) on the job's service account, which by default is the default Compute Engine service account
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
-
Batch Job Editor (
Retry tasks for all failures
You can define the
maximum number of automatic retries (maxRetryCount
field)
for a job's failed tasks using the gcloud CLI or Batch API.
gcloud
Create a JSON file that specifies the job's configuration details and the
maxRetryCount
field.For example, to create a basic script job that specifies the maximum retries for failed tasks, create a JSON file with the following contents:
{ "taskGroups": [ { "taskSpec": { "runnables": [ { "script": { "text": "echo Hello world from task ${BATCH_TASK_INDEX}" } } ], "maxRetryCount": MAX_RETRY_COUNT }, "taskCount": 3 } ], "logsPolicy": { "destination": "CLOUD_LOGGING" } }
Replace
MAX_RETRY_COUNT
with the maximum number of retries for each task. For a job to be able to retry failed tasks, this value must be set to an integer between1
and10
. If themaxRetryCount
field is not specified, the default value is0
, which means to not retry any tasks.To create and run the job, use the
gcloud batch jobs submit
command:gcloud batch jobs submit JOB_NAME \ --location LOCATION \ --config JSON_CONFIGURATION_FILE
Replace the following:
JOB_NAME
: the name of the job.LOCATION
: the location of the job.JSON_CONFIGURATION_FILE
: the path for a JSON file with the job's configuration details.
API
Make a POST
request to the
jobs.create
method
that specifies the maxRetryCount
field.
For example, to create a basic script job that specifies the maximum retries for failed tasks, make the following request:
POST https://batch.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/jobs?job_id=JOB_NAME
{
"taskGroups": [
{
"taskSpec": {
"runnables": [
{
"script": {
"text": "echo Hello world from task ${BATCH_TASK_INDEX}"
}
}
],
"maxRetryCount": MAX_RETRY_COUNT
},
"taskCount": 3
}
],
"logsPolicy": {
"destination": "CLOUD_LOGGING"
}
}
Replace the following:
PROJECT_ID
: the project ID of your project.LOCATION
: the location of the job.JOB_NAME
: the name of the job.MAX_RETRY_COUNT
: The maximum number of retries for each task. For a job to be able to retry failed tasks, this value must be set to an integer between1
and10
. If themaxRetryCount
field is not specified, the default value is0
, which means to not retry any tasks.
Retry tasks for some failures
You can define how you want a job to handle different task failures by using
lifecycle policies (lifecyclePolicies[]
field).
A lifecycle policy consists of an
action (action
field),
action condition (actionCondition
field),
and exit code (exitCodes[]
field).
The specified action is taken whenever the
action condition—a specific exit code—occurs.
You can specify one the following actions:
RETRY_TASK
: retry tasks that fail with the exit codes specified in theexitCodes[]
field. Tasks that fail with any unspecified exit codes are not retried.FAIL_TASK
: do not retry tasks that fail with the exit codes specified in theexitCodes[]
field. Tasks that fail with any unspecified exit codes are retried.
Notably, any tasks that fail with unspecified exit codes take the
opposite action—some exit codes are retried and some are failed.
Consequently, for the lifecycle policy to work as expected,
you also need to define the
maximum number of automatic retries (maxRetryCount
field)
to allow the job to automatically retry failed tasks at least once.
Each exit code represents a specific failure that is defined either by your application or Batch. The exit codes from 50001 to 59999 are reserved and defined by Batch. For more information about the reserved exit codes, see Troubleshooting.
You can specify for a job to retry or fail tasks after specific failures using gcloud CLI or Batch API.
gcloud
Create a JSON file that specifies the job's configuration details, the
maxRetryCount
field, and thelifecyclePolicies[]
subfields.To create a basic script job that retries failed tasks only for some exit codes, create a JSON file with the following contents:
{ "taskGroups": [ { "taskSpec": { "runnables": [ { "script": { "text": "echo Hello world from task ${BATCH_TASK_INDEX}" } } ], "maxRetryCount": MAX_RETRY_COUNT, "lifecyclePolicies": [ { "action": "ACTION", "actionCondition": { "exitCodes": [EXIT_CODES] } } ] } } ], "logsPolicy": { "destination": "CLOUD_LOGGING" } }
Replace the following:
MAX_RETRY_COUNT
: the maximum number of retries for each task. For a job to be able to retry failed tasks, this value must be set to an integer between1
and10
. If themaxRetryCount
field is not specified, the default value is0
, which means to not retry any tasks.ACTION
: the action, eitherRETRY_TASK
orFAIL_TASK
, that you want for tasks that fail with the specified exit codes. Tasks that fail with unspecified exit codes take the other action.EXIT_CODES
: a comma-separated list of one or more exit codes that you want to trigger the specified action—for example,50001, 50002
.Each exit code can be defined by your application or Batch. The exit codes from
50001
to59999
are reserved by Batch. For more information about the reserved exit codes, see Troubleshooting.
For example, the following job only retries tasks that fail due to the preemption of Spot VMs.
{ "taskGroups": [ { "taskSpec": { "runnables": [ { "script": { "text": "sleep 30" } } ], "maxRetryCount": 3, "lifecyclePolicies": [ { "action": "RETRY_TASK", "actionCondition": { "exitCodes": [50001] } } ] } } ], "allocationPolicy": { "instances": [ { "policy": { "machineType": "e2-standard-4", "provisioningModel": "SPOT" } } ] } }
To create and run the job, use the
gcloud batch jobs submit
command:gcloud batch jobs submit JOB_NAME \ --location LOCATION \ --config JSON_CONFIGURATION_FILE
Replace the following:
JOB_NAME
: the name of the job.LOCATION
: the location of the job.JSON_CONFIGURATION_FILE
: the path for a JSON file with the job's configuration details.
API
Make a POST
request to the
jobs.create
method
that specifies the maxRetryCount
field and lifecyclePolicies[]
subfields.
To create a basic script job that retries failed tasks only for some exit codes, make the following request:
POST https://batch.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/jobs?job_id=JOB_NAME
{
"taskGroups": [
{
"taskSpec": {
"runnables": [
{
"script": {
"text": "echo Hello world from task ${BATCH_TASK_INDEX}"
}
}
],
"maxRetryCount": MAX_RETRY_COUNT,
"lifecyclePolicies": [
{
"action": "ACTION",
"actionCondition": {
"exitCodes": [EXIT_CODES]
}
}
]
}
}
],
"logsPolicy": {
"destination": "CLOUD_LOGGING"
}
}
Replace the following:
PROJECT_ID
: the project ID of your project.LOCATION
: the location of the job.JOB_NAME
: the name of the job.MAX_RETRY_COUNT
: the maximum number of retries for each task. For a job to be able to retry failed tasks, this value must be set to an integer between1
and10
. If themaxRetryCount
field is not specified, the default value is0
, which means to not retry any tasks.ACTION
: the action, eitherRETRY_TASK
orFAIL_TASK
, that you want for tasks that fail with the specified exit codes. Tasks that fail with unspecified exit codes take the other action.EXIT_CODES
: a comma-separated list of one or more exit codes that you want to trigger the specified action—for example,50001, 50002
.Each exit code can be defined by your application or Batch. The exit codes from
50001
to59999
are reserved by Batch. For more information about the reserved exit codes, see Troubleshooting.
For example, the following job only retries tasks that fail due to the preemption of Spot VMs.
POST https://batch.googleapis.com/v1/projects/example-project/locations/us-central1/jobs?job_id=example-job
{
"taskGroups": [
{
"taskSpec": {
"runnables": [
{
"script": {
"text": "sleep 30"
}
}
],
"maxRetryCount": 3,
"lifecyclePolicies": [
{
"action": "RETRY_TASK",
"actionCondition": {
"exitCodes": [50001]
}
}
]
}
}
],
"allocationPolicy": {
"instances": [
{
"policy": {
"machineType": "e2-standard-4",
"provisioningModel": "SPOT"
}
}
]
}
}
Modify task behavior based on the number of retries
Optionally, after you've enabled automatic retries for a task
as described in the previous sections on this page, you can
update your runnables to use the
BATCH_TASK_RETRY_ATTEMPT
predefined environment variable.
The BATCH_TASK_RETRY_ATTEMPT
variable describes the number of times
that this task has already been attempted. Use the
BATCH_TASK_RETRY_ATTEMPT
variable in your runnables if you want
a task to behave differently based on the number of retries.
For example, when a task is being retried, you might want to
confirm which commands were already successfully executed in
the previous attempt. For more information, see
Predefined environment variables.
What's next
- If you have issues creating or running a job, see Troubleshooting.
- View jobs and tasks.
- Learn about more job creation options.