Automate task retries

This page describes how to automatically retry tasks after all or some failures.

A Batch job fails when at least one of its tasks fails, which can happen for various reasons. By default, each task in a job only runs once; if a task fails, it is not retried. However, some issues that cause a task to fail can be easily resolved just by retrying the task. In these cases, configuring the job to automatically retry tasks can substantially help reduce troubleshooting friction and the overall run time of your jobs.

Automatic retries are well-suited to loosely coupled (independent) tasks and can help with a variety of issues. For example, automatic task retries can resolve time-sensitive issues like following:

You can configure automatic task retries for each task when you create a job. Specifically, for each task, you can use one of the following configuration options:

  • By default, each task is not retried when it fails.
  • Retry tasks for all failures: You can configure the maximum times to automatically retry failed tasks. You can specify between 0 (default) and 10 retries.
  • Retry tasks for some failures: You can configure different task actions—either automatic retry or fail without retry—for specific failures. The opposite action is taken for all unspecified failures. Specific failures can each be identified by an exit code that is defined by your application or Batch.

Before you begin

  1. If you haven't used Batch before, review Get started with Batch and enable Batch by completing the prerequisites for projects and users.
  2. To get the permissions that you need to create a job, ask your administrator to grant you the following IAM roles:

    For more information about granting roles, see Manage access to projects, folders, and organizations.

    You might also be able to get the required permissions through custom roles or other predefined roles.

Retry tasks for all failures

You can define the maximum number of automatic retries (maxRetryCount field) for a job's failed tasks using the gcloud CLI or Batch API.

gcloud

  1. Create a JSON file that specifies the job's configuration details and the maxRetryCount field.

    For example, to create a basic script job that specifies the maximum retries for failed tasks, create a JSON file with the following contents:

    {
      "taskGroups": [
        {
          "taskSpec": {
            "runnables": [
              {
                "script": {
                  "text": "echo Hello world from task ${BATCH_TASK_INDEX}"
                }
              }
            ],
            
            "maxRetryCount": MAX_RETRY_COUNT
            
          },
          "taskCount": 3
        }
      ],
      "logsPolicy": {
        "destination": "CLOUD_LOGGING"
      }
    }
    

    Replace MAX_RETRY_COUNT with the maximum number of retries for each task. For a job to be able to retry failed tasks, this value must be set to an integer between 1 and 10. If the maxRetryCount field is not specified, the default value is 0, which means to not retry any tasks.

  2. To create and run the job, use the gcloud batch jobs submit command:

    gcloud batch jobs submit JOB_NAME \
      --location LOCATION \
      --config JSON_CONFIGURATION_FILE
    

    Replace the following:

    • JOB_NAME: the name of the job.

    • LOCATION: the location of the job.

    • JSON_CONFIGURATION_FILE: the path for a JSON file with the job's configuration details.

API

Make a POST request to the jobs.create method that specifies the maxRetryCount field.

For example, to create a basic script job that specifies the maximum retries for failed tasks, make the following request:

POST https://batch.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/jobs?job_id=JOB_NAME

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "script": {
              "text": "echo Hello world from task ${BATCH_TASK_INDEX}"
            }
          }
        ],
        
        "maxRetryCount": MAX_RETRY_COUNT
        
      },
      "taskCount": 3
    }
  ],
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

Replace the following:

  • PROJECT_ID: the project ID of your project.

  • LOCATION: the location of the job.

  • JOB_NAME: the name of the job.

  • MAX_RETRY_COUNT: The maximum number of retries for each task. For a job to be able to retry failed tasks, this value must be set to an integer between 1 and 10. If the maxRetryCount field is not specified, the default value is 0, which means to not retry any tasks.

Retry tasks for some failures

You can define how you want a job to handle different task failures by using lifecycle policies (lifecyclePolicies[] field).

A lifecycle policy consists of an action (action field), action condition (actionCondition field), and exit code (exitCodes[] field). The specified action is taken whenever the action condition—a specific exit code—occurs. You can specify one the following actions:

  • RETRY_TASK: retry tasks that fail with the exit codes specified in the exitCodes[] field. Tasks that fail with any unspecified exit codes are not retried.
  • FAIL_TASK: do not retry tasks that fail with the exit codes specified in the exitCodes[] field. Tasks that fail with any unspecified exit codes are retried.

Notably, any tasks that fail with unspecified exit codes take the opposite action—some exit codes are retried and some are failed. Consequently, for the lifecycle policy to work as expected, you also need to define the maximum number of automatic retries (maxRetryCount field) to allow the job to automatically retry failed tasks at least once.

Each exit code represents a specific failure that is defined either by your application or Batch. The exit codes from 50001 to 59999 are reserved and defined by Batch. For more information about the reserved exit codes, see Troubleshooting.

You can specify for a job to retry or fail tasks after specific failures using gcloud CLI or Batch API.

gcloud

  1. Create a JSON file that specifies the job's configuration details, the maxRetryCount field, and the lifecyclePolicies[] subfields.

    To create a basic script job that retries failed tasks only for some exit codes, create a JSON file with the following contents:

    {
      "taskGroups": [
        {
          "taskSpec": {
            "runnables": [
              {
                "script": {
                  "text": "echo Hello world from task ${BATCH_TASK_INDEX}"
                }
              }
            ],
            
            "maxRetryCount": MAX_RETRY_COUNT,
            "lifecyclePolicies": [
              {
                "action": "ACTION",
                "actionCondition": {
                   "exitCodes": [EXIT_CODES]
                }
              }
            ]
          }
        }
      ],
      "logsPolicy": {
        "destination": "CLOUD_LOGGING"
      }
    }
    

    Replace the following:

    • MAX_RETRY_COUNT: the maximum number of retries for each task. For a job to be able to retry failed tasks, this value must be set to an integer between 1 and 10. If the maxRetryCount field is not specified, the default value is 0, which means to not retry any tasks.

    • ACTION: the action, either RETRY_TASK or FAIL_TASK, that you want for tasks that fail with the specified exit codes. Tasks that fail with unspecified exit codes take the other action.

    • EXIT_CODES: a comma-separated list of one or more exit codes that you want to trigger the specified action—for example, 50001, 50002.

      Each exit code can be defined by your application or Batch. The exit codes from 50001 to 59999 are reserved by Batch. For more information about the reserved exit codes, see Troubleshooting.

    For example, the following job only retries tasks that fail due to the preemption of Spot VMs.

    {
      "taskGroups": [
        {
          "taskSpec": {
            "runnables": [
              {
                "script": {
                  "text": "sleep 30"
                }
              }
            ],
            "maxRetryCount": 3,
            "lifecyclePolicies": [
              {
                 "action": "RETRY_TASK",
                 "actionCondition": {
                   "exitCodes": [50001]
                }
              }
            ]
          }
        }
      ],
      "allocationPolicy": {
        "instances": [
          {
            "policy": {
              "machineType": "e2-standard-4",
              "provisioningModel": "SPOT"
            }
          }
        ]
      }
    }
    
  2. To create and run the job, use the gcloud batch jobs submit command:

    gcloud batch jobs submit JOB_NAME \
      --location LOCATION \
      --config JSON_CONFIGURATION_FILE
    

    Replace the following:

    • JOB_NAME: the name of the job.

    • LOCATION: the location of the job.

    • JSON_CONFIGURATION_FILE: the path for a JSON file with the job's configuration details.

API

Make a POST request to the jobs.create method that specifies the maxRetryCount field and lifecyclePolicies[] subfields.

To create a basic script job that retries failed tasks only for some exit codes, make the following request:

POST https://batch.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/jobs?job_id=JOB_NAME

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "script": {
              "text": "echo Hello world from task ${BATCH_TASK_INDEX}"
            }
          }
        ],
        
        "maxRetryCount": MAX_RETRY_COUNT,
        "lifecyclePolicies": [
          {
            "action": "ACTION",
            "actionCondition": {
                "exitCodes": [EXIT_CODES]
            }
          }
        ]
      }
    }
  ],
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

Replace the following:

  • PROJECT_ID: the project ID of your project.

  • LOCATION: the location of the job.

  • JOB_NAME: the name of the job.

  • MAX_RETRY_COUNT: the maximum number of retries for each task. For a job to be able to retry failed tasks, this value must be set to an integer between 1 and 10. If the maxRetryCount field is not specified, the default value is 0, which means to not retry any tasks.

  • ACTION: the action, either RETRY_TASK or FAIL_TASK, that you want for tasks that fail with the specified exit codes. Tasks that fail with unspecified exit codes take the other action.

  • EXIT_CODES: a comma-separated list of one or more exit codes that you want to trigger the specified action—for example, 50001, 50002.

    Each exit code can be defined by your application or Batch. The exit codes from 50001 to 59999 are reserved by Batch. For more information about the reserved exit codes, see Troubleshooting.

For example, the following job only retries tasks that fail due to the preemption of Spot VMs.

POST https://batch.googleapis.com/v1/projects/example-project/locations/us-central1/jobs?job_id=example-job

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "script": {
              "text": "sleep 30"
            }
          }
        ],
        "maxRetryCount": 3,
        "lifecyclePolicies": [
          {
             "action": "RETRY_TASK",
             "actionCondition": {
               "exitCodes": [50001]
            }
          }
        ]
      }
    }
  ],
  "allocationPolicy": {
    "instances": [
      {
        "policy": {
          "machineType": "e2-standard-4",
          "provisioningModel": "SPOT"
        }
      }
    ]
  }
}

Modify task behavior based on the number of retries

Optionally, after you've enabled automatic retries for a task as described in the previous sections on this page, you can update your runnables to use the BATCH_TASK_RETRY_ATTEMPT predefined environment variable. The BATCH_TASK_RETRY_ATTEMPT variable describes the number of times that this task has already been attempted. Use the BATCH_TASK_RETRY_ATTEMPT variable in your runnables if you want a task to behave differently based on the number of retries. For example, when a task is being retried, you might want to confirm which commands were already successfully executed in the previous attempt. For more information, see Predefined environment variables.

What's next