This page describes known issues that you might run into while using Batch.
If you need further help using Batch, see the Troubleshooting documentation or Get support.
Timeouts logs don't indicate whether task's or runnable's timeout was exceeded
When a job fails due to exceeding a timeout, the logs associated with the job don't indicate whether the failure was caused by the relevant task's timeout or the relevant runnable's timeout.
To workaround this issue, set different timeout values for tasks and runnables. Then, you can identify whether a failure was caused by exceeding the timeout of the relevant task or runnable by using the following procedure:
Identify the task, runnable, and time of an exceeded-timeout failure.
Find a log that mentions the exceeded-timeout exit code,
50005
. This log has atextPayload
that's similar to the following message:Task task/JOB_UID-group0-TASK_INDEX/0/0 runnable RUNNABLE_INDEX...exitCode 50005
From that log, record
TASK_INDEX
as the failed task,RUNNABLE_INDEX
as the failed runnable, and the log'stimestamp
value as the time of the exceeded-timeout failure.
Identify the start time of the failed task.
Find the status event that mentions the following message:
Task state is updated from ASSIGNED to RUNNING
From that status event, record the
eventTime
field as the start time of the failed task.
Calculate failed task's total run time, \({failedTaskRunTime}\), by using the following formula:
\[{failedTaskRunTime}={failureTime}-{failedTaskStartTime}\]
Replace the following values:
- \({failureTime}\): the time of the exceeded-timeout failure.
- \({failedTaskStartTime}\): the start time of the failed task.
Identify the exceeded timeout:
If \({failedTaskRunTime}\) matches the timeout that you configured for the failed task, then that failed task's timeout was exceeded and caused the failure.
Otherwise, the timeout that you configured for the failed runnable was exceeded and caused the failure.
Jobs consuming reservations might be delayed or prevented
When you try to create and run a job that consumes Compute Engine reservations, Batch might incorrectly delay or prevent the job from running. Specifically, Batch is requiring projects to have sufficient Compute Engine resource quotas even when those resource quotas are being used by unconsumed reservations.
Workaround the issue
To workaround this issue for a job, add a label with the
name goog-batch-skip-quota-check
and value true
to the
job-level labels
field.
This label causes Batch to skip verifying
your project's resource quotas before trying to create a job.
For example, to prevent or resolve this issue for a basic script job that can consume reservations, create and run a job with the following JSON configuration:
{
"taskGroups": [
{
"taskSpec": {
"runnables": [
{
"script": {
"text": "echo Hello world from task ${BATCH_TASK_INDEX}"
}
}
]
},
"taskCount": 3
}
],
"allocationPolicy": {
"instances": [
{
VM_RESOURCES
}
],
},
"labels": {
"goog-batch-skip-quota-check": "true"
},
"logsPolicy": {
"destination": "CLOUD_LOGGING"
}
}
Replace VM_RESOURCES
with the VM resources that
match the reservation that you want to the job to consume.
For more instructions, see Create and run a job that can consume reserved VMs and Define custom labels for the job.
Identify the issue
This issue is not indicated by any specific error message. Instead, this issue can happen in the following circumstances:
If your project reserves all of the resources that it has quota for, this issue prevents any jobs that specify those resources.
For example, suppose your project has the following:
- A maximum quota for H100 GPUs of 16.
- An unconsumed, single-project reservation for 2
a3-highgpu-8g
VMs, which reserves 16 H100 GPUs total.
In this scenario, this issue prevents your project from scheduling and running any job that is correctly configured to consume any of the reserved H100 GPUs.
If your project reserves some of the resources that it has quota for, this issue might prevent or delay jobs that specify those resources.
For example, suppose your project has the following:
- A maximum quota for H100 GPUs of 16.
- An unconsumed, single-project reservation for 1
a3-highgpu-8g
VM, which reserves 8 H100 GPUs total. - A
a3-highgpu-8g
VM that is configured to not consume any reservations and is occasionally deleted then recreated. (This VM uses 8 unreserved H100 GPUs when it exists.)
In this scenario, this issue only allows your project to schedule and start running any job that is correctly configured to consume any of the reserved H100 GPUs when the
a3-highgpu-8g
VM does not exist.
Jobs might fail when specifying Compute Engine (or custom) VM OS images with outdated kernels
A job might fail if it specifies a Compute Engine VM OS image that does not have the latest kernel version. This issue also impacts any custom images based on Compute Engine VM OS images. The Compute Engine public images that cause this issue are not easily identified and subject to change at any time.
This issue is not indicated by a specific error message. Instead, consider this issue if you have a job that fails unexpectedly and specifies a Compute Engine VM OS image or similar custom image.
To prevent or resolve this issue, you can do the following:
- Whenever possible, use Batch images or custom images based off Batch images, which aren't affected by this issue.
- If you can't use a Batch image, try the latest version of your preferred Compute Engine image. Generally, newer versions of Compute Engine images are more likely to have the latest kernel version than older versions.
- If the latest version of a specific image doesn't work, you might need to try a different OS or create a custom image. For example, if the latest version of Debian 11 doesn't work, you can try to create a custom image from a Compute Engine VM that runs Debian 11 and that you've updated to use the latest kernel version.
This issue is caused by an outdated kernel version in the VM OS image that causes the VM to reboot. When a job specifies any VM OS image that is not from Batch or based on a Batch image, Batch installs required packages on the job's VMs after they start. The required packages can vary for different jobs and change over time, and they might require your VM OS image to have the latest kernel version. This issue appears when updating the kernel version requires the VM to reboot, which causes the package installation and the job to fail.
For more information about VM OS images, see Overview of the OS environment for a job's VMs.
Jobs using GPUs and VM OS images with outdated kernels might fail only when automatically installing drivers
This issue is closely related to Jobs might fail when specifying Compute Engine (or custom) VM OS images with outdated kernels. Specifically, jobs that both specify a Compute Engine (or custom) VM OS image without the latest kernel and use GPUs might fail only if you try to install GPU drivers automatically. For these jobs, you might also resolve the failures just by installing GPU drivers manually.
For more information about GPUs, see Create and run a job that uses GPUs.