Known issues

This page describes known issues that you might run into while using Batch.

If you need further help using Batch, see the Troubleshooting documentation or Get support.

Timeouts logs don't indicate whether task's or runnable's timeout was exceeded

When a job fails due to exceeding a timeout, the logs associated with the job don't indicate whether the failure was caused by the relevant task's timeout or the relevant runnable's timeout.

To workaround this issue, set different timeout values for tasks and runnables. Then, you can identify whether a failure was caused by exceeding the timeout of the relevant task or runnable by using the following procedure:

  1. Identify the task, runnable, and time of an exceeded-timeout failure.

    1. View logs for the job.

    2. Find a log that mentions the exceeded-timeout exit code, 50005. This log has a textPayload that's similar to the following message:

      Task task/JOB_UID-group0-TASK_INDEX/0/0 runnable RUNNABLE_INDEX...exitCode 50005
      

      From that log, record TASK_INDEX as the failed task, RUNNABLE_INDEX as the failed runnable, and the log's timestamp value as the time of the exceeded-timeout failure.

  2. Identify the start time of the failed task.

    1. View the status events of the failed task.

    2. Find the status event that mentions the following message:

      Task state is updated from ASSIGNED to RUNNING
      

      From that status event, record the eventTime field as the start time of the failed task.

  3. Calculate failed task's total run time, \({failedTaskRunTime}\), by using the following formula:

    \[{failedTaskRunTime}={failureTime}-{failedTaskStartTime}\]

    Replace the following values:

    • \({failureTime}\): the time of the exceeded-timeout failure.
    • \({failedTaskStartTime}\): the start time of the failed task.
  4. Identify the exceeded timeout:

    • If \({failedTaskRunTime}\) matches the timeout that you configured for the failed task, then that failed task's timeout was exceeded and caused the failure.

    • Otherwise, the timeout that you configured for the failed runnable was exceeded and caused the failure.

Jobs consuming reservations might be delayed or prevented

When you try to create and run a job that consumes Compute Engine reservations, Batch might incorrectly delay or prevent the job from running. Specifically, Batch is requiring projects to have sufficient Compute Engine resource quotas even when those resource quotas are being used by unconsumed reservations.

Workaround the issue

To workaround this issue for a job, add a label with the name goog-batch-skip-quota-check and value true to the job-level labels field. This label causes Batch to skip verifying your project's resource quotas before trying to create a job.

For example, to prevent or resolve this issue for a basic script job that can consume reservations, create and run a job with the following JSON configuration:

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "script": {
              "text": "echo Hello world from task ${BATCH_TASK_INDEX}"
            }
          }
        ]
      },
      "taskCount": 3
    }
  ],
  "allocationPolicy": {
    "instances": [
      {
        VM_RESOURCES
      }
    ],
  },
  "labels": {
    "goog-batch-skip-quota-check": "true"
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

Replace VM_RESOURCES with the VM resources that match the reservation that you want to the job to consume.

For more instructions, see Create and run a job that can consume reserved VMs and Define custom labels for the job.

Identify the issue

This issue is not indicated by any specific error message. Instead, this issue can happen in the following circumstances:

  • If your project reserves all of the resources that it has quota for, this issue prevents any jobs that specify those resources.

    For example, suppose your project has the following:

    • A maximum quota for H100 GPUs of 16.
    • An unconsumed, single-project reservation for 2 a3-highgpu-8g VMs, which reserves 16 H100 GPUs total.

    In this scenario, this issue prevents your project from scheduling and running any job that is correctly configured to consume any of the reserved H100 GPUs.

  • If your project reserves some of the resources that it has quota for, this issue might prevent or delay jobs that specify those resources.

    For example, suppose your project has the following:

    • A maximum quota for H100 GPUs of 16.
    • An unconsumed, single-project reservation for 1 a3-highgpu-8g VM, which reserves 8 H100 GPUs total.
    • A a3-highgpu-8g VM that is configured to not consume any reservations and is occasionally deleted then recreated. (This VM uses 8 unreserved H100 GPUs when it exists.)

    In this scenario, this issue only allows your project to schedule and start running any job that is correctly configured to consume any of the reserved H100 GPUs when the a3-highgpu-8g VM does not exist.

Jobs might fail when specifying Compute Engine (or custom) VM OS images with outdated kernels

A job might fail if it specifies a Compute Engine VM OS image that does not have the latest kernel version. This issue also impacts any custom images based on Compute Engine VM OS images. The Compute Engine public images that cause this issue are not easily identified and subject to change at any time.

This issue is not indicated by a specific error message. Instead, consider this issue if you have a job that fails unexpectedly and specifies a Compute Engine VM OS image or similar custom image.

To prevent or resolve this issue, you can do the following:

  1. Whenever possible, use Batch images or custom images based off Batch images, which aren't affected by this issue.
  2. If you can't use a Batch image, try the latest version of your preferred Compute Engine image. Generally, newer versions of Compute Engine images are more likely to have the latest kernel version than older versions.
  3. If the latest version of a specific image doesn't work, you might need to try a different OS or create a custom image. For example, if the latest version of Debian 11 doesn't work, you can try to create a custom image from a Compute Engine VM that runs Debian 11 and that you've updated to use the latest kernel version.

This issue is caused by an outdated kernel version in the VM OS image that causes the VM to reboot. When a job specifies any VM OS image that is not from Batch or based on a Batch image, Batch installs required packages on the job's VMs after they start. The required packages can vary for different jobs and change over time, and they might require your VM OS image to have the latest kernel version. This issue appears when updating the kernel version requires the VM to reboot, which causes the package installation and the job to fail.

For more information about VM OS images, see Overview of the OS environment for a job's VMs.

Jobs using GPUs and VM OS images with outdated kernels might fail only when automatically installing drivers

This issue is closely related to Jobs might fail when specifying Compute Engine (or custom) VM OS images with outdated kernels. Specifically, jobs that both specify a Compute Engine (or custom) VM OS image without the latest kernel and use GPUs might fail only if you try to install GPU drivers automatically. For these jobs, you might also resolve the failures just by installing GPU drivers manually.

For more information about GPUs, see Create and run a job that uses GPUs.