Troubleshooting

This page shows you how to resolve issues with Batch.

If you are trying to troubleshoot a job that you do not have an error message for, check if the job's history contains any error messages by viewing status events before reviewing this document.

For more information about troubleshooting a job, also see the following documents:

Job creation errors

If you can't create a job, it might be due to one of the errors in this section.

Insufficient quota

Issue

One of the following issues occurs when you try to create a job:

  • When the job is in the QUEUED state, the following issue appears in the statusEvents field:

    Quota checking process decides to delay scheduling for the job JOB_UID due to inadequate quotas [Quota: QUOTA_NAME, limit: QUOTA_LIMIT, usage: QUOTA_CURRENT_USAGE, wanted: WANTED_QUOTA.].
    

    This issue indicates that the job has been delayed because the current usage (QUOTA_USAGE) and limit (QUOTA_LIMIT) of QUOTA_NAME quota prevented the job's requested usage (WANT_QUOTA).

  • When the job is in QUEUED, SCHEDULED, or FAILED states, one of the following issues appears in the statusEvents field:

    RESOURCE_NAME creation failed:
    Quota QUOTA_NAME exceeded. Limit: QUOTA_LIMIT in region REGION
    
    RESOURCE_NAME creation failed:
    Quota QUOTA_NAME exceeded. Limit: QUOTA_LIMIT in zone ZONE
    

    This issue indicates that creating a resource failed because the request exceeded your QUOTA_NAME quota, which has a limit of QUOTA_LIMIT in the specified location.

Solution

To resolve the issue, do the following:

  • If the job was delayed, try to wait for more quota to be released.

  • If the job failed due to insufficient quota or if these delays persist, try to prevent insufficient quota by doing any of the following:

    • Create jobs that use less of that quota or a different quota. For example, specify a different allowed location or resource type for the job, or split your quota usage across additional projects.

    • Request a higher quota limit for your project from Google Cloud.

For more information, see Batch quotas and limits and Work with quotas.

Insufficient permissions to act as the service account

Issue

The following issue occurs when you try to create a job:

  • If the job does not use an instance template, the issue appears as the following:

    caller does not have access to act as the specified service account: SERVICE_ACCOUNT_NAME
    
  • If the job uses an instance template, the issue appears as the following:

    Error: code - CODE_SERVICE_ACCOUNT_MISMATCH, description - The service account specified in the instance template INSTANCE_TEMPLATE_SERVICE_ACCOUNT doesn't match the service account specified in the job JOB_SERVICE_ACCOUNT for JOB_UID, project PROJECT_NUMBER
    

This issue usually occurs because the user creating the job does not have sufficient permissions to act as the service account used by the job, which is controlled by the iam.serviceAccounts.actAs permission.

Solution

To resolve the issue, do the following:

  1. If the job uses an instance template, verify that the service account specified in the instance template matches the service account specified in the job's definition.
  2. Make sure that the user who is creating the job has been granted Service Account User role (roles/iam.serviceAccountUser) on the service account specified for the job. For more information, see Manage access.
  3. Recreate the job.

Repeated networks

Issue

The following issue occurs when you try to create a job:

Networks must be distinct for NICs in the same InstanceTemplate

This issue occurs because you specified the network for a job more than once.

Solution

To resolve the issue, recreate the job and specify the network by using one of the following options:

For more information, see Specify the network for a job.

Invalid network for VPC Service Controls

Issue

The following issue occurs when you try to create a job:

no_external_ip_address field is invalid. VPC Service Controls is enabled for the project, so external ip address must be disabled for the job. Please set no_external_ip_address field to be true

Solution

This issue occurs because you are attempting to create and run a job with VMs that have external IP addresses in a VPC Service Controls service perimeter.

To resolve the issue, create a job that blocks external access for all VMs.

For more information about how to configure networking for a job in a VPC Service Controls service perimeter, see Use VPC Service Controls with Batch.

Job failure errors

If you have issues with a job that is not running correctly or failed for unclear reasons, it might be due to one of the errors in this section or one of the exit codes in the following Task failure exit codes section.

No logs in Cloud Logging

Issue

You need to debug a job, but no logs appear for the job in Cloud Logging.

This issue often occurs for the following reasons:

  • The Cloud Logging API is not enabled for your project. Even if you correctly configure everything else for a job's logs, it won't produce logs if the service is not enabled for your project.
  • The job's service account does not have permission to write logs. A job can't produce logs without sufficient permissions.
  • The job was not configured to produce logs. To produce logs in Cloud Logging, a job needs to have Cloud Logging enabled. The job's runnables should also be configured to write any information that you want to appear in logs to the standard output (stdout) and standard error (stderr) streams. For more information, see Analyze a job by using logs.
  • Tasks did not run. Logs cannot be produced until tasks have been assigned resources and start running.

Solution

To resolve this issue, do the following:

  1. Make sure that the Cloud Logging API is enabled for your project.
  2. Make sure the service account for the job has the Logs Writer (roles/logging.logWriter) IAM role. For more information, see Enable Batch for a project.
  3. View the details of the job using the gcloud CLI or Batch API. The job details can help you understand why the job did not produce logs and might provide information that you hoped to get from logs. For example, do the following:
    1. To verify that logging is enabled, review the job's logsPolicy field.
    2. To verify that the job ran successfully, review the job's status field.

No service agent reporting

Issue

The following issue appears in the statusEvents field for a job that is not running properly or failed before VMs were created:

No VM has agent reporting correctly within time window NUMBER_OF_SECONDS seconds, VM state for instance VM_NAME is TIMESTAMP,agent,start

The issue indicates that none of a job's VMs are reporting to the Batch service agent.

This issue often occurs for the following reasons:

  • The job's VMs do not have sufficient permissions. A job's VMs require specific permissions to report their state to the Batch service agent. You can provide these permissions for a job's VMs by granting the Batch Agent Reporter role (roles/batch.agentReporter) to the job's service account.
  • The job's VMs have network issues. A job's VMs require network access to communicate with the Batch service agent.
  • The job's VMs are using an outdated Batch VM OS image or using a VM OS image with outdated Batch service agent software. The job's VMs require software in its VM OS image that provides the current dependencies for reporting to the Batch service agent.

Solution

To resolve the issue, do the following:

  1. Verify that the job's VMs have the permissions required to report their state to the Batch service agent.

    1. To identify the job's service account, view the details of the job using the gcloud CLI or Batch API. If no service account is listed, the job uses the Compute Engine default service account by default.
    2. Confirm that the job's service account has permissions for the Batch Agent Reporter role (roles/batch.agentReporter). For more information, see Manage access and Restricting service account usage.

      For example, to grant the Compute Engine default service account the required permissions, use the following command:

      gcloud projects add-iam-policy-binding \
        --role roles/batch.agentReporter \
        --member serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com
      

      Replace PROJECT_NUMBER with your project number.

  2. Verify that the job's VMs have proper network access. For more information, see Batch networking overview and Troubleshoot common networking issues.

  3. If you specified the VM OS image for the job, verify that the VM OS image is currently supported.

    1. If you enabled Cloud Logging for the job, you can identify this issue by checking for any of the following agent logs (batch_agent_logs). For more information, see Analyze a job using logs.

      • Log for outdated Batch service agent software error:

        rpc error: code = FailedPrecondition, desc = Invalid resource state for BATCH_AGENT_VERSION: outdated Batch agent version used.
        

        The BATCH_AGENT_VERSION is the version of software for communicating with the Batch service agent that the job uses—for example, cloud-batch-agent_20221103.00_p00.

      • Log for outdated Batch VM OS image error:

        rpc error: code = FailedPrecondition, desc = Invalid resource state for BATCH_VM_OS_IMAGE_NAME: outdated Batch image version.
        

        The BATCH_VM_OS_IMAGE_NAME is the specific version of a VM OS image from Batch that the job uses—for example, batch-debian-11-20220909-00-p00.

    2. You can resolve this issue by using a newer VM OS image. If the job uses a custom image, recreate the custom image based on one of the latest version of a supported public image.

      For more information, see Supported VM OS images and View VM OS images.

  4. Recreate the job.

Constraint violated for VM external IP addresses

Issue

The following issue appears in the statusEvents field for a failed job:

Instance VM_NAME creation failed: Constraint constraints/compute.vmExternalIpAccess violated for project PROJECT_NUMBER.
Add instance VM_NAME to the constraint to use external IP with it.

This issue occurs because your project, folder, or organization has set the compute.vmExternalIpAccess organizational policy constraint so that only allowlisted VMs can use external IP addresses.

Solution

To resolve the issue, recreate the job and do one of the following:

Constraint violated for trusted images

Issue

The following issue appears in the statusEvents field for a failed job:

Instance VM_NAME creation failed: Constraint constraints/compute.trustedImageProjects violated for project PROJECT_ID. Use of images from project batch-custom-image is prohibited.

Solution

This issue occurs because your project has set the trusted images (compute.trustedImageProjects) policy constraint so that images from Batch, which are in the batch-custom-image images project, are not allowed.

To resolve the issue, do at least one of the following:

  • Recreate the job to specify a VM OS image that is already allowed by the trusted images policy constraint.
  • Ask your administrator to allow the modify the trusted image policy constraint to allow VM OS images from the batch-custom-image images project. For instructions, see Control access to VM OS images for Batch.

Job failed while using an instance template

Issue

The following issue appears in the statusEvents field for a failed job that uses an instance template:

INVALID_FIELD_VALUE,BACKEND_ERROR

This issue occurs due to unclear problems with the job's instance template.

Solution

To debug the issue further, do the following:

  1. Create a MIG using the instance template and observe if errors occur with more details.
  2. Optional: To try to find more information, see the long running operation that is creating the MIG in the Google Cloud console.

    Go to Compute Engine Operations

Task failure exit codes

When a specific task in a job fails, the task returns a nonzero exit code. Depending on how you configure the ignoreExitStatus field, a failed task might or might not cause a job to fail.

In addition to any exit codes that you define in a runnable, a Batch has several reserved exit codes, including the following exit codes.

VM preemption (50001)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to Spot Preemption with exit code 50001.

This issue occurs when a Spot VM for the job is preempted during run time.

Solution

To resolve the issue, do one of the following:

  • Retry the task either by using automated task retries or manually re-running the job.
  • To guarantee there is no preemption, use VMs with the standard provisioning model instead.

VM reporting timeout (50002)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to Batch no longer receives VM updates with exit code 50002.

This issue occurs when there is a timeout in the backend that caused a VM for the job to no longer receive updates.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

VM rebooted during execution (50003)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to VM is rebooted during task execution with exit code 50003.

This issue occurs when a VM for a job unexpectedly reboots during run time.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

VM and task are unresponsive (50004)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to tasks cannot be canceled with exit code 50004.

This issue occurs when a task reaches the unresponsive time limit and cannot be cancelled.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

Task runs over the maximum runtime (50005)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to task runs over the maximum runtime with exit code 50005.

This issue occurs in the following cases:

To identify specifically which time limit was exceeded, view logs for the job and find a log that mentions the 50005 exit code. This textPayload field of this log indicates where and when the time limit was exceeded.

Solution

To resolve the issue, attempt to verify the total run time required by the task or runnable that exceeded the time limit. Then, do one of the following:

  • If you only occasionally expect this error, such as for a task or runnable with an inconsistent run time, you can try to recreate the job and configure it to automate task retries to try to increase the success rate.

  • Otherwise, if the task or runnable consistently and intentionally needs more time to finish running than the current timeout allows, set a longer timeout.

VM recreated during execution (50006)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to VM is recreated during task execution with exit code 50006.

This issue occurs when a VM for a job is unexpectedly recreated during run time.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

What's next