Troubleshooting

This page shows you how to resolve issues with Batch.

If you are trying to troubleshoot a job that you do not have an error message for, check if the job's history contains any error messages by viewing status events before reviewing this document.

For more information about troubleshooting a job, also see the following documents:

Job creation errors

If you can't create a job, it might be due to one of the errors in this section.

Insufficient quota

Issue

One of the following issues occurs when you try to create a job:

When the job is in the QUEUED state, the following issue appears in the statusEvents field:
```
Quota checking process decides to delay scheduling for the job JOB_UID due to inadequate quotas [Quota: QUOTA_NAME, limit: QUOTA_LIMIT, usage: QUOTA_CURRENT_USAGE, wanted: WANTED_QUOTA.].
```
This issue indicates that the job has been delayed because the current usage (QUOTA_USAGE) and limit (QUOTA_LIMIT) of QUOTA_NAME quota prevented the job's requested usage (WANT_QUOTA).
When the job is in QUEUED, SCHEDULED, or FAILED states, one of the following issues appears in the statusEvents field:
```
RESOURCE_NAME creation failed:
Quota QUOTA_NAME exceeded. Limit: QUOTA_LIMIT in region REGION
```
```
RESOURCE_NAME creation failed:
Quota QUOTA_NAME exceeded. Limit: QUOTA_LIMIT in zone ZONE
```
This issue indicates that creating a resource failed because the request exceeded your QUOTA_NAME quota, which has a limit of QUOTA_LIMIT in the specified location.

Solution

To resolve the issue, do the following:

If the job was delayed, try to wait for more quota to be released.
If the job failed due to insufficient quota or if these delays persist, try to prevent insufficient quota by doing any of the following:
- Create jobs that use less of that quota or a different quota. For example, specify a different allowed location or resource type for the job, or split your quota usage across additional projects.
- Request a higher quota limit for your project from Google Cloud.

For more information, see Batch quotas and limits and Work with quotas.

Insufficient permissions to act as the service account

Issue

The following issue occurs when you try to create a job:

If the job does not use an instance template, the issue appears as the following:

caller does not have access to act as the specified service account: SERVICE_ACCOUNT_NAME

If the job uses an instance template, the issue appears as the following:

Error: code - CODE_SERVICE_ACCOUNT_MISMATCH, description - The service account specified in the instance template INSTANCE_TEMPLATE_SERVICE_ACCOUNT doesn't match the service account specified in the job JOB_SERVICE_ACCOUNT for JOB_UID, project PROJECT_NUMBER

This issue usually occurs because the user creating the job does not have sufficient permissions to act as the service account used by the job, which is controlled by the iam.serviceAccounts.actAs permission.

Solution

To resolve the issue, do the following:

If the job uses an instance template, verify that the service account specified in the instance template matches the service account specified in the job's definition.
Make sure that the user who is creating the job has been granted Service Account User role (roles/iam.serviceAccountUser) on the service account specified for the job. For more information, see Manage access.
Recreate the job.

Repeated networks

Issue

The following issue occurs when you try to create a job:

Networks must be distinct for NICs in the same InstanceTemplate

This issue occurs because you specified the network for a job more than once.

Solution

To resolve the issue, recreate the job and specify the network by using one of the following options:

VM instance template: If you want to use a VM instance template while creating this job, you must specify the network in the VM instance template.
network and subnetwork fields: These fields can be used in the request body when you create a job using the Batch API or in the JSON configuration file when you create a job using the gcloud CLI.
--network and --subnetwork flags: These flags can be used with the gcloud batch jobs submit command when you create a job using the gcloud CLI.

For more information, see Specify the network for a job.

Invalid network for VPC Service Controls

Issue

The following issue occurs when you try to create a job:

no_external_ip_address field is invalid. VPC Service Controls is enabled for the project, so external ip address must be disabled for the job. Please set no_external_ip_address field to be true

Solution

This issue occurs because you are attempting to create and run a job with VMs that have external IP addresses in a VPC Service Controls service perimeter.

To resolve the issue, create a job that blocks external access for all VMs.

For more information about how to configure networking for a job in a VPC Service Controls service perimeter, see Use VPC Service Controls with Batch.

Job issues and failure errors

If you have issues with a job that is not running correctly or failed for unclear reasons, it might be due to one of the errors in this section or one of the exit codes in the following Task failure exit codes section.

No logs in Cloud Logging

Issue

You need to debug a job, but no logs appear for the job in Cloud Logging.

This issue often occurs for the following reasons:

The Cloud Logging API is not enabled for your project. Even if you correctly configure everything else for a job's logs, it won't produce logs if the service is not enabled for your project.
The job's service account does not have permission to write logs. A job can't produce logs without sufficient permissions.
The job was not configured to produce logs. To produce logs in Cloud Logging, a job needs to have Cloud Logging enabled. The job's runnables should also be configured to write any information that you want to appear in logs to the standard output (stdout) and standard error (stderr) streams. For more information, see Analyze a job by using logs.
Tasks did not run. Logs cannot be produced until tasks have been assigned resources and start running.
Cloud Logging was configured to automatically exclude the job's logs. Logs from Batch jobs can't appear if you have configured exclusion filters for Cloud Logging that cause logs from Batch jobs to be excluded.

Solution

To resolve this issue, do the following:

Make sure that the logs weren't automatically excluded from Cloud Logging by disabling any current exclusion filters for Cloud Logging.
Make sure that the Cloud Logging API is enabled for your project.
Make sure the service account for the job has the Logs Writer (roles/logging.logWriter) IAM role. For more information, see Enable Batch for a project.
View the details of the job using the gcloud CLI or Batch API. The job details can help you understand why the job did not produce logs and might provide information that you hoped to get from logs. For example, do the following:
1. To verify that logging is enabled, review the job's logsPolicy field.
2. To verify that the job finished running successfully, review the job's status field.

After you've made any changes, recreate the job and wait for the job to finish running before checking for logs.

No service agent reporting

Issue

The following issue appears in the statusEvents field for a job that is not running properly or failed before VMs were created:

No VM has agent reporting correctly within time window NUMBER_OF_SECONDS seconds, VM state for instance VM_NAME is TIMESTAMP,agent,start

The issue indicates that none of a job's VMs are reporting to the Batch service agent.

This issue often occurs for the following reasons:

The job's VMs do not have sufficient permissions. A job's VMs require specific permissions to report their state to the Batch service agent. You can provide these permissions for a job's VMs by granting the Batch Agent Reporter role (roles/batch.agentReporter) to the job's service account.
The job's VMs have network issues. A job's VMs require network access to communicate with the Batch service agent.
The job's VMs are using an outdated Batch VM OS image or using a VM OS image with outdated Batch service agent software. The job's VMs require software in its VM OS image that provides the current dependencies for reporting to the Batch service agent.

Solution

To resolve the issue, do the following:

Verify that the job's VMs have the permissions required to report their state to the Batch service agent.
1. To identify the job's service account, view the details of the job using the gcloud CLI or Batch API. If no service account is listed, the job uses the Compute Engine default service account by default.
2. Confirm that the job's service account has permissions for the Batch Agent Reporter role (roles/batch.agentReporter). For more information, see Manage access and Restricting service account usage.
  
  For example, to grant the Compute Engine default service account the required permissions, use the following command:
```
gcloud projects add-iam-policy-binding PROJECT_ID \
  --role roles/batch.agentReporter \
  --member serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com
```
  - Replace PROJECT_NUMBER with your project number
  - Replace PROJECT_ID with your project ID
Verify that the job's VMs have proper network access. For more information, see Batch networking overview and Troubleshoot common networking issues.
If you specified the VM OS image for the job, verify that the VM OS image is currently supported.
1. If you enabled Cloud Logging for the job, you can identify this issue by checking for any of the following agent logs (batch_agent_logs). For more information, see Analyze a job using logs.
  - Log for outdated Batch service agent software error:
```
rpc error: code = FailedPrecondition, desc = Invalid resource state for BATCH_AGENT_VERSION: outdated Batch agent version used.
```
    The BATCH_AGENT_VERSION is the version of software for communicating with the Batch service agent that the job uses—for example, cloud-batch-agent_20221103.00_p00.
  - Log for outdated Batch VM OS image error:
```
rpc error: code = FailedPrecondition, desc = Invalid resource state for BATCH_VM_OS_IMAGE_NAME: outdated Batch image version.
```
    The BATCH_VM_OS_IMAGE_NAME is the specific version of a VM OS image from Batch that the job uses—for example, batch-debian-11-20220909-00-p00.
2. You can resolve this issue by using a newer VM OS image. If the job uses a custom image, recreate the custom image based on one of the latest version of a supported public image.
  
  For more information, see Supported VM OS images and View VM OS images.
Recreate the job.

Missing resource metrics in Cloud Monitoring

Issue

You want to view resource metrics for a job, but some or all of the expected metrics are missing.

This issue often occurs for the following reasons:

The API wasn't enabled for your project. Even if you correctly configure everything else in your project, resource metrics might not appear until the Cloud Monitoring API is enabled. For the Ops Agent, you must also enable the Cloud Logging API.
You don't have sufficient permissions to view the metrics. You can't view metrics without sufficient permissions.
The job's VMs didn't run. Metrics can't be produced for a job until at least one of the job's VMs is running.
The job's configuration or permissions didn't support Ops Agent metrics. Some resource metrics can be provided only by the Ops Agent. To support Ops Agent metrics, a job must meet the requirements for the Ops Agent, install the Ops Agent, and use a service account that can write metrics to Monitoring.
You need to use a different method or filter to view the metrics. Some methods for viewing metrics don't display metrics for VMs after the VMs are deleted. Additionally, metrics won't appear if they are omitted by filters or the displayed time period. Furthermore, metric graphs have adjustable resolutions that can cause small amounts of data to be too thin to display.
The metrics were deleted. You cannot view metrics after they are deleted, which happens automatically after the Monitoring retention periods.

Solution

If only Ops Agent metrics are missing, first try to resolve the issue by do the following:

Verify the job's configuration by doing the following:
1. To see the job's complete configuration information, view the details of the job using the gcloud CLI or Batch API. Use the output for the remaining steps.
2. Ensure that the job's service account has the permissions to write Ops Agent metrics.
3. Ensure that the job meets all the Ops Agent requirements.
4. Ensure that the job correctly installs the Ops Agent. Although it's possible to install the Ops Agent manually in a runnable, the recommended method is to automatically install the Ops Agent by setting the installOpsAgent field to true.
If the issue still persists, see Troubleshoot the Ops Agent in the Google Cloud Observability documentation.

Otherwise, resolve the issue by doing the following:

Ensure that the Monitoring API is enabled for your project:

Enable the API
Ensure that job's VMs started running and that the run time is still within the Monitoring retention periods. You can see the job's run time by viewing the details of the job.
Verify that there aren't any issues with the methods you are using to view metrics by doing the following:
1. Unless you want to see metrics for only running resources, ensure that you are viewing metrics by using the Metrics Explorer or a custom dashboard made from Metrics Explorer charts. Other methods, such as Compute Engine dashboards, don't display metrics for resources that have been deleted.
2. Ensure that the display period includes the job's run time. For graphs, also ensure that the graph resolution is appropriate for your data.
3. Ensure that you don't have any filters that are hiding the data.
If the issue still persists, see the Troubleshoot Cloud Monitoring pages in the Google Cloud Observability documentation.

Constraint violated for VM external IP addresses

Issue

The following issue appears in the statusEvents field for a failed job:

Instance VM_NAME creation failed: Constraint constraints/compute.vmExternalIpAccess violated for project PROJECT_NUMBER.
Add instance VM_NAME to the constraint to use external IP with it.

This issue occurs because your project, folder, or organization has set the compute.vmExternalIpAccess organizational policy constraint so that only allowlisted VMs can use external IP addresses.

Solution

To resolve the issue, recreate the job and do one of the following:

Use a project that is exempt from the constraint.
Create a job that blocks external access for all VMs.

Constraint violated for trusted images

Issue

The following issue appears in the statusEvents field for a failed job:

Instance VM_NAME creation failed: Constraint constraints/compute.trustedImageProjects violated for project PROJECT_ID. Use of images from project batch-custom-image is prohibited.

Solution

This issue occurs because your project has set the trusted images (compute.trustedImageProjects) policy constraint so that images from Batch, which are in the batch-custom-image images project, are not allowed.

To resolve the issue, do at least one of the following:

Recreate the job to specify a VM OS image that is already allowed by the trusted images policy constraint.
Ask your administrator to allow the modify the trusted image policy constraint to allow VM OS images from the batch-custom-image images project. For instructions, see Control access to VM OS images for Batch.

Job failed while using an instance template

Issue

The following issue appears in the statusEvents field for a failed job that uses an instance template:

INVALID_FIELD_VALUE,BACKEND_ERROR

This issue occurs due to unclear problems with the job's instance template.

Solution

To debug the issue further, do the following:

Create a MIG using the instance template and observe if errors occur with more details.
Optional: To try to find more information, see the long running operation that is creating the MIG in the Google Cloud console.

Go to Compute Engine Operations

Task failure exit codes

When a specific task in a job fails, the task returns a nonzero exit code. Depending on how you configure the ignoreExitStatus field, a failed task might or might not cause a job to fail.

In addition to any exit codes that you define in a runnable, a Batch has several reserved exit codes, including the following exit codes.

VM preemption (50001)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to Spot Preemption with exit code 50001.

This issue occurs when a Spot VM for the job is preempted during run time.

Solution

To resolve the issue, do one of the following:

Retry the task either by using automated task retries or manually re-running the job.
To guarantee there is no preemption, use VMs with the standard provisioning model instead.

VM reporting timeout (50002)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to Batch no longer receives VM updates with exit code 50002.

This issue occurs when there is a timeout in the backend that caused Batch to no longer receive updates from a VM for the job. Unfortunately, many hardware failures or software failures can cause a VM to be unresponsive—for example, a VM might crash due to a temporary host event or insufficient resources.

Solution

To resolve this issue, do the following:

In case the issue is temporary and resolves itself, retry the task either by using automated task retries or manually re-running the job.
If the issue persists, identify and resolve what is causing the VM to be unresponsive by doing one or more of the following:
- Recommended: Get support through Google Cloud Support or the Batch label on Cloud Forums.
- Try to identify and resolve the issue yourself. For example, if you are familiar with Compute Engine, you can try to troubleshoot the job's VMs by doing the following:
  1. To identify the names of your job's VMs, do the following:
    1. View logs for the job.
    2. Filter logs for entries that contain phrase report agent state:.
    3. Review the logs to determine the VM for each attempt of each task. Each log is similar to the following, in which there is one instance: phrase and one or more task_id: phrases.
      report agent state: ... instance:"INSTANCE_NAME" ... task_id:"task/JOB_UID-group0-TASK_INDEX/TASK_RETRIES/0 ..."
      This log includes the following values:
      - INSTANCE_NAME: The name of the VM.
      - JOB_UID: The unique ID (UID) of the job.
      - TASK_INDEX: The index of the task.
      - TASK_RETRIES: The attempt of the task that ran on this VM, which is formatted as the number of retries. For example, this value is 0 for the first attempt of a task. Each task is only attempted once unless you enable automated task retries.
  2. Troubleshoot your job's VMs using the Compute Engine documentation. For example, see Troubleshooting VM shutdowns and reboots and Troubleshooting VM startup.

VM rebooted during execution (50003)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to VM is rebooted during task execution with exit code 50003.

This issue occurs when a VM for a job unexpectedly reboots during run time.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

VM and task are unresponsive (50004)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to tasks cannot be canceled with exit code 50004.

This issue occurs when a task reaches the unresponsive time limit and cannot be cancelled.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

Task runs over the maximum runtime (50005)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to task runs over the maximum runtime with exit code 50005.

This issue occurs in the following cases:

A task's run time exceeds the time limit specified in the maxRunDuration field
A runnable's run time exceeds the time limit specified in the timeout field

To identify specifically which time limit was exceeded, view logs for the job and find a log that mentions the 50005 exit code. This textPayload field of this log indicates where and when the time limit was exceeded.

Solution

To resolve the issue, attempt to verify the total run time required by the task or runnable that exceeded the time limit. Then, do one of the following:

If you only occasionally expect this error, such as for a task or runnable with an inconsistent run time, you can try to recreate the job and configure it to automate task retries to try to increase the success rate.
Otherwise, if the task or runnable consistently and intentionally needs more time to finish running than the current timeout allows, set a longer timeout.

VM recreated during execution (50006)

Issue

The following issue appears in the statusEvents field for a job:

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to VM is recreated during task execution with exit code 50006.

This issue occurs when a VM for a job is unexpectedly recreated during run time.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

What's next

Get support for Batch.