Job creation and execution overview

This document explains the execution process and creation options for jobs. Batch jobs let you run batch-processing workloads on Google Cloud. To learn about the components of a job and the prerequisites for using Batch, see Get started with Batch.

How job creation and execution works

To use Batch, you create a job that specifies your workload and its requirements, and then Batch automatically runs it.

The details of how job creation and execution works are described in the following sections:

  • Job lifecycle: understand the states that a job progresses through from creation to deletion.
  • Job queuing and scheduling: understand the factors that affect how long it takes for a job to start running.
  • Job execution: understand how a job's tasks run on its resources during execution.

Job lifecycle

This section describes the lifecycle of a job and its tasks from creation to deletion.

For each workload you want to run on Batch, you go through the following basic process:

  1. Create a job: You define the workload you want to run by specifying a job's runnables, tasks, and any other requirements. The details for creating a job are introduced in the Job creation options section in this document.
  2. Monitor and troubleshoot the job: When you finish creating a job, the job is automatically queued, scheduled, and executed on the specified resources. You can view the details of a created job or any of its task to see the current state. If needed, you can cancel a job (Preview) to stop it or prevent it from running. After a job is running or finished, you can also monitor and analyze the job using logs. If a job fails, you can troubleshoot it using any error messages, status events, or logs to diagnose the issue before recreating the job.
  3. Delete or export the job: A job's information in Batch remains available until you or Google Cloud deletes it. Google Cloud automatically deletes a job 60 days after it finishes. Before then, you can optionally delete the job yourself, or, if you need to retain the info, you can export the job's information in Batch before the job is deleted. Any information about a job that's stored in other Google Cloud services aren't affected when a job is deleted and have separate retention policies. For example, a job's logs are automatically retained and deleted according to the Cloud Logging retention policy.

After you create a job, it progresses through the following states:

  1. Queued (QUEUED): the job request has been admitted and is waiting in the queue. The job remains in the queue until the required resources are available and the jobs ahead of it have been assessed.
  2. Scheduled (SCHEDULED): the job has been selected from the queue to start running and the resources are being allocated.
  3. Running (RUNNING): the resources for the job have been created successfully and its tasks can start running.

    When a job is running, each of its tasks progresses through the following states:

    1. Pending (PENDING): the task is waiting for a VM to run on.
    2. Assigned (ASSIGNED): the task has been assigned a VM to run on.
    3. Running (RUNNING): the task is running on a VM.
    4. A task finishes in one of the following states:

      • Succeeded (SUCCEEDED): the task succeeded because each of its runnables met one of the following conditions:

        • The runnable succeeded (returned an exit code of zero).
        • The runnable failed (returned a nonzero exit code) but was a noncritical runnable (you enabled the runnable's ignoreExitStatus field).
        • The runnable didn't finish but was a background runnable (you enabled the runnable's background field).
      • Failed (FAILED): the task failed and stopped running because at least one runnable didn't meet the preceding conditions.

    The resources for the job are deleted before the job finishes.

  4. A job finishes in one of the following states:

    • Succeeded (SUCCEEDED): the job succeeded because all of its tasks succeeded.
    • Failed (FAILED): the job failed and stopped running because at least one of its tasks failed.
    • Cancelled (CANCELLED): A user cancelled the job (Preview) before it succeeded or failed.

For more information, see job states and task states in the reference documentation.

Job queuing and scheduling

Generally, jobs are more likely to run and finish sooner if they are smaller and require only a few common resources. For the example jobs in Batch documentation, which are typically very small and use minimal resources, you might see them finish running in as little as a few minutes.

Specifically, the time a job takes to finish queueing and scheduling varies for different jobs and at different times based on the following factors:

  • User-specified job prerequisites: any prerequisites that you require to be met before the job is scheduled.

    By default, a job doesn't have any prerequisites. Optionally, you can specify that a job can't be scheduled until one or more existing jobs have succeeded or failed. For more information, see Schedule dependent jobs (Preview).

  • Job priority: the priority for a job relative to the priorities of other jobs in your project.

    You can optionally specify the priority of a job by including the --priority flag for gcloud CLI or the priority JSON field. You can define the priority of a job as a number between 0 (lowest priority) and 99 (highest priority). Setting a higher priority can help a job to run sooner than lower-priority jobs in your project.

    If you don't configure the priority of a job, it defaults to using the lowest priority, 0. If two queued jobs have the same priority, the job that was created first has the higher priority.

  • Job resource availability: the availability of the job's required resources within the allowed locations.

    Firstly, a job cannot run if you happen to specify any resources that aren't offered in that location—when this happens, the job fails with a zone availability error.

    Secondly, a job is more likely to be delayed or fail if any of its required resources are in low capacity relative to the current demand due to resource availability errors. As a result, your job might run sooner when you require fewer, more-common resources and don't restrict the job from running in any zones in a region.

    For more information about the resources for a job, see Job execution in this document. For more information about the locations you can specify for a Batch job and its resources, see the Locations page.

  • Quotas and limits: the thresholds that your project has for Google Cloud resources and requests.

    A job cannot run if it exceeds a limit or your project's quota for any of the required resources or requests. When this happens, Batch might delay a job and retry it at a later time or fail the job and display a related error.

    You can help prevent delays and errors for your job by creating jobs that comply with all relevant limits and ensuring your project has enough relevant quota. For more information, see Batch quotas and limits.

Job execution

The time a job takes to be executed can vary based on task scheduling and the job's resources.

Task scheduling

When a job runs, its tasks are scheduled according to the scheduling policy (schedulingPolicy) field, which lets you specify one of the following options:

  • As soon as possible (AS_SOON_AS_POSSIBLE) (default): tasks run as soon as resources are available and can run in parallel. The amount of tasks that run at a time depends on the parallel tasks per VM allowed by the job's resources and other configuration options as explained in Job resources in this document.
  • In order (IN_ORDER): tasks run one at a time in increasing index order.

Job resources

Each Batch job runs on a regional managed instance group (MIG), which is a group of one or more matching Compute Engine virtual machine (VM) instances that are each located in one of the included zones. Each VM has dedicated hardware for CPU cores (specifically virtual CPUs (vCPUs)) and memory—which affect the performance of your job—and a boot disk—which stores an operating system (OS) image and instructions for running your job.

During a job's run time, Batch automatically creates and deletes resources that meet your specifications. When you create a job, you configure its resources by specifying the following:

  • Compute resources per task: unless the default values are sufficient, you must specify the compute resources—vCPUs, memory, and (if required) extra boot disk storage—required for each task to run. For more information, see the compute resources per task (computeResource) fields.

  • VM resources: optionally, you can also specify the job's VMs—such as the machine type and OS, and additional resources such as GPUs and storage volumes—by using the VM resource policy (instances[].policy) fields or the alternative instances[].instanceTemplate field. If you don't define these fields, Batch selects compatible VMs and doesn't add any additional resources.

The number of VMs and the number of tasks that can run simultaneously on each VM vary for different jobs based on the task scheduling and your specified hardware requirements. If you specify for a job's tasks to run IN_ORDER, the job has one VM and only runs one task at a time. Otherwise, if a job's tasks run AS_SOON_AS_POSSIBLE, then you can estimate the number of VMs and the number simultaneous tasks using the following formula:

\[{vmsPerJob}=\frac{taskCount}{parallelTasksPerVm}\]

This formula has the following values:

  • \({vmsPerJob}\): the maximum number of VMs for a job. The actual amount of VMs created for a job might be smaller than this—for example, if Batch expects it's faster to run a job on fewer resources than to wait for more resources. This value is also limited by the concurrent VMs per job limits.
  • \({taskCount}\): the total number of tasks for the job, which you define using the task count (taskCount) field.
  • \({parallelTasksPerVM}\): the maximum number of tasks that can run on a VM simultaneously.

    This value is determined by all of the following criteria:

    • The minimum value is 1 task.

    • The maximum value is the smaller of 20 tasks and, if defined, the value of the max parallel tasks per job (parallelism) field.

    • If the maximum parallel tasks per VM (taskCountPerNode) field is defined, that value is used.

      Otherwise, if taskCountPerNode is undefined, Batch decides a value by dividing the total number of compute resources—specifically vCPUs—per VM into the amount required for each task:

      \[{parallelTasksPerVm}=\frac{vcpusPerVm}{vcpusPerTask}\]

      This formula has the following values:

      • \({vcpusPerVm}\): the total number of vCPUs per VM, which is determined by the machine type of your job's VMs.

      • \({vcpusPerTask}\): the number of vCPUs per task, which is determined by converting the units of the vCPUs per task (cpuMilli) field.

Job creation options

Create and run a basic job explains the fundamentals, including how to define a runnable using either a script or container image and how to configure predefined and custom environment variables.

After you understand the fundamentals for job creation, consider creating a job that uses one or more of the following additional configuration options:

  • Control access for a job:

    • Control access for a job using a custom service account explains how to specify a job's service account, which influences the resources and applications that a job's VMs can access. If you don't specify a custom service account, jobs default to using the Compute Engine default service account.

    • Networking overview provides an overview of when and how you can customize the networking configuration for a job, including specifying the job's network, blocking external connections, and protecting data and resources by using VPC Service Controls.

    • Protect sensitive data using Secret Manager explains how to securely define sensitive data, such as custom environment variables and login credentials, by using Secret Manager secrets to specify encrypted information when you create a job.

  • Configure additional options for a job:

    • Configure task communication using an MPI library explains how to configure a job with interdependent tasks that communicate with each other across different VMs by using a Message Passing Interface (MPI) library. A common use case for MPI is tightly coupled, high performance computing (HPC) workloads.

    • Customize the resources that a job runs on:

      • Define job resources using a VM instance template explains how to specify a Compute Engine VM template to define a job's resources when you create a job. This is an alternative to specifying a job's resources directly by using the instances[].policy field.

      • Use GPUs for a job explains how to define a job that uses one or more graphics processing units (GPUs). Common use cases for jobs that use GPUs include intensive data processing or machine learning (ML) workloads.

      • Use storage volumes for a job explains how to define a job that can access one or more external storage volumes. Storage options include new or existing persistent disks, new local SSDs, existing Cloud Storage buckets, and an existing network file system (NFS) such as a Filestore file share.

      • VM OS environment overview provides an overview of when and how you can customize the VM operating system (OS) environment for a job, including the job's VM OS image and boot disks.

    • Optimize various aspects of a job:

      • Improve monitoring and analysis:

      • Schedule dependent jobs (Preview) explains how to specify a job that doesn't run until one or more existing dependency jobs have succeeded or failed. If you have a workload with varying resource requirements, you can reduce costs and quota usage by separating the types of VMs used for low-demand operations (like data preparation) and compute-intensive operations (like data processing).

      • Automate task retries explains how to automatically retry a job's tasks after all or specified failures. Automated retries can help reduce troubleshooting friction and the overall run time required for jobs that experience temporary errors. For example, use automatic retries for a job that runs on Spot VMs, which provide significant discounts but might not always be available and can be preempted at any time.

      • Limit run times using timeouts explains how to limit the amount of time that a task or runnable is permitted to run. By preventing excess run times, you might reduce unexpected costs and delays.

      • Ensure resource availability using VM reservations explains how to configure a job that can run on reserved VMs. Using reserved VMs can help you minimize a job's scheduling time, prevent resource availability errors, and optimize costs.

      • Reduce latency:

        • Colocate VMs to reduce latency explains how to reduce network latency between a job's VMs by requiring the VMs to be located physically close to each other. This performance benefit can be especially useful for jobs that have frequent network communications across VMs, such as tasks that communicate using MPI libraries.

        • Use Image streaming explains how to improve job startup time by streaming container images from Artifact Registry.

  • Use additional services to create and run jobs:

What's next