Individual job tasks or even job executions can fail for a variety of reasons. This page contains best practices to handle these failures, centered around task restarts and job checkpointing.
Use task retries
Individual job tasks can fail for a variety of reasons, including issues with application dependencies, quotas, or even internal system events. Often such issues are transient and the task will succeed after a retry.
By default, each task will automatically retry up to 3 times. This helps ensure a job will run to completion even if it encounters transient task failures. You can also customize the maximum number of retries. However, if you do change the default, you should specify at least one retry.
Plan for job task restarts
Make your jobs idempotent, so that a task restart does not result in corrupt or duplicate output. That is, write repeatable logic that has the same behavior for a given set of inputs no matter how many times it is repeated or when it is executed.
Write your output to a different location than the input data, leaving input data intact. This way, if the job runs again, the job can repeat the process from the beginning and get the same result.
Avoid duplicating output data by reusing the same unique identifier or checking if the output already exists. Duplicate data represents collection-level data corruption.
Use checkpointing
Where possible, checkpoint your jobs so that if a task restarts after a failure, it can pick up where it left off, instead of restarting work at the beginning. Doing this will speed up your jobs as well as minimize unnecessary costs.
Periodically write partial results and an indication of progress made to a persistent storage location such as Cloud Storage or a database. When your task starts, look for partial results upon startup. If partial results are found, begin processing where they left off.
If your job does not lend itself to checkpointing, consider breaking it up into smaller chunks and run a larger number of tasks.
Checkpointing example 1: calculating Pi
If you have a job that executes a recursive algorithm, such as calculating Pi to many decimal places, and uses parallelism set to a value of 1:
- Write your progress every 10 minutes or whatever your lost work tolerance
allows, to a
pi-progress.txt
Cloud Storage object. - When a task starts, query the
pi-progress.txt
object and load the value as a starting place. Use that value as the initial input to your function. - Write your final result to Cloud Storage as an object named
pi-complete.txt
to avoid duplication via parallel or repeated execution orpi-complete-DATE.txt
to differentiate by completion date.
Checkpointing example 2: processing 10,000 records from Cloud SQL
If you have a job processing 10,000 records in a relational database such as Cloud SQL:
- Retrieve records to be processed with a SQL query such as
SELECT * FROM example_table LIMIT 10000
- Write out updated records in batches of 100 so significant processing work is not lost on interruption.
- When records are written, note which ones have been processed. You might add a boolean column processed to the table which is set to 1 only if processing is confirmed.
- When a task starts, the query used to retrieve items for processing should add the condition processed = 0.
- In addition to clean retries, this technique also supports breaking up work
into smaller tasks, such as by modifying your query to select 100 records at a
time:
LIMIT 100 OFFSET $CLOUD_RUN_TASK_INDEX*100
, and running 100 tasks to process all 10,000 records.CLOUD_RUN_TASK_INDEX
is a built-in environment variable present inside the container running Cloud Run jobs.
Using all these pieces together, the final query might look like this:
SELECT * FROM example_table WHERE processed = 0 LIMIT 100 OFFSET $CLOUD_RUN_TASK_INDEX*100
What's next
- To create a Cloud Run job, see Create jobs.
- To execute a job, see Execute jobs.
- To execute a job on a schedule, see Execute jobs on a schedule.