Retry events

Advanced

An event can be rejected for multiple reasons. For example, the event receiver service might be temporarily unavailable due to an outage; an error might be encountered by the service when processing an event; or service resources might become exhausted. Transient errors like this can be retried.

An event can also fail to be delivered to the event receiver. For example, the event might not match the expected schema that is configured, or the mediation of the event might fail before the event message can be routed to its final destination. Such cases result in persistent errors.

Transient errors

Eventarc Advanced provides you with the capability to handle transient errors. These transient errors can be retried and include those with the following error codes:

HTTP 408 Request Timeout
HTTP 409 Conflict
HTTP 429 Too Many Requests
HTTP 500 Internal Server Error
HTTP 502 Bad Gateway
HTTP 503 Service Unavailable
HTTP 504 Gateway Time-out

Persistent errors

In contrast to transient errors, persistent errors include the following:

Errors that occur when the number of configured retries is exhausted
Errors that occur when an event fails before it can be routed to its destination
Errors that result in an error code that is considered non-retryable; for example, error codes other than those listed for transient errors

You can manually identify persistent errors and handle them appropriately.

Retry transient errors

Eventarc Advanced uses an exponential backoff delay to handle errors that can be retried. The default retry policy starts with a one-second delay, and the delay is doubled after each failed attempt (up to a maximum of 60 seconds and five attempts).

You can change the default retry policy using the Google Cloud console or the gcloud beta eventarc pipelines update command.

Note that the default backoff factor of 2 can't be changed.

Console

In the Google Cloud console, go to the Eventarc > Pipelines page.

Go to Pipelines
Click the name of the pipeline.
In the Pipeline details page, click Edit.
On the Edit pipeline page, in the Retry policy section, modify the following fields:
- Max attempts: the number of retries; default is 5 attempts. Can be any positive real number. If set to 1, no retry policy is applied and only one attempt is made to deliver a message.
- Min delay (seconds): the initial delay in seconds; default is 1 second. Must be between 1 and 600.
- Max delay (seconds): the maximum delay in seconds; default is 60 seconds. Must be between 1 and 600.
You can configure a linear backoff by setting the minimum and maximum delays to the same value.
Click Save.

gcloud

gcloud beta eventarc pipelines update PIPELINE_NAME \
    --min-retry-delay=MIN_DELAY \
    --max-retry-delay=MAX_DELAY \
    --max-retry-attempts=MAX_ATTEMPTS

Replace the following:

PIPELINE_NAME: the ID or fully qualified identifier of the pipeline.
MIN_DELAY: the initial delay in seconds; default is 1 second. Must be between 1 and 600.
MAX_DELAY: the maximum delay in seconds; default is 60 seconds. Must be between 1 and 600.
MAX_ATTEMPTS: the number of retries; default is 5 attempts. Can be any positive real number. If set to 1, no retry policy is applied and only one attempt is made to deliver a message.

The following example configures a linear backoff by setting the minimum and maximum delays to the same value:

gcloud beta eventarc pipelines update my-pipeline \
    --min-retry-delay=4 \
    --max-retry-delay=4 \
    --max-retry-attempts=5

Archive messages to handle persistent errors

You can write messages to a BigQuery table as they are received. This lets you manually identify persistent errors and handle them appropriately.

The following provides an overview of the steps required to archive your event messages, identify persistent errors, and retry the affected events.

Create a bus. Configure the bus appropriately; for example, to publish events from Google sources.
Create a Pub/Sub topic. This Pub/Sub topic will be the target destination for your pipeline.
Create a BigQuery subscription for the Pub/Sub topic. A BigQuery subscription is a type of export subscription that writes messages to an existing BigQuery table as they are received. Alternatively, you can create the table when you create the BigQuery subscription.

Create a pipeline and enrollment that routes every message received by the bus (using --cel-match="true") to the Pub/Sub topic. Configure a retry policy for the pipeline.

For example, the following commands create a pipeline and an enrollment:

gcloud beta eventarc pipelines create my-archive-pipeline \
    --destinations=pubsub_topic='my-archive-topic',network_attachment='my-network-attachment' \
    --min-retry-delay=1 \
    --max-retry-delay=20 \
    --max-retry-attempts=6 \
    --location=us-central1

gcloud beta eventarc enrollments create my-archive-enrollment \
    --cel-match="true" \
    --destination-pipeline=my-archive-pipeline \
    --message-bus=my-message-bus \
    --message-bus-project=my-google-cloud-project \
    --location=us-central1

Route your pipeline logs to another BigQuery dataset.

You should now have two separate BigQuery datasets: one that stores every message received by your Eventarc Advanced bus and one that stores your pipeline logs.
To identify messages that have failed, use a query statement to join both BigQuery datasets on the message_uid field.
After identifying any failed messages, you can publish them to your bus again using the Eventarc Publishing API. For example, you can deploy a Cloud Run service or job to read the messages from BigQuery and publish them directly to the Eventarc Advanced bus.

Make event handlers idempotent

Event handlers that can be retried should be idempotent, using the following general guidelines:

Many external APIs let you supply an idempotency key as a parameter. If you are using such an API, you should use the event source and ID as the idempotency key. (Producers must ensure that source + id is unique for each distinct event.)
Additionally, you can use a CloudEvents attribute, xgooglemessageuid, to provide idempotency. The value of this attribute is the same as the message_uid field in Eventarc Advanced messages. It uniquely identifies the action of publishing an event. For example, if the same event is published twice to a bus, each event will have a different xgooglemessageuid value when sent to an event handler.
Idempotency works well with at-least-once delivery, because it makes it safe to retry. So a general best practice for writing reliable code is to combine idempotency with retries.
Make sure that your code is internally idempotent. For example:
- Make sure that mutations can happen more than once without changing the outcome.
- Query database state in a transaction before mutating the state.
- Make sure that all side effects are themselves idempotent.
Impose a transactional check outside your service, independent of the code. For example, persist state somewhere recording that a given event ID has already been processed.
Deal with duplicate calls out-of-band. For example, have a separate clean up process that cleans up after duplicate calls.