Troubleshooting a pull subscription

This document provides some common troubleshooting tips for Pub/Sub pull subscriptions. Read more about pull subscriptions in the Pull subscriber guide.

To effectively monitor your Pub/Sub subscription, it is recommended to first look at the delivery latency health score (subscription/delivery_latency_health_score) to check which factors can contribute to an unexpected or increased latency.

Oldest unacked message age keeps increasing

The oldest_unacked_message_age is a critical metric for monitoring the health of Pub/Sub subscriptions. It measures the age, in seconds, of the oldest message in a subscription's backlog that has not yet been acknowledged (acked) by a subscriber. This metric offers valuable insights into potential processing delays or bottlenecks.

Monitoring message backlog ensures timely and efficient message processing. By tracking the oldest unacked message age, you can proactively identify situations where consumers are falling behind. This practice allows for early intervention to address potential issues related to degraded performance.

Some of the common backlog issues that you can investigate include:

Client configuration issues

When both the oldest_unacked_message_age and num_undelivered_messages metrics increase simultaneously, it could mean that the subscribers are not keeping up with the message volume. In this situation, focus your investigation on the subscriber components:

Client health: Analyze resource utilization on machines hosting subscriber clients, like CPU, memory, and network bandwidth. Look for pressure points that might impede processing efficiency.
Client code: Review recent code changes and examine error logs. Bugs or inefficiencies in the subscriber code can significantly impact message processing rates. Note that there could be issues in specific messages. For example, multiple messages might need to access the same row in a database simultaneously. This behavior can lead to contention and high latency.
Quota limitations: Verify that you haven't exceeded any Pub/Sub quotas or limitations imposed by your hosting service. If the subscribers are hosted in Google Cloud, review Compute Engine or GKE resource quotas to prevent potential bottlenecks.

Subscriber negatively acknowledged the messages

When a subscriber negatively acknowledges (nacks) a message, it signals to Pub/Sub that the message couldn't be processed successfully. Pub/Sub then attempts to redeliver the same message. Repeated nacks for a message leads to duplicates and potentially a high delay in message delivery.

Note that nacking a message won't guarantee that the next pull fetches a different message. Pub/Sub's redelivery policy might continue to redeliver nacked messages before new ones. Therefore, don't rely on nacks as a method of filtering or skipping specific messages. Instead, set a retry policy, preferably an exponential backoff, as a way to back off on individual messages that are likely to be processable later but need a little longer time before redelivery.

If you need to intentionally skip certain messages, the recommended approach is to ack them, even if you won't process them. This removes them from the subscription, avoids unnecessary redeliveries, and reduces resource consumption. Leaving messages unacknowledged, whether intentionally or not, creates backlog issues and duplicate deliveries.

High delivery latency

Delivery latency in Pub/Sub is the time it takes for a message from a publisher to reach a subscriber. Some possible causes of a high delivery latency are described in the next sections.

Not enough subscribers

For clients using StreamingPull, to achieve consistently low latency, maintain multiple open StreamingPull connections to your subscription. Without active subscriber connections, Pub/Sub cannot deliver messages promptly. A single stream could be a single point of failure, increasing the risk of delays. The subscription/open_streaming_pulls metric provides visibility into the number of active streaming connections. Use this to make sure you consistently have enough streams to handle incoming messages.

For clients using unary pull, to achieve consistently low latency, maintain multiple outstanding pull requests to your subscription. Infrequent requests could potentially accumulate messages in the backlog and increase latency. This approach helps minimize gaps in connectivity and improve delivery latency.

The high-level client library is recommended for cases where you require high throughput and low latency with minimal operational overhead and processing cost. By default, the high-level client library uses the StreamingPull API, since it tends to be a better choice for minimizing latency. The high-level client libraries contain prebuilt functions and classes that handle the underlying API calls for authentication, throughput and latency optimization, message formatting, and other features.

Client configuration issues

See Client configuration issues.

High backlog

Note that a message backlog of unacked messages in a Pub/Sub subscription inherently increases end-to-end latency because messages are not getting processed immediately by subscribers.

Ordering keys and exactly-once delivery

Ordering keys and exactly-once delivery are valuable features, however they require additional coordination within Pub/Sub to ensure correct delivery. This coordination can reduce availability and increase latency. While the difference is minimal in the steady state, any necessary coordination steps could result in temporary increases in latency or an increased error rate. If ordering is enabled, messages with an ordering key cannot be delivered until earlier ones with the same ordering key are ACKed.

Consider whether message ordering or exactly-once delivery are absolutely essential for your application. If low latency is your top priority, minimizing the use of these features could help reduce message processing delays.

Increase in message size

A sudden increase in the message size could potentially increase the transfer time between Pub/Sub and your client, and slow down the processing time of the messages in the client side.

If you observe an increase in delivery latency, you can check message sizes using the topic/message_sizes metric, grouping by topic_id. Correlate any spikes in message size with observed performance issues.

Missing messages

If you suspect messages are not being successfully delivered to your subscriber, one of the following reasons might be the contributing factor.

Message distribution in Pub/Sub subscriptions with multiple consumers

In Pub/Sub, messages might be distributed unevenly across consumers. This behavior occurs because Pub/Sub distributes messages among active consumers for efficiency. Sometimes, a single consumer might receive fewer messages than expected, or a different subset of messages than other consumers.

Note that messages could be outstanding to clients already and a backlog of unacknowledged messages doesn't necessarily mean you'll receive those messages on your next pull request. Be aware that a consumer may be someone using pull in the Google Cloud console or Google Cloud CLI, or running a custom subscriber locally to check messages.

For unary Pull clients, you might observe some pull requests returning zero messages. As discussed in the not enough subscribers section, it is recommended to maintain multiple outstanding pull requests since some requests could return less than the maximum number of messages configured or even zero messages.

If you suspect any of these behaviors, investigate whether there are multiple consumers concurrently attached to the subscription and inspect them.

Filter on the subscription

Check if the subscription has a filter attached to it. If so, you only receive the messages that match the filter. The Pub/Sub service automatically acknowledges the messages that don't match the filter. Consider how filters affect backlog metrics.

Using the option `returnImmediately`

If your client is using unary Pull, check if the returnImmediately field is set to true. This is a deprecated field that tells the Pub/Sub service to respond to the pull request immediately even if it returns with no messages. This can result in pull requests returning with 0 messages even when there is a backlog.

Dealing with duplicates

Message duplication in Pub/Sub often occurs when subscribers can't ack messages within the ack deadline. This causes the messages to be redelivered, creating the impression of duplicates. You can measure the rate at which subscribers miss the ack deadline using the subscription/expired_ack_deadlines_count metric. Learn more about how to Monitor acknowledgment deadline expiration.

To reduce the duplication rate, extend message deadlines.

Client libraries handle deadline extension automatically, but there are default limits on the maximum extension deadline that you can configure.
If you are building your own client library, use the modifyAckDeadline method to extend the acknowledgment deadline.

If messages are pulled in the subscriber faster than they can be processed and acked, some messages might expire and require deadline extensions. However, if the subscriber remains overwhelmed, repeated deadline extensions eventually fail. In the worst-case scenario, this can lead to a subscriber overflowing with duplicates, exacerbating the backlog. Expiring duplicates then generate new duplicates.

To avoid overwhelming the subscriber, reduce the number of messages that the subscriber pulls at a time. This way the subscriber has fewer messages to process within the deadline. Fewer messages expire and fewer messages are redelivered.

To reduce the number of messages that the subscriber pulls at a time, you need to reduce the maximum number of outstanding messages setting in your subscriber's flow control configuration. There is no one-size-fits-all value, so you must adjust the maximum outstanding messages limit based on your throughput and subscriber capacity. Consider that each application processes messages differently and takes a different amount of time to ack a message.

Forcing retries

To force Pub/Sub to retry a message, send a nack request. If you are not using the high-level client libraries, send a modifyAckDeadline request with an ackDeadlineSeconds set to 0.

Ordering keys

When Pub/Sub redelivers a message with an ordering key, it also redelivers all subsequent messages with the same ordering key, even if they were previously acknowledged. This is done to preserve the order of the sequence. However, there's no strict guarantee that dependent messages are only sent after the successful acking of prior messages in the sequence.

Subscriber is nacking the messages

See subscriber is nacking the messages.

Troubleshooting a StreamingPull subscription

Relationship between the request latency metric and end-to-end delivery latency

For StreamingPull, the metric serviceruntime.googleapis.com/api/request_latencies represents the time for which the stream is open. The metric is not helpful for determining end-to-end delivery latency.

Instead of using the request latency metric, use the delivery latency health score to check which factors are contributing to an increased end-to-end delivery latency.

StreamingPull Connections Close with a non-OK Status

StreamingPull streams always close with a non-OK status. Unlike an error status for unary RPCs, this status for StreamingPull is just an indication that the stream is disconnected. The requests are not failing. Therefore, while the StreamingPull API might have a surprising 100% error rate, this behavior is by design.

Since StreamingPull streams always close with an error, it isn't helpful to examine stream termination metrics while diagnosing errors. Rather, focus on the StreamingPull response metric subscription/streaming_pull_response_count, grouped by response_code or response_class.

Look for these errors:

Failed precondition errors can occur if there are messages in the subscription backlog that are encrypted with a disabled Cloud KMS key. To resume pulling, restore access to the key.
Unavailable errors can occur when Pub/Sub is unable to process a request. This is most likely a transient condition and the client library retries the requests. No action on your part is necessary if you are using a client library.
Not found errors can occur when the subscription is deleted or if it never existed in the first place. The latter case happens if you provided an invalid subscription path.

Additional references

Troubleshoot pull subscriptions with gcpdiag

Troubleshooting a pull subscription

Oldest unacked message age keeps increasing

Client configuration issues

Subscriber negatively acknowledged the messages

High delivery latency

Not enough subscribers

Client configuration issues

High backlog

Ordering keys and exactly-once delivery

Increase in message size

Missing messages

Message distribution in Pub/Sub subscriptions with multiple consumers

Filter on the subscription

Using the option returnImmediately

Dealing with duplicates

Forcing retries

Ordering keys

Subscriber is nacking the messages

Troubleshooting a StreamingPull subscription

Relationship between the request latency metric and end-to-end delivery latency

StreamingPull Connections Close with a non-OK Status

Additional references

Using the option `returnImmediately`