This document provides some common troubleshooting tips for Pub/Sub pull subscriptions. Read more about pull subscriptions in the Pull subscriber guide.
To effectively monitor your Pub/Sub subscription, it is recommended to first look at the delivery latency health score (subscription/delivery_latency_health_score) to check which factors can contribute to an unexpected or increased latency.
Oldest unacked message age keeps increasing
The oldest_unacked_message_age
is a critical metric for monitoring the health of Pub/Sub subscriptions. It measures the age, in seconds, of the oldest message in a subscription's backlog that has not yet been acknowledged (acked) by a subscriber. This metric offers valuable insights into potential processing delays or bottlenecks.
Monitoring message backlog ensures timely and efficient message processing. By tracking the oldest unacked message age, you can proactively identify situations where consumers are falling behind. This practice allows for early intervention to address potential issues related to degraded performance.
Some of the common backlog issues that you can investigate include:
Client configuration issues
When both the oldest_unacked_message_age
and num_undelivered_messages
metrics increase simultaneously, it could mean that the subscribers are not keeping up with the message volume. In this situation, focus your investigation on the subscriber components:
Client health: Analyze resource utilization on machines hosting subscriber clients, like CPU, memory, and network bandwidth. Look for pressure points that might impede processing efficiency.
Client code: Review recent code changes and examine error logs. Bugs or inefficiencies in the subscriber code can significantly impact message processing rates. Note that there could be issues in specific messages. For example, multiple messages might need to access the same row in a database simultaneously. This behavior can lead to contention and high latency.
Quota limitations: Verify that you haven't exceeded any Pub/Sub quotas or limitations imposed by your hosting service. If the subscribers are hosted in Google Cloud, review Compute Engine or GKE resource quotas to prevent potential bottlenecks.
Subscriber negatively acknowledged the messages
When a subscriber negatively acknowledges (nacks) a message, it signals to Pub/Sub that the message couldn't be processed successfully. Pub/Sub then attempts to redeliver the same message. Repeated nacks for a message leads to duplicates and potentially a high delay in message delivery.
Note that nacking a message won't guarantee that the next pull fetches a different message. Pub/Sub's redelivery policy might continue to redeliver nacked messages before new ones. Therefore, don't rely on nacks as a method of filtering or skipping specific messages. Instead, set a retry policy, preferably an exponential backoff, as a way to back off on individual messages that are likely to be processable later but need a little longer time before redelivery.
If you need to intentionally skip certain messages, the recommended approach is to ack them, even if you won't process them. This removes them from the subscription, avoids unnecessary redeliveries, and reduces resource consumption. Leaving messages unacknowledged, whether intentionally or not, creates backlog issues and duplicate deliveries.
High delivery latency
Delivery latency in Pub/Sub is the time it takes for a message from a publisher to reach a subscriber. Some possible causes of a high delivery latency are described in the next sections.
Not enough subscribers
For clients using StreamingPull, to achieve consistently low latency, maintain multiple open StreamingPull connections to your subscription. Without active subscriber connections, Pub/Sub cannot deliver messages promptly. A single stream could be a single point of failure, increasing the risk of delays. The subscription/open_streaming_pulls
metric provides visibility into the number of active streaming connections. Use this to make sure you consistently have enough streams to handle incoming messages.
For clients using unary pull, to achieve consistently low latency, maintain multiple outstanding pull requests to your subscription. Infrequent requests could potentially accumulate messages in the backlog and increase latency. This approach helps minimize gaps in connectivity and improve delivery latency.
The high-level client library is recommended for cases where you require high throughput and low latency with minimal operational overhead and processing cost. By default, the high-level client library uses the StreamingPull API, since it tends to be a better choice for minimizing latency. The high-level client libraries contain prebuilt functions and classes that handle the underlying API calls for authentication, throughput and latency optimization, message formatting, and other features.
Client configuration issues
See Client configuration issues.
High backlog
Note that a message backlog of unacked messages in a Pub/Sub subscription inherently increases end-to-end latency because messages are not getting processed immediately by subscribers.
Ordering keys and exactly-once delivery
Ordering keys and exactly-once delivery are valuable features, however they require additional coordination within Pub/Sub to ensure correct delivery. This coordination can reduce availability and increase latency. While the difference is minimal in the steady state, any necessary coordination steps could result in temporary increases in latency or an increased error rate. If ordering is enabled, messages with an ordering key cannot be delivered until earlier ones with the same ordering key are ACKed.
Consider whether message ordering or exactly-once delivery are absolutely essential for your application. If low latency is your top priority, minimizing the use of these features could help reduce message processing delays.
Increase in message size
A sudden increase in the message size could potentially increase the transfer time between Pub/Sub and your client, and slow down the processing time of the messages in the client side.
If you observe an increase in delivery latency, you can check message sizes using the topic/message_sizes
metric, grouping by topic_id
. Correlate any spikes in message size with observed performance issues.
Missing messages
If you suspect messages are not being successfully delivered to your subscriber, one of the following reasons might be the contributing factor.
Message distribution in Pub/Sub subscriptions with multiple consumers
In Pub/Sub, messages might be distributed unevenly across consumers. This behavior occurs because Pub/Sub distributes messages among active consumers for efficiency. Sometimes, a single consumer might receive fewer messages than expected, or a different subset of messages than other consumers.
Note that messages could be outstanding to clients already and a backlog of unacknowledged messages doesn't necessarily mean you'll receive those messages on your next pull request. Be aware that a consumer may be someone using pull in the Google Cloud console or Google Cloud CLI, or running a custom subscriber locally to check messages.
For unary Pull clients, you might observe some pull requests returning zero messages. As discussed in the not enough subscribers section, it is recommended to maintain multiple outstanding pull requests since some requests could return less than the maximum number of messages configured or even zero messages.
If you suspect any of these behaviors, investigate whether there are multiple consumers concurrently attached to the subscription and inspect them.
Filter on the subscription
Check if the subscription has a filter attached to it. If so, you only receive the messages that match the filter. The Pub/Sub service automatically acknowledges the messages that don't match the filter. Consider how filters affect backlog metrics.
Using the option returnImmediately
If your client is using unary Pull, check if the returnImmediately
field is set to true. This is a deprecated field that tells the Pub/Sub service to respond to the pull request immediately even if it returns with no messages. This can result in pull requests returning with 0 messages even when there is a backlog.
Dealing with duplicates
Message duplication in Pub/Sub often occurs when subscribers can't ack messages within the ack deadline. This causes the messages to be redelivered, creating the impression of duplicates. You can measure the rate at which subscribers miss the ack deadline using the subscription/expired_ack_deadlines_count
metric. Learn more about how to Monitor acknowledgment deadline expiration.
To reduce the duplication rate, extend message deadlines.
- Client libraries handle deadline extension automatically, but there are default limits on the maximum extension deadline that you can configure.
- If you are building your own client library, use the
modifyAckDeadline
method to extend the acknowledgment deadline.
If messages are pulled in the subscriber faster than they can be processed and acked, some messages might expire and require deadline extensions. However, if the subscriber remains overwhelmed, repeated deadline extensions eventually fail. In the worst-case scenario, this can lead to a subscriber overflowing with duplicates, exacerbating the backlog. Expiring duplicates then generate new duplicates.
To avoid overwhelming the subscriber, reduce the number of messages that the subscriber pulls at a time. This way the subscriber has fewer messages to process within the deadline. Fewer messages expire and fewer messages are redelivered.
To reduce the number of messages that the subscriber pulls at a time, you need to reduce the maximum number of outstanding messages setting in your subscriber's flow control configuration. There is no one-size-fits-all value, so you must adjust the maximum outstanding messages limit based on your throughput and subscriber capacity. Consider that each application processes messages differently and takes a different amount of time to ack a message.
Forcing retries
To force Pub/Sub to retry a message, send a nack
request. If you are not using the high-level client libraries, send a modifyAckDeadline
request with an ackDeadlineSeconds
set to 0.
Ordering keys
When Pub/Sub redelivers a message with an ordering key, it also redelivers all subsequent messages with the same ordering key, even if they were previously acknowledged. This is done to preserve the order of the sequence. However, there's no strict guarantee that dependent messages are only sent after the successful acking of prior messages in the sequence.
Subscriber is nacking the messages
See subscriber is nacking the messages.
Troubleshooting a StreamingPull subscription
Relationship between the request latency metric and end-to-end delivery latency
For StreamingPull, the metric serviceruntime.googleapis.com/api/request_latencies represents the time for which the stream is open. The metric is not helpful for determining end-to-end delivery latency.
Instead of using the request latency metric, use the delivery latency health score to check which factors are contributing to an increased end-to-end delivery latency.
StreamingPull Connections Close with a non-OK Status
StreamingPull streams always close with a non-OK status. Unlike an error status for unary RPCs, this status for StreamingPull is just an indication that the stream is disconnected. The requests are not failing. Therefore, while the StreamingPull API might have a surprising 100% error rate, this behavior is by design.
Since StreamingPull streams always close with an error, it isn't helpful to
examine stream termination metrics while diagnosing errors. Rather, focus on
the StreamingPull response metric
subscription/streaming_pull_response_count
,
grouped by response_code
or response_class
.
Look for these errors:
Failed precondition errors can occur if there are messages in the subscription backlog that are encrypted with a disabled Cloud KMS key. To resume pulling, restore access to the key.
Unavailable errors can occur when Pub/Sub is unable to process a request. This is most likely a transient condition and the client library retries the requests. No action on your part is necessary if you are using a client library.
Not found errors can occur when the subscription is deleted or if it never existed in the first place. The latter case happens if you provided an invalid subscription path.