The Pub/Sub API exports metrics via Cloud Monitoring. Cloud Monitoring lets you to create monitoring dashboards and alerting policies or access the metrics programmatically.
Viewing metrics
To view Cloud Monitoring dashboards or to define alerting policies, go to Monitoring in the Google Cloud console:
You can also use the Cloud Monitoring API to query and view metrics for subscriptions and topics.
Metrics and resource types
- To see the metrics that Pub/Sub reports to Cloud Monitoring, view the Pub/Sub Metrics List in the Cloud Monitoring documentation.
- To see the details for the
pubsub_topic
,pubsub_subscription
, orpubsub_snapshot
monitored resource types, view Monitored Resource Types in the Cloud Monitoring documentation.
Ensuring you don't run out of quota
For a given project, you can use the IAM & admin quotas dashboard to view current quotas and usage.
You can view your historical quota usage using the following metrics:
These metrics use the
consumer_quota
monitored
resource type. For more quota-related metrics, see the
Metrics List.
For example, the following Monitoring Query Language query creates a chart with the fraction of publisher quota being used in each region:
fetch consumer_quota
| filter resource.service == 'pubsub.googleapis.com'
| { metric serviceruntime.googleapis.com/quota/rate/net_usage
| filter metric.quota_metric == 'pubsub.googleapis.com/regionalpublisher'
| align delta_gauge(1m)
| group_by [metric.quota_metric, resource.location],
sum(value.net_usage)
; metric serviceruntime.googleapis.com/quota/limit
| filter metric.quota_metric == 'pubsub.googleapis.com/regionalpublisher'
| group_by [metric.quota_metric, resource.location],
sliding(1m), max(val()) }
| ratio
If you anticipate your usage exceeding the default quota limits, create alerting policies for all the relevant quotas. These alerts should fire when your usage reaches some fraction of the limit. For example, the following Monitoring Query Language query will trigger your alerting policy when any Pub/Sub quota exceeds 80% usage:
fetch consumer_quota
| filter resource.service == 'pubsub.googleapis.com'
| { metric serviceruntime.googleapis.com/quota/rate/net_usage
| align delta_gauge(1m)
| group_by [metric.quota_metric, resource.location],
sum(value.net_usage)
; metric serviceruntime.googleapis.com/quota/limit
| group_by [metric.quota_metric, resource.location],
sliding(1m), max(val()) }
| ratio
| every 1m
| condition gt(val(), 0.8 '1')
For more customized monitoring of and alerting on quota metrics, see Using quota metrics.
See Quotas and limits for more information about quotas.
Keeping subscribers healthy
Monitoring the backlog
To ensure that your subscribers are keeping up with the flow of messages, create a dashboard that shows the following backlog metrics, aggregated by resource, for all your subscriptions:
subscription/num_undelivered_messages
to see the number of unacknowledged messagessubscription/oldest_unacked_message_age
to see the age of the oldest unacknowledged message in the subscription's backlog
Create alerting policies that will fire when these values are unusually large in the context of your system. For instance, the absolute number of unacknowledged messages is not necessarily meaningful. A backlog of a million messages might be acceptable for a million message-per-second subscription, but unacceptable for a one message-per-second subscription.
Symptoms | Problem | Solutions |
---|---|---|
Both the oldest_unacked_message_age and
num_undelivered_messages are growing in tandem. |
Subscribers not keeping up with message volume |
|
If there is a steady, small backlog size combined with a steadily
growing oldest_unacked_message_age , there may be a small
number of messages that cannot be processed. |
Stuck messages | Examine your application logs to understand whether some messages are causing your code to crash. It's unlikely—but possible —that the offending messages are stuck on Pub/Sub rather than in your client. Raise a support case once you are confident your code successfully processes each message. |
The oldest_unacked_message_age exceeds the subscription's
message retention duration. |
Permanent data loss | Set up an alert that fires well in advance of the subscription's message retention duration lapsing. |
Monitoring ack deadline expiration
In order to reduce end-to-end latency of message delivery, Pub/Sub allows subscriber clients a limited amount of time to acknowledge a given message (known as the "ack deadline") before re-delivering the message. If your subscribers take too long to acknowledge messages, the messages will be re-delivered, resulting in the subscribers seeing duplicate messages. This can happen for a number of reasons:
Your subscribers are under-provisioned (you need more threads or machines).
Each message takes longer to process than the message acknowledgement deadline. Google Cloud Client Libraries generally extend the deadline for individual messages up to a configurable maximum. However, a maximum extension deadline is also in effect for the libraries.
Some messages consistently crash the client.
It can be useful to measure the rate at which subscribers miss the ack deadline. The specific metric depends on the subscription type:
Pull and StreamingPull:
subscription/expired_ack_deadlines_count
Push:
subscription/push_request_count
filtered byresponse_code != "success"
Excessive ack deadline expiration rates can result in costly inefficiencies in your system. You pay for every redelivery and for attempting to process each message repeatedly. Conversely, a small expiration rate (for example, 0.1-1%) might be healthy.
Monitoring message throughput
Pull and StreamingPull subscribers may receive batches of messages in each pull response; push subscriptions receive a single message in each push request. You can monitor the batch message throughput being processed by your subscribers with these metrics:
Pull:
subscription/pull_request_count
(note that this metric may also include pull requests which returned with no messages)StreamingPull:
subscription/streaming_pull_response_count
You can monitor the individual (i.e. unbatched) message throughput being processed by your subscribers with these metrics:
Pull and StreamingPull:
subscription/sent_message_count
filtered by thedelivery_type
of interest
Monitoring push subscriptions
For push subscriptions, you should also monitor these metrics:
subscription/push_request_count
Group the metric by
response_code
andsubcription_id
. Since Pub/Sub push subscriptions use response codes as implicit message acknowledgements, it is important to monitor push request response codes. Because push subscriptions exponentially back off when they encounter timeouts or errors, your backlog can grow quickly based on how your endpoint responds.Consider setting an alert for high error rates (create a metric filtered by response class), since those rates lead to slower delivery and a growing backlog. However, push request counts are likely to be more useful as a tool for investigating growing backlog size and age.
subscription/num_outstanding_messages
Pub/Sub generally limits the number of outstanding messages. You should aim for fewer than 1000 outstanding messages in most situations. As a rule, the service adjusts the limit based on the overall throughput of the subscription in increments of 1000, once the throughput achieves a rate on the order of ten thousand messages per second. No specific guarantees are made beyond the maximum value, so 1000 is a good guide.
subscription/push_request_latencies
This metric helps you understand your push endpoint's response latency distribution. Because of the limit on the number of outstanding messages, endpoint latency affects subscription throughput. If it takes 100 milliseconds to process each message, your throughput limit is likely to be 10 messages per second.
To access higher outstanding message limits, push subscribers must acknowledge more than 99% of the messages they receive.
You can calculate the fraction of messages that subscribers acknowledge using the Monitoring Query Language. The following MQL query creates a chart with the fraction of messages that subscribers acknowledge on a subscription:
fetch pubsub_subscription
| metric 'pubsub.googleapis.com/subscription/push_request_count'
| filter
(resource.subscription_id == '$SUBSCRIPTION')
| filter_ratio_by [], metric.response_class == 'ack'
| every 1m
Monitoring subscriptions with filters
Pub/Sub automatically acknowledges the messages that don't match a filter. You can monitor the number, size, and cost of these messages.
To monitor the number of messages that don't match a filter, use the
subscription/ack_message_count
metric with the delivery_type
label and the filter
value.
To monitor the size and cost of messages that don't match a filter, use the
subscription/byte_cost
metric with the operation_type
label and the
filter_drop
value. For more information about the fees for these messages, see
the Pub/Sub pricing page.
Monitoring forwarded undeliverable messages
To monitor undeliverable messages that Pub/Sub
forwards to a dead-letter topic, use the
subscription/dead_letter_message_count
metric. This metric shows the number
of undeliverable messages that Pub/Sub forwards from a
subscription.
To verify that Pub/Sub is forwarding undeliverable messages, you
can compare the subscription/dead_letter_message_count
metric with the
topic/send_request_count
metric for the dead-letter topic that
Pub/Sub forwards these messages to.
You can also attach a subscription to the dead-letter topic and then monitor the forwarded undeliverable messages on this subscription via the following metrics:
subscription/num_undelivered_messages
- the number of forwarded messages that have accumulated in the subscription
subscription/oldest_unacked_message_age
- the age of the oldest forwarded message in the subscription
Keeping publishers healthy
The primary goal of a publisher is to persist message data quickly. Monitor this
performance using
topic/send_request_count
, grouped by response_code
. This
metric gives you an indication of whether Pub/Sub is healthy and
accepting requests.
A background rate of retryable errors (significantly lower than 1%) should not
be a cause for concern, since most Google Cloud Client Libraries retry
message failures. You should investigate error rates that are greater than 1%.
Because non-retryable codes are handled by your application (rather than by the
client library), you should examine response codes. If your publisher
application does not have a good way of signaling an unhealthy state, consider
setting an alert on the topic/send_request_count
metric.
It is equally important to track failed publish requests in your publish client. While client libraries generally retry failed requests, they do not guarantee publication. Refer to Publishing messages for ways to detect permanent publish failures when using Google Cloud Client Libraries. At a minimum, your publisher application should log permanent publish errors. If you log those errors to Cloud Logging, you can set up a logs-based metric with an alerting policy.
Monitoring message throughput
Publishers may send messages in batches. You can monitor the message throughput being sent by your publishers with these metrics:
topic/send_request_count
: the volume of batch messages being sent by publishersA count of
topic/message_sizes
: the volume of individual (i.e. unbatched) messages being sent by publishersYou can calculate a count of messages being sent by applying a count aggregator to this metric, or by using the Monitoring Query Language. The following MQL query creates a chart with the volume of individual messages sent on a topic:
fetch pubsub_topic | metric 'pubsub.googleapis.com/topic/message_sizes' | filter (resource.topic_id == '$TOPIC') | align delta(1m) | every 1m | group_by [], [row_count: row_count()]