Go to Stackdriver in Google Cloud Platform Console to view Stackdriver monitoring dashboards or to define Stackdriver alerts. You can also use the Stackdriver monitoring API to query and view metrics for subscriptions and topics.
Metrics and resource types
- To see the usage metrics that Cloud Pub/Sub reports to Stackdriver, view the Metrics List in the Stackdriver documentation.
- To see the details for the
pubsub_snapshotmonitored resource types, view Monitored Resource Types in the Stackdriver documentation.
Monitor topic or subscription quota utilization
You can use the APIs and services quotas dashboard to monitor the current utilization for a given topic or subscription.
Those metrics are:
Note that these metrics are in bytes, whereas quota is measured in kilobytes.
Keeping your subscribers healthy
Monitoring the backlog
To ensure that your subscribers are keeping up with the flow of messages, create a dashboard that shows the following metrics, aggregated by resource, for all your subscriptions.:
Create alerts that will fire when these values are unusually large in the context of your system. For instance, the absolute number of undelivered messages is not necessarily meaningful. A backlog of a million messages might be acceptable for a million message-per-second subscription, but highly problematic for a one message-per-second subscription.
||Subscribers not keeping up with message volume||
|If there is a steady, small backlog size combined with a steadily
||Stuck messages||Examine your application logs to understand whether some messages are causing your code to crash. It's unlikely—but possible —that the offending messages are stuck on Cloud Pub/Sub rather than in your client. Raise a support case once you are confident your code successfully processes each message.|
||Permanent data loss||Set up an alert that fires well in advance of the subscription's message retention time|
Monitoring message expiration
If your subscribers take too long to acknowledge a message, it expires and is re-delivered. This can happen for a number of reasons:
Your subscriptions are under-provisioned (you need more threads or more machines).
Each message takes longer to process than the maximum message acknowledgement deadline. While Google Cloud Platform Client Libraries generally extend this deadline automatically for individual messages, a maximum extension deadline applies there as well.
Some messages consistently crash the client.
It can be useful to measure the rate at which messages expire. The specific metric depends on the subscription type:
response_code ~= "success"
response_class != "success"
Excessive message expiration rates can indicate costly inefficiencies in your system. You pay for every redelivery and also for attempting to process each message repeatedly. Conversely, a small expiration rate (for example, 0.1-1%) might be healthy, since minimizing overall end-to-end latency of data processing requires aggressively expiring messages to avoid them being stuck on subscriber client instances.
Monitoring push subscribers
For push subscriptions, you should also monitor these metrics:
Group the request metric by
subcription_id. Since Cloud Pub/Sub push subscriptions use response codes as implicit message acknowledgements, it is important to monitor push request response codes. Because push subscriptions exponentially back off when they encounter timeouts or error codes, your backlog can grow quickly based on how the webhook responds.
It might be advisable to set an alert for high error rates (create a metric filtered by response class), since those rates lead to slower delivery and a growing backlog. However, push request counts are likely to be more useful as a tool for investigating growing backlog size and age.
Cloud Pub/Sub generally limits the number of outstanding messages. You should aim to keep this number under 1000 in most situations. Generally, the service adjusts the limit based on the overall throughput of the subscription in increments of 1000, once the throughput achieves of order ten thousand messages per second. No specific guarantees are made beyond the minimum value, so 1000 is a good guide.
This metric helps you understand your messages' latency distribution. Because of the limit on the number of outstanding messages, webhook latency affects subscription throughput. If it takes 100 seconds to process each message, your throughput limit is likely to be 10 messages per second.
Keeping publishers healthy
The primary goal of a publisher is to persist message data quickly. Monitor this
topic/send_request_count, grouped by
response_code. This metric gives you an indication of whether
Cloud Pub/Sub is healthy and accepting requests. A background
rate of retryable errors (significantly lower than 1%) is routine, since most
GCP Client Libraries retry message failures. You should
investigate error rates that are greater than 1%. Response codes are important
to examine (as a grouping dimension
for the metric) since non-retryable codes are handled by your
application rather than the client library. If your publisher application
does not have a good way of signaling an unhealthy state, it might be advisable
to alert on increased publish request error counts.
It is equally important to track failed publish requests in your publish client. While client libraries generally retry failed requests, they do not guarantee publication. Refer to Publishing messages for ways to detect permanent publish failures when using GCP Client Libraries. At a minimum, your subscriber should log permanent publish errors. If you log those errors to Stackdriver Logging, you can set up a logs-based metric with an alert.