Dataflow job metrics

You can view charts in the Job metrics tab of the Dataflow web interface. Each metric is organized into the following dashboards:

Overview metrics

Streaming metrics (streaming pipelines only)

Resource metrics

Input metrics

Output metrics

To access additional information in these charts, click the Expand chart legend toggle.

The legend toggle button is located near the Create alerting policy
button.

Some of these charts are specific to streaming pipelines only. A list of scenarios where these metrics can be useful for debugging can be found in the using streaming Dataflow monitoring metrics section.

To write metrics data, a user-managed service account requires the IAM permission monitoring.timeSeries.create, which is included with the Dataflow Worker role. For more details about usage of these metrics outside of the Dataflow web interface, see the complete list of Dataflow metrics.

The Dataflow service reports Reserved CPU Time after jobs are completed. Therefore, for unbounded (streaming) jobs, Reserved CPU time is only reported after jobs have been cancelled or have failed. Therefore, the job metrics don't include Reserved CPU Time for streaming jobs.

Stage and worker metrics

The following sections provide details about the stage and worker metrics available in the monitoring interface.

Autoscaling

The Dataflow service automatically chooses the number of worker instances required to run your autoscaling job. The number of worker instances can change over time according to the job requirements.

A data visualization showing number of workers in a pipeline.

To see the history of autoscaling changes, click the More History button. A table with information about the worker history of your pipeline displays.

Table showing history of the worker history of a pipeline.

Throughput

Throughput is the volume of data that is processed at any point in time. This per-step metric is displayed as the number of elements per second. To see this metric in bytes per second, view the Throughput (bytes/sec) chart farther down the page.

A data visualization showing throughput of four steps in a
pipeline.

Worker error log count

The Worker error log count shows you the rate of errors observed across all workers at any point in time.

A summary of each logged error and the number of times it occurred.

Data freshness (with and without Streaming Engine)

The data freshness metric shows the difference in seconds between the timestamp on the data element and the time that the event is processed in your pipeline. The data element receives a timestamp when an event occurs on the element, such as a click event on a website or ingestion by Pub/Sub. The output watermark is the time that the data is processed.

At any time, the Dataflow job is processing multiple elements. The data points in the data freshness chart show the element with the largest delay relative to its event time. Therefore, the same line in the chart displays data for multiple elements. Each data point in the line displays data for the slowest element at that stage in the pipeline.

If some input data has not yet been processed, the output watermark might be delayed, which affects data freshness. A significant difference between the watermark time and the event time might indicate a slow or stuck operation.

For recently updated streaming jobs, job state and watermark information might be unavailable. The Update operation makes several changes that take a few minutes to propagate to the Dataflow monitoring interface. Try refreshing the monitoring interface 5 minutes after updating your job.

For more information, see Watermarks and late data in the Apache Beam documentation.

The dashboard includes the following two charts:

  • Data freshness by stages
  • Data freshness

A data visualization showing data freshness in a
streaming pipeline.

In the preceding image, the highlighted area shows a substantial difference between the event time and the output watermark time, indicating a slow operation.

High data freshness metrics (for example, metrics indicating that data is less fresh) might be caused by:

  • Performance Bottlenecks: If your pipeline has stages with high system latency or logs indicating stuck transforms, the pipeline might have performance issues that could raise data freshness. Follow these guidelines for troubleshooting slow pipelines to investigate further.
  • Data Source Bottlenecks: If your data sources have growing backlogs, the event timestamps of your elements might diverge from the watermark as they wait to be processed. Large backlogs are often caused either by performance bottlenecks, or data source issues which are best detected by monitoring the sources used by your pipeline.
    • Note: Unordered sources such as Pub/Sub can produce stuck watermarks even while outputting at a high rate. This situation occurs because elements are not output in timestamp order, and the watermark is based on the minimum unprocessed timestamp.
  • Frequent Retries: If you see any errors indicating elements failing to process and getting retried, that older timestamps from retried elements might be raising data freshness. The list of common Dataflow errors can help you troubleshoot.

System latency (with and without Streaming Engine)

System latency is the current maximum number of seconds that an item of data has been processing or awaiting processing. This metric indicates how long an element waits inside any one source in the pipeline. The maximum duration is adjusted after processing. The following cases are additional considerations:

  • For multiple sources and sinks, system latency is the maximum amount of time that an element waits inside a source before it's written to all sinks.
  • Sometimes, a source does not provide a value for the time period for which an element waits inside the source. In addition, the element might not have metadata to define its event time. In this scenario, system latency is calculated from the time the pipeline first receives the element.

The dashboard includes the following two charts:

  • System latency by stages
  • System latency

A data visualization showing system latency in a
streaming pipeline.

Backlog

The Backlog dashboard provides information about elements waiting to be processed. The dashboard includes the following two charts:

  • Backlog seconds (Streaming Engine only)
  • Backlog bytes (with and without Streaming Engine)

The Backlog seconds chart shows an estimate of the amount of time in seconds needed to consume the current backlog if no new data arrives and throughput doesn't change. The estimated backlog time is calculated from both the throughput and the backlog bytes from the input source that still need to be processed. This metric is used by the streaming autoscaling feature to determine when to scale up or down.

A data visualization showing the backlog seconds chart in a
streaming pipeline.

The Backlog bytes chart shows the amount of known unprocessed input for a stage in bytes. This metric compares the remaining bytes to be consumed by each stage against upstream stages. For this metric to report accurately, each source ingested by the pipeline must be configured correctly. Built-in sources such as Pub/Sub and BigQuery are already supported out of the box, however, custom sources require some extra implementation. For more details, see autoscaling for custom unbounded sources.

A data visualization showing the backlog bytes chart in a
streaming pipeline.

Processing (Streaming Engine only)

The Processing dashboard provides information about active user operations. The dashboard includes the following two charts:

  • User processing latencies heatmap
  • User processing latencies by stage

The User processing latencies heatmap shows the maximum user operation latencies over the 50th, 95th, and 99th percentile distributions. You can use the heatmap to see whether any long-tail operations are causing high overall system latency or are negatively affecting overall data freshness.

Having high latencies in the 99th percentile is often less impactful on the job than having high latencies in the 50th percentile. To fix an upstream issue before it becomes a problem downstream, set an alerting policy for high latencies in the 50th percentile.

A data visualization showing the user processing latencies heatmap chart for a
streaming pipeline.

The User processing latencies by stage chart shows the 99th percentile for every active user operation broken down by stage. If user code is causing a bottleneck, this chart shows which stage contains the bottleneck. You can use the following steps to debug the pipeline:

  1. Use the chart to find a stage with an unusually high latency.

  2. On the job details page, in the Execution details tab, for Graph view, select Stage workflow. In the Stage workflow graph, find the stage that has unusually high latency.

  3. To find the associated user operations, in the graph, click the node for that stage.

  4. To find additional details, navigate to Cloud Profiler, and use Cloud Profiler to debug the stack trace at the correct time range. Look for the user operations that you identified in the previous step.

A data visualization showing the user processing latencies by stage chart for a
streaming pipeline.

Parallelism (Streaming Engine only)

The Parallel processing chart shows the approximate number of keys in use for data processing for each stage. Dataflow scales based on the parallelism of a pipeline. In Dataflow, the parallelism of a pipeline is an estimate of the number of threads needed to most efficiently process data at any given time. Processing for any given key is serialized, so the total number of keys for a stage represents the maximum available parallelism at that stage. Parallelism metrics can be useful for finding hot keys or bottlenecks for slow or stuck pipelines.

A data visualization showing the parallel processing chart in a
streaming pipeline.

Persistence (Streaming Engine only)

The Persistence dashboard provides information about the rate at which persistent storage is written and read by a particular pipeline stage in bytes per second. The dashboard includes the following two charts:

  • Storage write
  • Storage read

The speed of write and read operations is limited by the maximum IOPS (input/output operations per second) of the selected disk. To determine if the current disk is causing a bottleneck, review the IOPS of the disks that the workers are using. For more information about the performance limits for persistent disks, see Performance limits.

A data visualization showing the storage write chart for a
streaming pipeline.

Duplicates (Streaming Engine only)

The Duplicates chart shows the number of messages being processed by a particular stage that have been filtered out as duplicates. Dataflow supports many sources and sinks which guarantee at least once delivery. The downside of at least once delivery is that it can result in duplicates. Dataflow guarantees exactly once delivery, which means that duplicates are automatically filtered out. Downstream stages are saved from reprocessing the same elements, which ensures that state and outputs are not affected. The pipeline can be optimized for resources and performance by reducing the number of duplicates produced in each stage.

A data visualization showing the duplicates chart in a
streaming pipeline.

Timers (Streaming Engine only)

The Timers dashboard provides information about the number of timers pending and the number of timers already processed in a particular pipeline stage. Because windows rely on timers, this metric lets you track the progress of windows.

The dashboard includes the following two charts:

  • Timers pending by stage
  • Timers processing by stage

These charts show the rate at which windows are pending or processing at a specific point in time. The Timers pending by stage chart indicates how many windows are delayed due to bottlenecks. The Timers processing by stage chart indicates how many windows are currently collecting elements.

These charts display all job timers, so if timers are used elsewhere in your code, those timers also appear in these charts.

A data visualization showing the number of timers pending in a particular stage.

A data visualization showing the number of timers already processed in a particular stage.

CPU utilization

CPU utilization is the amount of CPU used divided by the amount of CPU available for processing. This per-worker metric is displayed as a percentage. The dashboard includes the following four charts:

  • CPU utilization (All workers)
  • CPU utilization (Stats)
  • CPU utilization (Top 4)
  • CPU utilization (Bottom 4)

An animated data visualization showing CPU utilization for one Dataflow
worker.

Memory utilization

Memory utilization is the estimated amount of memory used by the workers in bytes per second. The dashboard includes the following two charts:

  • Max worker memory utilization (estimated bytes per second)
  • Memory utilization (estimated bytes per second)

The Max worker memory utilization chart provides information about the workers that use the most memory in the Dataflow job at each point in time. If, at different points during a job, the worker using the maximum amount of memory changes, the same line in the chart displays data for multiple workers. Each data point in the line displays data for the worker using the maximum amount of memory at that time. The chart compares the estimated memory used by the worker to the memory limit in bytes.

You can use this chart to troubleshoot out-of-memory (OOM) issues. Worker out-of-memory crashes are not shown on this chart.

The Memory utilization chart shows an estimate of the memory used by all workers in the Dataflow job compared to the memory limit in bytes.

Input and Output Metrics

If your streaming Dataflow job reads or writes records using Pub/Sub, input metrics and output metrics display.

All input metrics of the same type are combined, and all output metrics are also combined. For example, all Pub/Sub metrics are grouped in one section. Each metric type is organized into a separate section. To change which metrics are displayed, select the section on the left which best represents the metrics you're looking for. The following images show all the available sections.

An example image which shows the separate input and output sections for a Dataflow job.

The following two charts are displayed in both the Input Metrics and Output Metrics sections.

A series of charts showing input and output metrics for a Dataflow job.

Requests per second

Requests per sec is the rate of API requests to read or write data by the source or sink over time. If this rate drops to zero, or decreases significantly for an extended time period relative to expected behavior, then the pipeline might be blocked from performing certain operations. Also, there might be no data to read. In such a case, review the job steps that have a high system watermark. Also, examine the worker logs for errors or indications about slow processing.

A chart showing the number of API requests to read or write data by the source or sink over time.

Response errors per second by error type

Response errors per sec by error type is the rate of failed API requests to read or write data by the source or sink over time. If such errors occur frequently, these API requests might slow down processing. Such failed API requests must be investigated. To help troubleshoot these issues, review the general I/O error code documentation. Also review any specific error code documentation used by the source or sink, such as the Pub/Sub error codes.

A chart showing the rate of failed API requests to read or write data by the source or sink over time.

Use Metrics explorer

The following Dataflow I/O metrics can be viewed in Metrics explorer:

For the complete list of Dataflow metrics, see the Google Cloud metrics documentation.

Upcoming changes to Pub/Sub metrics (Streaming Engine only)

Streaming Engine currently uses Synchronous Pull to consume data from Pub/Sub. To improve performance, we will migrate to Streaming Pull. The existing Requests per sec and Response errors per sec graphs and metrics are appropriate for Synchronous Pull only. A metric about the health of Streaming Pull connections will be added, as will any errors that terminate those connections.

You don't need to do anything for this migration. During the migration, a job might use Synchronous Pull for one period, and Streaming Pull for another period. Therefore, the same job might show the Synchronous Pull metrics for one period and the Streaming Pull metrics for another period. After the migration is complete, the Synchronous Pull metrics will be removed from the interface.

The migration also affects the job/pubsub/read_count and job/pubsub/read_latencies metrics in Cloud Monitoring. Those counters aren't incremented while a job is using Streaming Pull.

Streaming jobs that don't use Streaming Engine won't be migrated to Streaming Pull, and this change doesn't affect them. They will continue to display the Synchronous Pull metrics.

Additional information about the Streaming Pull migration can be found on the Streaming with Pub/Sub page.

Contact your account team if you have any questions about this change.