Execution details

Dataflow provides an Execution details tab in its web-based monitoring user interface. This tool can help you optimize performance for your jobs and diagnose why your job might be slow or stuck. This document is for any Dataflow user who needs to inspect the execution details of their Dataflow jobs.

This page provides a high-level summary of the execution details feature and its user interface layout. For troubleshooting details, read Using the Execution details tab.

Terminology

To use execution details effectively, you need to understand how the following key concepts apply to Dataflow jobs:

Dataflow terminology

  • Fusion optimization: The process of Dataflow fusing multiple steps or transforms. This optimizes user-submitted pipelines. For more information, read Fusion optimization.
  • Stages: The unit of fused steps in Dataflow pipelines.
  • Last stage: The final node in Dataflow pipelines. A pipeline can have multiple final nodes.

Batch terminology

  • Critical paths: The sequence of stages of a pipeline that contribute to the overall job runtime. For example, this sequence excludes the following stages:
    • Branches of the pipeline that finished earlier than the overall job.
    • Inputs that did not delay downstream processing.
  • Workers: Compute Engine VM instances running a Dataflow job.
  • Work items: The units of work that corresponds to a bundle selected by Dataflow.

Streaming terminology

When to use execution details

The following are common scenarios for using execution details when running Dataflow jobs:

  • Your pipeline is stuck and you want to troubleshoot the issue.
  • Your pipeline is slow and you want to target pipeline optimization.
  • Nothing needs to be fixed, but you want to see the execution details of your pipeline to understand your job.

Enable execution details

The Stage Workflow view is automatically enabled for all batch and streaming jobs. Batch and streaming jobs also have a Stage progress view, and batch jobs have an additional Worker progress view.

This feature does not cause additional CPU, network, etc. usage for your VMs. The execution details are collected by Dataflow's backend monitoring system which does not affect the performance of the job.

Once you launch your job, you can view the Execution details tab using the Dataflow monitoring UI. For more information, read Accessing the Dataflow monitoring interface.

Use the Execution details tab

The Execution details tab includes four views: Stage progress, Stage info panel (within Stage progress), Stage workflow, and Worker progress. This section walks you through each view and provides examples of successful and unsuccessful Dataflow jobs.

Stage progress for Batch jobs

The Stage progress view for Batch jobs shows the execution stages of the job, arranged by their start and end times. The length of time is represented with a bar. For example, you can visually identify the longest running stages of a pipeline by finding the longest bar.

Below each of the bars, you can find a sparkline that shows the progress of the stage over time. To highlight the stages that contributed to the overall runtime of the job, click the Critical path toggle. Additionally, you can use the "Filter Stages" dropdown to only select the stages you are interested in.

An example of the Stage progress view for Batch jobs, showing a visualization of the length of
time for six different execution stages.

Stage progress for Streaming jobs

The Stage progress view for Streaming jobs can be broken down into two sections. The top half of the view shows a chart representing the Data Freshness for each execution stage of the job. Hovering over the chart provides the Data Freshness value at that specific instant of time. The bottom half of the view shows the execution stages of the job, arranged in a topological order, where stages with no descendant stages are shown at the top and their descendents are listed underneath. This view makes it easier to identify stages of a pipeline which take longer than they should. The bars are sized relative to the longest Data Freshness for the entire time domain.

Streaming jobs run until they are cancelled, drained or updated. The time picker above the chart can be used to scope down the domain to a more useful time range. Additionally, you can use the "Filter Stages" dropdown to only select the stages you are interested in.

The Stage progress view makes it easier to identify when your streaming job is slow or stuck in two different ways:

  1. The Data Freshness by Stages chart includes anomaly detection, which will automatically display windows of time when the Data Freshness looks unhealthy. The chart will highlight "potential stuckness" when Data Freshness exceeds the 99th percentile for the selected time window. Likewise, the chart will highlight "potential slowness" when Data Freshness exceeds the 95th percentile.
  2. Bottlenecks can be detected by first hovering over a time in the chart which looks abnormal. Once hovered, longer bars indicate slower stages. Alternatively, the x-axis of the chart can be clicked to display the data at that instance of time. A common approach to finding the stage causing the stuckness/slowness is to find the most upstream (topmost) or the most downstream (bottommost) stage causing the Data Freshness to spike. This approach does not suit all scenarios and further debugging might be required to pinpoint the exact cause.

An example of the Stage progress view for Streaming jobs, showing a visualization of the length of
time for one execution stage and a possible slowness anomaly.

Stage info panel

The Stage info panel displays a list of steps associated with a fused stage, ranked by descending wall time. The panel opens on the right side of the screen. To open the panel, hover over one of the bars in the Stage progress view and click View details.

An example of the Stage Info Panel

Stage workflow

Stage workflow shows the execution stages of the job, represented as a workflow graph. To show only the stages that directly contributed to the overall runtime of the job, click the Critical path toggle.

An example of the Stage workflow view, showing the hierarchy of the different
execution stages of a job.

Worker progress

For batch jobs, Worker progress shows the workers for a particular stage. This view is not available for streaming jobs.

Each bar maps to a work item scheduled to a worker. A sparkline that tracks CPU utilization on a worker is located below each worker, making it easier to spot underutilization issues.

Due to the density of this visualization, you must filter this view by pre-selecting a stage. First, identify a stage in the Stage progress view. Hover over that stage and click View workers to enter the Worker progress view.

An example of the worker progress view. The workers have bars and sparklines
that correspond to work item scheduling and CPU utilizations.

What's next