Stackdriver provides powerful monitoring, logging, and diagnostics. Cloud Dataflow integration with Stackdriver Monitoring allows you to access Cloud Dataflow job metrics such as Job Status, Element Counts, System Lag (for streaming jobs), and User Counters from the Stackdriver dashboards. You can also employ Stackdriver alerting capabilities to be notified of a variety of conditions, such as long streaming system lag or failed jobs.
Before you begin
You can explore Cloud Dataflow metrics using Stackdriver. Follow the steps in this section to observe the several standard metrics provided for each of your Apache Beam pipelines.
Note: Any aggregator you define in your Apache Beam pipeline will be reported by Cloud Dataflow to Stackdriver as a custom metric. Cloud Dataflow will report incremental updates to Stackdriver approximately every 30 seconds. All user metrics will be exported as the “double” data type to avoid conflicts.
Go to the Google Cloud Platform Console and select the Stackdriver Monitoring menu.
Follow the steps in the Google Cloud Platform Console to create a Stackdriver account and start a free trial for Stackdriver.
Go to the Cloud Dataflow dashboard in Stackdriver and navigate to Resources > Metrics Explorer (Beta).
In the Metrics Explorer, select the
From the list that appears, select a metric you'd like to observe for one of your jobs.
Click here to read about the Cloud Dataflow-related metrics available.
- Job status: Provide job status (Failed, Successful) as an enum
every 30 seconds and on update. Note:
enums may not be charted or used for alerts, but this value may be retrieved through the Cloud Monitoring UI. For alerting use the “Failed” metric that will be set to 1 if a job fails.
- Failed: Failed will be set to 1 if a job exits with a failure. Use this metric to alert on and chart the number of failed pipelines.
- Elapsed time: Job elapsed time (measured in seconds), reported every 30 seconds.
- System lag: Max lag across the entire pipeline, reported in seconds.
- Current vCPU count: Current # of virtual CPUs used by job and updated on value change.
- Total vCPU usage: Total # of virtual CPUs used by a job and updated on value change.
- Total PD usage: Cumulative total of Persistent Disk used by job, measured in GB-seconds and updated on value change. Note: there are two different types of persistent disk (SSD and HDD). They are both reported using the same metric name and use a metric label to differentiate.
- Total memory usage: Cumulative total memory allocated to job, measured in GB-seconds and updated on value change.
- Element count: Number of elements per PCollection. Note: This is a per-PCollection metric, not a job-level metric, so it is not yet available for alerting.
- Estimated byte count: Number of bytes processed per PCollection. Note: This is a per-PCollection metric, not a job-level metric, so it is not yet available for alerting.
For example: This example shows a streaming pipeline that reads from a Cloud Pub/Sub topic and writes to BigQuery. It has 5 steps, one of which is
PubsubIO.Read. The image below displays the
PubsubIO.Readstep of the pipeline.
- Job status: Provide job status (Failed, Successful) as an enum every 30 seconds and on update. Note:
Create alerts and dashboards
Stackdriver does not only provide you with access to Cloud Dataflow-related metrics, but also allows you to create alerts and dashboards so you can chart time series of metrics and choose to be notified when these metrics reach specified values.
Create groups of resources
You can create resource groups that include multiple Apache Beam pipelines so that you can easily set alerts and build dashboards.
In the Cloud Dataflow dashboard in Stackdriver go to the Groups menu and select Create Group.
Add filter criteria that define the Cloud Dataflow resources included in the group. For example, one of your filter criteria can be the name prefix of your pipelines.
After the group is created, you will be able to see the basic metrics related to resources in that group.
Create alerts for Cloud Dataflow metrics
Stackdriver gives you the ability to create alerts and be notified when a certain metric crosses a specified threshold. For example, when System Lag of a streaming pipeline increases above a predefined value.
In the Cloud Dataflow dashboard in Stackdriver, go to the Alerting menu and select Policies Overview.
Click on Add Policy.
In the Create new alerting policy page, you can define the alerting conditions and the channels of communication for alerts.
For example, to set an alert on the System Lag for the
WindowedWordCountApache Beam pipeline group, select “Dataflow Job” in the Resource Type dropdown, “Group” in the Applies To dropdown, and “System Lag” in the If Metric dropdown.
After you've created an alert, you can review the events related to Cloud Dataflow by navigating to Alerting > Events. Every time an alert is triggered by a Metric Threshold condition, an Incident and a corresponding Event are created in Stackdriver. If you specified a notification mechanism in the alert (email, SMS, etc), you will also receive a notification.
Build your own custom monitoring dashboard
You can build Stackdriver monitoring dashboards with the most relevant Cloud Dataflow-related charts.
In the Cloud Dataflow dashboard in Stackdriver, go to the Dashboards menu and select Create Dashboard.
Click on Add Chart.
In the Add Chart window, select "Dataflow Job" as the Resource Type, select a metric you want to chart in the Metric Type field, and select a group that contains Apache Beam pipelines in the Filter panel.
You can add as many charts to the dashboard as you like.
Receive worker VM metrics from Stackdriver Monitoring agent
If you would like to monitor persistent disk, CPU, network, and process metrics from your Cloud Dataflow worker VM instances, you can enable the Stackdriver Monitoring Agent when you run your pipeline. See the list of available Monitoring agent metrics.
To enable the Monitoring agent, use the
option when running your pipeline.
To disable the Monitoring agent without stopping your pipeline, update your pipeline by
launching a replacement job
and without specifying the
To learn more, consider exploring these other resources: