Google Cloud Platform
Monitoring and improving your Google Cloud Dataflow pipelines with Google Stackdriver
Learn how to use Google Stackdriver to monitor, chart and create alerts for Google Cloud Dataflow jobs
Stackdriver provides monitoring and diagnostics for applications on Google Cloud Platform (GCP) and AWS. Integrating Cloud Dataflow with Stackdriver Monitoring has been one of the most frequently requested features by customers, and at Google Cloud NEXT '17, we were excited to announce the public beta of Stackdriver Monitoring for Cloud Dataflow. In this post, we’ll provide details about what it is and how to use it.
Summary of featuresWhat does the availability of this beta mean for you as a Cloud Dataflow user? In summary, you can now access Dataflow job metrics such as System Lag (for streaming jobs), Job Status (Failed, Successful) and many others from within Stackdriver, chart them in dashboards and employ Stackdriver alerting capabilities to get notified of a variety of conditions, such as long streaming system lag or failed jobs.
Here's a summary of what you can do:
- Explore Cloud Dataflow metrics: Browse available pipeline metrics (see next section for a list of metrics) and visualize them in charts.
- Chart Cloud Dataflow metrics in Stackdriver Dashboards: Create dashboards and chart time series of metrics.
- Configure alerts: Define thresholds on job or resource group-level metrics and alert when these metrics reach specified values.
- Monitor user-defined metrics: In addition to metrics, Cloud Dataflow exposes user-defined metrics (SDK Aggregators) as Stackdriver custom counters in the Monitoring UI, available for charting and alerting.
Metrics and dashboardsExploring metrics and building dashboards is easy with Stackdriver. Navigate to the Cloud Console and select the Stackdriver Monitoring menu.
If you haven’t set up your project for using Stackdriver, you'll be asked to create a Stackdriver account and start a free trial. Once you're done with the setup, navigate to the Cloud Dataflow dashboard in Stackdriver following this link.
Select one of your pipelines shown in the dashboard to review the most important metrics. For a streaming pipeline that we picked in the screenshot above, these are the System Lag and the number of processed data elements:
If you would like to see less or more data in your graphs, change the time window for the displayed metrics by picking the appropriate time window setting in the toolbar, e.g., 1d for a single day or 1w for the entire week.
The Cloud Dataflow dashboard shows the most important metrics for the type of pipeline you selected, but if you'd like to explore other available metrics, go to the Resources > Metrics Explorer menu and select the dataflow_job resource type. You should now see a list of Cloud Dataflow-related metrics you can choose from and chart them in the charting area to the right of the selection list.
If you're interested in building your own custom dashboards with the metrics that are most relevant to your own pipelines, you can do that too.
Go to the Dashboards menu and select “Create Dashboard” and then “Add Chart.” In the Add Chart page, select “Dataflow Job” as the Resource Type, select a metric you want to chart in the Metric Type field, and select a group that contains Cloud Dataflow pipelines in the Filter panel.
You can add more charts to your custom dashboard, if you'd like.
Creating alertsOne of our favorite features in Stackdriver is the ability to create alerts and be notified when a certain metric crosses a specified threshold (for example, when System Lag of a streaming pipeline increases above a predefined value).
Navigate to the Alerting area of Stackdriver Monitoring, select Policies Overview and click on Add Policy.
The “Create new Alerting Policy” page allows you to define the alerting conditions and the channels of communication for alerts.
For example, to set an alert on the System Lag for our streaming Cloud Dataflow pipeline:
- Select “Dataflow Job” in the Resource Type picklist.
- Select “Single” in the Applies To picklist.
- Choose the pipeline for which you want alerts from the resources list.
- Pick “System Lag” in the If Metric picklist.
After you've created an Alert, you can review the Events related to Cloud Dataflow in the Alerting > Events page. Every time an alert is triggered by a Metric Threshold condition, an Incident and a corresponding Event are created in Stackdriver. If you specified a notification mechanism in the alert (email, SMS, pager, etc), you'll also receive a notification.
Next stepsTo learn more, see the following resources:
- Stackdriver docs
- Blog post: “Understanding cost-versus-speed tradeoffs in Google Cloud Dataflow batch pipelines”
- Google Cloud NEXT '17 session video: “Stackdriver: monitor, diagnose and fix”
- Google Cloud NEXT '17 session video: “Monitoring and improving your big data applications”