Analyzing errors in Cloud Dataflow with Stackdriver Error Reporting
By Alex Amato, Cloud Dataflow Software Engineer
- A list of errors, with a summary error message.
- A count that indicates how many times each error occurred.
- A histogram that indicates when each error occurred.
Stackdriver Error Reporting is a feature of Google Stackdriver, the monitoring, logging, and diagnostics tool for Google Cloud Platform, AWS, and open source services. Error Reporting counts, analyzes, aggregates, and reports crashes as they occur. It also lets you sort and filter errors in various ways to help you understand them and create new alerts.
In this post, you’ll learn via example how Stackdriver Error Reporting helps monitor and debug your Cloud Dataflow jobs. In this example, we'll examine errors in two pipelines: a streaming pipeline that is writing rows to a Google BigQuery table, and a batch pipeline that runs hourly, reading the new rows from the BigQuery table.
Initially, our two pipelines are running normally. The streaming job is constantly running, writing rows to a BigQuery table. The batch pipeline is launched periodically to process a batch of rows from the table.
Now, we'll simulate an error that originates outside of the Cloud Dataflow pipeline. We'll delete the BigQuery table, which will cause failures in the pipelines as they try to read and write from this table.
In the Cloud Dataflow job page, there is a new section called Stack Traces. To access it, first make sure that no step is selected in the pipeline graph. Then click “Logs” -> “Stack Traces”.
In the streaming pipeline, an error indicating a failure to write to BigQuery is displayed, and a count indicating how many times the error occurred in this job. Similarly, the batch pipeline displays an error indicating a failure to read from the BigQuery table.
Now we'll look at the Stackdriver Error Reporting UI to see a report of the errors across all Cloud Dataflow jobs in one place. This report can be accessed by clicking on the menu icon in the top left and then selecting “Error Reporting”.
This loads a page which displays a summary of all the errors that occurred in the project. We can see the errors from both pipelines in this view.
Clicking on the error summary loads a page with a detailed display.
Here we can see that the error has been occurring for a few hours. This is because our BigQuery table no longer exists. For more detailed information, we can click on specific errors in the “Recent Samples” section, to view the original stack trace of the error.
After recreating the table, the error histogram shows us that the pipeline has resumed and the errors have stopped.
We hope you now have a good grasp of how this new feature is useful for monitoring Cloud Dataflow jobs. You can also enable email notifications when errors occur in your project from the Stackdriver Error Reporting UI. Please take a look at the new documentation here for additional information.