Queuing unprocessable data for further analysis

In production systems, it is important to handle problematic data. You should validate and if possible correct data in-stream, but where correction isn't possible, we recommend logging the value to a queue (sometimes called a "dead letter" queue) for later analysis. A common place to see issues is when converting data from one format to another, for example when converting JSON strings to Rows.

To address this, we recommend using a multi-output transform to shuttle the elements containing the problematic data into another PCollection for further processing. Since this is a common operation that you might want to use in many places in a pipeline, you should make the transform generic enough to use in multiple places. To do this, first create an error object to wrap common properties, including the original data, then create a sink transform that has multiple options for the destination, as needed to meet your application requirements.

Examples