Processing JSON strings in Dataflow is a common need. For example, when processing clickstream information captured from web applications. To process JSON strings, you need to convert them into either Rows or plain old Java objects (POJOs) for the duration of the pipeline processing.
The Apache Beam built-in transform JsonToRow is a good solution for converting JSON strings to Rows. However, if you want a queue for processing unsuccessful messages, you'll need to build that separately, see Queuing unprocessable data for further analysis.
If you need to convert a JSON string to a POJO using AutoValue,
register a schema for the type by using the
@DefaultSchema(AutoValueSchema.class) annotation, then use the
utility class, so you end up with code similar to the following:
PCollection<String> json = ... PCollection<MyUserType> = json .apply("Parse JSON to Beam Rows", JsonToRow.withSchema(expectedSchema)) .apply("Convert to a user type with a compatible schema registered", Convert.to(MyUserType.class))
For more information, including what different Java types you can infer schemas from, see Creating Schemas.
If JsonToRow won't work with your data, Gson is a reasonable alternative. Gson is fairly relaxed in its default processing of data, which may require you to build more validation into the data conversion process.