Parse error in wrangler with JSON and XML source with newline

Problem

Wrangler step which reads data in XML or JSON format fails with the following types of errors:

For JSON:

java.io.EOFException: End of input at line 1 column 2 at com.google.gson.stream.JsonReader.nextNonWhitespace(JsonReader.java:1377) ~[com.google.code.gson.gson-2.2.4.jar:na]at com.google.gson.stream.JsonReader.doPeek(JsonReader.java:483) ~[com.google.code.gson.gson-2.2.4.jar:na]

For XML:

Caused by: org.json.JSONException: Mismatched close tag note at 6 [character 7 line 1]
at org.json.JSONTokener.syntaxError(JSONTokener.java:505) ~[org.json.json-20090211.jar:na] at org.json.XML.parse(XML.java:311) ~[org.json.json-20090211.jar:na]
To confirm this issue, try a sample input JSON / XML file and run.
 $ cat -e <file_path>
If you notice line endings as ^M$ then the input file is having Microsoft Windows line ending \r\n or old Mac line ending \r.

Environment

  • CDAP version 6.2.3 or earlier

Solution

  1. Before the parse as JSON or parse XML to JSON directive, remove all new lines from the input file and replace it with empty space.
  2. To remove new lines, insert a find and replace step and replace all newlines using this regex (\r\n|\r|\n) to empty space ``.

Cause

The issue is caused due to the Data Fusion Wrangler step unable to properly handle line ending generated by different Operating Systems. Currently it only handles '' and \n.