XmlSource (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1


Class XmlSource<T>

  • Type Parameters:
    T - Type of the objects that represent the records of the XML file. The PCollection generated by this source will be of this type.
    All Implemented Interfaces:
    HasDisplayData, Serializable

    public class XmlSource<T>
    extends FileBasedSource<T>
    A source that can be used to read XML files. This source reads one or more XML files and creates a PCollection of a given type. An Dataflow read transform can be created by passing an XmlSource object to Read.from(). Please note the example given below.

    The XML file must be of the following form, where root and record are XML element names that are defined by the user:

     <record> ... </record>
     <record> ... </record>
     <record> ... </record>
     <record> ... </record>

    Basically, the XML document should contain a single root element with an inner list consisting entirely of record elements. The records may contain arbitrary XML content; however, that content must not contain the start <record> or end </record> tags. This restriction enables reading from large XML files in parallel from different offsets in the file.

    Root and/or record elements may additionally contain an arbitrary number of XML attributes. Additionally users must provide a class of a JAXB annotated Java type that can be used convert records into Java objects and vice versa using JAXB marshalling/unmarshalling mechanisms. Reading the source will generate a PCollection of the given JAXB annotated Java type. Optionally users may provide a minimum size of a bundle that should be created for the source.

    The following example shows how to read from XmlSource in a Dataflow pipeline:

     XmlSource<String> source = XmlSource.<String>from(file.toPath().toString())
     PCollection<String> output = p.apply(Read.from(source));

    Currently, only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.

    To use XmlSource:

    1. Explicitly declare a dependency on org.codehaus.woodstox:stax2-api
    2. Include a compatible implementation on the classpath at run-time, such as org.codehaus.woodstox:woodstox-core-asl

    These dependencies have been declared as optional in Maven sdk/pom.xml file of Google Cloud Dataflow.


    Permission requirements depend on the PipelineRunner that is used to execute the Dataflow job. Please refer to the documentation of corresponding PipelineRunners for more details.

    See Also:
    Serialized Form
    • Method Detail

      • from

        public static <T> XmlSource<T> from(String fileOrPatternSpec)
        Creates an XmlSource for a single XML file or a set of XML files defined by a Java "glob" file pattern. Each XML file should be of the form defined in XmlSource.
      • withRootElement

        public XmlSource<T> withRootElement(String rootElement)
        Sets name of the root element of the XML document. This will be used to create a valid starting root element when initiating a bundle of records created from an XML document. This is a required parameter.
      • withRecordElement

        public XmlSource<T> withRecordElement(String recordElement)
        Sets name of the record element of the XML document. This will be used to determine offset of the first record of a bundle created from the XML document. This is a required parameter.
      • withRecordClass

        public XmlSource<T> withRecordClass(Class<T> recordClass)
        Sets a JAXB annotated class that can be populated using a record of the provided XML file. This will be used when unmarshalling record objects from the XML file. This is a required parameter.
      • withMinBundleSize

        public XmlSource<T> withMinBundleSize(long minBundleSize)
        Sets a parameter minBundleSize for the minimum bundle size of the source. Please refer to OffsetBasedSource for the definition of minBundleSize. This is an optional parameter.
      • validate

        public void validate()
        Description copied from class: Source
        Checks that this source is valid, before it can be used in a pipeline.

        It is recommended to use Preconditions for implementing this method.

        validate in class FileBasedSource<T>
      • populateDisplayData

        public void populateDisplayData(DisplayData.Builder builder)
        Description copied from class: Source
        Register display data for the given transform or component.

        populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace of the subcomponent.

        By default, does not register any display data. Implementors may override this method to provide their own display data.

        Specified by:
        populateDisplayData in interface HasDisplayData
        populateDisplayData in class FileBasedSource<T>
        builder - The builder to populate with display data.
        See Also:
      • getDefaultOutputCoder

        public Coder<T> getDefaultOutputCoder()
        Description copied from class: Source
        Returns the default Coder to use for the data read from this source.
        Specified by:
        getDefaultOutputCoder in class Source<T>
      • getRootElement

        public String getRootElement()
      • getRecordElement

        public String getRecordElement()
      • getRecordClass

        public Class<T> getRecordClass()