Google Cloud Dataflow SDK for Java, version 1.9.1
Class XmlSource<T>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.io.Source<T>
-
- com.google.cloud.dataflow.sdk.io.BoundedSource<T>
-
- com.google.cloud.dataflow.sdk.io.OffsetBasedSource<T>
-
- com.google.cloud.dataflow.sdk.io.FileBasedSource<T>
-
- com.google.cloud.dataflow.sdk.io.XmlSource<T>
-
- Type Parameters:
T
- Type of the objects that represent the records of the XML file. ThePCollection
generated by this source will be of this type.
- All Implemented Interfaces:
- HasDisplayData, Serializable
public class XmlSource<T> extends FileBasedSource<T>
A source that can be used to read XML files. This source reads one or more XML files and creates aPCollection
of a given type. An Dataflow read transform can be created by passing anXmlSource
object toRead.from()
. Please note the example given below.The XML file must be of the following form, where
root
andrecord
are XML element names that are defined by the user:<root> <record> ... </record> <record> ... </record> <record> ... </record> ... <record> ... </record> </root>
Basically, the XML document should contain a single root element with an inner list consisting entirely of record elements. The records may contain arbitrary XML content; however, that content must not contain the start
<record>
or end</record>
tags. This restriction enables reading from large XML files in parallel from different offsets in the file.Root and/or record elements may additionally contain an arbitrary number of XML attributes. Additionally users must provide a class of a JAXB annotated Java type that can be used convert records into Java objects and vice versa using JAXB marshalling/unmarshalling mechanisms. Reading the source will generate a
PCollection
of the given JAXB annotated Java type. Optionally users may provide a minimum size of a bundle that should be created for the source.The following example shows how to read from
XmlSource
in a Dataflow pipeline:XmlSource<String> source = XmlSource.<String>from(file.toPath().toString()) .withRootElement("root") .withRecordElement("record") .withRecordClass(Record.class); PCollection<String> output = p.apply(Read.from(source));
Currently, only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.
To use
XmlSource
:- Explicitly declare a dependency on org.codehaus.woodstox:stax2-api
- Include a compatible implementation on the classpath at run-time, such as org.codehaus.woodstox:woodstox-core-asl
These dependencies have been declared as optional in Maven sdk/pom.xml file of Google Cloud Dataflow.
Permissions
Permission requirements depend on thePipelineRunner
that is used to execute the Dataflow job. Please refer to the documentation of correspondingPipelineRunners
for more details.- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.Source
Source.Reader<T>
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method and Description protected FileBasedSource<T>
createForSubrangeOfFile(String fileName, long start, long end)
Creates and returns a newFileBasedSource
of the same type as the currentFileBasedSource
backed by a given file and an offset range.protected FileBasedSource.FileBasedReader<T>
createSingleFileReader(PipelineOptions options)
Creates and returns an instance of aFileBasedReader
implementation for the current source assuming the source represents a single file.static <T> XmlSource<T>
from(String fileOrPatternSpec)
Creates an XmlSource for a single XML file or a set of XML files defined by a Java "glob" file pattern.Coder<T>
getDefaultOutputCoder()
Returns the defaultCoder
to use for the data read from this source.Class<T>
getRecordClass()
String
getRecordElement()
String
getRootElement()
void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.boolean
producesSortedKeys(PipelineOptions options)
Whether this source is known to produce key/value pairs sorted by lexicographic order on the bytes of the encoded key.void
validate()
Checks that this source is valid, before it can be used in a pipeline.XmlSource<T>
withMinBundleSize(long minBundleSize)
Sets a parameterminBundleSize
for the minimum bundle size of the source.XmlSource<T>
withRecordClass(Class<T> recordClass)
Sets a JAXB annotated class that can be populated using a record of the provided XML file.XmlSource<T>
withRecordElement(String recordElement)
Sets name of the record element of the XML document.XmlSource<T>
withRootElement(String rootElement)
Sets name of the root element of the XML document.-
Methods inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource
createReader, createSourceForSubrange, expandFilePattern, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, isSplittable, splitIntoBundles, toString
-
Methods inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource
allowsDynamicSplitting, getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
-
-
-
-
Method Detail
-
from
public static <T> XmlSource<T> from(String fileOrPatternSpec)
Creates an XmlSource for a single XML file or a set of XML files defined by a Java "glob" file pattern. Each XML file should be of the form defined inXmlSource
.
-
withRootElement
public XmlSource<T> withRootElement(String rootElement)
Sets name of the root element of the XML document. This will be used to create a valid starting root element when initiating a bundle of records created from an XML document. This is a required parameter.
-
withRecordElement
public XmlSource<T> withRecordElement(String recordElement)
Sets name of the record element of the XML document. This will be used to determine offset of the first record of a bundle created from the XML document. This is a required parameter.
-
withRecordClass
public XmlSource<T> withRecordClass(Class<T> recordClass)
Sets a JAXB annotated class that can be populated using a record of the provided XML file. This will be used when unmarshalling record objects from the XML file. This is a required parameter.
-
withMinBundleSize
public XmlSource<T> withMinBundleSize(long minBundleSize)
Sets a parameterminBundleSize
for the minimum bundle size of the source. Please refer toOffsetBasedSource
for the definition of minBundleSize. This is an optional parameter.
-
createForSubrangeOfFile
protected FileBasedSource<T> createForSubrangeOfFile(String fileName, long start, long end)
Description copied from class:FileBasedSource
Creates and returns a newFileBasedSource
of the same type as the currentFileBasedSource
backed by a given file and an offset range. When current source is being split, this method is used to generate new sub-sources. When creating the source subclasses must call the constructorFileBasedSource.FileBasedSource(String, long, long, long)
ofFileBasedSource
with corresponding parameter values passed here.- Specified by:
createForSubrangeOfFile
in classFileBasedSource<T>
- Parameters:
fileName
- file backing the newFileBasedSource
.start
- starting byte offset of the newFileBasedSource
.end
- ending byte offset of the newFileBasedSource
. May be Long.MAX_VALUE, in which case it will be inferred usingFileBasedSource.getMaxEndOffset(com.google.cloud.dataflow.sdk.options.PipelineOptions)
.
-
createSingleFileReader
protected FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
Description copied from class:FileBasedSource
Creates and returns an instance of aFileBasedReader
implementation for the current source assuming the source represents a single file. File patterns will be handled byFileBasedSource
implementation automatically.- Specified by:
createSingleFileReader
in classFileBasedSource<T>
-
producesSortedKeys
public boolean producesSortedKeys(PipelineOptions options) throws Exception
Description copied from class:BoundedSource
Whether this source is known to produce key/value pairs sorted by lexicographic order on the bytes of the encoded key.- Specified by:
producesSortedKeys
in classBoundedSource<T>
- Throws:
Exception
-
validate
public void validate()
Description copied from class:Source
Checks that this source is valid, before it can be used in a pipeline.It is recommended to use
Preconditions
for implementing this method.- Overrides:
validate
in classFileBasedSource<T>
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Description copied from class:Source
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Overrides:
populateDisplayData
in classFileBasedSource<T>
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
getDefaultOutputCoder
public Coder<T> getDefaultOutputCoder()
Description copied from class:Source
Returns the defaultCoder
to use for the data read from this source.- Specified by:
getDefaultOutputCoder
in classSource<T>
-
getRootElement
public String getRootElement()
-
getRecordElement
public String getRecordElement()
-
-