Google Cloud Dataflow SDK for Java, version 1.9.1
Class AvroSource<T>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.io.Source<T>
-
- com.google.cloud.dataflow.sdk.io.BoundedSource<T>
-
- com.google.cloud.dataflow.sdk.io.OffsetBasedSource<T>
-
- com.google.cloud.dataflow.sdk.io.FileBasedSource<T>
-
- com.google.cloud.dataflow.sdk.io.BlockBasedSource<T>
-
- com.google.cloud.dataflow.sdk.io.AvroSource<T>
-
- Type Parameters:
T
- The type of records to be read from the source.
- All Implemented Interfaces:
- HasDisplayData, Serializable
@Experimental(value=SOURCE_SINK) public class AvroSource<T> extends BlockBasedSource<T>
AFileBasedSource
for reading Avro files.To read a
PCollection
of objects from one or more Avro files, usefrom(java.lang.String)
to specify the path(s) of the files to read. TheAvroSource
that is returned will read objects of typeGenericRecord
with the schema(s) that were written at file creation. To further configure theAvroSource
to read with a user-defined schema, or to return records of a type other thanGenericRecord
, usewithSchema(Schema)
(using an AvroSchema
),withSchema(String)
(using a JSON schema), orwithSchema(Class)
(to return objects of the Avro-generated class specified).An
AvroSource
can be read from using theRead
transform. For example:AvroSource<MyType> source = AvroSource.from(file.toPath()).withSchema(MyType.class); PCollection<MyType> records = Read.from(mySource);
The
readFromFileWithClass(String, Class)
method is a convenience method that returns a read transform. For example:PCollection<MyType> records = AvroSource.readFromFileWithClass(file.toPath(), MyType.class));
This class's implementation is based on the Avro 1.7.7 specification and implements parsing of some parts of Avro Object Container Files. The rationale for doing so is that the Avro API does not provide efficient ways of computing the precise offsets of blocks within a file, which is necessary to support dynamic work rebalancing. However, whenever it is possible to use the Avro API in a way that supports maintaining precise offsets, this class uses the Avro API.
Avro Object Container files store records in blocks. Each block contains a collection of records. Blocks may be encoded (e.g., with bzip2, deflate, snappy, etc.). Blocks are delineated from one another by a 16-byte sync marker.
An
AvroSource
for a subrange of a single file contains records in the blocks such that the start offset of the block is greater than or equal to the start offset of the source and less than the end offset of the source.To use XZ-encoded Avro files, please include an explicit dependency on
xz-1.5.jar
, which has been marked as optional in the Mavensdk/pom.xml
for Google Cloud Dataflow:<dependency> <groupId>org.tukaani</groupId> <artifactId>xz</artifactId> <version>1.5</version> </dependency>
Permissions
Permission requirements depend on the
PipelineRunner
that is used to execute the Dataflow job. Please refer to the documentation of correspondingPipelineRunner
s for more details.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
AvroSource.AvroReader<T>
ABlockBasedSource.BlockBasedReader
for reading blocks from Avro files.-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.BlockBasedSource
BlockBasedSource.Block<T>, BlockBasedSource.BlockBasedReader<T>
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.Source
Source.Reader<T>
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method and Description BlockBasedSource<T>
createForSubrangeOfFile(String fileName, long start, long end)
Creates aBlockBasedSource
for the specified range in a single file.protected BlockBasedSource.BlockBasedReader<T>
createSingleFileReader(PipelineOptions options)
Creates aBlockBasedReader
.static AvroSource<GenericRecord>
from(String fileNameOrPattern)
Creates anAvroSource
that reads from the given file name or pattern ("glob").AvroCoder<T>
getDefaultOutputCoder()
Returns the defaultCoder
to use for the data read from this source.String
getSchema()
boolean
producesSortedKeys(PipelineOptions options)
Whether this source is known to produce key/value pairs sorted by lexicographic order on the bytes of the encoded key.static <T> Read.Bounded<T>
readFromFileWithClass(String filePattern, Class<T> clazz)
Creates aRead
transform that will read from anAvroSource
that is configured to read records of the given type from a file pattern.void
validate()
Checks that this source is valid, before it can be used in a pipeline.AvroSource<T>
withMinBundleSize(long minBundleSize)
Returns anAvroSource
that's like this one but uses the supplied minimum bundle size.<X> AvroSource<X>
withSchema(Class<X> clazz)
Returns anAvroSource
that's like this one but reads files containing records of the type of the given class.AvroSource<GenericRecord>
withSchema(Schema schema)
Returns anAvroSource
that's like this one but reads files containing records that conform to the given schema.AvroSource<GenericRecord>
withSchema(String schema)
Returns anAvroSource
that's like this one but reads files containing records that conform to the given schema.-
Methods inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource
createReader, createSourceForSubrange, expandFilePattern, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, isSplittable, populateDisplayData, splitIntoBundles, toString
-
Methods inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource
allowsDynamicSplitting, getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
-
-
-
-
Method Detail
-
readFromFileWithClass
public static <T> Read.Bounded<T> readFromFileWithClass(String filePattern, Class<T> clazz)
Creates aRead
transform that will read from anAvroSource
that is configured to read records of the given type from a file pattern.
-
from
public static AvroSource<GenericRecord> from(String fileNameOrPattern)
Creates anAvroSource
that reads from the given file name or pattern ("glob"). The returned source can be further configured by callingwithSchema(java.lang.String)
to return a type other thanGenericRecord
.
-
withSchema
public AvroSource<GenericRecord> withSchema(String schema)
Returns anAvroSource
that's like this one but reads files containing records that conform to the given schema.Does not modify this object.
-
withSchema
public AvroSource<GenericRecord> withSchema(Schema schema)
Returns anAvroSource
that's like this one but reads files containing records that conform to the given schema.Does not modify this object.
-
withSchema
public <X> AvroSource<X> withSchema(Class<X> clazz)
Returns anAvroSource
that's like this one but reads files containing records of the type of the given class.Does not modify this object.
-
withMinBundleSize
public AvroSource<T> withMinBundleSize(long minBundleSize)
Returns anAvroSource
that's like this one but uses the supplied minimum bundle size. Refer toOffsetBasedSource
for a description ofminBundleSize
and its use.Does not modify this object.
-
validate
public void validate()
Description copied from class:Source
Checks that this source is valid, before it can be used in a pipeline.It is recommended to use
Preconditions
for implementing this method.- Overrides:
validate
in classFileBasedSource<T>
-
createForSubrangeOfFile
public BlockBasedSource<T> createForSubrangeOfFile(String fileName, long start, long end)
Description copied from class:BlockBasedSource
Creates aBlockBasedSource
for the specified range in a single file.- Specified by:
createForSubrangeOfFile
in classBlockBasedSource<T>
- Parameters:
fileName
- file backing the newFileBasedSource
.start
- starting byte offset of the newFileBasedSource
.end
- ending byte offset of the newFileBasedSource
. May be Long.MAX_VALUE, in which case it will be inferred usingFileBasedSource.getMaxEndOffset(com.google.cloud.dataflow.sdk.options.PipelineOptions)
.
-
createSingleFileReader
protected BlockBasedSource.BlockBasedReader<T> createSingleFileReader(PipelineOptions options)
Description copied from class:BlockBasedSource
Creates aBlockBasedReader
.- Specified by:
createSingleFileReader
in classBlockBasedSource<T>
-
producesSortedKeys
public boolean producesSortedKeys(PipelineOptions options) throws Exception
Description copied from class:BoundedSource
Whether this source is known to produce key/value pairs sorted by lexicographic order on the bytes of the encoded key.- Specified by:
producesSortedKeys
in classBoundedSource<T>
- Throws:
Exception
-
getDefaultOutputCoder
public AvroCoder<T> getDefaultOutputCoder()
Description copied from class:Source
Returns the defaultCoder
to use for the data read from this source.- Specified by:
getDefaultOutputCoder
in classSource<T>
-
getSchema
public String getSchema()
-
-