AvroSource (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1


Class AvroSource<T>

  • Type Parameters:
    T - The type of records to be read from the source.
    All Implemented Interfaces:
    HasDisplayData, Serializable

    public class AvroSource<T>
    extends BlockBasedSource<T>
    A FileBasedSource for reading Avro files.

    To read a PCollection of objects from one or more Avro files, use from(java.lang.String) to specify the path(s) of the files to read. The AvroSource that is returned will read objects of type GenericRecord with the schema(s) that were written at file creation. To further configure the AvroSource to read with a user-defined schema, or to return records of a type other than GenericRecord, use withSchema(Schema) (using an Avro Schema), withSchema(String) (using a JSON schema), or withSchema(Class) (to return objects of the Avro-generated class specified).

    An AvroSource can be read from using the Read transform. For example:

     AvroSource<MyType> source = AvroSource.from(file.toPath()).withSchema(MyType.class);
     PCollection<MyType> records = Read.from(mySource);

    The readFromFileWithClass(String, Class) method is a convenience method that returns a read transform. For example:

     PCollection<MyType> records = AvroSource.readFromFileWithClass(file.toPath(), MyType.class));

    This class's implementation is based on the Avro 1.7.7 specification and implements parsing of some parts of Avro Object Container Files. The rationale for doing so is that the Avro API does not provide efficient ways of computing the precise offsets of blocks within a file, which is necessary to support dynamic work rebalancing. However, whenever it is possible to use the Avro API in a way that supports maintaining precise offsets, this class uses the Avro API.

    Avro Object Container files store records in blocks. Each block contains a collection of records. Blocks may be encoded (e.g., with bzip2, deflate, snappy, etc.). Blocks are delineated from one another by a 16-byte sync marker.

    An AvroSource for a subrange of a single file contains records in the blocks such that the start offset of the block is greater than or equal to the start offset of the source and less than the end offset of the source.

    To use XZ-encoded Avro files, please include an explicit dependency on xz-1.5.jar, which has been marked as optional in the Maven sdk/pom.xml for Google Cloud Dataflow:



    Permission requirements depend on the PipelineRunner that is used to execute the Dataflow job. Please refer to the documentation of corresponding PipelineRunners for more details.

    See Also:
    Serialized Form