AvroIO (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1

com.google.cloud.dataflow.sdk.io

Class AvroIO



  • public class AvroIO
    extends Object
    PTransforms for reading and writing Avro files.

    To read a PCollection from one or more Avro files, use AvroIO.Read, specifying AvroIO.Read.from(java.lang.String) to specify the path of the file(s) to read from (e.g., a local filename or filename pattern if running locally, or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>"), and optionally AvroIO.Read.named(java.lang.String) to specify the name of the pipeline step.

    It is required to specify AvroIO.Read.withSchema(java.lang.Class<T>). To read specific records, such as Avro-generated classes, provide an Avro-generated class type. To read GenericRecords, provide either a Schema object or an Avro schema in a JSON-encoded string form. An exception will be thrown if a record doesn't match the specified schema.

    For example:

     
     Pipeline p = ...;
    
     // A simple Read of a local file (only runs locally):
     PCollection<AvroAutoGenClass> records =
         p.apply(AvroIO.Read.from("/path/to/file.avro")
                            .withSchema(AvroAutoGenClass.class));
    
     // A Read from a GCS file (runs locally and via the Google Cloud
     // Dataflow service):
     Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
     PCollection<GenericRecord> records =
         p.apply(AvroIO.Read.named("ReadFromAvro")
                            .from("gs://my_bucket/path/to/records-*.avro")
                            .withSchema(schema));
      

    To write a PCollection to one or more Avro files, use AvroIO.Write, specifying AvroIO.Write.to(java.lang.String) to specify the path of the file to write to (e.g., a local filename or sharded filename pattern if running locally, or a Google Cloud Storage filename or sharded filename pattern of the form "gs://<bucket>/<filepath>"), and optionally AvroIO.Write.named(java.lang.String) to specify the name of the pipeline step.

    It is required to specify AvroIO.Write.withSchema(java.lang.Class<T>). To write specific records, such as Avro-generated classes, provide an Avro-generated class type. To write GenericRecords, provide either a Schema object or a schema in a JSON-encoded string form. An exception will be thrown if a record doesn't match the specified schema.

    For example:

     
     // A simple Write to a local file (only runs locally):
     PCollection<AvroAutoGenClass> records = ...;
     records.apply(AvroIO.Write.to("/path/to/file.avro")
                               .withSchema(AvroAutoGenClass.class));
    
     // A Write to a sharded GCS file (runs locally and via the Google Cloud
     // Dataflow service):
     Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
     PCollection<GenericRecord> records = ...;
     records.apply(AvroIO.Write.named("WriteToAvro")
                               .to("gs://my_bucket/path/to/numbers")
                               .withSchema(schema)
                               .withSuffix(".avro"));
      

    Permissions

    Permission requirements depend on the PipelineRunner that is used to execute the Dataflow job. Please refer to the documentation of corresponding PipelineRunners for more details.


Send feedback about...

Cloud Dataflow