Google Cloud Dataflow SDK for Java, version 1.9.1
- Type Parameters:
T- the type that will be written to the Sink.
@Experimental(value=SOURCE_SINK) public abstract class Sink<T> extends Object implements Serializable, HasDisplayDataA
Sinkrepresents a resource that can be written to using the
A parallel write to a
Sinkconsists of three phases:
- A sequential initialization phase (e.g., creating a temporary output directory, etc.)
- A parallel write phase where workers write bundles of records
- A sequential finalization phase (e.g., committing the writes, merging output files, etc.)
Sinkand the corresponding write operations requires extending three abstract classes:
Sink: an immutable logical description of the location/resource to write to. Depending on the type of sink, it may contain fields such as the path to an output directory on a filesystem, a database table name, etc. Implementors of
Sinkmust implement two methods:
Validateis called by the Write transform at pipeline creation, and should validate that the Sink can be written to. The createWriteOperation method is also called at pipeline creation, and should return a WriteOperation object that defines how to write to the Sink. Note that implementations of Sink must be serializable and Sinks must be immutable.
Sink.WriteOperation: The WriteOperation implements the initialization and finalization phases of a write. Implementors of
Sink.WriteOperationmust implement corresponding
Sink.WriteOperation.finalize(java.lang.Iterable<WriteT>, com.google.cloud.dataflow.sdk.options.PipelineOptions)methods. A WriteOperation must also implement
Sink.WriteOperation.createWriter(com.google.cloud.dataflow.sdk.options.PipelineOptions)that creates Writers,
Sink.WriteOperation.getWriterResultCoder()that returns a
Coderfor the result of a parallel write, and a
Sink.WriteOperation.getSink()that returns the Sink that the write operation corresponds to. See below for more information about these methods and restrictions on their implementation.
Sink.Writer: A Writer writes a bundle of records. Writer defines four methods:
Sink.Writer.open(java.lang.String), which is called once at the start of writing a bundle;
Sink.Writer.write(T), which writes a single record from the bundle;
Sink.Writer.close(), which is called once at the end of writing a bundle; and
Sink.Writer.getWriteOperation(), which returns the write operation that the writer belongs to.
Sink.WriteOperation.finalize(java.lang.Iterable<WriteT>, com.google.cloud.dataflow.sdk.options.PipelineOptions)are conceptually called once: at the beginning and end of a Write transform. However, implementors must ensure that these methods are idempotent, as they may be called multiple times on different machines in the case of failure/retry or for redundancy.
The finalize method of WriteOperation is passed an Iterable of a writer result type. This writer result type should encode the result of a write and, in most cases, some encoding of the unique bundle id.
All implementations of
Sink.WriteOperationmust be serializable.
WriteOperation may have mutable state. For instance,
Sink.WriteOperation.initialize(com.google.cloud.dataflow.sdk.options.PipelineOptions)may mutate the object state. These mutations will be visible in
In order to ensure fault-tolerance, a bundle may be executed multiple times (e.g., in the event of failure/retry or for redundancy). However, exactly one of these executions will have its result passed to the WriteOperation's finalize method. Each call to
Sink.Writer.open(java.lang.String)is passed a unique bundle id when it is called by the Write transform, so even redundant or retried bundles will have a unique way of identifying their output.
The bundle id should be used to guarantee that a bundle's output is unique. This uniqueness guarantee is important; if a bundle is to be output to a file, for example, the name of the file must be unique to avoid conflicts with other Writers. The bundle id should be encoded in the writer result returned by the Writer and subsequently used by the WriteOperation's finalize method to identify the results of successful writes.
For example, consider the scenario where a Writer writes files containing serialized records and the WriteOperation's finalization step is to merge or rename these output files. In this case, a Writer may use its unique id to name its output file (to avoid conflicts) and return the name of the file it wrote as its writer result. The WriteOperation will then receive an Iterable of output file names that it can then merge or rename using some bundle naming scheme.
Sink.Writers must agree on a writer result type that will be returned by a Writer after it writes a bundle. This type can be a client-defined object or an existing type;
Sink.WriteOperation.getWriterResultCoder()should return a
Coderfor the type.
A note about thread safety: Any use of static members or methods in Writer should be thread safe, as different instances of Writer objects may be created in different threads on the same worker.
- See Also:
- Serialized Form
Nested Class Summary
Nested Classes Modifier and Type Class and Description
Sink.WriteOperationdefines the process of a parallel write of objects to a Sink.
Sink.Writer<T,WriteT>A Writer writes a bundle of elements from a PCollection to a sink.
Constructors Constructor and Description
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method and Description
createWriteOperation(PipelineOptions options)Returns an instance of a
Sink.WriteOperationthat can write to this Sink.
populateDisplayData(DisplayData.Builder builder)Register display data for the given transform or component.
validate(PipelineOptions options)Ensures that the sink is valid and can be written to before the write operation begins.
public abstract void validate(PipelineOptions options)Ensures that the sink is valid and can be written to before the write operation begins. One should use
Preconditionsto implement this method.
public abstract Sink.WriteOperation<T,?> createWriteOperation(PipelineOptions options)Returns an instance of a
Sink.WriteOperationthat can write to this Sink.
public void populateDisplayData(DisplayData.Builder builder)Register display data for the given transform or component.
populateDisplayData(DisplayData.Builder)is invoked by Pipeline runners to collect display data via
DisplayData.from(HasDisplayData). Implementations may call
super.populateDisplayData(builder)in order to register display data in the current namespace, but should otherwise use
subcomponent.populateDisplayData(builder)to use the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.