Google Cloud Dataflow SDK for Java, version 1.9.1
Class Sink<T>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.io.Sink<T>
-
- Type Parameters:
T
- the type that will be written to the Sink.
- All Implemented Interfaces:
- HasDisplayData, Serializable
- Direct Known Subclasses:
- DatastoreIO.Sink, FileBasedSink
@Experimental(value=SOURCE_SINK) public abstract class Sink<T> extends Object implements Serializable, HasDisplayData
ASink
represents a resource that can be written to using theWrite
transform.A parallel write to a
Sink
consists of three phases:- A sequential initialization phase (e.g., creating a temporary output directory, etc.)
- A parallel write phase where workers write bundles of records
- A sequential finalization phase (e.g., committing the writes, merging output files, etc.)
The
Write
transform can be used in a Dataflow pipeline to perform this write. Specifically, a Write transform can be applied to aPCollection
p
by:p.apply(Write.to(new MySink()));
Implementing a
Sink
and the corresponding write operations requires extending three abstract classes:Sink
: an immutable logical description of the location/resource to write to. Depending on the type of sink, it may contain fields such as the path to an output directory on a filesystem, a database table name, etc. Implementors ofSink
must implement two methods:validate(com.google.cloud.dataflow.sdk.options.PipelineOptions)
andcreateWriteOperation(com.google.cloud.dataflow.sdk.options.PipelineOptions)
.Validate
is called by the Write transform at pipeline creation, and should validate that the Sink can be written to. The createWriteOperation method is also called at pipeline creation, and should return a WriteOperation object that defines how to write to the Sink. Note that implementations of Sink must be serializable and Sinks must be immutable.Sink.WriteOperation
: The WriteOperation implements the initialization and finalization phases of a write. Implementors ofSink.WriteOperation
must implement correspondingSink.WriteOperation.initialize(com.google.cloud.dataflow.sdk.options.PipelineOptions)
andSink.WriteOperation.finalize(java.lang.Iterable<WriteT>, com.google.cloud.dataflow.sdk.options.PipelineOptions)
methods. A WriteOperation must also implementSink.WriteOperation.createWriter(com.google.cloud.dataflow.sdk.options.PipelineOptions)
that creates Writers,Sink.WriteOperation.getWriterResultCoder()
that returns aCoder
for the result of a parallel write, and aSink.WriteOperation.getSink()
that returns the Sink that the write operation corresponds to. See below for more information about these methods and restrictions on their implementation.Sink.Writer
: A Writer writes a bundle of records. Writer defines four methods:Sink.Writer.open(java.lang.String)
, which is called once at the start of writing a bundle;Sink.Writer.write(T)
, which writes a single record from the bundle;Sink.Writer.close()
, which is called once at the end of writing a bundle; andSink.Writer.getWriteOperation()
, which returns the write operation that the writer belongs to.
WriteOperation
Sink.WriteOperation.initialize(com.google.cloud.dataflow.sdk.options.PipelineOptions)
andSink.WriteOperation.finalize(java.lang.Iterable<WriteT>, com.google.cloud.dataflow.sdk.options.PipelineOptions)
are conceptually called once: at the beginning and end of a Write transform. However, implementors must ensure that these methods are idempotent, as they may be called multiple times on different machines in the case of failure/retry or for redundancy.The finalize method of WriteOperation is passed an Iterable of a writer result type. This writer result type should encode the result of a write and, in most cases, some encoding of the unique bundle id.
All implementations of
Sink.WriteOperation
must be serializable.WriteOperation may have mutable state. For instance,
Sink.WriteOperation.initialize(com.google.cloud.dataflow.sdk.options.PipelineOptions)
may mutate the object state. These mutations will be visible inSink.WriteOperation.createWriter(com.google.cloud.dataflow.sdk.options.PipelineOptions)
andSink.WriteOperation.finalize(java.lang.Iterable<WriteT>, com.google.cloud.dataflow.sdk.options.PipelineOptions)
because the object will be serialized after initialize and deserialized before these calls. However, it is not serialized again after createWriter is called, as createWriter will be called within workers to create Writers for the bundles that are distributed to these workers. Therefore, newWriter should not mutate the WriteOperation state (as these mutations will not be visible in finalize).Bundle Ids:
In order to ensure fault-tolerance, a bundle may be executed multiple times (e.g., in the event of failure/retry or for redundancy). However, exactly one of these executions will have its result passed to the WriteOperation's finalize method. Each call to
Sink.Writer.open(java.lang.String)
is passed a unique bundle id when it is called by the Write transform, so even redundant or retried bundles will have a unique way of identifying their output.The bundle id should be used to guarantee that a bundle's output is unique. This uniqueness guarantee is important; if a bundle is to be output to a file, for example, the name of the file must be unique to avoid conflicts with other Writers. The bundle id should be encoded in the writer result returned by the Writer and subsequently used by the WriteOperation's finalize method to identify the results of successful writes.
For example, consider the scenario where a Writer writes files containing serialized records and the WriteOperation's finalization step is to merge or rename these output files. In this case, a Writer may use its unique id to name its output file (to avoid conflicts) and return the name of the file it wrote as its writer result. The WriteOperation will then receive an Iterable of output file names that it can then merge or rename using some bundle naming scheme.
Writer Results:
Sink.WriteOperation
s andSink.Writer
s must agree on a writer result type that will be returned by a Writer after it writes a bundle. This type can be a client-defined object or an existing type;Sink.WriteOperation.getWriterResultCoder()
should return aCoder
for the type.A note about thread safety: Any use of static members or methods in Writer should be thread safe, as different instances of Writer objects may be created in different threads on the same worker.
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
Sink.WriteOperation<T,WriteT>
ASink.WriteOperation
defines the process of a parallel write of objects to a Sink.static class
Sink.Writer<T,WriteT>
A Writer writes a bundle of elements from a PCollection to a sink.
-
Constructor Summary
Constructors Constructor and Description Sink()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method and Description abstract Sink.WriteOperation<T,?>
createWriteOperation(PipelineOptions options)
Returns an instance of aSink.WriteOperation
that can write to this Sink.void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.abstract void
validate(PipelineOptions options)
Ensures that the sink is valid and can be written to before the write operation begins.
-
-
-
Method Detail
-
validate
public abstract void validate(PipelineOptions options)
Ensures that the sink is valid and can be written to before the write operation begins. One should usePreconditions
to implement this method.
-
createWriteOperation
public abstract Sink.WriteOperation<T,?> createWriteOperation(PipelineOptions options)
Returns an instance of aSink.WriteOperation
that can write to this Sink.
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
-