Google Cloud Dataflow SDK for Java, version 1.9.1
Class FileBasedSink.FileBasedWriteOperation<T>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.io.Sink.WriteOperation<T,FileBasedSink.FileResult>
-
- com.google.cloud.dataflow.sdk.io.FileBasedSink.FileBasedWriteOperation<T>
-
- Type Parameters:
T
- the type of values written to the sink.
- All Implemented Interfaces:
- Serializable
- Direct Known Subclasses:
- XmlSink.XmlWriteOperation
- Enclosing class:
- FileBasedSink<T>
public abstract static class FileBasedSink.FileBasedWriteOperation<T> extends Sink.WriteOperation<T,FileBasedSink.FileResult>
AbstractSink.WriteOperation
that manages the process of writing to aFileBasedSink
.The primary responsibilities of the FileBasedWriteOperation is the management of output files. During a write,
FileBasedSink.FileBasedWriter
s write bundles to temporary file locations. After the bundles have been written,finalize(java.lang.Iterable<com.google.cloud.dataflow.sdk.io.FileBasedSink.FileResult>, com.google.cloud.dataflow.sdk.options.PipelineOptions)
is given a list of the temporary files containing the output bundles.- During finalize, these temporary files are copied to final output locations and named according to a file naming template.
- Finally, any temporary files that were created during the write are removed.
Subclass implementations of FileBasedWriteOperation must implement
createWriter(com.google.cloud.dataflow.sdk.options.PipelineOptions)
to return a concrete FileBasedSinkWriter.Temporary and Output File Naming:
During the write, bundles are written to temporary files using the baseTemporaryFilename that can be provided via the constructor of FileBasedWriteOperation. These temporary files will be named{baseTemporaryFilename}-temp-{bundleId}
, where bundleId is the unique id of the bundle. For example, if baseTemporaryFilename is "gs://my-bucket/my_temp_output", the output for a bundle with bundle id 15723 will be "gs://my-bucket/my_temp_output-temp-15723".Final output files are written to baseOutputFilename with the format
{baseOutputFilename}-0000i-of-0000n.{extension}
where n is the total number of bundles written and extension is the file extension. Both baseOutputFilename and extension are required constructor arguments.Subclass implementations can change the file naming template by supplying a value for
FileBasedSink.fileNamingTemplate
.Temporary Bundle File Handling:
temporaryFileRetention
controls the behavior for managing temporary files. By default, temporary files will be removed. Subclasses can provide a different value to the constructor.Note that in the case of permanent failure of a bundle's write, no clean up of temporary files will occur.
If there are no elements in the PCollection being written, no output will be generated.
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention
Options for handling of temporary output files.
-
Field Summary
Fields Modifier and Type Field and Description protected ValueProvider<String>
baseTemporaryFilename
Base filename used for temporary output files.protected FileBasedSink<T>
sink
The Sink that this WriteOperation will write to.protected static String
TEMPORARY_FILENAME_SEPARATOR
Name separator for temporary files.protected FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention
temporaryFileRetention
Option to keep or remove temporary output files.
-
Constructor Summary
Constructors Constructor and Description FileBasedWriteOperation(FileBasedSink<T> sink)
Construct a FileBasedWriteOperation using the same base filename for both temporary and output files.FileBasedWriteOperation(FileBasedSink<T> sink, String baseTemporaryFilename)
Construct a FileBasedWriteOperation.FileBasedWriteOperation(FileBasedSink<T> sink, String baseTemporaryFilename, FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention temporaryFileRetention)
Create a new FileBasedWriteOperation.
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Deprecated Methods Modifier and Type Method and Description protected static String
buildTemporaryFilename(String prefix, String suffix)
Build a temporary filename using the temporary filename separator with the given prefix and suffix.protected List<String>
copyToOutputFiles(List<String> filenames, PipelineOptions options)
Copy temporary files to final output filenames using the file naming template.abstract FileBasedSink.FileBasedWriter<T>
createWriter(PipelineOptions options)
Clients must implement to return a subclass ofFileBasedSink.FileBasedWriter
.void
finalize(Iterable<FileBasedSink.FileResult> writerResults, PipelineOptions options)
Finalizes writing by copying temporary output files to their final location and optionally removing temporary files.protected List<String>
generateDestinationFilenames(int numFiles)
Generate output bundle filenames.FileBasedSink<T>
getSink()
Returns the FileBasedSink for this write operation.Coder<FileBasedSink.FileResult>
getWriterResultCoder()
Provides a coder forFileBasedSink.FileResult
.void
initialize(PipelineOptions options)
Initialization of the sink.protected void
removeTemporaryFiles(Collection<String> knownFiles, PipelineOptions options)
Removes temporary output files.protected void
removeTemporaryFiles(PipelineOptions options)
Deprecated.
-
-
-
Field Detail
-
sink
protected final FileBasedSink<T> sink
The Sink that this WriteOperation will write to.
-
temporaryFileRetention
protected final FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention temporaryFileRetention
Option to keep or remove temporary output files.
-
baseTemporaryFilename
protected final ValueProvider<String> baseTemporaryFilename
Base filename used for temporary output files. Default is the baseOutputFilename.
-
TEMPORARY_FILENAME_SEPARATOR
protected static final String TEMPORARY_FILENAME_SEPARATOR
Name separator for temporary files. Temporary files will be named{baseTemporaryFilename}-temp-{bundleId}
.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
FileBasedWriteOperation
public FileBasedWriteOperation(FileBasedSink<T> sink)
Construct a FileBasedWriteOperation using the same base filename for both temporary and output files.- Parameters:
sink
- the FileBasedSink that will be used to configure this write operation.
-
FileBasedWriteOperation
public FileBasedWriteOperation(FileBasedSink<T> sink, String baseTemporaryFilename)
Construct a FileBasedWriteOperation.- Parameters:
sink
- the FileBasedSink that will be used to configure this write operation.baseTemporaryFilename
- the base filename to be used for temporary output files.
-
FileBasedWriteOperation
public FileBasedWriteOperation(FileBasedSink<T> sink, String baseTemporaryFilename, FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention temporaryFileRetention)
Create a new FileBasedWriteOperation.- Parameters:
sink
- the FileBasedSink that will be used to configure this write operation.baseTemporaryFilename
- the base filename to be used for temporary output files.temporaryFileRetention
- defines how temporary files are handled.
-
-
Method Detail
-
buildTemporaryFilename
protected static final String buildTemporaryFilename(String prefix, String suffix)
Build a temporary filename using the temporary filename separator with the given prefix and suffix.
-
createWriter
public abstract FileBasedSink.FileBasedWriter<T> createWriter(PipelineOptions options) throws Exception
Clients must implement to return a subclass ofFileBasedSink.FileBasedWriter
. This method must satisfy the restrictions placed on implementations ofSink.WriteOperation.createWriter(com.google.cloud.dataflow.sdk.options.PipelineOptions)
. Namely, it must not mutate the state of the object.- Specified by:
createWriter
in classSink.WriteOperation<T,FileBasedSink.FileResult>
- Throws:
Exception
-
initialize
public void initialize(PipelineOptions options) throws Exception
Initialization of the sink. Default implementation is a no-op. May be overridden by subclass implementations to perform initialization of the sink at pipeline runtime. This method must be idempotent and is subject to the same implementation restrictions asSink.WriteOperation.initialize(com.google.cloud.dataflow.sdk.options.PipelineOptions)
.- Specified by:
initialize
in classSink.WriteOperation<T,FileBasedSink.FileResult>
- Throws:
Exception
-
finalize
public void finalize(Iterable<FileBasedSink.FileResult> writerResults, PipelineOptions options) throws Exception
Finalizes writing by copying temporary output files to their final location and optionally removing temporary files.Finalization may be overridden by subclass implementations to perform customized finalization (e.g., initiating some operation on output bundles, merging them, etc.).
writerResults
contains the filenames of written bundles.If subclasses override this method, they must guarantee that its implementation is idempotent, as it may be executed multiple times in the case of failure or for redundancy. It is a best practice to attempt to try to make this method atomic.
- Specified by:
finalize
in classSink.WriteOperation<T,FileBasedSink.FileResult>
- Parameters:
writerResults
- the results of writes (FileResult).- Throws:
Exception
-
copyToOutputFiles
protected final List<String> copyToOutputFiles(List<String> filenames, PipelineOptions options) throws IOException
Copy temporary files to final output filenames using the file naming template.Can be called from subclasses that override
finalize(java.lang.Iterable<com.google.cloud.dataflow.sdk.io.FileBasedSink.FileResult>, com.google.cloud.dataflow.sdk.options.PipelineOptions)
.Files will be named according to the file naming template. The order of the output files will be the same as the sorted order of the input filenames. In other words, if the input filenames are ["C", "A", "B"], baseOutputFilename is "file", the extension is ".txt", and the fileNamingTemplate is "-SSS-of-NNN", the contents of A will be copied to file-000-of-003.txt, the contents of B will be copied to file-001-of-003.txt, etc.
- Parameters:
filenames
- the filenames of temporary files.- Returns:
- a list containing the names of final output files.
- Throws:
IOException
-
generateDestinationFilenames
protected final List<String> generateDestinationFilenames(int numFiles)
Generate output bundle filenames.
-
removeTemporaryFiles
@Deprecated protected final void removeTemporaryFiles(PipelineOptions options) throws IOException
Deprecated.UseremoveTemporaryFiles(Collection, PipelineOptions)
instead.- Throws:
IOException
-
removeTemporaryFiles
protected final void removeTemporaryFiles(Collection<String> knownFiles, PipelineOptions options) throws IOException
Removes temporary output files. Uses the temporary filename to find files to remove.Additionally, to partially mitigate the effects of filesystems with eventually-consistent directory matching APIs, takes a list of files that are known to exist - i.e. removes the union of the known files and files that the filesystem says exist in the directory.
Assumes that, if globbing had been strongly consistent, it would have matched all of knownFiles - i.e. on a strongly consistent filesystem, knownFiles can be ignored.
Can be called from subclasses that override
finalize(java.lang.Iterable<com.google.cloud.dataflow.sdk.io.FileBasedSink.FileResult>, com.google.cloud.dataflow.sdk.options.PipelineOptions)
. Note:If finalize is overridden and does not rename or otherwise finalize temporary files, this method will remove them.- Throws:
IOException
-
getWriterResultCoder
public Coder<FileBasedSink.FileResult> getWriterResultCoder()
Provides a coder forFileBasedSink.FileResult
.- Overrides:
getWriterResultCoder
in classSink.WriteOperation<T,FileBasedSink.FileResult>
-
getSink
public FileBasedSink<T> getSink()
Returns the FileBasedSink for this write operation.- Specified by:
getSink
in classSink.WriteOperation<T,FileBasedSink.FileResult>
-
-