Google Cloud Dataflow SDK for Java, version 1.9.1
Class CompressedSource<T>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.io.Source<T>
-
- com.google.cloud.dataflow.sdk.io.BoundedSource<T>
-
- com.google.cloud.dataflow.sdk.io.OffsetBasedSource<T>
-
- com.google.cloud.dataflow.sdk.io.FileBasedSource<T>
-
- com.google.cloud.dataflow.sdk.io.CompressedSource<T>
-
- Type Parameters:
T
- The type to read from the compressed file.
- All Implemented Interfaces:
- HasDisplayData, Serializable
@Experimental(value=SOURCE_SINK) public class CompressedSource<T> extends FileBasedSource<T>
A Source that reads from compressed files. ACompressedSources
wraps a delegateFileBasedSource
that is able to read the decompressed file format.For example, use the following to read from a gzip-compressed XML file:
XmlSource mySource = XmlSource.from(...); PCollection<T> collection = p.apply(Read.from(CompressedSource .from(mySource) .withDecompression(CompressedSource.CompressionMode.GZIP)));
Supported compression algorithms are
CompressedSource.CompressionMode.GZIP
andCompressedSource.CompressionMode.BZIP2
. User-defined compression types are supported by implementingCompressedSource.DecompressingChannelFactory
.By default, the compression algorithm is selected from those supported in
CompressedSource.CompressionMode
based on the file name provided to the source, namely".bz2"
indicatesCompressedSource.CompressionMode.BZIP2
and".gz"
indicatesCompressedSource.CompressionMode.GZIP
. If the file name does not match any of the supported algorithms, it is assumed to be uncompressed data.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
CompressedSource.CompressedReader<T>
Reader for aCompressedSource
.static class
CompressedSource.CompressionMode
Default compression types supported by theCompressedSource
.static interface
CompressedSource.DecompressingChannelFactory
Factory interface for creating channels that decompress the content of an underlying channel.-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource
FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource
OffsetBasedSource.OffsetBasedReader<T>
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.BoundedSource
BoundedSource.BoundedReader<T>
-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.Source
Source.Reader<T>
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method and Description protected FileBasedSource<T>
createForSubrangeOfFile(String fileName, long start, long end)
Creates aCompressedSource
for a subrange of a file.protected FileBasedSource.FileBasedReader<T>
createSingleFileReader(PipelineOptions options)
Creates aFileBasedReader
to read a single file.static <T> CompressedSource<T>
from(FileBasedSource<T> sourceDelegate)
Creates aCompressedSource
from an underlyingFileBasedSource
.CompressedSource.DecompressingChannelFactory
getChannelFactory()
Coder<T>
getDefaultOutputCoder()
Returns the delegate source's default output coder.protected boolean
isSplittable()
Determines whether a single file represented by this source is splittable.void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.boolean
producesSortedKeys(PipelineOptions options)
Returns whether the delegate source produces sorted keys.static <T> Read.Bounded<T>
readFromSource(FileBasedSource<T> sourceDelegate, CompressedSource.DecompressingChannelFactory channelFactory)
Creates aRead
transform that reads from that reads from the underlyingFileBasedSource
sourceDelegate
after decompressing it with aCompressedSource.DecompressingChannelFactory
.void
validate()
Validates that the delegate source is a valid source and that the channel factory is not null.CompressedSource<T>
withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
Return aCompressedSource
that is like this one but will decompress its underlying file with the givenCompressedSource.DecompressingChannelFactory
.-
Methods inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource
createReader, createSourceForSubrange, expandFilePattern, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, splitIntoBundles, toString
-
Methods inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource
allowsDynamicSplitting, getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
-
-
-
-
Method Detail
-
readFromSource
public static <T> Read.Bounded<T> readFromSource(FileBasedSource<T> sourceDelegate, CompressedSource.DecompressingChannelFactory channelFactory)
Creates aRead
transform that reads from that reads from the underlyingFileBasedSource
sourceDelegate
after decompressing it with aCompressedSource.DecompressingChannelFactory
.
-
from
public static <T> CompressedSource<T> from(FileBasedSource<T> sourceDelegate)
Creates aCompressedSource
from an underlyingFileBasedSource
. The type of compression used will be based on the file name extension unless explicitly configured viawithDecompression(com.google.cloud.dataflow.sdk.io.CompressedSource.DecompressingChannelFactory)
.
-
withDecompression
public CompressedSource<T> withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
Return aCompressedSource
that is like this one but will decompress its underlying file with the givenCompressedSource.DecompressingChannelFactory
.
-
validate
public void validate()
Validates that the delegate source is a valid source and that the channel factory is not null.- Overrides:
validate
in classFileBasedSource<T>
-
createForSubrangeOfFile
protected FileBasedSource<T> createForSubrangeOfFile(String fileName, long start, long end)
Creates aCompressedSource
for a subrange of a file. Called by superclass to create a source for a single file.- Specified by:
createForSubrangeOfFile
in classFileBasedSource<T>
- Parameters:
fileName
- file backing the newFileBasedSource
.start
- starting byte offset of the newFileBasedSource
.end
- ending byte offset of the newFileBasedSource
. May be Long.MAX_VALUE, in which case it will be inferred usingFileBasedSource.getMaxEndOffset(com.google.cloud.dataflow.sdk.options.PipelineOptions)
.
-
isSplittable
protected final boolean isSplittable() throws Exception
Determines whether a single file represented by this source is splittable. Returns true if we are using the default decompression factory and and it determines from the requested file name that the file is not compressed.- Overrides:
isSplittable
in classFileBasedSource<T>
- Throws:
Exception
-
createSingleFileReader
protected final FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
Creates aFileBasedReader
to read a single file.Uses the delegate source to create a single file reader for the delegate source. Utilizes the default decompression channel factory to not wrap the source reader if the file name does not represent a compressed file allowing for splitting of the source.
- Specified by:
createSingleFileReader
in classFileBasedSource<T>
-
producesSortedKeys
public final boolean producesSortedKeys(PipelineOptions options) throws Exception
Returns whether the delegate source produces sorted keys.- Specified by:
producesSortedKeys
in classBoundedSource<T>
- Throws:
Exception
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Description copied from class:Source
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Overrides:
populateDisplayData
in classFileBasedSource<T>
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
getDefaultOutputCoder
public final Coder<T> getDefaultOutputCoder()
Returns the delegate source's default output coder.- Specified by:
getDefaultOutputCoder
in classSource<T>
-
getChannelFactory
public final CompressedSource.DecompressingChannelFactory getChannelFactory()
-
-