The Dataflow SDKs provide an API for reading data from, and writing data to,
Google Cloud Bigtable. The BigtableIO
source and
sink let you read or write a PCollection
of Bigtable Row
objects from a
given Bigtable.
Setting Bigtable Options
When you read from or write to Bigtable, you'll need to provide a table ID and a set of Bigtable options. These options contain information necessary to identify the target Bigtable cluster, including:
- Project ID
- Cluster ID
- Zone ID
The easiest way to provide these options is to construct them using
BigtableOptions.Builder
in the package
com.google.cloud.bigtable.config.BigtableOptions
:
BigtableOptions.Builder optionsBuilder = new BigtableOptions.Builder() .setProjectId("project") .setClusterId("cluster") .setZoneId("zone");
Reading from Bigtable
To read from Bigtable, apply the BigtableIO.read()
transform to your
Pipeline
object. You'll need to specify the table ID and BigtableOptions
using .withTableId
and .withBigtableOptions
, respectively. By default,
BigtableIO.read()
scans the entire specified Bigtable and returns
PCollection
of Bigtable Row
objects:
// Scan the entire table. PCollection <Row> btRows = p.apply("read", BigtableIO.read() .withBigtableOptions(optionsBuilder) .withTableId("table"));
If you want to scan a subset of the rows in the specified Bigtable, you can provide a Bigtable
RowFilter
object. If you provide a
RowFilter
, BigtableIO.read()
will return only the Row
s that
match the filter:
// Read only rows that match the specified filter. RowFilter filter = ...; PCollection <Row> filteredBtRows = p.apply("filtered read", BigtableIO.read() .withBigtableOptions(optionsBuilder) .withTableId("table") .withRowFilter(filter));
Writing to Bigtable
To write to Bigtable, apply the BigtableIO.write()
transform to the
PCollection
containing your output data. You'll need to specify the table ID and
BigtableOptions
using .withTableId
and
.withBigtableOptions
, respectively.
Formatting Bigtable Output Data
The BigtableIO
data sink performs each write operation as a set of row mutations to
the target Bigtable. As such, you must format your output data as a
PCollection<KV<ByteString, Iterable<Mutation>>>
. Each element
in the PCollection
must contain:
- The key of the row to be written as a
ByteString
. - An
Iterable
ofMutation
objects that represent a series of idempotent row mutation operations.