The Cloud Dataflow connector for Cloud Bigtable makes it possible to use Cloud Bigtable in a Cloud Dataflow pipeline. You can use the connector for both batch and streaming operations.
The connector is written in Java and is built on the Cloud Bigtable HBase client for Java. It is compatible with the Dataflow SDK 2.x for Java, which is based on Apache Beam. The connector's source code is on GitHub in the repository googleapis/java-bigtable-hbase.
This page provides an overview of how to use Read
and Write
transforms with
the Cloud Dataflow connector. You can also read full API
documentation for the Cloud Dataflow connector.
Adding the connector to a Maven project
To add the Cloud Dataflow connector to a Maven project, add the Maven artifact
to your pom.xml
file as a dependency:
Specifying the Cloud Bigtable configuration
Create an options interface to allow inputs for running your pipeline:
When you read from or write to Cloud Bigtable, you must provide a
CloudBigtableConfiguration
configuration object. This object specifies the
project ID and instance ID for your table, as well as the name of the table
itself:
For reading, provide a CloudBigtableScanConfiguration
configuration object,
which allows you to specify an Apache HBase Scan
object that
limits and filters the results of a read. See Reading from
Cloud Bigtable for details.
Reading from Cloud Bigtable
To read from a Cloud Bigtable table, you apply a Read
transform to the
result of a CloudBigtableIO.read
operation. The Read
transform returns a
PCollection
of HBase Result
objects, where each element
in the PCollection
represents a single row in the table.
By default, a CloudBigtableIO.read
operation returns all of the rows in your
table. You can use an HBase Scan
object to limit the read to
a range of row keys within your table, or to apply filters to the results of the
read. To use a Scan
object, include it in your
CloudBigtableScanConfiguration
.
For example, you can add a Scan
that returns only the first key-value pair
from each row in your table, which is useful when counting the number of rows in
the table:
Writing to Cloud Bigtable
To write to a Cloud Bigtable table, you apply
a
CloudBigtableIO.writeToTable
operation. You'll need to perform this operation
on a PCollection
of HBase Mutation
objects, which can
include Put
and Delete
objects.
The Cloud Bigtable table must already exist and must have the
appropriate column families defined. The Dataflow connector does not create
tables and column families on the fly. You can use the cbt
command-line
tool to create a table and set up column families, or you can
do this programmatically.
Before you write to Cloud Bigtable, you must create your Cloud Dataflow pipeline so that puts and deletes can be serialized over the network:
In general, you'll need to perform a transform, such as a ParDo
, to format
your output data into a collection of HBase Put
or Delete
objects. The
following example shows a simple DoFn
transform that takes the current value
and uses it as the row key for a Put
. You can then write the Put
objects to
Cloud Bigtable.
Here is the full writing example.