Bigtable HBase Beam connector
To help you use Bigtable in a Dataflow pipeline, two open source Bigtable Beam I/O connectors are available.
If you are migrating from HBase to Bigtable or your application
calls the HBase API, use the Bigtable HBase Beam connector
(CloudBigtableIO
) discussed on this page.
In all other cases, you should use the Bigtable Beam connector
(BigtableIO
) in conjunction with the Cloud Bigtable client for Java,
which works with the Cloud Bigtable APIs. To get started using that
connector, see Bigtable Beam connector.
For more information on the Apache Beam programming model, see the Beam documentation.
Get started with HBase
The Bigtable HBase Beam connector is written in Java and is built on the Bigtable HBase client for Java. It's compatible with the Dataflow SDK 2.x for Java, which is based on Apache Beam. The connector's source code is on GitHub in the repository googleapis/java-bigtable-hbase.
This page provides an overview of how to use Read
and Write
transforms.
Set up authentication
To use the Java samples on this page in a local development environment, install and initialize the gcloud CLI, and then set up Application Default Credentials with your user credentials.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
If you're using a local shell, then create local authentication credentials for your user account:
gcloud auth application-default login
You don't need to do this if you're using Cloud Shell.
For more information, see Set up authentication for a local development environment.
For information about setting up authentication for a production environment, see Set up Application Default Credentials for code running on Google Cloud.
Add the connector to a Maven project
To add the Bigtable HBase Beam connector to a Maven project, add
the Maven artifact to your pom.xml
file as a dependency:
Specify the Bigtable configuration
Create an options interface to allow inputs for running your pipeline:
When you read from or write to Bigtable, you must provide a
CloudBigtableConfiguration
configuration object. This object specifies the
project ID and instance ID for your table, as well as the name of the table
itself:
For reading, provide a CloudBigtableScanConfiguration
configuration object,
which lets you specify an Apache HBase Scan
object that
limits and filters the results of a read. See Reading from
Bigtable for details.
Read from Bigtable
To read from a Bigtable table, you apply a Read
transform to the
result of a CloudBigtableIO.read
operation. The Read
transform returns a
PCollection
of HBase Result
objects, where each element
in the PCollection
represents a single row in the table.
By default, a CloudBigtableIO.read
operation returns all of the rows in your
table. You can use an HBase Scan
object to limit the read to
a range of row keys within your table, or to apply filters to the results of the
read. To use a Scan
object, include it in your
CloudBigtableScanConfiguration
.
For example, you can add a Scan
that returns only the first key-value pair
from each row in your table, which is useful when counting the number of rows in
the table:
Write to Bigtable
To write to a Bigtable table, you apply
a
CloudBigtableIO.writeToTable
operation. You'll need to perform this operation
on a PCollection
of HBase Mutation
objects, which can
include Put
and Delete
objects.
The Bigtable table must already exist and must have the
appropriate column families defined. The Dataflow connector does
not create tables and column families on the fly. You can use the
cbt
CLI
to create a table and set up column families,
or you can do this programmatically.
Before you write to Bigtable, you must create your Dataflow pipeline so that puts and deletes can be serialized over the network:
In general, you'll need to perform a transform, such as a ParDo
, to format
your output data into a collection of HBase Put
or Delete
objects. The
following example shows a DoFn
transform that takes the current value
and uses it as the row key for a Put
. You can then write the Put
objects to
Bigtable.
To enable batch write flow control, set
BIGTABLE_ENABLE_BULK_MUTATION_FLOW_CONTROL
to true
. This feature
automatically rate-limits traffic for batch write requests and lets
Bigtable autoscaling add or remove nodes automatically to handle
your Dataflow job.
Here is the full writing example, including the variation that enables batch write flow control.