Flink Bigtable connector
Apache Flink is a stream-processing framework that lets you manipulate data in real time. If you have a Bigtable table, you can use a Flink Bigtable connector to stream, serialize, and write data from your specified data source to Bigtable. The connector lets you do the following, using either the Apache Flink Table API or the Datastream API:
- Create a pipeline
- Serialize the values from your data source into Bigtable mutation entries
- Write those entries to your Bigtable table
This document describes the Flink Bigtable connector and what you need to know before you use it. Before you read this document, you should be familiar with Apache Flink, the Bigtable storage model, and Bigtable writes.
To use the connector, you must have a pre-existing Bigtable table to serve as your data sink. You must create the table's column families before you start the pipeline; column families can't be created on write. For more information, see Create and manage tables.
The connector is available on GitHub. For information about installing the connector, see the Flink Bigtable Connector repository. For code samples that demonstrate how to use the connector, see the flink-examples-gcp-bigtable directory.
Serializers
The Flink connector has three built-in serializers that you can use to convert data into Bigtable mutation entries:
GenericRecordToRowMutationSerializer
: For AVROGenericRecord
objectsRowDataToRowMutationSerializer
: For FlinkRowData
objectsFunctionRowMutationSerializer
: For custom serialization logic using a provided function
You can also choose to create your own custom serializer inheriting from
BaseRowMutationSerializer
.
Serialization modes
When you use the Flink connector, you choose one of two serialization modes. The mode specifies how your source data is serialized to represent your Bigtable column families and then written your Bigtable table. You must use either one mode or the other.
Column family mode
In column family mode, all data is written to a single specified column family. Nested fields are not supported.
Nested-rows mode
In nested-rows mode, each top-level field represents a column family. The value of the top-level field (RowKeyField) is another field. The value of that field has a row object for each column in the Bigtable column family. In nested rows mode, all fields other than the top-level field must be row objects. Double-nested rows are not supported.
Exactly-once processing
In Apache Flink, exactly once means that each data record in a stream is processed exactly one time, preventing any duplicate processing or data loss, even in the event of system failures.
A Bigtable mutateRow
mutation is idempotent by default, so a
write request that has the same row key, column family, column, timestamp, and
value doesn't create a new cell, even if it's retried. This means that when you
use Bigtable as the data sink for an Apache Flink framework, you
get exactly-once behavior automatically, as long as you don't change the
timestamp in retries and the rest of your pipeline also satisfies exactly-once
requirements.
For more information on exactly-once semantics, see An overview of end-to-end exactly-once processing in Apache Flink.