Flink Bigtable connector

Apache Flink is a stream-processing framework that lets you manipulate data in real time. If you have a Bigtable table, you can use a Flink Bigtable connector to stream, serialize, and write data from your specified data source to Bigtable. The connector lets you do the following, using either the Apache Flink Table API or the Datastream API:

Create a pipeline
Serialize the values from your data source into Bigtable mutation entries
Write those entries to your Bigtable table

This document describes the Flink Bigtable connector and what you need to know before you use it. Before you read this document, you should be familiar with Apache Flink, the Bigtable storage model, and Bigtable writes.

To use the connector, you must have a pre-existing Bigtable table to serve as your data sink. You must create the table's column families before you start the pipeline; column families can't be created on write. For more information, see Create and manage tables.

The connector is available on GitHub. For information about installing the connector, see the Flink Bigtable Connector repository. For code samples that demonstrate how to use the connector, see the flink-examples-gcp-bigtable directory.

Serializers

The Flink connector has three built-in serializers that you can use to convert data into Bigtable mutation entries:

GenericRecordToRowMutationSerializer: For AVRO GenericRecord objects
RowDataToRowMutationSerializer: For Flink RowData objects
FunctionRowMutationSerializer: For custom serialization logic using a provided function

You can also choose to create your own custom serializer inheriting from BaseRowMutationSerializer.

Serialization modes

When you use the Flink connector, you choose one of two serialization modes. The mode specifies how your source data is serialized to represent your Bigtable column families and then written your Bigtable table. You must use either one mode or the other.

Column family mode

In column family mode, all data is written to a single specified column family. Nested fields are not supported.

Nested-rows mode

In nested-rows mode, each top-level field represents a column family. The value of the top-level field (RowKeyField) is another field. The value of that field has a row object for each column in the Bigtable column family. In nested rows mode, all fields other than the top-level field must be row objects. Double-nested rows are not supported.

Exactly-once processing

In Apache Flink, exactly once means that each data record in a stream is processed exactly one time, preventing any duplicate processing or data loss, even in the event of system failures.

A Bigtable mutateRow mutation is idempotent by default, so a write request that has the same row key, column family, column, timestamp, and value doesn't create a new cell, even if it's retried. This means that when you use Bigtable as the data sink for an Apache Flink framework, you get exactly-once behavior automatically, as long as you don't change the timestamp in retries and the rest of your pipeline also satisfies exactly-once requirements.

For more information on exactly-once semantics, see An overview of end-to-end exactly-once processing in Apache Flink.