Aggregate values at write time
If you want to create a counter or aggregate your data in Bigtable at write time, you can use aggregates. Aggregates are Bigtable table cells that aggregate cell values as the data is written. When you add a new value, an aggregation function merges the value with the aggregated value that is already in the cell. Other databases refer to similar capabilities as distributed counters.
You can read and write aggregate values using the
cbt
CLI
and the
Bigtable client libraries for C++, Go, and Java. You can also read
aggregation results using SQL. You update
aggregate cells using methods that send a MutateRow
request with either an
AddToCell
or MergeToCell
mutation. MergeToCell
lets you merge in an
accumulator, and AddToCell
lets you add an input.
This document provides an overview of aggregates and describes how to create an aggregate column family. Before you read this document, you should be familiar with the Bigtable overview and Writes.
When to use aggregates
Bigtable aggregates are useful for situations where you care about data for an entity in aggregate and not as individual data points.
Counters
You can build a counter using aggregate cells and increment the value at write
time, without the limitations involved in making a ReadModifyWriteRow
request.
If you're migrating to Bigtable from databases such as Apache Cassandra or Redis, you can use Bigtable aggregates in places where you previously relied on counters in these systems.
To work through a quickstart that demonstrates how to implement counters using
the
cbt
CLI
, see Create and update
counters.
Time buckets
You can use time buckets to get aggregate values for a period of time, such as an hour, day, or week. Instead of aggregating data before or after it is written to your table, you add new values to aggregate cells in the table.
For example, if you run a service that helps charities raise money, you might
want to know the amount of online donations per day for each campaign, but you
don't need to know the exact time of each donation or the amount per hour. In
your table, row keys represent charity IDs, and you create an aggregate column
family called donations
. The column qualifiers in the row are campaign IDs.
As each donation amount received for a given day for a campaign is received, it's added to the sum in the aggregate cell in the column for that day. Each add request for the cell uses a timestamp truncated to the beginning of the day, so that in effect each request has the same timestamp. Truncating the timestamps ensures that all of the donations from that day are added to the same cell. The next day, all of your requests go into a new cell, using timestamps that are truncated down to the new date, and that pattern continues.
Depending on your use case, you might choose to create new columns for your new aggregates instead. Depending on the number of buckets that you plan to accumulate, you might consider a different row key design.
For more information on time buckets, see Schema design for time series data.
Streamlining workflows
Aggregates let you aggregate your data in your Bigtable table without needing to use any ETL or streaming processing software to aggregate your data before or after you write it to Bigtable. For example, if your application previously published messages to Pub/Sub and then used Dataflow to read the messages and aggregate the data before writing it to Bigtable, you could instead send the data directly to aggregate cells in Bigtable.
Aggregate column families
To create and update aggregate cells, you must have one or more aggregate column families in your table – column families that contain only aggregate cells. You can create them when you create a table, or you can add an aggregate column family to a table that is already in use. When you create the column family, you specify the aggregation type, such as sum.
You can't convert a column family that contains non-aggregate data into an aggregate column family. Columns in aggregate column families can't contain non-aggregate cells, and standard column families can't contain aggregate cells.
To create a new table with an aggregate column family, see Create a table. To add an aggregate column family to a table, see Add column families.
Aggregation types
Bigtable supports the following aggregation types:
Sum
When you write a value to a sum aggregate cell (sum
), the cell value is
replaced with the sum of the newly added value and the current cell value. The
input type that is supported for sums is Int64
.
Minimum
When you write a value to a minimum aggregate cell (min
), the cell value is
replaced with the lower value between the newly added value and the current
cell value. The input type that is supported for min is Int64
.
Maximum
When you write a value to a maximum aggregate cell (max
), the cell value is
replaced with the higher value between the newly added value and the current
cell value. The input type that is supported for max is Int64
.
HyperLogLog (HLL)
When you write a value to an HLL aggregate cell (inthll
), the value is added
to a probabilistic set of all values added since the most recent reset. The cell
value represents the state of that set. For more general information about the
HLL algorithm, see HyperLogLog.
You can read HLL values using the Zetasketch library. For more information, see
the Zetasketch GitHub repository. The
input type that is supported for HLL is BYTES
.
Timestamps
An aggregate cell is defined by row key, column family, column qualifier, and timestamp. You use the same timestamp each time you add data to the cell. If you send a value to the same row key, column family, and column qualifier but with a different timestamp, a new aggregate cell is created in the column.
Any request sent to an aggregate cell must include a timestamp.
Input type
The input type of the value in the write request must match the input type that
the column family is created with. For example, if you send a string value to a
column family configured for Int64
, the request is rejected.
Mutation type
A Bigtable MutateRow
request includes the type of mutation,
which is a change to the table. The mutation types that you can send to create
and update aggregate cells are AddToCell
and MergeToCell
. In contrast, a
non-aggregate write involves a SetCell
mutation. You can also use deletion
mutations to clear the accumulated value of a cell.
In a replicated table, an aggregate cell converges on the same final value in
all clusters within the current replication delay. The final value is the
aggregate of all AddToCell
mutations sent to that cell in all clusters since
the last delete operation, or since the cell was created.
Aggregation operations are subject to the same operations limits as other table mutations.
AddToCell
To add data to an aggregate cell, such as when you are incrementing a counter,
you send an AddToCell
mutation in a MutateRow
request. For more information,
see
AddToCell
in the Bigtable
Data API reference.
MergeToCell
If you want to copy data between cells, use a MergeToCell
mutation. For
example, to copy the state from cell A to cell B, you can do something like
[DeleteCell(B), MergeToCell(B)]
with the value that you read from cell A. For
more information see
MergeToCell
in the Bigtable
Data API reference.
Deletions
Just like with non-aggregated data, you can reset a counter or delete aggregated data using Data API mutations. For more information, see Mutation in the Bigtable Data API reference.
Garbage collection
Aggregate cells are treated like any other cell during garbage collection: if a cell is marked for deletion, the deletion is replicated to all clusters in the instance. For more information, see Replication and garbage collection. If an add request is sent to an aggregate cell that has been removed by garbage collection, a new aggregate cell is created.
What's next
- Work through a quickstart to see how to create and update counters
using the
cbt
CLI . - See code samples showing how to add a value to an aggregate cell.
- Review concepts related to schema design.