Aggregate values at write time

If you want to create a counter or aggregate your data in Bigtable at write time, you can use aggregates. Aggregates are Bigtable table cells that aggregate cell values as the data is written. When you add a new value, an aggregation function merges the value with the aggregated value that is already in the cell. Other databases refer to similar capabilities as distributed counters.

You can read and write aggregate values using the cbt CLI and the Bigtable client libraries for C++, Go, and Java. You can also read aggregation results using SQL. You update aggregate cells using methods that send a MutateRow request with either an AddToCell or MergeToCell mutation. MergeToCell lets you merge in an accumulator, and AddToCell lets you add an input.

This document provides an overview of aggregates and describes how to create an aggregate column family. Before you read this document, you should be familiar with the Bigtable overview and Writes.

When to use aggregates

Bigtable aggregates are useful for situations where you care about data for an entity in aggregate and not as individual data points.

Counters

You can build a counter using aggregate cells and increment the value at write time, without the limitations involved in making a ReadModifyWriteRow request.

If you're migrating to Bigtable from databases such as Apache Cassandra or Redis, you can use Bigtable aggregates in places where you previously relied on counters in these systems.

To work through a quickstart that demonstrates how to implement counters using the cbt CLI , see Create and update counters.

Time buckets

You can use time buckets to get aggregate values for a period of time, such as an hour, day, or week. Instead of aggregating data before or after it is written to your table, you add new values to aggregate cells in the table.

For example, if you run a service that helps charities raise money, you might want to know the amount of online donations per day for each campaign, but you don't need to know the exact time of each donation or the amount per hour. In your table, row keys represent charity IDs, and you create an aggregate column family called donations. The column qualifiers in the row are campaign IDs.

As each donation amount received for a given day for a campaign is received, it's added to the sum in the aggregate cell in the column for that day. Each add request for the cell uses a timestamp truncated to the beginning of the day, so that in effect each request has the same timestamp. Truncating the timestamps ensures that all of the donations from that day are added to the same cell. The next day, all of your requests go into a new cell, using timestamps that are truncated down to the new date, and that pattern continues.

Depending on your use case, you might choose to create new columns for your new aggregates instead. Depending on the number of buckets that you plan to accumulate, you might consider a different row key design.

For more information on time buckets, see Schema design for time series data.

Streamlining workflows

Aggregates let you aggregate your data in your Bigtable table without needing to use any ETL or streaming processing software to aggregate your data before or after you write it to Bigtable. For example, if your application previously published messages to Pub/Sub and then used Dataflow to read the messages and aggregate the data before writing it to Bigtable, you could instead send the data directly to aggregate cells in Bigtable.

Aggregate column families

To create and update aggregate cells, you must have one or more aggregate column families in your table – column families that contain only aggregate cells. You can create them when you create a table, or you can add an aggregate column family to a table that is already in use. When you create the column family, you specify the aggregation type, such as sum.

You can't convert a column family that contains non-aggregate data into an aggregate column family. Columns in aggregate column families can't contain non-aggregate cells, and standard column families can't contain aggregate cells.

To create a new table with an aggregate column family, see Create a table. To add an aggregate column family to a table, see Add column families.

Aggregation types

Bigtable supports the following aggregation types:

Sum

When you write a value to a sum aggregate cell (sum), the cell value is replaced with the sum of the newly added value and the current cell value. The input type that is supported for sums is Int64.

Minimum

When you write a value to a minimum aggregate cell (min), the cell value is replaced with the lower value between the newly added value and the current cell value. The input type that is supported for min is Int64.

Maximum

When you write a value to a maximum aggregate cell (max), the cell value is replaced with the higher value between the newly added value and the current cell value. The input type that is supported for max is Int64.

HyperLogLog (HLL)

When you write a value to an HLL aggregate cell (inthll), the value is added to a probabilistic set of all values added since the most recent reset. The cell value represents the state of that set. For more general information about the HLL algorithm, see HyperLogLog.

You can read HLL values using the Zetasketch library. For more information, see the Zetasketch GitHub repository. The input type that is supported for HLL is BYTES.

Timestamps

An aggregate cell is defined by row key, column family, column qualifier, and timestamp. You use the same timestamp each time you add data to the cell. If you send a value to the same row key, column family, and column qualifier but with a different timestamp, a new aggregate cell is created in the column.

Any request sent to an aggregate cell must include a timestamp.

Input type

The input type of the value in the write request must match the input type that the column family is created with. For example, if you send a string value to a column family configured for Int64, the request is rejected.

Mutation type

A Bigtable MutateRow request includes the type of mutation, which is a change to the table. The mutation types that you can send to create and update aggregate cells are AddToCell and MergeToCell. In contrast, a non-aggregate write involves a SetCell mutation. You can also use deletion mutations to clear the accumulated value of a cell.

In a replicated table, an aggregate cell converges on the same final value in all clusters within the current replication delay. The final value is the aggregate of all AddToCell mutations sent to that cell in all clusters since the last delete operation, or since the cell was created.

Aggregation operations are subject to the same operations limits as other table mutations.

`AddToCell`

To add data to an aggregate cell, such as when you are incrementing a counter, you send an AddToCell mutation in a MutateRow request. For more information, see AddToCell in the Bigtable Data API reference.

`MergeToCell`

If you want to copy data between cells, use a MergeToCell mutation. For example, to copy the state from cell A to cell B, you can do something like [DeleteCell(B), MergeToCell(B)] with the value that you read from cell A. For more information see MergeToCell in the Bigtable Data API reference.

Deletions

Just like with non-aggregated data, you can reset a counter or delete aggregated data using Data API mutations. For more information, see Mutation in the Bigtable Data API reference.

Garbage collection

Aggregate cells are treated like any other cell during garbage collection: if a cell is marked for deletion, the deletion is replicated to all clusters in the instance. For more information, see Replication and garbage collection. If an add request is sent to an aggregate cell that has been removed by garbage collection, a new aggregate cell is created.