In Cloud Bigtable, garbage collection policies are set at the column-family level, and you cannot specify a cell-level garbage-collection policy. However, you can simulate a time-to-live (TTL) policy at the cell level by changing your garbage-collection settings. This page explains a few different approaches that you can use.
Before you read this page, you should read the garbage collection overview.
In this approach, set your garbage-collection rule to let data expire after 1 second. Whenever you write data, set the cell's timestamp to the time you want the value to expire. During compaction, the garbage collector removes any cells with a timestamp that is at least 1 second in the past. For example, if you set a cell's timestamp to April 30 at 9:00:00 AM, the cell is garbage-collected sometime after April 30 at 9:00:01 AM. This approach enables you to set different expiration values for different cells in the same column family.
- The timestamp has a real meaning: the expiration time.
Every application that writes data to this Cloud Bigtable column family needs to be configured to follow this rule. If you forget and use a default server timestamp on a write, that data expires right away and will be removed during the next compaction.
Because your timestamps aren't "real" you cannot use timestamps for any other use case, such as determining when a value was written. As a workaround, you can write the real timestamp to a separate column, but this will increase the amount of data you store.
You cannot implement this strategy on a column family that already has data with real timestamps. If existing data has real timestamps, or if you accidentally write new data with real timestamps, that data will be removed during the next compaction.
You cannot specify that multiple cells for a given row and column should expire at the same time as each other. New data will overwrite old data with the same timestamp.
Because garbage collection can take up to a week, you always need to use filters when you read the data.
Let's say you want most of your data to have a default TTL, but you want to set different per-cell expiration values for some of your data.
For example, you might store click events for ten customers in one table. Most of the click events should expire after 2 days, but you have one customer whose click events should expire after an hour, and you have another customer whose click events should expire after 3 days.
In this approach, create your column family with an age limit for garbage collection set to the default TTL. For data you want to expire sooner than the default, set the timestamp to be earlier than the time the data is actually written. For data you want to expire later, set the timestamp to be later than the time the data is actually written.
A default TTL is in place for writes that do not have a custom TTL.
This approach can safely be applied to a pre-existing table.
The timestamp is not semantically meaningful because a cell's timestamp might be real or artificial. This means you cannot use the cells' timestamps for any other use case, such as determining when a value was written. As a workaround, you can write the real timestamp to a separate column, but this will increase the amount of data you store.
You can inadvertently write a custom timestamp that clashes with a real timestamp in a given column.
Because garbage collection is asynchronous, you still need to always use filters when you read the data when you use this strategy.