About garbage collection
This page describes how garbage collection works in Cloud Bigtable and covers the following topics:
- Types of garbage collection
- Default garbage collection settings
- When data is deleted
- Changes to garbage collection policies for replicated tables
Overview of garbage collection
Garbage collection is the automatic, ongoing process of removing expired and obsolete data from Bigtable tables. A garbage collection policy is a set of rules you create that state when data in a specific column family is no longer needed.
Garbage collection is a built-in, asynchronous background process. It can take up to a week before data that is eligible for garbage collection is actually deleted. Garbage collection occurs on a fixed schedule that does not vary based on how much data needs to be deleted. Until the data is deleted, it appears in read results. You can filter your reads to exclude this data.
The benefits of garbage collection policies include the following:
- Minimize row size - You always want to prevent rows from growing indefinitely. Large rows negatively affect performance. Ideally, you should never let a row grow beyond 100 MB in size, and the limit is 256 MB. If you don't need to keep old data, or old versions of your current data, using garbage collection can help you minimize the size of each row.
- Keep costs down - Garbage collection ensures that you don't pay to store data that is no longer required or used. You are charged for storage of expired or obsolete data until compaction occurs and data eligible for garbage collection is deleted. This process typically takes a few days but might take up to a week.
You can set garbage collection policies either programmatically or with the
cbt tool. Garbage collection policies are set at the column family
Each column family in a table has its own garbage collection policy. The garbage collection process looks up the current garbage collection policy for each column family, then deletes data according to the rules in the policy.
In Bigtable, the intersection of a row and a column can have
multiple cells, which contain timestamped versions of the value for that intersection.
Each cell has a timestamp. A timestamp is the number of microseconds since the
1970-01-01 00:00:00 UTC. You can
use default timestamps or set them when you send write requests.
The timestamp property of a cell can be a "real" timestamp, reflecting the actual time the value for the cell is written, or it can be an "artificial" timestamp. Artificial timestamps include sequential numbers, zeroes, or timestamp-formatted values that are not the actual time the cell was written. Before you use artificial timestamps, review the use cases for artificial timestamps, including the risks of using them:
Types of garbage collection
This section describes the types of garbage collection available in Bigtable. Code samples for each type of garbage collection are at Configuring garbage collection.
Expiring values (age-based)
You can set a garbage collection rule based on the timestamp for each cell. For example, you might not want to keep any cells with timestamps more than 30 days before the current date and time. With this type of garbage collection rule, you set the time to live (TTL) for data. Bigtable looks at each column family during garbage collection and deletes any cells that have expired.
Number of versions
You can set a garbage collection rule that explicitly states the maximum number of cells to keep for all columns in a column family.
For instance, if you want to keep only the latest username and email address for
a customer, you can create a column family containing those two columns and set
the maximum number of values to
1 for that column family.
In another case, you might want to keep the last five versions of a user's
password hash to make sure they don't reuse the password, so you would set the
maximum number of versions for the column family containing the password column
5. When Bigtable looks at the column family during garbage
collection, if a sixth cell has been written to the password column, the oldest
cell is deleted to keep the number of cells to five.
Combinations of expiration and version number rules
You can use a combination of expiration and version number rules for garbage collection. The types of combinations are intersection, union, and nested. For configuration examples, see Garbage collection based on multiple criteria.
An intersection garbage collection policy marks data for deletion when it meets all the criteria in a given set of rules. For example, you might want to delete profiles older than 30 days but always keep at least one for each user. In this case, your intersection policy for the column family containing the profile column would consist of a rule for an expiring value and a rule for the number of versions.
A union garbage collection policy marks data for deletion when it meets any item in a given set of rules. For example, you might want to make sure that you retain a maximum of two page-view records per user but only if they are less than 30 days old. In this case, your union policy is set for an expiring value or a number of versions.
A nested garbage collection policy has a combination of union and intersection rules.
Default settings for garbage collection
There is no default TTL for a column family. The number of cells retained for a column depends on how you create the column family that the column is in, as explained in the following sections.
If you create the column family with the HBase client for Java, the HBase shell, or another tool that uses the HBase client for Java, Bigtable retains only the most recent cell in each column in the column family, unless you change the rule. This default setting is consistent with HBase.
All other client libraries or tools
If you create the column family with any other client library or tool,
Bigtable retains an infinite number of cells in each column in
the column family. This includes column families created with
gcloud and the
cbt tool. You must change the garbage collection policy for the column family
if you want to limit the number of versions.
When data is deleted
Garbage collection is a continuous process in which Bigtable checks the rules for each column family and deletes expired and obsolete data accordingly. In general, it can take up to a week from the time that data matches the criteria in the rules for the data to actually be deleted. You are not able to change the timing of garbage collection.
Because it can take up to a week for expired data to be deleted, you should never rely solely on garbage collection policies to ensure that read requests return the desired data. Always apply a filter to your read requests that excludes the same values as your garbage collection rules. You can filter by limiting the number of cells per column or by specifying a timestamp range.
For example, let's say that a column family's garbage collection rule is set to keep only the five most recent versions of a profile, and five versions are already stored. After a new version of the profile is written, it might take up to a week for the oldest cell to be deleted. Therefore, to avoid reading the sixth value, you should always filter out everything except the five most recent versions.
You are charged for storage of expired data until compaction occurs and the data is deleted.
Garbage collection is retroactive: when a new garbage collection policy is set, over the next few days it is applied to all data in the table. If the new policy is more restrictive than the previous policy, old data is deleted as the background work happens, including data that was written before the policy change.
If you want to make sure that data marked for garbage collection is being deleted, you can query your table and compare the data with expected results. You can also monitor table size in the Google Cloud console. A table that never gets smaller might reflect a garbage collection policy that is not working as expected, but remember that garbage collection is executed on a delay.
Replication and garbage collection
Replication can affect garbage collection in a few ways.
Version-based garbage collection and CPU usage
In an instance that uses replication, deletes from version-based garbage collection are replicated to all clusters in the instance in the same way that application requests are replicated. If you rapidly write new cells that cause older cells to become marked for deletion, you might see increased CPU utilization when Bigtable deletes the stale cells and replicates those deletes to other clusters in the instance. Be prepared for this increase in CPU usage if you add a cluster to an instance containing tables that use version-based garbage collection.
Age-based garbage collection, on the other hand, does not increase CPU usage in replicated instances.
Changes to garbage collection policies for replicated tables
If a table is in a single-cluster instance, Bigtable lets you modify or delete a policy for a column family at any time. In instances that use replication, on the other hand, some restrictions apply. These restrictions protect your data.
You can modify a column family's maximum number of versions in a replicated table. However, if you lower the number of versions for a column family, it can take up to a week for all replicated clusters to reflect the new, lower number. Therefore, you should always use filters when reading the data.
Bigtable does not let you increase the TTL for a column family in a replicated table. To see why, consider a case where you want to change a column family's TTL from 30 days to 50 days. Age-based garbage collection can run separately in each cluster. As a result, at the time you change the policy in cluster A, garbage collection might have deleted a 31-day-old value in cluster B. Therefore the 31-day-old value will exist in cluster A but not in cluster B until it's deleted by the new 50-day policy in cluster A. Changing the garbage collection policy in this situation would leave the copies out of sync for almost 20 days.
For the same reason, Bigtable does not let you delete an age-based garbage collection policy for a column family in a replicated table.
- Explore strategies to simulate cell-level TTL.
- Read about how timestamps that are sequential numbers affect garbage collection.
- Learn how to only keep the most recent value of a column.
- Learn more about storage pricing.
- Look at garbage collection code samples in your preferred programming language.