Using VACUUM to accelerate transaction ID freezing in Cloud SQL for PostgreSQL
PostgreSQL Database Engineer, Google Cloud
PostgreSQL uses transaction IDs (TXIDs, XIDs) to implement Multi-Version Concurrency Control semantics (MVCC). The PostgreSQL documentation explains the role of TXIDs as follows:
PostgreSQL's MVCC transaction semantics depend on being able to compare transaction ID (XID) numbers: a row version with an insertion XID greater than the current transaction's XID is “in the future” and should not be visible to the current transaction. But since transaction IDs have limited size (32 bits), a cluster that runs for a long time (more than 4 billion transactions) would suffer transaction ID wraparound: the XID counter wraps around to zero, and all of a sudden transactions that were in the past appear to be in the future - which means their output become invisible. In short, catastrophic data loss.
To overcome the transaction ID wraparound problem, PostgreSQL uses a vacuum mechanism that freezes committed transaction IDs and releases them for further use. You can think of this mechanism as “recycling” of transaction IDs that keeps the database operating despite using a finite number to store the ID.
Vacuum can operate as a background task called autovacuum (enabled by default), and it can also be invoked manually using the VACUUM command. Autovacuum is designed as a low-priority task that yields to regular workload, so the speed and effectiveness of the process depends on database activity. In some cases, autovacuum might not be able to recycle transaction IDs quickly enough, reaching a point where PostgreSQL initiates a special type of vacuum called “aggressive vacuum” or “anti-wraparound vacuum”.
If TXID utilization continues to increase despite autovacuum’s best efforts, the database stops accepting commands to prevent transaction ID wraparound and consequently data loss.
This blog post demonstrates ways of accelerating the vacuum process on your Cloud SQL PostgreSQL instance to avoid the problems described above.
Note: This post uses vacuum features introduced in PostgreSQL 12. In order to follow the instructions, your database must be running PostgreSQL version 12 or later.
Autovacuum versus manual vacuum for TXID freezing
The following are the main stages of a vacuum operation, as shown in the pg_stat_progress_vacuum view:
Scanning heap: Vacuum is scanning the heap. It will prune and defragment each page if required, and possibly perform TXID freezing activity.
Vacuuming indexes: Vacuum is removing the dead row versions from indexes.
Vacuuming heap: Vacuum is removing dead row versions from the heap (table).
Truncating heap: Vacuum is truncating the heap to return empty pages at the end of the table to the operating system.
The task of freezing old transaction IDs requires the “scanning heap” and “vacuuming heap” phases, but not the “vacuuming indexes” and “truncating heap” phases. Consequently, when running a vacuum for the sole purpose of freezing transaction IDs, it can be beneficial to skip the optional phases. Large tables with multiple indexes are particularly susceptible to spending a considerable amount of time in the “vacuuming indexes” stage, and skipping that stage can considerably reduce vacuum time.
The optional phases are not skipped during automatic vacuuming. You can however cancel an autovacuum operation already in progress, and run a customized manual vacuum instead. You can also use the manual vacuum technique proactively to prevent PostgreSQL from initiating an aggressive autovacuum later.
Note: PostgreSQL 14 introduced a vacuum_failsafe_age parameter that provides equivalent functionality as part of autovacuum. PostgreSQL 14 users can still follow the manual procedure to become familiar with vacuum performance concepts, but the built-in functionality might be a better long-term choice.
Manual vacuum guide
This section will walk you through the following steps:
Check transaction ID utilization in each database.
Identify the autovacuum operations that can be canceled and re-run manually using an optimized VACUUM command.
Run an optimized VACUUM command for each table and monitor its progress.
Optionally, rebuild indexes on each table that was vacuumed manually.
Check database metrics again to verify that transaction ID utilization went down after the manual vacuum.
Note that although vacuum operations do not cause downtime, they do introduce additional load in the database, and their run time is difficult to predict. You may want to use a clone to test and familiarize yourself with the procedure before attempting it on production databases.
Check transaction ID utilization before vacuuming
To determine transaction ID utilization for each database, connect to your Cloud SQL instance as the user
postgres and run the following query:
The output will look like this:
A value of >80% in the
consumed_txid_pct column indicates that tables in that database are in need of vacuuming to recycle transaction IDs. It is likely that autovacuum is already running on those tables, and it can be sped up as described earlier.
Follow the steps in the next section for each database that needs vacuuming.
Choose tables to vacuum manually
Connect to the database and run the following query to list tables that are currently processed by the autovacuum daemon:
The output will look like this:
Each record in the output corresponds to one autovacuum operation running in the database. Review the output and identify any records where the vacuum is in the “vacuuming indexes” phase as shown in the
phase field. This indicates an operation that can potentially be sped up by canceling the autovacuum and performing a manual vacuum instead.
If there are multiple tables eligible for manual vacuuming, focus on the largest ones first. The larger the table, the longer the autovacuum process can take. Therefore, applying this procedure on the largest tables first can produce the biggest gains.
Follow the steps in the next section for each table you identified. Subsequent examples will use information from
RECORD 2 highlighted above.
Vacuum a table manually
Ensure that these prerequisites are met before proceeding:
In order to run a VACUUM command on a table, you must be logged in as the table owner.
In order to cancel an ongoing autovacuum, you need permission to execute the
pg_cancel_backendfunction. The user
postgresholds this permission by default, but other users must be granted access explicitly. To grant access, connect to the instance as
postgresand run the following command:
Once the prerequisites are met, execute the following commands to cancel the ongoing autovacuum and issue a manual vacuum:
You can monitor the progress of your manual vacuum using the
After the vacuum completes, you can optionally reindex the table. Our optimized VACUUM command contained the
INDEX_CLEANUP false clause, which skips the index optimization stage. Bypassing index optimization doesn’t cause any immediate issues, but if you frequently vacuum the same tables with
INDEX_CLEANUP false, it can lead to index bloat in the long term. You may want to REINDEX your table periodically if index bloat becomes a problem. You can read more about index bloat here.
Use the REINDEX command to reindex a table:
Vacuuming tables proactively
The above procedure focuses on tables that are actively processed by autovacuum. You can also vacuum your tables proactively, based on what you know about their size and the volume of workload they receive.
Even when you vacuum your tables routinely and proactively, continue to monitor transaction ID utilization and try to keep it as low as possible across all databases. For context: by default, Postgres starts an aggressive vacuum when transaction ID utilization reaches 10%. This threshold can be configured using the autovacuum_freeze_max_age setting.
If you find that the default autovacuum behavior is not sufficient for your workload (e.g. it’s routinely unable to reclaim transaction IDs quickly enough), you should consider tuning the autovacuum parameters. Finding optimal configuration can involve some trial and error, but a well-tuned autovacuum can reduce or even eliminate the need for proactive manual vacuuming. Note that a more aggressive autovacuum can potentially impact the performance of regular workloads, so the related settings are best changed and validated in small increments.
Checking transaction ID utilization after vacuuming
After you’ve completed manual vacuuming, check transaction ID utilization again:
You should see that the value in
consumed_txid_pct field is now lower than it was before:
If the value is still very high, it might be driven by certain tables that weren’t covered by the prior procedure. You can obtain TXID information at table level to identify tables that still need vacuuming.
For example, this query will show top ten tables ordered by transaction ID utilization:
In this blog post, we have demonstrated the ways of accelerating the vacuum process on your Cloud SQL PostgreSQL instance to avoid the transaction ID wraparound. Here is the link that describes what you can do when your database runs into Transaction ID Wraparound protection in PostgreSQL.