PostgreSQL is a powerful, open source database. Enterprises are increasingly migrating their workloads to Cloud SQL for PostgreSQL from other engines for a number of reasons, including price, performance and to escape vendor lock-in. A critical aspect of PostgreSQL is its Multiversion Concurrency Control (MVCC) implementation and the garbage collector which is called VACUUM. While VACUUM is a vast topic, the objective of this blog is to explain VACUUM internals in an easy to understand Q&A format.
Q. Who is the intended audience?
A. This blog post is intended for database administrators and developers who are curious to know the PostgreSQL MVCC and VACUUM internals.
Q. What is MVCC and why do we need VACUUM?
A. MVCC allows multiple transactions to read and write to the database concurrently without interfering with each other. The most important aspect of MVCC is that “readers never block writers” and “writers never block readers”. PostgreSQL achieves MVCC by creating a new version of a row. Once the updating transaction commits, it makes the new row visible to all new transactions that start after that point, while existing transactions continue to see the old row. PostgreSQL keeps all the row versions in the table data structure. It means an UPDATE query keeps the existing row version (a.k.a. dead tuple) and creates a new version with updated data. Similarly a DELETE query on a row marks that as a dead tuple but does not physically remove it. PostgreSQL keeps track of all these versioned rows and their visibility w.r.t. to a transaction with two columns, xmin and xmax stored in each row.
xmin – defines the transaction id that inserts the tuple.
xmax – defines the transaction id that deletes the tuple.
If there are lots of updates and deletes, it may result in a higher number of dead tuples, which increases storage usage and also impacts query performance. VACUUMing a table reclaims the space occupied by those dead tuples. It also does transaction ID freezing to avoid transaction wraparound. The VACUUM operation is run in the background by an autovacuum launcher process, but we can also run it manually with the VACUUM command.
Q. What is the transaction wraparound?
A. The xmin and xmax columns are of fixed size of 4 bytes, which allows it to store a maximum transaction ID of 2^32 – 1, or approximately 4 billion and hence transactions id need to be recycled with more transactions coming onto a busy database. PostgreSQL autovacuum continuously checks for tables where transaction id age crosses a threshold limit (controlled by autovacuum_freeze_max_age flag) and runs a freeze operation on it. We can also run a manual freeze job in an emergency situation.
Q. What is a visibility map?
A. Each heap relation has a Visibility Map (VM) to keep track of which pages have been modified since the last VACUUM. The visibility map stores two bits per heap page. The first bit, if set, indicates that the tuples on this page are visible to all sessions, or in other words that the page does not contain any tuples that need to be vacuumed. This information can be used by index-only scans to answer queries using only the index tuple and avoids the need to visit the tuple in the heap table to check for visibility. The second bit, if set, means that all tuples on the page have been frozen. That means that even an anti-wraparound VACUUM need not to revisit the page. The VM helps in speeding up the autovacuum/vacuum process for a table.
Q. Does the VACUUM process always clean up dead tuples or freeze transactions?
A. If there are blockers for a VACUUM process, it will not be able to clean up all the dead tuples or freeze transactions. For example, the blocker can be a long running transaction on either a writer instance or a replica, abandoned replication slots, or orphaned prepared transactions.You can follow the Cloud SQL documentation to identify the blocker and remove it.
Q. What is the impact on the database if the VACUUM is not running or is stuck?
A. If VACUUM is not running on a busy database, it will keep accumulating dead tuples depending on the database workload and DML operations, which impacts the queries performance adversely. Without transaction id recycling, the database will eventually hit a wraparound issue and cause a database outage. Cloud SQL provides metrics to monitor dead/live tuples and transaction id utilization.
Q. When should we use VACUUM FULL instead of a plain VACUUM?
A. A plain VACUUM (without FULL) simply reclaims space and makes it available for re-use. This form of the command can operate in parallel with normal reading and writing of the table, as an exclusive lock is not obtained. However, extra space is not returned to the operating system (in most cases), it is made available for re-use within the same table. VACUUM FULL rewrites the entire contents of the table into a new disk file with no extra space, allowing unused space to be returned to the operating system. This form is much slower and requires an ACCESS EXCLUSIVE lock on each table while it is being processed and usage of the table will be blocked until this completes. We may consider this an outage of the table. VACUUM FULL is useful when a particular table is full of dead rows and not expected to become that big again.
Note that plain VACUUM can also take ACCESS EXCLUSIVE lock when truncating empty pages at the end of a table, but usually that is for a short duration.
Q. Can we get the benefits of the VACUUM FULL without locking?
A. Cloud SQL provides pg_repack extension, which helps in removing bloat from tables and indexes. Unlike CLUSTER and VACUUM FULL it works online and holds an exclusive lock on the tables for a short duration during processing. This is useful for a production database where business does not allow longer downtime.
Q. Can we skip the index cleanup to speed up the VACUUM process?
A. PostgreSQL 12 introduced index_cleanup defer feature, this defers cleaning up indexes and hence can be significantly faster. This is handy in emergency situations where we are going to hit wraparound issues soon or are already in an outage situation.
PostgreSQL 14 skips the index vacuuming when failsafe VACUUM triggers, which is controlled by vacuum_failsafe_age flag.
Q. Is there a parallel option to speed up VACUUM?
A. All PostgreSQL versions allow the vacuum of two tables to proceed in parallel. PostgreSQL 13 added parallelism within the vacuum of a single table. One thing to keep in mind is that the heap scan is still done by a single process and there would be max one parallel worker corresponding to a single Index cleanup and the maximum number of VACUUM worker processes is dependent on max_worker_processes, min_parallel_index_scan_size and max_parallel_maintenance_workers flags. This is useful to reduce vacuum run time on a huge table with multiple Indexes.
Q. How can we speed up VACUUM operation?
A. There are a few autovacuum/vacuum flags that can be adjusted as per database workload to reduce VACUUM run time.
autovacuum_vacuum_cost_limit
The accumulated cost that causes the autovacuum process to sleep.
autovacuum_vacuum_cost_delay
The specified amount of time in milliseconds that the autovacuum process sleeps when it exceeds the limit. A lower value will speed up the VACUUM at the expense of more CPU and IO.
autovacuum_work_mem
Specifies the maximum amount of memory to be used by each autovacuum worker process. A high value will have a positive impact on the VACUUM run time. Currently a single VACUUM process can not use more than 1GB of memory.
Other ways to speed up a VACUUM operation are to run the VACUUM with parallel workers or defer the index clean up phase as explained above.
Q. What are the other tuning knobs for VACUUM?
A. There are a few more flags that can be used to tune a VACUUM operation.
autovacuum_max_workers
Maximum number of autovacuum processes (other than the autovacuum launcher) that may be running at any one time.
autovacuum_naptime
Specifies the minimum delay between autovacuum runs on any given database.
autovacuum_freeze_max_age
Specifies the maximum age (in transactions) that a table’s pg_class.relfrozenxid field can attain before an anti wraparound VACUUM. Setting it to a very high value gives less opportunity to finish the VACUUM and causes wraparound issues.
autovacuum_freeze_min_age
Specifies the cutoff age (in transactions) that VACUUM should use to decide whether to freeze row versions. The default is 50 million transactions. For INSERT only tables this can be set to 0 as those tuples will not get updated/deleted in their lifetime.
autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor
These flags control the behavior when an autovacuum should trigger. They can be configured at table level and we can configure a more aggressive VACUUM for large tables.
It is also recommended to set statement_timeout to put a limit on the maximum allowed duration of a SQL statement and idle_in_transaction_session_timeout to put a limit on the maximum allowed duration on a connection with idle in transaction state. Some of the flags can be set at table level so we can control VACUUM based on table usage.
Q. How does VACUUM work when there is aTOASTtable associated?
A. During an UPDATE operation, values of unchanged fields are normally preserved as-is; so an UPDATE of a row with out-of-line values incurs no TOAST costs if none of the out-of-line values change.
For example, consider a table with two columns ID INT, NAME TEXT[] where NAME values are stored out of line in a toast table. So now if we update the ID column, the dead tuple will be generated in the main table only and toast table will remain unchanged
Each TOAST-able data type specifies a default strategy for columns of that data type, but the strategy for a given table column can be altered with ALTER TABLE … SET STORAGE. Another setting TOAST_TUPLE_TARGET can be adjusted for each table using ALTER TABLE … SET (toast_tuple_target = N)to control when TOAST management code triggers .
We can run VACUUM operation on the main table and toast table in parallel. This would be useful in emergency wraparound situations to quickly recover from an outage.
Q. Does VACUUM block/cancel the queries running on a writer instance?
A. A plain auto VACUUM takes a lock of type SHARE UPDATE EXCLUSIVE MODE but the job gets canceled when it sees that another concurrent query requires the lock. In case it is taking ACCESS EXCLUSIVE lock for truncating empty data pages, it will skip this truncate operation in favor of the conflicting query. However, an anti wraparound vacuum blocks conflicting queries so before running any DDL query, it’s recommended to set lock_timeout.
Q. Does VACUUM block/cancel the queries running on a Cloud SQL replica node?
A. When a VACUUM is running and a SELECT query runs on a replica node accessing the same table then VACUUM might be cleaning up the dead tuples needed (for a consistent snapshot) for the replica query and this causes the replica query to be canceled.
To avoid query cancelation issues, set hot_standby_feedback flag on the replica that sends feedback from a hot standby to the writer instance. There are still a few scenarios that cause a replica query to be canceled even after setting this flag. For example, an exclusive lock on the table with a manual DDL operation, or it could be due to the VACUUM job releasing empty data pages.
For such cases, we can use the max_standby_streaming_delay flag, which delays Write Ahead Log (WAL) apply on the replica when there are conflicts; of course, it causes replication delay. We can disable the truncate behavior of a plain VACUUM if this causes a query cancelation issue.
The pg_stat_database_conflicts view will contain one row per database, showing database-wide statistics about query cancels occurring due to conflicts with recovery on standby servers.
Q. Does INSERT only tables need VACUUM?
A. Yes, INSERT only tables need vacuum for few reasons –
Freeze tuples to avoid transaction wraparound issues.
To update visibility map, which prevents query regression involving Index only scans because Index only scan needs to check the visibility map to see if a heap data page is all-visible.
PostgreSQL 13 introduced two flags autovacuum_vacuum_insert_threshold and autovacuum_vacuum_insert_scale_factor, which can be set appropriately to trigger an early autovacuum for insert only tables.
Q. What happens when VACUUM has an unexpected restart or gets terminated?
A. PostgreSQL keeps track of pages containing dead tuples and unfrozen tuples through visibility map to avoid a full table scan. During VACUUM run, this visibility map gets updated frequently so an unexpected restart of the VACUUM operation does not need to process those pages again.
In the case of anti wraparound vacuum to advance relfrozenxid (to reduce table age), VACUUM has to scan all the pages from visibility map in a single successful pass.
Q. What is the HOT feature and how does it help?
A. As UPDATE query creates a new tuple version and gets a new ctid (physical location of the tuple), it seems that the UPDATE query has to update all the respective indices as well. However, PostgreSQL has an optimization called HOT (heap only tuple) that helps in preventing write amplification (due to index page writes) and improves UPDATE query latency. With the HOT feature, old versions of updated rows can be completely removed during normal operations, including SELECTs, instead of requiring periodic vacuum operations. HOT is applicable only when a new tuple fits in the same data page and there are no updates on an index column.
Q. What things to keep in mind w.r.t to VACUUM when designing a table?
A. When we expect a lot of UPDATE operations on a table, we can do the following things:
Reduce the table FILLFACTOR value so the new tuple version fits in the same data page instead of moving it into a new page. Default value of FILLFACTOR is 100, we can reduce it to 90-95 range depending on our use case.
Avoid indexing frequently updated columns whenever possible to get the benefits of the HOT feature.
Use partitioned tables when data size is huge. It’s more efficient to manage and run VACUUM on individual manageable sized partitions instead of running it on a single huge table.
Depending on the columns we are updating, It’s more efficient to have a normalized table design. Let’s take a simple example of designing a table for a social media app (store followers for a user).
Use array to store all the followers ids:
userid int, follower_id int[]
In this design when we add or remove a new follower to a user id, the entire array needs to get updated and this may add a lot of churn to VACUUM operation depending on the number of followers for a user.
Use normalized table design:
userid int, follower_id int
In normalized design, we only need to add/delete that specific row of the table to add or remove a follower.
These recommendations are purely based on VACUUM efficiency but there may be other factors (query performance, storage) to choose one thing over other.
Q. How can we keep track of the VACUUM progress and number of pages to clean/freeze?
A. We can check the VACUUM progress using the catalog view pg_stat_progress_vacuum.
We can look at the visibility map stats using the pg_visibility extension. For example, check how many pages are all_visible and all_frozen in a table.
Note that a few flags mentioned in this blog may not be available at instance level but we can modify them at database or at user level.
Summary
Better understanding of VACUUM in PostgreSQL is important to maintain database performance, manage disk space efficiently, ensure proper database maintenance, and troubleshoot wraparound and performance issues. In this blog post, we learned how vacuum is helpful to reclaim the disk space occupied by dead tuples, preventing fragmentation and how to release space to the operating system. In addition, we learned how to make use of parallelism in vacuuming and understood the important flags that can affect the vacuum performance.
PostgreSQL VACUUM documentation
Optimizing, monitoring, and troubleshooting VACUUM operations in PostgreSQL
Cloud BlogRead More