引言

表膨胀(Relation Bloat)，在日常运维中会经常与之打交道，听名字就感觉不太友好，那么表膨胀到底是个什么东西呢？又该如何预防呢？

什么是表膨胀

表膨胀是指表的数据和索引所占文件系统的空间，在有效数据量并未发生大的变化的情况下，不断增大。久而久之，关系文件被大量空洞填满，浪费了大量的磁盘空间。甚至某些特殊场景下，一个表中只有一条简单的数据，但是表对应的物理文件可能已经达到M级甚至G级。

那么危害是什么？

空间持续上涨，到达某一个点后，需要执行一个高额代价的vacuum full(或者cluster等可以重组表的命令)，但vacuum full又是AccessExclusiveLock，8级锁，会阻塞一切访问，意味着在完成清理重组之前，都无法访问该表。
扫描的效率变低，即使所有记录都是dead状态，PostgreSQL的顺序扫描也会扫描对象所有的老版本，直到执行vacuum将dead的记录删除

为什么会表膨胀

这是因为PostgreSQL的独特MVCC机制。业界主流的MVCC一般有三种实现方式：

以Oracle和Innodb为代表的，写新数据时，把旧数据转移到一个单独的地方，如回滚段中，其他人读数据时，从回滚段中把旧的数据读出来，所以可以有效避免膨胀。
以SQL Server为代表的，把旧版本的数据写入专门的临时表空间，新数据写入日志，然后去更改数据。这种方式，旧版本的数据放入了专门的临时表空间，所以也可以有效地避免膨胀。
写新数据时，旧数据不删除，而是把新数据插入，将旧数据标记为无效，PostgreSQL就是使用的这种实现方法，新老数据存放在一起，在被清理之前，会一直占据着空间，所以会导致膨胀。

两种方法各有利弊，相对于第一种来说，PostgreSQL的MVCC实现方式优缺点如下：

优点

无论事务进行了多少操作，事务回滚可以立即完成，Oracle中使用了回滚段，如Oracle数据库宕机时如果有很多事务正在运行，这时数据库再启动后，需要把之前的事务做回滚，当没有回滚完成时，数据行上仍然有锁的，这时业务仍然不能正常操作，如果恰好碰到要回滚一些很大的事务，情况会更坏。
数据可以进行很多更新，不必像Oracle和MySQL的Innodb引擎那样需要经常保证回滚段不会被用完，也不会像oracle数据库那样经常遇到“ORA-1555”错误的困扰

缺点

旧版本的数据需要清理。好在在v8.3中，PostgreSQL引入了自动化的autovacuum，采用多进程架构，支持多表同时操作
旧版本的数据可能会导致查询需要扫描的数据块增多，从而导致查询变慢
空间持续上涨，存储没有被有效利用

MVCC实现机制

前面也介绍了数据膨胀的原因，需要先了解一下数据基于MVCC的存储机制：

插入很简单，就是将元组插入到页面的空闲空间中；
删除则是将元组标记为旧版本，但是即使这个旧版本对所有事务都不可见了，这个元组占用的空间也不会归还给文件系统
UPDATE相当于DELETE + INSERT，等于是占用了两条元组的位置，类似DELETE，旧版本的元组依然占用着物理空间。

很明显，在一通增删改操作之后，页面上的旧版本元组势必是占有一定比重的。这就导致了物理文件大小明显高于实际的数据量。

为此，PostgreSQL引入了vacuum的机制，去清理那些不再需要的死元组。前文也介绍过，vacuum分为普通的vacuum和vacuum full。

普通的vacuum会做大概如下几件事情：

清除UPDATE或DELETE操作后留下的"死元组"
跟踪表块中可用空间，更新free space map
更新visibility map，index only scan以及后续vacuum都会利用到
"冻结"表中的行，防止事务ID回卷
配合ANALYZE，定期更新统计信息

可以看到，普通的vacuum只是清理死元组，"腾出"空间，在今后有新数据插入时，将优先使用该空间，直到所有被重用的空间用完时，再考虑新增页面。但是每个页面的空闲空间又不是固定大小的，所以如果要利用这些空间空间，就需要遍历一遍数据页面来找到它们，这样会造成比较大的开销。因此就设计了用来记录每个页面剩余空间的空闲空间映射表FSM(Free Space Mapping)，以便高效的将空闲空间管理起来，方便查找和重新使用。FSM在第一次vacuum之后会出现，可以使用pg_freespacemap扩展查看。FSM的结构类似于一个Btree，感兴趣的可以参阅：PgSQL · 原理介绍 · PostgreSQL中的空闲空间管理

极端情况下，就会发生每个页面的"填充率"特别低，如下每个页面总计三个元组，有效利用率特别低。

因此，需要引入vacuum full，vacuum full会对表进行重组，也就意味着表的oid会变，所以不能我们在日常操作中，要定位表的oid的时候，不能通过pg_class的oid来找，得通过pg_class的relfilenode来找，这样才精确。而且，vacuum full最大会占据原来磁盘空间的两倍，所以请预留好磁盘空间。

说到空间，这里说个小技巧，可以提前touch或者dd一个大文件，比如5GB，放在那里，关键时刻删除它以释放空间，说不定能救你一命。

如何预防表膨胀

正常来说，死元组会在合适的时候，被vacuum清理掉，那么什么时候是不合适呢？核心函数是vacuumlazy.c里面的HeapTupleSatisfiesVacuum

/*
 * HeapTupleSatisfiesVacuum
 *
 *Determine the status of tuples for VACUUM purposes.  Here, what
 *we mainly want to know is if a tuple is potentially visible to *any*
 *running transaction.  If so, it can't be removed yet by VACUUM.
 *
 * OldestXmin is a cutoff XID (obtained from GetOldestXmin()).  Tuples
 * deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
 * still be visible to some open transaction, so we can't remove them,
 * even if we see that the deleting transaction has committed.
 */
HTSV_Result
HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 Buffer buffer)
    
...
    if (!TransactionIdPrecedes(xmax, OldestXmin))
    return HEAPTUPLE_RECENTLY_DEAD;

...

/*
 * TransactionIdPrecedes --- is id1 logically < id2?
 */
bool
TransactionIdPrecedes(TransactionId id1, TransactionId id2)
{
    /*
     * If either ID is a permanent XID then we can just do unsigned
     * comparison.  If both are normal, do a modulo-2^32 comparison.
     */
    int32diff;

    if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
        return (id1 < id2);

    diff = (int32) (id1 - id2);
        return (diff < 0);
}

TransactionIdPrecedes用于判断两个事务的大小，xmax就是删除、更新时的事务ID，OldestXmin是最老的事务ID

/*
 * GetOldestXmin -- returns oldest transaction that was running
 *when any current transaction was started.
 *
 * If rel is NULL or a shared relation, all backends are considered, otherwise
 * only backends running in this database are considered.
 *
 * The flags are used to ignore the backends in calculation when any of the
 * corresponding flags is set. Typically, if you want to ignore ones with
 * PROC_IN_VACUUM flag, you can use PROCARRAY_FLAGS_VACUUM.
 *
 * PROCARRAY_SLOTS_XMIN causes GetOldestXmin to ignore the xmin and
 * catalog_xmin of any replication slots that exist in the system when
 * calculating the oldest xmin.
 *
 * This is used by VACUUM to decide which deleted tuples must be preserved in
 * the passed in table. For shared relations backends in all databases must be
 * considered, but for non-shared relations that's not required, since only
 * backends in my own database could ever see the tuples in them. Also, we can
 * ignore concurrently running lazy VACUUMs because (a) they must be working
 * on other tables, and (b) they don't need to do snapshot-based lookups.

 ...
*/

TransactionId
GetOldestXmin(Relation rel, int flags)
{
    ProcArrayStruct *arrayP = procArray;
    TransactionId result;
    intindex;
    boolallDbs;
    actionId replication_slot_catalog_xmin = InvalidTransactionId;

backend_xid，所有后端进程的当前事务ID的最小值
backend_xmin，所有后端进程的事务启动时的事务快照中最小事务的最小值
replication_slot_xmin，所有复制槽中最小的xmin，由备库定时发回(假如开启了hot_standby_feecback)
replication_slot_catalog_xmin，所有复制槽中最小的catalog_xmin
如果设置了old_snapshot_threshold，则比backend_xid和old_snapshot_threshold->xmin

这个函数的逻辑是，如果 xmax > OldestXmin，则返回 HEAPTUPLE_RECENTLY_DEAD，意味着保留此Tuple，不去进行删除。

/*
 *lazy_scan_heap() -- scan an open heap relation
 *
 *This routine prunes each page in the heap, which will among other
 *things truncate dead tuples to dead line pointers, defragment the
 *page, and set commit status bits (see heap_page_prune).  It also builds
 *lists of dead tuples and pages with free space, calculates statistics
 *on the number of live tuples in the heap, and marks pages as
 *all-visible if appropriate.  When done, or when we run low on space for
 *dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
 *to reclaim dead line pointers.
*/
static void
lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
       Relation *Irel, int nindexes, bool aggressive)
        case HEAPTUPLE_RECENTLY_DEAD:
        /*
         * If tuple is recently deleted then we must not remove it
         * from relation.
         */
        nkeep += 1;
        all_visible = false;
        break;

所以，如果系统中含有很久之前开启而未提交的事务，并且这个事务由于执行过更新，创建了事务ID，那么计算出来的OldestXmin会非常小，vacuum做上述这个判断时，结果通常为true，即返回HEAPTUPLE_RECENTLY_DEAD，这样就会保留此tuple(旧版本)，导致回收无法完成，表膨胀由此发生。那么，什么时候会产生backend_xid，什么时候会产生backend_xmin呢？在pg_stat_get_activity函数中，会去获取backend_xid和backend_xmin。

/*
 * Returns activity of PG backends.
 */
Datum
pg_stat_get_activity(PG_FUNCTION_ARGS)
{
#define PG_STAT_GET_ACTIVITY_COLS30
    intnum_backends = pgstat_fetch_stat_numbackends();
    intcurr_backend;
    intpid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
    TupleDesctupdesc;
    Tuplestorestate *tupstore;
    MemoryContext per_query_ctx;
    MemoryContext oldcontext;
     ...
         if (TransactionIdIsValid(local_beentry->backend_xid))
            values[15] = TransactionIdGetDatum(local_beentry->backend_xid);
        else
            nulls[15] = true;

        if (TransactionIdIsValid(local_beentry->backend_xmin))
            values[16] = TransactionIdGetDatum(local_beentry->backend_xmin);
     ...

/* ----------
 * LocalPgBackendStatus
 *
 * When we build the backend status array, we use LocalPgBackendStatus to be
 * able to add new values to the struct when needed without adding new fields
 * to the shared memory. It contains the backend status as a first member.
 * ----------
 */
typedef struct LocalPgBackendStatus
{
    /*
     * Local version of the backend status entry.
     */
    PgBackendStatus backendStatus;

    /*
     * The xid of the current transaction if available, InvalidTransactionId
     * if not.
     */
    TransactionId backend_xid;

    /*
     * The xmin of the current session if available, InvalidTransactionId if
     * not.
     */
    TransactionId backend_xmin;
} LocalPgBackendStatus;

可以看到，backend_xid，backend_xmin用来表示会话是否申请了事务号，以及会话的快照ID。

开一个读已提交的事务

postgres=# select pg_backend_pid();
 pg_backend_pid 
----------------
          12756
(1 row)

postgres=# begin;
BEGIN
postgres=*# insert into test values(1,'test');
INSERT 0 1

新开一个会话进行查询，因为对数据库产生了永久性的变更，所以获取了事务ID(非虚拟事务ID)，backend_xmin代表快照的事务ID

postgres=# select backend_xid,backend_xmin from pg_stat_activity where pid = 12756;
-[ RECORD 1 ]+--------
backend_xid  | 7504371
backend_xmin |

再开启一个可重复读的事务

postgres=# begin transaction isolation level repeatable read ;
BEGIN
postgres=*# insert into test values(1,'test');
INSERT 0 1

进行查询，可以看到获取了快照ID，对于可重复读以上隔离级别的事务，在整个事务过程中backend_xmin是一个不变的值。

postgres=# select backend_xid,backend_xmin from pg_stat_activity where pid = 12756;
-[ RECORD 1 ]+--------
backend_xid  | 7504372
backend_xmin | 7504372

再开一个可串行化的事务

postgres=# begin transaction isolation level serializable ;
BEGIN
postgres=*#

只是开启事务，不做任何操作，可以看到 backend_xid 并不会立马有值，因为只在真正需要的时候，才会分配

postgres=# select backend_xid,backend_xmin from pg_stat_activity where pid = 12756;
-[ RECORD 1 ]+-
backend_xid  | 
backend_xmin |

第一个会话进行插入

postgres=# begin transaction isolation level serializable ;
BEGIN
postgres=*# insert into test values(1,'test');
INSERT 0 1

进行查询

postgres=# select backend_xid,backend_xmin from pg_stat_activity where pid = 12756;
-[ RECORD 1 ]+--------
backend_xid  | 7504374
backend_xmin | 7504374

哪些情况会导致表膨胀

如之前所述，假如存在特别老的backend_xid或者backend_xmin，会导致无法回收，进而导致表膨胀

存在很老的backend_xid

开启一个事务，让其持有backend_xid

postgres=# create table test_bloat(id int,info text);
CREATE TABLE
postgres=# insert into test_bloat values(1,'test');
INSERT 0 1
postgres=# insert into test_bloat values(2,'test');
INSERT 0 1
postgres=# begin;
BEGIN
postgres=*# select pg_backend_pid();
 pg_backend_pid 
----------------
          13183
(1 row)

postgres=*# select txid_current();
 txid_current 
--------------
      7505695
(1 row)

另外一个会话不断进行更新

postgres=# update test_bloat set info = 'hello';\watch 0.01

不一会儿，就可以看到表膨胀了

postgres=# \dt+ test_bloat 
                              List of relations
 Schema |    Name    | Type  |  Owner   | Persistence |  Size  | Description 
--------+------------+-------+----------+-------------+--------+-------------
 public | test_bloat | table | postgres | permanent   | 536 kB | 
(1 row)

可以看到，11352条死元组无法被vacuum

postgres=# vacuum VERBOSE test_bloat ;
INFO:  vacuuming "public.test_bloat"
INFO:  "test_bloat": found 0 removable, 11354 nonremovable row versions in 62 out of 62 pages
DETAIL:  11352 dead row versions cannot be removed yet, oldest xmin: 7505695
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  vacuuming "pg_toast.pg_toast_171002"
INFO:  index "pg_toast_171002_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_171002": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 7505695
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

这个是怎么算出来的呢

总计62个block
填充满的block里面含有185个元组
61 * 185 + 69总计11354个元组，减去最开始的2条活元组，总计11352条死元组

postgres=# select count(*) from heap_page_items(get_raw_page('test_bloat', 0));
 count 
-------
   185
(1 row)

postgres=# select relpages from pg_class where relname = 'test_bloat';
 relpages 
----------
       62
(1 row)

postgres=# select count(*) from heap_page_items(get_raw_page('test_bloat', 61));
 count 
-------
    69
(1 row)

postgres=# select 61 * 185 + 69;
 ?column? 
----------
    11354
(1 row)

所以可以看到，最老的事务之后的元组都无法清理，提交之后，即可正常清理

postgres=# vacuum VERBOSE test_bloat ;
INFO:  vacuuming "public.test_bloat"
INFO:  "test_bloat": removed 11352 row versions in 61 pages
INFO:  "test_bloat": found 11352 removable, 2 nonremovable row versions in 62 out of 62 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 7511374
There were 11228 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  vacuuming "pg_toast.pg_toast_171002"
INFO:  index "pg_toast_171002_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_171002": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 7511374
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
   M

存在很老的backend_xmin

对于可重复读以上隔离级别的事务，即repeatable read或serializable的隔离级别，在整个事务过程中backend_xmin是一个不变的值。

postgres=# truncate table test_bloat ;
TRUNCATE TABLE
postgres=# insert into test_bloat values(1,'test');
INSERT 0 1
postgres=# insert into test_bloat values(2,'test');
INSERT 0 1
postgres=# begin transaction isolation level repeatable read ;
BEGIN
postgres=*# select 1;
 ?column? 
----------
        1
(1 row)

postgres=# select backend_xmin,backend_xid from pg_stat_activity where pid = 14428;
 backend_xmin | backend_xid 
--------------+-------------
      7511378      |            
(1 row)

另一个会话不断更新

postgres=# update test_bloat set info = 'hello';\watch 0.01

不一会儿，就可以看到表膨胀了

postgres=# \dt+ test_bloat 
                              List of relations
 Schema |    Name    | Type  |  Owner   | Persistence |  Size   | Description 
--------+------------+-------+----------+-------------+---------+-------------
 public | test_bloat | table | postgres | permanent   | 1152 kB | 
(1 row)

老样子，也可以看到无法回收

postgres=# vacuum verbose test_bloat ;
INFO: vacuuming "public.test_bloat"
INFO: "test_bloat": found 0 removable, 25612 nonremovable row versions in 139 out of 139 pages
DETAIL: 25610 dead row versions cannot be removed yet, oldest xmin: 7511378
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: vacuuming "pg_toast.pg_toast_171002"
INFO: index "pg_toast_171002_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "pg_toast_171002": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 7511378
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.

CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.

vacuum

另外一个值得注意的是，游标(开发人员特别喜欢用的)，好处多多，但是要注意游标一定要即时关闭，不然backend_xmin也会一直持续到事务结束，这期间产生的死元组一样无法回收。

postgres=# begin;
BEGIN
postgres=*# declare mycursor cursor for select id from test where id = 1;
DECLARE CURSOR
postgres=*# select pg_backend_pid();
 pg_backend_pid 
----------------
          14428
(1 row)

postgres=# select backend_xmin,backend_xid from pg_stat_activity where pid = 14428;
 backend_xmin | backend_xid 
--------------+-------------
      7524186      |            
(1 row)

另外一个就是长查询，backend_xmin会持续到SQL结束时，这期间产生的死元组一样无法回收。

postgres=# begin;
BEGIN
postgres=*# select pg_sleep(10000);

postgres=# select backend_xmin,backend_xid from pg_stat_activity where pid = 14428;
 backend_xmin | backend_xid 
--------------+-------------
      7524186      |            
(1 row)

取消了之后，就释放了

postgres=# begin;
BEGIN
postgres=*# select pg_sleep(10000);
CCancel request sent
ERROR:  canceling statement due to user request

postgres=# select backend_xmin,backend_xid from pg_stat_activity where pid = 14428;
 backend_xmin | backend_xid 
--------------+-------------
                           |            
(1 row)

存在失效的复制槽

复制槽通常用于逻辑复制、CDC和物理复制等场景下

postgres=# select slot_name,plugin,database,xmin,catalog_xmin,active from pg_replication_slots ;
 slot_name |    plugin     | database | xmin | catalog_xmin | active 
-----------+---------------+----------+------+--------------+--------
 myslot3   | test_decoding | postgres |      |      4488911 | f
 myslot    | decoder_raw   | postgres |      |      4491488 | f
 myslot2   | decoder_raw   | postgres |      |      4488911 | f
(3 rows)

对于xmin，通常使用在流复制场景下，配合复制槽 + hot_standby_feedback参数，这样备库就会定时发送xmin给主库，这样主库在做vacuum判断的时候，就会保留备库需要的行版本，不会频繁导致 snapshot conflict，当然还有一些其他的参数可以控制，比如max_standby_streaming_delay、vacuum_defer_cleanup_age，在此表过不提等。

所以，假如备库有一些特别大的查询，会导致发回的xmin较小，代价则是主库上的表会膨胀。而catalog_xmin则是针对系统表的，可以看到最老的catalog xmin是4488911，导致无法回收。

postgres=# vacuum VERBOSE pg_class;
INFO:  vacuuming "pg_catalog.pg_class"
INFO:  "pg_class": found 0 removable, 1469 nonremovable row versions in 38 out of 102 pages
DETAIL:  957 dead row versions cannot be removed yet, oldest xmin: 4488911
There were 287 unused item identifiers.
Skipped 0 pages due to buffer pins, 64 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

XA事务

prepared transaction是独立于会话，抗崩溃和状态维护的事务。事务的状态会持久化在磁盘上，这一点不难理解，因为两阶段提交中一般都涉及多台数据库之间的协同，各台数据库收到prepare transaction的命令后，如果要返回成功，那么该节点必须要确保自己后续能在被要求提交事务的时候去提交，或后续能在被要求回滚的时候回滚，所以PostgreSQL需要把相关信息持久化到存储上，随时响应。

既然持久化了，那么数据库服务器即使从崩溃中重新启动后也可以恢复事务，直到在对prepared transaction执行回滚或提交操作之前，将一直维护该事务。预备事务会持有锁，持有事务ID，导致表膨胀。

postgres=# begin;
BEGIN
postgres=*# insert into test values(1,'test');
INSERT 0 1
postgres=*# prepare transaction 't1';
PREPARE TRANSACTION

2pc一定要及时commit或者回滚，危害很多，表膨胀、持锁、年龄回收等
基于流复制的备库，2PC事务也会复制过去
生产中，结合pg_prepared_xacts定制监控，对于长时间不提交的prepared transaction，及时告警。
对于临时表，不支持预备事务

好在默认情况下，PostgreSQL并不开启两阶段提交，可以通过在postgresql.conf文件中设置max_prepared_transactions，除非知道自己在做什么，否则不建议使用预备事务。

未开启autovacuum

到了v8.3，引入了自动化的autovacuum，采用多进程架构，支持多表同时操作。但是仍有许多用户会选择关闭autovacuum，自行调度vacuum的时机，对于未开启autovacuum的用户，同时又没有合理的自定义vacuum调度的话，表的垃圾没有及时回收，新的数据又不断进来，膨胀是必然的。

另外还有track_counts参数，如该参数设置为禁用，PostgreSQL的statistics collector不会更新Relation的dead rows信息，而这些信息是PostgreSQL确定是否执行autovacuum所依赖的，所以也会导致autovacuum不会执行。

autovacuum不给力

IO差

IO差，当数据库非常繁忙时，如果IO比较差，会导致回收垃圾变慢，从而导致膨胀。这种一般出现在数据库中存在非常巨大的表，并且这些表在执行whole table vacuum (prevent xid wrapped，或当表的年龄大于vacuum_freeze_table_age时会全表扫)，因此产生大量IO，这期间很容易导致自身或其他表膨胀。

延迟触发

autovacuum触发较迟，什么情况会触发autovacuum呢，受到autovacuum_vacuum_scale_factor这个参数的影响，默认是20%的阈值，才做一次vacuum，触发vacuum的条件是：当表更新或者删除的元组数超过autovacuum_vacuum_threshold+ autovacuum_vacuum_scale_factor * pg_class.reltuples会触发VACUUM

类似的还有vacuum_defer_cleanup_age参数，延迟多少个事务再进行清理，

代价1，主库膨胀，因为垃圾版本要延迟若干个事务后才能被回收。
代价2，重复扫描垃圾版本，重复耗费垃圾回收进程的CPU资源。(n_dead_tup会一直处于超过垃圾回收阈值的状态，从而autovacuum不断唤醒worker进行回收动作)。当主库的 autovacuum_naptime=很小的值，同时autovacuum_vacuum_scale_factor=很小的值时，尤为明显。
代价3，如果期间发生大量垃圾，垃圾版本可能会在事务到达并解禁后，爆炸性的被回收，产生大量的WAL日志，从而造成WAL的写IO尖刺。

worker过于繁忙

所有worker繁忙，某些表产生的垃圾如果超过阈值，但是在此期间没有worker可以为它处理垃圾回收的事情，导致可能发生膨胀，另外表级是不能并行vacuum的，因为vacuum获取的是自斥的ShareUpdateExclusive锁。并且在9.1以前，vacuum还可能导致其他需要的表饿死。

存在长事务

数据库中存在长SQL或带XID的长事务。之后产生的死元组都无法被清理。

backend_xid表示已申请事务号的事务，例如有增删改，DLL等操作的事务。backend_xid从申请事务号开始持续到事务结束。
backend_xmin表示SQL执行时的snapshot，即可见的最大已提交事务。例如查询语句，查询游标。backend_xmin从SQL开始持续到SQL结束，如果是游标的话，持续到游标关闭。

当数据库中存在未结束的SQL语句或者未结束的持有事务ID的事务，在此事务过程中，或在此SQL执行时间范围内产生垃圾的话，这些垃圾无法回收，导致数据库膨胀。也即是判断当前数据库中backend_xid和backend_xmin最小的值，凡是超过这个最小值的事务产生的垃圾都不能回收。

基于成本的清理

开启了autovacuum_vacuum_cost_delay，开启了基于IO的vacuum清理，总代价是autovacuum_vacuum_cost_limit，达到总成本之后就会进行休眠，休眠autovacuum_vacuum_cost_delay指定的值。这个可以有利于降低VACUUM带来的IO影响，但是对于IO没有问题的系统，就没有必要开启autovacuum_vacuum_cost_delay，因为这会使得垃圾回收的时间变长。

对于有cache的raid卡，这个值应该设置成1000左右，对于ssd，应该设置成10000。

vacuum_cost_page_hit：清理一个在共享缓存中找到的缓冲区的估计代价。它表示锁住缓冲池、查找共享哈希表和扫描页内容的代价。默认值为1。
vacuum_cost_page_miss：清理一个必须从磁盘上读取的缓冲区的代价。它表示锁住缓冲池、查找共享哈希表、从磁盘读取需要的块以及扫描其内容的代价。默认值为10。
vacuum_cost_page_dirty：当清理修改一个之前干净的块时需要花费的估计代价。它表示再次把脏块刷出到磁盘所需要的额外I/O。默认值为20。

调度周期太长

autovacuum launcher process 的周期太久，autovacuum launcher 进程可以理解为AutoVacuum机制的守护进程，周期性地调度autovacuum worker进程。autovacuum_naptime决定了autovacuum launcher 的基本执行周期，autovacuum launcher 进程会周期性地创建autovacuum worker 进程，最多能够创建autovacuum_max_workers个autovacuum worker 进程。

力不从心

例如有一个表包含100万条记录，分成10个进程，每个进程批量更新其中的10万条，并且持续不断的更新，为什么说这样操作会引起膨胀呢，因为worker process 最小粒度是表级别的，同一张表同一时间只有一个进程在回收垃圾。这种场景会产生三个问题：

瞬时产生垃圾的速度可能超过回收的速度，
产生新TUPLE版本的需求超过FSM的剩余空间，
回收过程中(其他进程可能会启动并发的更新，持有事务排他锁)会遇到不可回收的问题，就是前面这个例子提到的问题。

vacuum的演进

另外需要注意的是，vacuum是需要获取锁的，vacuum可能在获取不到锁的时候，跳过该页。关于autovacuum的优化历史，可以参照：https://www.enterprisedb.com/postgres-tutorials/history-improvements-vacuum-postgresql

最开始，是没有autovacuum的，只能人为的去做vacuum，放在定时任务里，不好权衡写入负载高或者写入负载低的库
在8.3中，引入了自动化的autovacuum，采用多进程架构，支持多表同时操作
在8.4中，对FSM进行了改进，最开始是fixed-size固定大小的，如果空闲空间的大小超过了该配置，就不再追踪空闲空间，导致膨胀，然后8.4引入了动态扩展的FSM，fsm文件并不是在创建文件时就立马创建，而是等到需要时才创建，也就是执行vacuum时，或者为了插入行第一次查找fsm文件时才创建；同时添加了VM文件，维护了堆表中哪些page包含对所有事务都可见的tuple，这样VACUUM可以通过判断该map映射关系，跳过清理这些页。
在9.1中，autovacuum可以跳过当前获得不到表锁的表，以前的情况是，一个表上autovacuum长时间内获取不到想要的锁，然后就会一直卡在那里，这样就导致其他想要进行vacuum的表“饿死”，并且表级是不能并行vacuum的，就会导致膨胀。
在9.2中，系统可以跳过获得不了清理锁的相应page，除非这个page包含一些必须要删除或者冻结的行。
在9.5中，减少了Btree索引保留最近访问的index page的情形，这样减少了vacuum因为等待index scan而被卡住的情况。
在9.6中的visibility map文件中，增加了一个bit位，不仅记录page是不是all-visible，也记录是不是all-frozen，这样可以大幅提升静态数据的freeze操作，减少不必要的IO操作
在11中，引入了vacuum_cleanup_index_scale_factor，可以加速含有大量insert，没有update、delete操作的表的vacuum，大致原理就是，当(insert_tuples - previous_total_tuples) / previous_total_tuples > vacuum_cleanup_index_scale_factor时，vacuum cleanup阶段才需要去扫描索引，更新index stats信息(包括meta page计数器信息)
在13中，引入了并行vacuum索引(表级还不行)，同时引入了autovacuum_vacuum_insert_threshold，为了防止大量insert操作后，导致的 “冻结” 炸弹，因为以前纯insert操作是不会触发vacuum的，只会触发analyze。值得注意的是，vacuum_cleanup_index_scale_factor和autovacuum_vacuum_insert_threshold有点水火不容，并且社区commiters更倾向于autovacuum_vacuum_insert_threshold这个机制，于是从v14之后就不再有vacuum_cleanup_index_scale_factor这个参数了

如何观察表膨胀

获取膨胀率

SQL I

SELECT
    CURRENT_CATALOG AS datname,
    nspname,
    relname,
    bs * tblpages AS size,
    CASE WHEN tblpages - est_tblpages_ff > 0 THEN
    (tblpages - est_tblpages_ff) / tblpages::float
ELSE
    0
    END AS ratio
FROM (
    SELECT
        ceil(reltuples / ((bs - page_hdr) * fillfactor / (tpl_size * 100))) + ceil(toasttuples / 4) AS est_tblpages_ff,
        tblpages,
        fillfactor,
        bs,
        tblid,
        nspname,
        relname,
        is_na
    FROM (
        SELECT
            (4 + tpl_hdr_size + tpl_data_size + (2 * ma) - CASE WHEN tpl_hdr_size % ma = 0 THEN
                    ma
                ELSE
                    tpl_hdr_size % ma
                END - CASE WHEN ceil(tpl_data_size)::int % ma = 0 THEN
                    ma
                ELSE
                    ceil(tpl_data_size)::int % ma
                END) AS tpl_size,
            (heappages + toastpages) AS tblpages,
            heappages,
            toastpages,
            reltuples,
            toasttuples,
            bs,
            page_hdr,
            tblid,
            nspname,
            relname,
            fillfactor,
            is_na
        FROM (
            SELECT
                tbl.oid AS tblid,
                ns.nspname,
                tbl.relname,
                tbl.reltuples,
                tbl.relpages AS heappages,
                coalesce(toast.relpages, 0) AS toastpages,
                coalesce(toast.reltuples, 0) AS toasttuples,
                coalesce(substring(array_to_string(tbl.reloptions, ' ')
                    FROM 'fillfactor=([0-9]+)')::smallint, 100) AS fillfactor,
                current_setting('block_size')::numeric AS bs,
                CASE WHEN version() ~ 'mingw32'
                    OR version() ~ '64-bit|x86_64|ppc64|ia64|amd64' THEN
                    8
                ELSE
                    4
                END AS ma,
                24 AS page_hdr,
                23 + CASE WHEN MAX(coalesce(s.null_frac, 0)) > 0 THEN
                (7 + count(s.attname)) / 8
            ELSE
                0::int
                END + CASE WHEN bool_or(att.attname = 'oid'
                    AND att.attnum < 0) THEN
                    4
                ELSE
                    0
                END AS tpl_hdr_size,
                sum((1 - coalesce(s.null_frac, 0)) * coalesce(s.avg_width, 0)) AS tpl_data_size,
                bool_or(att.atttypid = 'pg_catalog.name'::regtype)
                OR sum(
                    CASE WHEN att.attnum > 0 THEN
                        1
                    ELSE
                        0
                    END) <> count(s.attname) AS is_na
            FROM
                pg_attribute AS att
                JOIN pg_class AS tbl ON att.attrelid = tbl.oid
                JOIN pg_namespace AS ns ON ns.oid = tbl.relnamespace
                LEFT JOIN pg_stats AS s ON s.schemaname = ns.nspname
                    AND s.tablename = tbl.relname
                    AND s.inherited = FALSE
                    AND s.attname = att.attname
            LEFT JOIN pg_class AS toast ON tbl.reltoastrelid = toast.oid
        WHERE
            NOT att.attisdropped
            AND tbl.relkind = 'r'
            AND nspname NOT IN ('pg_catalog', 'information_schema')
        GROUP BY
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10) AS s) AS s2) AS s3
WHERE
    NOT is_na;

SQL II

-- new table bloat query
-- still needs work; is often off by +/- 20%
WITH constants AS (
    -- define some constants for sizes of things
    -- for reference down the query and easy maintenance
    SELECT
        current_setting('block_size')::numeric AS bs,
        23 AS hdr,
        8 AS ma
),
no_stats AS (
    -- screen out table who have attributes
    -- which dont have stats, such as JSON
    SELECT
        table_schema,
        table_name,
        n_live_tup::numeric AS est_rows,
        pg_table_size(relid)::numeric AS table_size
    FROM
        information_schema.columns
        JOIN pg_stat_user_tables AS psut ON table_schema = psut.schemaname
            AND table_name = psut.relname
        LEFT OUTER JOIN pg_stats ON table_schema = pg_stats.schemaname
        AND table_name = pg_stats.tablename
        AND column_name = attname
    WHERE
        attname IS NULL
        AND table_schema NOT IN ('pg_catalog', 'information_schema')
    GROUP BY
        table_schema,
        table_name,
        relid,
        n_live_tup
),
null_headers AS (
    -- calculate null header sizes
    -- omitting tables which dont have complete stats
    -- and attributes which aren't visible
    SELECT
        hdr + 1 + (sum(
                CASE WHEN null_frac <> 0 THEN
                    1
                ELSE
                    0
                END) / 8) AS nullhdr,
        SUM((1 - null_frac) * avg_width) AS datawidth,
        MAX(null_frac) AS maxfracsum,
        schemaname,
        tablename,
        hdr,
        ma,
        bs
    FROM
        pg_stats
        CROSS JOIN constants
        LEFT OUTER JOIN no_stats ON schemaname = no_stats.table_schema
            AND tablename = no_stats.table_name
    WHERE
        schemaname NOT IN ('pg_catalog', 'information_schema')
        AND no_stats.table_name IS NULL
        AND EXISTS (
            SELECT
                1
            FROM
                information_schema.columns
            WHERE
                schemaname = columns.table_schema
                AND tablename = columns.table_name)
        GROUP BY
            schemaname,
            tablename,
            hdr,
            ma,
            bs
),
data_headers AS (
    -- estimate header and row size
    SELECT
        ma,
        bs,
        hdr,
        schemaname,
        tablename,
        (datawidth + (hdr + ma - (
                    CASE WHEN hdr % ma = 0 THEN
                        ma
                    ELSE
                        hdr % ma
                    END)))::numeric AS datahdr,
        (maxfracsum * (nullhdr + ma - (
                    CASE WHEN nullhdr % ma = 0 THEN
                        ma
                    ELSE
                        nullhdr % ma
                    END))) AS nullhdr2
    FROM
        null_headers
),
table_estimates AS (
    -- make estimates of how large the table should be
    -- based on row and page size
    SELECT
        schemaname,
        tablename,
        bs,
        reltuples::numeric AS est_rows,
        relpages * bs AS table_bytes,
        CEIL((reltuples * (datahdr + nullhdr2 + 4 + ma - (
                    CASE WHEN datahdr % ma = 0 THEN
                        ma
                    ELSE
                        datahdr % ma
                    END)) / (bs - 20))) * bs AS expected_bytes,
        reltoastrelid
    FROM
        data_headers
        JOIN pg_class ON tablename = relname
        JOIN pg_namespace ON relnamespace = pg_namespace.oid
            AND schemaname = nspname
    WHERE
        pg_class.relkind = 'r'
),
estimates_with_toast AS (
    -- add in estimated TOAST table sizes
    -- estimate based on 4 toast tuples per page because we dont have
    -- anything better.  also append the no_data tables
    SELECT
        schemaname,
        tablename,
        TRUE AS can_estimate,
        est_rows,
        table_bytes + (coalesce(toast.relpages, 0) * bs) AS table_bytes,
        expected_bytes + (ceil(coalesce(toast.reltuples, 0) / 4) * bs) AS expected_bytes
    FROM
        table_estimates
        LEFT OUTER JOIN pg_class AS toast ON table_estimates.reltoastrelid = toast.oid
        AND toast.relkind = 't'
),
table_estimates_plus AS (
    -- add some extra metadata to the table data
    -- and calculations to be reused
    -- including whether we cant estimate it
    -- or whether we think it might be compressed
    SELECT
        current_database() AS databasename,
        schemaname,
        tablename,
        can_estimate,
        est_rows,
        CASE WHEN table_bytes > 0 THEN
            table_bytes::numeric
        ELSE
            NULL::numeric
        END AS table_bytes,
        CASE WHEN expected_bytes > 0 THEN
            expected_bytes::numeric
        ELSE
            NULL::numeric
        END AS expected_bytes,
        CASE WHEN expected_bytes > 0
            AND table_bytes > 0
            AND expected_bytes <= table_bytes THEN
            (table_bytes - expected_bytes)::numeric
        ELSE
            0::numeric
        END AS bloat_bytes
    FROM
        estimates_with_toast
    UNION ALL
    SELECT
        current_database() AS databasename,
        table_schema,
        table_name,
        FALSE,
        est_rows,
        table_size,
        NULL::numeric,
        NULL::numeric
    FROM
        no_stats
),
bloat_data AS (
    -- do final math calculations and formatting
    SELECT
        current_database() AS databasename,
        schemaname,
        tablename,
        can_estimate,
        table_bytes,
        round(table_bytes / (1024 ^ 2)::numeric, 3) AS table_mb,
        expected_bytes,
        round(expected_bytes / (1024 ^ 2)::numeric, 3) AS expected_mb,
        round(bloat_bytes * 100 / table_bytes) AS pct_bloat,
        round(bloat_bytes / (1024::numeric ^ 2), 2) AS mb_bloat,
        table_bytes,
        expected_bytes,
        est_rows
    FROM
        table_estimates_plus)
        -- filter output for bloated tables
        SELECT
            databasename,
            schemaname,
            tablename,
            can_estimate,
            est_rows,
            pct_bloat,
            mb_bloat,
            table_mb
        FROM
            bloat_data
            -- this where clause defines which tables actually appear
            -- in the bloat chart
            -- example below filters for tables which are either 50%
            -- bloated and more than 20mb in size, or more than 25%
            -- bloated and more than 4GB in size
    WHERE (pct_bloat >= 50
        AND mb_bloat >= 10)
    OR (pct_bloat >= 25
        AND mb_bloat >= 1000)
ORDER BY
    pct_bloat DESC;

获取膨胀情况

SQL I

SELECT
    current_database(),
    schemaname,
    tablename,
    /*reltuples::bigint, relpages::bigint, otta,*/
    ROUND((
        CASE WHEN otta = 0 THEN
            0.0
        ELSE
            sml.relpages::float / otta
        END)::numeric, 1) AS tbloat,
    CASE WHEN relpages < otta THEN
        0
    ELSE
        bs * (sml.relpages - otta)::bigint
    END AS wastedbytes,
    iname,
    /*ituples::bigint, ipages::bigint, iotta,*/
    ROUND((
        CASE WHEN iotta = 0
            OR ipages = 0 THEN
            0.0
        ELSE
            ipages::float / iotta
        END)::numeric, 1) AS ibloat,
    CASE WHEN ipages < iotta THEN
        0
    ELSE
        bs * (ipages - iotta)
    END AS wastedibytes
FROM (
    SELECT
        schemaname,
        tablename,
        cc.reltuples,
        cc.relpages,
        bs,
        CEIL((cc.reltuples * ((datahdr + ma - (
                    CASE WHEN datahdr % ma = 0 THEN
                        ma
                    ELSE
                        datahdr % ma
                    END)) + nullhdr2 + 4)) / (bs - 20::float)) AS otta,
        COALESCE(c2.relname, '?') AS iname,
        COALESCE(c2.reltuples, 0) AS ituples,
        COALESCE(c2.relpages, 0) AS ipages,
        COALESCE(CEIL((c2.reltuples * (datahdr - 12)) / (bs - 20::float)), 0) AS iotta -- very rough approximation, assumes all cols
    FROM (
        SELECT
            ma,
            bs,
            schemaname,
            tablename,
            (datawidth + (hdr + ma - (
                        CASE WHEN hdr % ma = 0 THEN
                            ma
                        ELSE
                            hdr % ma
                        END)))::numeric AS datahdr,
            (maxfracsum * (nullhdr + ma - (
                        CASE WHEN nullhdr % ma = 0 THEN
                            ma
                        ELSE
                            nullhdr % ma
                        END))) AS nullhdr2
        FROM (
            SELECT
                schemaname,
                tablename,
                hdr,
                ma,
                bs,
                SUM((1 - null_frac) * avg_width) AS datawidth,
                MAX(null_frac) AS maxfracsum,
                hdr + (
                    SELECT
                        1 + count(*) / 8
                    FROM
                        pg_stats s2
                    WHERE
                        null_frac <> 0
                        AND s2.schemaname = s.schemaname
                        AND s2.tablename = s.tablename) AS nullhdr
                FROM
                    pg_stats s,
                    (
                        SELECT
                            (
                                SELECT
                                    current_setting('block_size')::numeric) AS bs,
                                CASE WHEN substring(v, 12, 3) IN ('8.0', '8.1', '8.2') THEN
                                    27
                                ELSE
                                    23
                                END AS hdr,
                                CASE WHEN v ~ 'mingw32' THEN
                                    8
                                ELSE
                                    4
                                END AS ma
                            FROM (
                                SELECT
                                    version() AS v) AS foo) AS constants
                        GROUP BY
                            1,
                            2,
                            3,
                            4,
                            5) AS foo) AS rs
                    JOIN pg_class cc ON cc.relname = rs.tablename
                    JOIN pg_namespace nn ON cc.relnamespace = nn.oid
                        AND nn.nspname = rs.schemaname
                        AND nn.nspname <> 'information_schema'
                LEFT JOIN pg_index i ON indrelid = cc.oid
                LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid) AS sml
ORDER BY
    wastedbytes DESC

SQL II

/* WARNING: executed with a non-superuser role, the query inspect only tables and materialized view (9.3+) you are granted to read.
 * This query is compatible with PostgreSQL 9.0 and more
 */
SELECT
    current_database(),
    schemaname,
    tblname,
    bs * tblpages AS real_size,
    (tblpages - est_tblpages) * bs AS extra_size,
    CASE WHEN tblpages - est_tblpages > 0 THEN
        100 * (tblpages - est_tblpages) / tblpages::float
    ELSE
        0
    END AS extra_pct,
    fillfactor,
    CASE WHEN tblpages - est_tblpages_ff > 0 THEN
    (tblpages - est_tblpages_ff) * bs
ELSE
    0
    END AS bloat_size,
    CASE WHEN tblpages - est_tblpages_ff > 0 THEN
        100 * (tblpages - est_tblpages_ff) / tblpages::float
    ELSE
        0
    END AS bloat_pct,
    is_na
    -- , tpl_hdr_size, tpl_data_size, (pst).free_percent + (pst).dead_tuple_percent AS real_frag -- (DEBUG INFO)
FROM (
    SELECT
        ceil(reltuples / ((bs - page_hdr) / tpl_size)) + ceil(toasttuples / 4) AS est_tblpages,
        ceil(reltuples / ((bs - page_hdr) * fillfactor / (tpl_size * 100))) + ceil(toasttuples / 4) AS est_tblpages_ff,
        tblpages,
        fillfactor,
        bs,
        tblid,
        schemaname,
        tblname,
        heappages,
        toastpages,
        is_na
        -- , tpl_hdr_size, tpl_data_size, pgstattuple(tblid) AS pst -- (DEBUG INFO)
    FROM (
        SELECT
            (4 + tpl_hdr_size + tpl_data_size + (2 * ma) - CASE WHEN tpl_hdr_size % ma = 0 THEN
                    ma
                ELSE
                    tpl_hdr_size % ma
                END - CASE WHEN ceil(tpl_data_size)::int % ma = 0 THEN
                    ma
                ELSE
                    ceil(tpl_data_size)::int % ma
                END) AS tpl_size,
            bs - page_hdr AS size_per_block,
            (heappages + toastpages) AS tblpages,
            heappages,
            toastpages,
            reltuples,
            toasttuples,
            bs,
            page_hdr,
            tblid,
            schemaname,
            tblname,
            fillfactor,
            is_na
            -- , tpl_hdr_size, tpl_data_size
        FROM (
            SELECT
                tbl.oid AS tblid,
                ns.nspname AS schemaname,
                tbl.relname AS tblname,
                tbl.reltuples,
                tbl.relpages AS heappages,
                coalesce(toast.relpages, 0) AS toastpages,
                coalesce(toast.reltuples, 0) AS toasttuples,
                coalesce(substring(array_to_string(tbl.reloptions, ' ')
                    FROM 'fillfactor=([0-9]+)')::smallint, 100) AS fillfactor,
                current_setting('block_size')::numeric AS bs,
                CASE WHEN version() ~ 'mingw32'
                    OR version() ~ '64-bit|x86_64|ppc64|ia64|amd64' THEN
                    8
                ELSE
                    4
                END AS ma,
                24 AS page_hdr,
                23 + CASE WHEN MAX(coalesce(s.null_frac, 0)) > 0 THEN
                (7 + count(s.attname)) / 8
            ELSE
                0::int
                END + CASE WHEN bool_or(att.attname = 'oid'
                    AND att.attnum < 0) THEN
                    4
                ELSE
                    0
                END AS tpl_hdr_size,
                sum((1 - coalesce(s.null_frac, 0)) * coalesce(s.avg_width, 0)) AS tpl_data_size,
                bool_or(att.atttypid = 'pg_catalog.name'::regtype)
                OR sum(
                    CASE WHEN att.attnum > 0 THEN
                        1
                    ELSE
                        0
                    END) <> count(s.attname) AS is_na
            FROM
                pg_attribute AS att
                JOIN pg_class AS tbl ON att.attrelid = tbl.oid
                JOIN pg_namespace AS ns ON ns.oid = tbl.relnamespace
                LEFT JOIN pg_stats AS s ON s.schemaname = ns.nspname
                    AND s.tablename = tbl.relname
                    AND s.inherited = FALSE
                    AND s.attname = att.attname
            LEFT JOIN pg_class AS toast ON tbl.reltoastrelid = toast.oid
        WHERE
            NOT att.attisdropped
            AND tbl.relkind IN ('r', 'm')
        GROUP BY
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10
        ORDER BY
            2,
            3) AS s) AS s2) AS s3
    -- WHERE NOT is_na
    --   AND tblpages*((pst).free_percent + (pst).dead_tuple_percent)::float4/100 >= 1
ORDER BY
    schemaname,
    tblname;

SQL III

SELECT
    current_database() AS db,
    schemaname,
    tablename,
    reltuples::bigint AS tups,
    relpages::bigint AS pages,
    otta,
    ROUND(
        CASE WHEN otta = 0
            OR sml.relpages = 0
            OR sml.relpages = otta THEN
            0.0
        ELSE
            sml.relpages / otta::numeric
        END, 1) AS tbloat,
    CASE WHEN relpages < otta THEN
        0
    ELSE
        relpages::bigint - otta
    END AS wastedpages,
    CASE WHEN relpages < otta THEN
        0
    ELSE
        bs * (sml.relpages - otta)::bigint
    END AS wastedbytes,
    CASE WHEN relpages < otta THEN
        '0 bytes'::text
    ELSE
        (bs * (relpages - otta))::bigint || ' bytes'
    END AS wastedsize,
    iname,
    ituples::bigint AS itups,
    ipages::bigint AS ipages,
    iotta,
    ROUND(
        CASE WHEN iotta = 0
            OR ipages = 0
            OR ipages = iotta THEN
            0.0
        ELSE
            ipages / iotta::numeric
        END, 1) AS ibloat,
    CASE WHEN ipages < iotta THEN
        0
    ELSE
        ipages::bigint - iotta
    END AS wastedipages,
    CASE WHEN ipages < iotta THEN
        0
    ELSE
        bs * (ipages - iotta)
    END AS wastedibytes,
    CASE WHEN ipages < iotta THEN
        '0 bytes'
    ELSE
        (bs * (ipages - iotta))::bigint || ' bytes'
    END AS wastedisize,
    CASE WHEN relpages < otta THEN
        CASE WHEN ipages < iotta THEN
            0
        ELSE
            bs * (ipages - iotta::bigint)
        END
    ELSE
        CASE WHEN ipages < iotta THEN
            bs * (relpages - otta::bigint)
        ELSE
            bs * (relpages - otta::bigint + ipages - iotta::bigint)
        END
    END AS totalwastedbytes
FROM (
    SELECT
        nn.nspname AS schemaname,
        cc.relname AS tablename,
        COALESCE(cc.reltuples, 0) AS reltuples,
        COALESCE(cc.relpages, 0) AS relpages,
        COALESCE(bs, 0) AS bs,
        COALESCE(CEIL((cc.reltuples * ((datahdr + ma - (
                        CASE WHEN datahdr % ma = 0 THEN
                            ma
                        ELSE
                            datahdr % ma
                        END)) + nullhdr2 + 4)) / (bs - 20::float)), 0) AS otta,
        COALESCE(c2.relname, '?') AS iname,
        COALESCE(c2.reltuples, 0) AS ituples,
        COALESCE(c2.relpages, 0) AS ipages,
        COALESCE(CEIL((c2.reltuples * (datahdr - 12)) / (bs - 20::float)), 0) AS iotta -- very rough approximation, assumes ALL cols
    FROM
        pg_class cc
        JOIN pg_namespace nn ON cc.relnamespace = nn.oid
            AND nn.nspname <> 'information_schema'
    LEFT JOIN (
        SELECT
            ma,
            bs,
            foo.nspname,
            foo.relname,
            (datawidth + (hdr + ma - (
                        CASE WHEN hdr % ma = 0 THEN
                            ma
                        ELSE
                            hdr % ma
                        END)))::numeric AS datahdr,
            (maxfracsum * (nullhdr + ma - (
                        CASE WHEN nullhdr % ma = 0 THEN
                            ma
                        ELSE
                            nullhdr % ma
                        END))) AS nullhdr2
        FROM (
            SELECT
                ns.nspname,
                tbl.relname,
                hdr,
                ma,
                bs,
                SUM((1 - coalesce(null_frac, 0)) * coalesce(avg_width, 2048)) AS datawidth,
                MAX(coalesce(null_frac, 0)) AS maxfracsum,
                hdr + (
                    SELECT
                        1 + count(*) / 8
                    FROM
                        pg_stats s2
                    WHERE
                        null_frac <> 0
                        AND s2.schemaname = ns.nspname
                        AND s2.tablename = tbl.relname) AS nullhdr
                FROM
                    pg_attribute att
                    JOIN pg_class tbl ON att.attrelid = tbl.oid
                    JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
                    LEFT JOIN pg_stats s ON s.schemaname = ns.nspname
                        AND s.tablename = tbl.relname
                        AND s.inherited = FALSE
                        AND s.attname = att.attname,
                        (
                            SELECT
                                (
                                    SELECT
                                        current_setting('block_size')::numeric) AS bs,
                                    CASE WHEN SUBSTRING(SPLIT_PART(v, ' ', 2)
                                    FROM '#"[0-9]+.[0-9]+#"%' FOR '#') IN ('8.0', '8.1', '8.2') THEN
                                        27
                                    ELSE
                                        23
                                    END AS hdr,
                                    CASE WHEN v ~ 'mingw32'
                                        OR v ~ '64-bit' THEN
                                        8
                                    ELSE
                                        4
                                    END AS ma
                                FROM (
                                    SELECT
                                        version() AS v) AS foo) AS constants
                        WHERE
                            att.attnum > 0
                            AND tbl.relkind = 'r'
                        GROUP BY
                            1,
                            2,
                            3,
                            4,
                            5) AS foo) AS rs ON cc.relname = rs.relname
                    AND nn.nspname = rs.nspname
                LEFT JOIN pg_index i ON indrelid = cc.oid
                LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid) AS sml;

获取索引膨胀

SQL I

-- btree index stats query
-- estimates bloat for btree indexes
WITH btree_index_atts AS (
    SELECT
        nspname,
        indexclass.relname AS index_name,
        indexclass.reltuples,
        indexclass.relpages,
        indrelid,
        indexrelid,
        indexclass.relam,
        tableclass.relname AS tablename,
        regexp_split_to_table(indkey::text, ' ')::smallint AS attnum,
        indexrelid AS index_oid
    FROM
        pg_index
        JOIN pg_class AS indexclass ON pg_index.indexrelid = indexclass.oid
        JOIN pg_class AS tableclass ON pg_index.indrelid = tableclass.oid
        JOIN pg_namespace ON pg_namespace.oid = indexclass.relnamespace
        JOIN pg_am ON indexclass.relam = pg_am.oid
    WHERE
        pg_am.amname = 'btree'
        AND indexclass.relpages > 0
        AND nspname NOT IN ('pg_catalog', 'information_schema')
),
index_item_sizes AS (
    SELECT
        ind_atts.nspname,
        ind_atts.index_name,
        ind_atts.reltuples,
        ind_atts.relpages,
        ind_atts.relam,
        indrelid AS table_oid,
        index_oid,
        current_setting('block_size')::numeric AS bs,
        8 AS maxalign,
        24 AS pagehdr,
        CASE WHEN max(coalesce(pg_stats.null_frac, 0)) = 0 THEN
            2
        ELSE
            6
        END AS index_tuple_hdr,
        sum((1 - coalesce(pg_stats.null_frac, 0)) * coalesce(pg_stats.avg_width, 1024)) AS nulldatawidth
    FROM
        pg_attribute
        JOIN btree_index_atts AS ind_atts ON pg_attribute.attrelid = ind_atts.indexrelid
            AND pg_attribute.attnum = ind_atts.attnum
        JOIN pg_stats ON pg_stats.schemaname = ind_atts.nspname
        -- stats for regular index columns
            AND ((pg_stats.tablename = ind_atts.tablename
                    AND pg_stats.attname = pg_catalog.pg_get_indexdef(pg_attribute.attrelid, pg_attribute.attnum, TRUE))
                -- stats for functional indexes
                OR (pg_stats.tablename = ind_atts.index_name
                    AND pg_stats.attname = pg_attribute.attname))
    WHERE
        pg_attribute.attnum > 0
    GROUP BY
        1,
        2,
        3,
        4,
        5,
        6,
        7,
        8,
        9
),
index_aligned_est AS (
    SELECT
        maxalign,
        bs,
        nspname,
        index_name,
        reltuples,
        relpages,
        relam,
        table_oid,
        index_oid,
        coalesce(ceil(reltuples * (6 + maxalign - CASE WHEN index_tuple_hdr % maxalign = 0 THEN
                        maxalign
                    ELSE
                        index_tuple_hdr % maxalign
                    END + nulldatawidth + maxalign - CASE /* Add padding to the data to align on MAXALIGN */
                    WHEN nulldatawidth::integer % maxalign = 0 THEN
                        maxalign
                    ELSE
                        nulldatawidth::integer % maxalign
                    END)::numeric / (bs - pagehdr::numeric) + 1), 0) AS expected
    FROM
        index_item_sizes
),
raw_bloat AS (
    SELECT
        current_database() AS dbname,
        nspname,
        pg_class.relname AS table_name,
        index_name,
        bs * (index_aligned_est.relpages)::bigint AS totalbytes,
        expected,
        CASE WHEN index_aligned_est.relpages <= expected THEN
            0
        ELSE
            bs * (index_aligned_est.relpages - expected)::bigint
        END AS wastedbytes,
        CASE WHEN index_aligned_est.relpages <= expected THEN
            0
        ELSE
            bs * (index_aligned_est.relpages - expected)::bigint * 100 / (bs * (index_aligned_est.relpages)::bigint)
        END AS realbloat,
        pg_relation_size(index_aligned_est.table_oid) AS table_bytes,
        stat.idx_scan AS index_scans
    FROM
        index_aligned_est
        JOIN pg_class ON pg_class.oid = index_aligned_est.table_oid
        JOIN pg_stat_user_indexes AS stat ON index_aligned_est.index_oid = stat.indexrelid
),
format_bloat AS (
    SELECT
        dbname AS database_name,
        nspname AS schema_name,
        table_name,
        index_name,
        round(realbloat) AS bloat_pct,
        round(wastedbytes / (1024 ^ 2)::numeric) AS bloat_mb,
        round(totalbytes / (1024 ^ 2)::numeric, 3) AS index_mb,
        round(table_bytes / (1024 ^ 2)::numeric, 3) AS table_mb,
        index_scans
    FROM
        raw_bloat)
    -- final query outputting the bloated indexes
    -- change the where and order by to change
    -- what shows up as bloated
    SELECT
        *
    FROM
        format_bloat
    WHERE (bloat_pct > 50
        AND bloat_mb > 10)
ORDER BY
    bloat_mb DESC;

SQL II

-- WARNING: executed with a non-superuser role, the query inspect only index on tables you are granted to read.
-- WARNING: rows with is_na = 't' are known to have bad statistics ("name" type is not supported).
-- This query is compatible with PostgreSQL 8.2 and after
SELECT
    current_database(),
    nspname AS schemaname,
    tblname,
    idxname,
    bs * (relpages)::bigint AS real_size,
    bs * (relpages - est_pages)::bigint AS extra_size,
    100 * (relpages - est_pages)::float / relpages AS extra_pct,
    fillfactor,
    CASE WHEN relpages > est_pages_ff THEN
        bs * (relpages - est_pages_ff)
    ELSE
        0
    END AS bloat_size,
    100 * (relpages - est_pages_ff)::float / relpages AS bloat_pct,
    is_na
    -- , 100-(pst).avg_leaf_density AS pst_avg_bloat, est_pages, index_tuple_hdr_bm, maxalign, pagehdr, nulldatawidth, nulldatahdrwidth, reltuples, relpages -- (DEBUG INFO)
FROM (
    SELECT
        coalesce(1 + ceil(reltuples / floor((bs - pageopqdata - pagehdr) / (4 + nulldatahdrwidth)::float)), 0 -- ItemIdData size + computed avg size of a tuple (nulldatahdrwidth)
) AS est_pages,
        coalesce(1 + ceil(reltuples / floor((bs - pageopqdata - pagehdr) * fillfactor / (100 * (4 + nulldatahdrwidth)::float))), 0) AS est_pages_ff,
        bs,
        nspname,
        tblname,
        idxname,
        relpages,
        fillfactor,
        is_na
        -- , pgstatindex(idxoid) AS pst, index_tuple_hdr_bm, maxalign, pagehdr, nulldatawidth, nulldatahdrwidth, reltuples -- (DEBUG INFO)
    FROM (
        SELECT
            maxalign,
            bs,
            nspname,
            tblname,
            idxname,
            reltuples,
            relpages,
            idxoid,
            fillfactor,
            (index_tuple_hdr_bm + maxalign - CASE -- Add padding to the index tuple header to align on MAXALIGN
                WHEN index_tuple_hdr_bm % maxalign = 0 THEN
                    maxalign
                ELSE
                    index_tuple_hdr_bm % maxalign
                END + nulldatawidth + maxalign - CASE -- Add padding to the data to align on MAXALIGN
                WHEN nulldatawidth = 0 THEN
                    0
                WHEN nulldatawidth::integer % maxalign = 0 THEN
                    maxalign
                ELSE
                    nulldatawidth::integer % maxalign
                END)::numeric AS nulldatahdrwidth,
            pagehdr,
            pageopqdata,
            is_na
            -- , index_tuple_hdr_bm, nulldatawidth -- (DEBUG INFO)
        FROM (
            SELECT
                n.nspname,
                i.tblname,
                i.idxname,
                i.reltuples,
                i.relpages,
                i.idxoid,
                i.fillfactor,
                current_setting('block_size')::numeric AS bs,
                CASE -- MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?)
                WHEN version() ~ 'mingw32'
                    OR version() ~ '64-bit|x86_64|ppc64|ia64|amd64' THEN
                    8
                ELSE
                    4
                END AS maxalign,
                /* per page header, fixed size: 20 for 7.X, 24 for others */
                24 AS pagehdr,
                /* per page btree opaque data */
                16 AS pageopqdata,
                /* per tuple header: add IndexAttributeBitMapData if some cols are null-able */
                CASE WHEN max(coalesce(s.null_frac, 0)) = 0 THEN
                    2 -- IndexTupleData size
                ELSE
                    2 + ((32 + 8 - 1) / 8) -- IndexTupleData size + IndexAttributeBitMapData size ( max num filed per index + 8 - 1 /8)
                END AS index_tuple_hdr_bm,
                /* data len: we remove null values save space using it fractionnal part from stats */
                sum((1 - coalesce(s.null_frac, 0)) * coalesce(s.avg_width, 1024)) AS nulldatawidth,
                max(
                    CASE WHEN i.atttypid = 'pg_catalog.name'::regtype THEN
                        1
                    ELSE
                        0
                    END) > 0 AS is_na
            FROM (
                SELECT
                    ct.relname AS tblname,
                    ct.relnamespace,
                    ic.idxname,
                    ic.attpos,
                    ic.indkey,
                    ic.indkey[ic.attpos],
                    ic.reltuples,
                    ic.relpages,
                    ic.tbloid,
                    ic.idxoid,
                    ic.fillfactor,
                    coalesce(a1.attnum, a2.attnum) AS attnum,
                    coalesce(a1.attname, a2.attname) AS attname,
                    coalesce(a1.atttypid, a2.atttypid) AS atttypid,
                    CASE WHEN a1.attnum IS NULL THEN
                        ic.idxname
                    ELSE
                        ct.relname
                    END AS attrelname
                FROM (
                    SELECT
                        idxname,
                        reltuples,
                        relpages,
                        tbloid,
                        idxoid,
                        fillfactor,
                        indkey,
                        pg_catalog.generate_series(1, indnatts) AS attpos
                    FROM (
                        SELECT
                            ci.relname AS idxname,
                            ci.reltuples,
                            ci.relpages,
                            i.indrelid AS tbloid,
                            i.indexrelid AS idxoid,
                            coalesce(substring(array_to_string(ci.reloptions, ' ')
                                FROM 'fillfactor=([0-9]+)')::smallint, 90) AS fillfactor,
                            i.indnatts,
                            pg_catalog.string_to_array(pg_catalog.textin(pg_catalog.int2vectorout(i.indkey)), ' ')::int[] AS indkey
                        FROM
                            pg_catalog.pg_index i
                            JOIN pg_catalog.pg_class ci ON ci.oid = i.indexrelid
                        WHERE
                            ci.relam = (
                                SELECT
                                    oid
                                FROM
                                    pg_am
                                WHERE
                                    amname = 'btree')
                                AND ci.relpages > 0) AS idx_data) AS ic
                        JOIN pg_catalog.pg_class ct ON ct.oid = ic.tbloid
                        LEFT JOIN pg_catalog.pg_attribute a1 ON ic.indkey[ic.attpos] <> 0
                            AND a1.attrelid = ic.tbloid
                            AND a1.attnum = ic.indkey[ic.attpos]
                    LEFT JOIN pg_catalog.pg_attribute a2 ON ic.indkey[ic.attpos] = 0
                        AND a2.attrelid = ic.idxoid
                        AND a2.attnum = ic.attpos) i
                JOIN pg_catalog.pg_namespace n ON n.oid = i.relnamespace
                JOIN pg_catalog.pg_stats s ON s.schemaname = n.nspname
                    AND s.tablename = i.attrelname
                    AND s.attname = i.attname
            GROUP BY
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11) AS rows_data_stats) AS rows_hdr_pdg_stats) AS relation_stats
ORDER BY
    nspname,
    tblname,
    idxname;

获取当前库表和索引大小

SELECT *, 
       Pg_size_pretty(total_bytes) AS total, 
       Pg_size_pretty(index_bytes) AS INDEX, 
       Pg_size_pretty(toast_bytes) AS toast, 
       Pg_size_pretty(table_bytes) AS TABLE 
FROM   (SELECT *, 
               total_bytes - index_bytes - Coalesce(toast_bytes, 0) AS 
               table_bytes 
        FROM   (SELECT c.oid, 
                       nspname                               AS table_schema, 
                       relname                               AS TABLE_NAME, 
                       c.reltuples                           AS row_estimate, 
                       Pg_total_relation_size(c.oid)         AS total_bytes, 
                       Pg_indexes_size(c.oid)                AS index_bytes, 
                       Pg_total_relation_size(reltoastrelid) AS toast_bytes 
                FROM   pg_class c 
                       LEFT JOIN pg_namespace n 
                              ON n.oid = c.relnamespace 
                WHERE  relkind = 'r') a 
        WHERE  table_schema = 'public' 
        ORDER  BY total_bytes DESC) a;

其他SQL

https://github.com/pgexperts/pgx_scripts/tree/master/bloat

https://github.com/pgexperts/pgx_scripts/blob/master/bloat/no_stats_table_check.sql

https://github.com/ioguix/pgsql-bloat-estimation/tree/master/table

https://github.com/ioguix/pgsql-bloat-estimation/tree/master/btree

https://wiki.postgresql.org/wiki/Show_database_bloat

自行估算

typedef struct HeapTupleFields
{
    TransactionId t_xmin;       /* inserting xact ID */
    TransactionId t_xmax;       /* deleting or locking xact ID */
    union
    {
        CommandId   t_cid;      /* inserting or deleting command ID, or both */
        TransactionId t_xvac;   /* old-style VACUUM FULL xact ID */
    }           t_field3;
} HeapTupleFields;

typedef struct DatumTupleFields
{
    int32       datum_len_;     /* varlena header (do not touch directly!) */
    int32       datum_typmod;   /* -1, or identifier of a record type */
    Oid         datum_typeid;   /* composite type OID, or RECORDOID */
    /*
     * Note: field ordering is chosen with thought that Oid might someday
     * widen to 64 bits.
     */
} DatumTupleFields;

struct HeapTupleHeaderData
{
    union
    {
        HeapTupleFields t_heap;
        DatumTupleFields t_datum;
    }           t_choice;
    ItemPointerData t_ctid;     /* current TID of this or newer tuple */
    /* Fields below here must match MinimalTupleData! */
    uint16      t_infomask2;    /* number of attributes + various flags */
    uint16      t_infomask;     /* various flag bits, see below */
    uint8       t_hoff;         /* sizeof header incl. bitmap, padding */
    /* ^ - 23 bytes - ^ */
    bits8       t_bits[1];      /* bitmap of NULLs -- VARIABLE LENGTH */
    /* MORE DATA FOLLOWS AT END OF STRUCT */
};
typedef HeapTupleHeaderData *HeapTupleHeader;

Tuple头部是由23byte固定大小的前缀和可选的NullBitMap构成。

PostgreSQL为每个关系都维护了很多的统计信息，利用统计信息，可以快速高效地估算数据库中所有表的膨胀率。估算膨胀率需要使用表与列上的统计信息，直接使用的统计指标有三个：

元组的平均宽度avgwidth：从列级统计数据计算而来，用于估计紧实状态占用的空间。
元组数：pg_class.reltuples：用于估计紧实状态占用的空间
页面数：pg_class.relpages：用于测算实际使用的空间

而计算公式也很简单：1 - (reltuples * avgwidth) / (block_size - pageheader) / relpages

这里block_size是页面大小，默认为8182，pageheader是首部占用的大小，默认为24字节。页面大小减去首部大小就是可以用于元组存储的实际空间，因此(reltuples * avgwidth)给出了元组的估计总大小，而除以前者后，就可以得到预计需要多少个页面才能紧实地存下所有的元组。最后，期待使用的页面数量，除以实际使用的页面数量，就是利用率，而1减去利用率，就是膨胀率。

pg_table_bloat

pg_bloat_check is a script to provide a bloat report for PostgreSQL tables and/or indexes. It requires at least Python 2.6 and the pgstattuple

依赖pgstattuple插件
pg_bloat_check会进行全表扫描，比pg_stat_all_tables准确，但比较慢对系统性能冲击也较大，不建议作为常规工具使用。

[postgres@xiongcc pg_bloat_check]$ ./pg_bloat_check.py -c "host=localhost" -n public
1. public.idx_t2 (i) ...........................................................(89.85%) 19 MB wasted
2. pg_catalog.pg_statistic (t) ...............................................(94.98%) 7256 kB wasted
3. public.test (t) ...........................................................(37.61%) 3255 kB wasted
4. pg_catalog.pg_attribute (t) ...............................................(74.64%) 1785 kB wasted
5. public.test_bloat (t) .....................................................(90.65%) 1008 kB wasted
6. pg_toast.pg_toast_49444 (t) ................................................(49.74%) 955 kB wasted
      Real table: public.blog
7. pg_toast.pg_toast_2619 (t) .................................................(96.39%) 694 kB wasted
      Real table: pg_catalog.pg_statistic
8. pg_catalog.pg_class (t) ....................................................(84.43%) 689 kB wasted
9. pg_catalog.pg_constraint (t) ...............................................(97.88%) 431 kB wasted
10. pg_catalog.pg_attrdef (t) .................................................(97.63%) 430 kB wasted
11. pg_catalog.pg_depend (t) ..................................................(38.46%) 342 kB wasted
12. pg_catalog.pg_depend_reference_index (i) ..................................(44.77%) 279 kB wasted
13. public.t_idx (i) ...........................................................(5.98%) 263 kB wasted
14. pg_catalog.pg_attribute_relid_attnam_index (i) ............................(51.57%) 239 kB wasted
15. public.t_a_idx (i) .........................................................(9.57%) 237 kB wasted
16. public.bookings_pkey2 (i) ..................................................(9.57%) 237 kB wasted
17. public.bookings_pkey3 (i) ..................................................(9.57%) 237 kB wasted
18. pg_catalog.pg_type (t) ....................................................(68.10%) 223 kB wasted

pg_freespacemap

testdb=# CREATE EXTENSION pg_freespacemap;
CREATE EXTENSION
testdb=# SELECT count(*) as "number of pages",
       pg_size_pretty(cast(avg(avail) as bigint)) as "Av. freespace size",
       round(100 * avg(avail)/8192 ,2) as "Av. freespace ratio"
       FROM pg_freespace('accounts');
 number of pages | Av. freespace size | Av. freespace ratio 
-----------------+--------------------+---------------------
            1640 | 99 bytes           |                1.21
(1 row)
testdb=# DELETE FROM accounts WHERE aid %10 != 0 OR aid < 100;
DELETE 90009
testdb=# VACUUM accounts;
VACUUM
testdb=# SELECT count(*) as "number of pages",
       pg_size_pretty(cast(avg(avail) as bigint)) as "Av. freespace size",
       round(100 * avg(avail)/8192 ,2) as "Av. freespace ratio"
       FROM pg_freespace('accounts');
 number of pages | Av. freespace size | Av. freespace ratio 
-----------------+--------------------+---------------------
            1640 | 7124 bytes         |               86.97
(1 row)
testdb=# SELECT *, round(100 * avail/8192 ,2) as "freespace ratio"
                FROM pg_freespace('accounts');
 blkno | avail | freespace ratio 
-------+-------+-----------------
     0 |  7904 |           96.00
     1 |  7520 |           91.00
     2 |  7136 |           87.00
     3 |  7136 |           87.00
     4 |  7136 |           87.00
     5 |  7136 |           87.00
....


testdb=# VACUUM FULL accounts;
VACUUM
testdb=# SELECT count(*) as "number of blocks",
       pg_size_pretty(cast(avg(avail) as bigint)) as "Av. freespace size",
       round(100 * avg(avail)/8192 ,2) as "Av. freespace ratio"
       FROM pg_freespace('accounts');
 number of pages | Av. freespace size | Av. freespace ratio 
-----------------+--------------------+---------------------
             164 | 0 bytes            |                0.00
(1 row)

pgstattuple

tuple_percent字段就是元组实际字节占关系总大小的百分比，用1减去该值即为膨胀率。

postgres=# select * from pgstattuple('test_bloat');
-[ RECORD 1 ]------+--------
table_len          | 1138688
tuple_count        | 2
tuple_len          | 68
tuple_percent      | 0.01
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 1032268
free_percent       | 90.65

check_postgres

https://bucardo.org/check_postgres/，check_postgres is a script for monitoring various attributes of your database. It is designed to work with Nagios, MRTG, or in standalone scripts.一个巡检PostgreSQL脚本

Example 1: Warn if any table on port 5432 is over 100 MB bloated, and critical if over 200 MB

  check_postgres_bloat --port=5432 --warning='100 M' --critical='200 M'
Example 2: Give a critical if table 'orders' on host 'sami' has more than 10 megs of bloat

  check_postgres_bloat --host=sami --include=orders --critical='10 MB'
Example 3: Give a critical if table 'q4' on database 'sales' is over 50% bloated

  check_postgres_bloat --db=sales --include=q4 --critical='50%'
Example 4: Give a critical any table is over 20% bloated and has over 150 MB of bloat:

  check_postgres_bloat --port=5432 --critical='20% and 150 M'
Example 5: Give a critical any table is over 40% bloated or has over 500 MB of bloat:

  check_postgres_bloat --port=5432 --warning='500 M or 40%'

 $ ./check_postgres.pl --action bloat -db pgbench -v -v -v
 $ psql -d pgbench -x
 pgbench=# SELECT
 current_database(), schemaname, tablename,
 reltuples::bigint, relpages::bigint, otta
 ...
 iname        | accounts_pkey
 ituples      | 1250000
 ipages       | 5489
 iotta        | 16673
 ibloat       | 0.3
 wastedipages | 0
 wastedibytes | 0
 wastedisize  | 0 bytes

如何处理表膨胀

vacuum full

vacuum full或者cluster等可以重组表的方式，不过都是8级锁，access exclusive lock，会阻塞一切访问，重组表，意味着表在磁盘上的物理位置会发生改变，同时最多会使用两倍表空间的存储大小，注意cluster可能带来的影响

pg_repack

pg_repack，http://reorg.github.io/pg_repack/，它会为待重建的表创建一份副本。首先取一份全量快照，将所有活元组写入新表，并通过触发器将所有针对原表的变更同步至新表，最后通过重命名，使用新的紧实副本替换老表。而对于索引，则是通过PostgreSQL的CREATE(DROP) INDEX CONCURRENTLY完成的。

注意点：

重整开始之前，最好取消掉所有正在进行的Vacuum任务。
对索引做重整之前，最好能手动清理掉可能正在使用该索引的查询
如果出现异常的情况(譬如中途强制退出)，有可能会留下未清理的垃圾，需要手工清理。
当完成重整，进行重命名替换时，会产生巨量的WAL，有可能会导致复制延迟，而且无法取消。
重整特别大的表时，需要预留至少与该表及其索引相同大小的磁盘空间，需要特别小心，手动检查。

pg_sequeeze

pg_sequeeze，https://github.com/cybertec-postgresql/pg_squeeze，它使用redo和logical replication实现增量重组，不需要建立触发器，但是要求表上面有PK或者UK。pg_squeeze支持自动的重组，即通过设置阈值、比较用户表与阈值，自动启动WORKER进程，将数据复制到重组表，最后加锁，切换FILENODE。同时pg_squeeze有一个内置的作业调度程序，可以以多种方式运行。它可以告诉系统在一定的时间范围内收缩一个表，或者在达到某些阈值时触发一个进程。

pgcompacttable

pgcompacttable，https://github.com/dataegret/pgcompacttable，性能较好，不需要为表维护预留空间(但仍需要为索引重建保留空间)，并且能够在压缩表的同时保证client的查询性能。

为什么普通vacuum也可以收缩表大小

在默认情况下，表末端的垃圾页可以被truncate，从磁盘中回收空间。例如表的末尾的100个数据块里面全是垃圾，那么这100个数据块可以truncate阶段，文件也会变小。而位于其他位置的垃圾页，不能回收，因为会影响寻址(ctid)的变化。末尾全垃圾的页被截断时不影响寻址。从v12开始，这个截断动作可以被控制，如果表的参数vacuum_truncate设置为false，那么末尾的垃圾页不会被回收。alter table test set (vacuum_truncate = off); 例如一些表还会往里面写入大量数据时，没有必要回收，不然回收了又要extend block，并且extend block也是有锁的。另外收缩文件需要对表加access exclusive lock，所以如果你不期望有这个锁冲突，同时这个表又还会有新的数据写入(这些空间可以被新写入的数据填满)时，可以将vacuum_truncate设置为false。锁冲突影响可以参照之前的文章《主从之间延迟过大如何优化》，真是被折腾的死去活来。

postgres=# create table test_bloat(id int,info text);
CREATE TABLE
postgres=# create index t_ix on test_bloat using btree (id);
CREATE INDEX
postgres=# insert into test_bloat select n,'test'||n from generate_series(1,100000) as n;
INSERT 0 100000
postgres=# \dt+ test_bloat 
                              List of relations
 Schema |    Name    | Type  |  Owner   | Persistence |  Size   | Description 
--------+------------+-------+----------+-------------+---------+-------------
 public | test_bloat | table | postgres | permanent   | 4360 kB | 
(1 row)

postgres=# \di+ t_ix
                                  List of relations
 Schema | Name | Type  |  Owner   |   Table    | Persistence |  Size   | Description 
--------+------+-------+----------+------------+-------------+---------+-------------
 public | t_ix | index | postgres | test_bloat | permanent   | 2208 kB | 
(1 row)
postgres=# delete from test_bloat ;
DELETE 100000
postgres=# vacuum test_bloat ;
VACUUM
postgres=# \dt+ test_bloat 
                             List of relations
 Schema |    Name    | Type  |  Owner   | Persistence | Size  | Description 
--------+------------+-------+----------+-------------+-------+-------------
 public | test_bloat | table | postgres | permanent   | 24 kB | 
(1 row)

postgres=# \di+ t_ix
                                  List of relations
 Schema | Name | Type  |  Owner   |   Table    | Persistence |  Size   | Description 
--------+------+-------+----------+------------+-------------+---------+-------------
 public | t_ix | index | postgres | test_bloat | permanent   | 2208 kB | 
(1 row)

可以看到，表的大小变成了24KB，但是索引的大小还是2208KB，这个正是因为表末尾的数据块被归还给操作系统了。

postgres=# alter table test_bloat set (vacuum_truncate = off);
ALTER TABLE
postgres=# insert into test_bloat select n,'test'||n from generate_series(1,100000) as n;
INSERT 0 100000
postgres=# \dt+ test_bloat 
                              List of relations
 Schema |    Name    | Type  |  Owner   | Persistence |  Size   | Description 
--------+------------+-------+----------+-------------+---------+-------------
 public | test_bloat | table | postgres | permanent   | 4360 kB | 
(1 row)

postgres=# \di+ t_ix
                                  List of relations
 Schema | Name | Type  |  Owner   |   Table    | Persistence |  Size   | Description 
--------+------+-------+----------+------------+-------------+---------+-------------
 public | t_ix | index | postgres | test_bloat | permanent   | 4392 kB | 
(1 row)

postgres=# delete from test_bloat ;
DELETE 100000
postgres=# \dt+ test_bloat 
                              List of relations
 Schema |    Name    | Type  |  Owner   | Persistence |  Size   | Description 
--------+------------+-------+----------+-------------+---------+-------------
 public | test_bloat | table | postgres | permanent   | 4360 kB | 
(1 row)

postgres=# \di+ t_ix
                                  List of relations
 Schema | Name | Type  |  Owner   |   Table    | Persistence |  Size   | Description 
--------+------+-------+----------+------------+-------------+---------+-------------
 public | t_ix | index | postgres | test_bloat | permanent   | 4392 kB | 
(1 row)

为什么普通vacuum不能收缩索引的大小

还是上面的例子，为什么索引的大小不能被收缩？索引页的复用与HEAP PAGE不一样，因为索引的内容是有序结构，只有符合顺序的ITEM才能插入对应的PAGE。不像HEAP TUPLE，只要有空间就可以插入。index page无论在任何位置，都不能从磁盘删除，索引变大以后，不可能变小，除非vacuum full。因此索引膨胀后，通常需要重建索引来缩小索引大小。

PG12以后，支持reindex concurrently，reindex是AccessExclusiveLock锁，会阻塞，12以后可以使用reindex concurrently，是Share Update Exclusive锁，不阻塞读写
PG12以前，可以使用create index concurrently建一个，再删除老的索引，不过要注意失败了会留下一个invalid的索引
pg_repack、pg_sequeeze等

索引膨胀

索引和表也是类似，随着不断的增删改，也会膨胀，对于Btree索引，也会涉及到分裂、合并等，导致索引页的空洞。另外，如前面所说，索引页的复用与HEAP PAGE不一样，因为索引的内容是有序结构，只有符合顺序的ITEM才能插入对应的PAGE。不像HEAP TUPLE，只要有空间就可以插入。index page无论在任何位置，都不能从磁盘删除，索引变大以后，不可能变小，除非vacuum full。因此索引膨胀后，通常需要重建索引来缩小索引大小。

https://www.postgresql.org/docs/14/routine-reindex.html，提到了什么时候会被回收，仅在全部为空的时候

In some situations it is worthwhile to rebuild indexes periodically with the REINDEX command or a series of individual rebuilding steps.

B-tree index pages that have become completely empty are reclaimed for re-use. However, there is still a possibility of inefficient use of space: if all but a few index keys on a page have been deleted, the page remains allocated. Therefore, a usage pattern in which most, but not all, keys in each range are eventually deleted will see poor use of space. For such usage patterns, periodic reindexing is recommended.

The potential for bloat in non-B-tree indexes has not been well researched. It is a good idea to periodically monitor the index's physical size when using any non-B-tree index type.

Also, for B-tree indexes, a freshly-constructed index is slightly faster to access than one that has been updated many times because logically adjacent pages are usually also physically adjacent in a newly built index. (This consideration does not apply to non-B-tree indexes.) It might be worthwhile to reindex periodically just to improve access speed.

REINDEX can be used safely and easily in all cases. This command requires an ACCESS EXCLUSIVE lock by default, hence it is often preferable to execute it with its CONCURRENTLY option, which requires only a SHARE UPDATE EXCLUSIVE lock.

注意，即使要删除，也是一个十分复杂的机制，详细可以参照src/backend/access/nbtree的README，总结来说就是：

删除的前提是都为空
删除页从叶子节点所在页开始
一个内部页面只能作为删除整个子树的一部分被删除
回收一个页面实际上并没有改变它在磁盘上的状态，只是将它记录在共享内存的FSM文件中

We consider deleting an entire page from the btree only when it's become
completely empty of items.  (Merging partly-full pages would allow better
space reuse, but it seems impractical to move existing data items left or
right to make this happen --- a scan moving in the opposite direction
might miss the items if so.)  Also, we *never* delete the rightmost page
on a tree level (this restriction simplifies the traversal algorithms, as
explained below).  Page deletion always begins from an empty leaf page.  An
internal page can only be deleted as part of deleting an entire subtree.
This is always a "skinny" subtree consisting of a "chain" of internal pages
plus a single leaf page.  There is one page on each level of the subtree,
and each level/page covers the same key space.


A deleted page can only be reclaimed once there is no scan or search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed.  We implement this by waiting until all active
snapshots and registered snapshots as of the deletion are gone; which is
overly strong, but is simple to implement within Postgres.  When marked
dead, a deleted page is labeled with the next-transaction counter value.
VACUUM can reclaim the page for re-use when this transaction number is
older than RecentGlobalXmin.  As collateral damage, this implementation
also waits for running XIDs with no snapshots and for snapshots taken
until the next transaction to allocate an XID commits.

Reclaiming a page doesn't actually change its state on disk --- we simply
record it in the shared-memory free space map, from which it will be
handed out the next time a new page is needed for a page split.  The
deleted page's contents will be overwritten by the split operation.
(Note: if we find a deleted page with an extremely old transaction
number, it'd be worthwhile to re-mark it with FrozenTransactionId so that
a later xid wraparound can't cause us to think the page is unreclaimable.
But in more normal situations this would be a waste of a disk write.)

Because we never delete the rightmost page of any level (and in particular
never delete the root), it's impossible for the height of the tree to
decrease.  After massive deletions we might have a scenario in which the
tree is "skinny", with several single-page levels below the root.
Operations will still be correct in this case, but we'd waste cycles
descending through the single-page levels.  To handle this we use an idea
from Lanin and Shasha: we keep track of the "fast root" level, which is
the lowest single-page level.  The meta-data page keeps a pointer to this
level as well as the true root.  All ordinary operations initiate their
searches at the fast root not the true root.  When we split a page that is
alone on its level or delete the next-to-last page on a level (both cases
are easily detected), we have to make sure that the fast root pointer is
adjusted appropriately.  In the split case, we do this work as part of the
atomic update for the insertion into the parent level; in the delete case
as part of the atomic update for the delete (either way, the metapage has
to be the last page locked in the update to avoid deadlock risks).  This
avoids race conditions if two such operations are executing concurrently.

索引膨胀的原因：

大量删除发生后，导致索引页面稀疏，降低了索引使用效率。
postgresql 9.0之前的版本，vacuum full 会同样导致索引页面稀疏。

In earlier versions, VACUUM FULL compacts tables by moving their rows to earlier positions in the table. The

documentation on VACUUM FULL describes how that creates index bloat："Moving a row requires transiently making duplicate index entries for it (the entry

pointing to its new location must be made before the old entry can be removed); so moving

a lot of rows this way causes severe index bloat."
长时间运行的事务，禁止vacuum对表的清理工作，因而导致页面稀疏状态一直保持
索引字段乱序写入，导致索引频繁分裂，使得索引页并不是百分百填满，膨胀使然

好在在v12之后，支持reindex concurrently了

PG12以后，支持reindex concurrently，reindex是AccessExclusiveLock锁，会阻塞，12以后可以使用reindex concurrently，是Share Update Exclusive锁，不阻塞读写
PG12以前，可以使用create index concurrently建一个，再删除老的索引，不过要注意创建失败了会留下一个invalid的索引
pg_repack、pg_sequeeze等

可以参照wiki：https://wiki.postgresql.org/wiki/Index_Maintenance

Btree

玩过MySQL的应该知道索引碎片这个东西，在PostgreSQL也是类似的，有些场景，用户会发现重建索引，索引比原来更小。通常这种情况是索引字段乱序写入，导致索引频繁分裂，使得索引页并不是百分百填满，密度低，膨胀使然。由于索引页中的数据是有序的，因此在乱序写入时，索引页可能出现分裂，分裂多了，空洞就会多起来(页里面没有填满)。

先建索引，乱序写入

postgres=# create table t_idx_split(id int);  
CREATE TABLE  
postgres=# create index idx_t_idx_split on t_idx_split (id);  
CREATE INDEX  
postgres=# insert into t_idx_split select random()*10000000 from generate_series(1,10000000);  
INSERT 0 10000000  
postgres=# \di+ idx_t_idx_split   
                                List of relations  
 Schema |      Name       | Type  |  Owner   |    Table    |  Size  | Description   
--------+-----------------+-------+----------+-------------+--------+-------------  
 public | idx_t_idx_split | index | postgres | t_idx_split | 280 MB |   
(1 row)

先建索引，顺序写入

postgres=# truncate t_idx_split ;  
TRUNCATE TABLE  
postgres=# \di+ idx_t_idx_split   
                                  List of relations  
 Schema |      Name       | Type  |  Owner   |    Table    |    Size    | Description   
--------+-----------------+-------+----------+-------------+------------+-------------  
 public | idx_t_idx_split | index | postgres | t_idx_split | 8192 bytes |   
(1 row)  

postgres=# insert into t_idx_split select generate_series(1,10000000);  
INSERT 0 10000000  
postgres=# \di+ idx_t_idx_split   
                                List of relations  
 Schema |      Name       | Type  |  Owner   |    Table    |  Size  | Description   
--------+-----------------+-------+----------+-------------+--------+-------------  
 public | idx_t_idx_split | index | postgres | t_idx_split | 214 MB |   
(1 row)

先写入，后建索引

postgres=# drop index idx_t_idx_split ;  
DROP INDEX  
postgres=# create index idx_t_idx_split on t_idx_split (id);  
CREATE INDEX  
postgres=# \di+ idx_t_idx_split   
                                List of relations  
 Schema |      Name       | Type  |  Owner   |    Table    |  Size  | Description   
--------+-----------------+-------+----------+-------------+--------+-------------  
 public | idx_t_idx_split | index | postgres | t_idx_split | 214 MB |   
(1 row)

另外就是运行久了，不断的增删改，也会导致许多碎片

postgres=# create table test(id int);
CREATE TABLE
postgres=# insert into test values(generate_series(1,10000000));
INSERT 0 10000000
postgres=# create index idx_fragmented on test(id);
CREATE INDEX

postgres=# \x
Expanded display is on.
postgres=# SELECT * FROM pgstatindex('idx_fragmented');
-[ RECORD 1 ]------+----------
version            | 4
tree_level         | 2
index_size         | 224641024
root_block_no      | 290
internal_pages     | 98
leaf_pages         | 27323
empty_pages        | 0
deleted_pages      | 0
avg_leaf_density   | 90.09
leaf_fragmentation | 0

leaf_fragmentation的碎片率是37.18%

postgres=# insert into test values(generate_series(1,10000000));
INSERT 0 10000000
postgres=# update test set id =99 where id > 3000000; 
UPDATE 14000000
postgres=# SELECT * FROM pgstatindex('idx_fragmented');
-[ RECORD 1 ]------+----------
version            | 4
tree_level         | 2
index_size         | 539639808
root_block_no      | 290
internal_pages     | 189
leaf_pages         | 42739
empty_pages        | 0
deleted_pages      | 22945
avg_leaf_density   | 73.1
leaf_fragmentation | 37.18

reindex之后，即可收缩

postgres=# \di+ idx_fragmented
                                    List of relations
 Schema |      Name      | Type  |  Owner   | Table | Persistence |  Size  | Description 
--------+----------------+-------+----------+-------+-------------+--------+-------------
 public | idx_fragmented | index | postgres | test  | permanent   | 515 MB | 
(1 row)
postgres=# reindex index idx_fragmented;
REINDEX
postgres=# \di+ idx_fragmented
                                    List of relations
 Schema |      Name      | Type  |  Owner   | Table | Persistence |  Size  | Description 
--------+----------------+-------+----------+-------+-------------+--------+-------------
 public | idx_fragmented | index | postgres | test  | permanent   | 208 MB | 
(1 row)

postgres=# \x
Expanded display is on.
postgres=# SELECT * FROM pgstatindex('idx_fragmented');
-[ RECORD 1 ]------+----------
version            | 4
tree_level         | 2
index_size         | 218529792
root_block_no      | 209
internal_pages     | 112
leaf_pages         | 26563
empty_pages        | 0
deleted_pages      | 0
avg_leaf_density   | 89.98
leaf_fragmentation | 0

GIN

在GIN索引中，如果TIDs列表非常小，它可以与元素放在同一个页面中，被称为posting list。但如果这个列表很大，就需要一个更高效的数据结构，不错也是Btree。这样的树位于单独的数据页上，被成为posting tree。所以GIN索引也会膨胀，实际上膨胀现象更加的明显，因为通常GIN是对多值类型的索引，而多值类型，通常输入的顺序更加无法保证。GIN主树索引页会膨胀较厉害。

GIST

GIST有Rtree和RDtree，假如插入更新的空间数据无序，也会导致膨胀。

HOT

HOT，我们知道，HOT是为了减少不必要的索引IO。PostgreSQL目前默认的存储引擎在更新记录时，会在堆内产生一条新版本，旧版本在不需要使用后vacuum回收，回收旧版本前，需要先回收所有关联这个版本的所有索引point item。

v8.3以前，每个tuple版本都有对应的索引point item，因此更新的放大比较大。
v8.3开始，引入了HOT的概念，当更新记录时，如果能满足两个条件时，通过heap page内部link来串起所有tuple版本，因此索引不变。

当然，HOT不是万能的，也有限制：

当更新的元组和老元组不在同一个page中时，新旧元组链是不能跨越页面的，指向该元组的索引元组也会被添加到索引页面中。所以需要我们设置表的fillfactor。
当索引的key值更新时，原有索引记录中的key无法再定位到正确元组，此时会在索引页面中插入一条新的索引元组。

系统表膨胀

用户大量使用临时表，频繁的创建(PG的临时表是需要随时用随时建的，每个会话都要自己建，而且每个临时表会在pg_class,pg_attribute 中留下痕迹，用完还需要从元表中 delete 这些元数据)，因此元表pg_attribute, pg_rewrite, pg_class 会出现大量的dead tuple。同时用户的业务需要在数据库长期跑长事务，这个也是膨胀的关键点。

postgres=# create temp table mytemp(id int);
CREATE TABLE
postgres=# select count(*) from pg_attribute where attrelid = 'mytemp'::regclass;
 count 
-------
     7
(1 row)

postgres=# select count(*) from pg_class where relname = 'mytemp';
 count 
-------
     1
(1 row)

postgres=# \d mytemp 
             Table "pg_temp_4.mytemp"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           |          | 

postgres=# \q
[postgres@xiongcc pg_bloat_check]$ psql
psql (13.2)
Type "help" for help.

postgres=# \d mytemp 
Did not find any relation named "mytemp".

假如是on commit drop的临时表，更严重，事务提交，临时表就没了，但会在catalog里面留下痕迹

postgres=# begin;
BEGIN
postgres=*# create temp table mytemp2(id int) on commit drop;
CREATE TABLE
postgres=*# select count(*) from pg_attribute where attrelid = 'mytemp2'::regclass;
 count 
-------
     7
(1 row)

postgres=*# select count(*) from pg_class where relname = 'mytemp2';
 count 
-------
     1
(1 row)

postgres=*# commit ;
COMMIT
postgres=# select count(*) from pg_class where relname = 'mytemp2';
 count 
-------
     0
(1 row)

postgres=# select count(*) from pg_attribute where attrelid = 'mytemp2'::regclass;
ERROR:  relation "mytemp2" does not exist
LINE 1: ...elect count(*) from pg_attribute where attrelid = 'mytemp2':...

使用pgbench压测一下

[postgres@xiongcc ~]$ cat temp.sql 
begin;
create temp table mytemp2(id int) on commit drop;
commit;
[postgres@xiongcc ~]$ pgbench -f temp.sql -T 3000 -c 100 -j 10

postgres=# \dt+ pg_class 
                               List of relations
   Schema   |   Name   | Type  |  Owner   | Persistence |  Size  | Description 
------------+----------+-------+----------+-------------+--------+-------------
 pg_catalog | pg_class | table | postgres | permanent   | 144 kB | 
(1 row)

postgres=# \dt+ pg_class
                               List of relations
   Schema   |   Name   | Type  |  Owner   | Persistence |  Size   | Description 
------------+----------+-------+----------+-------------+---------+-------------
 pg_catalog | pg_class | table | postgres | permanent   | 5096 kB | 
(1 row)

并且系统表的回收，也和OldestXmin有关，包括复制槽、2pc等

postgres=# select slot_name,plugin,database,xmin,catalog_xmin,active from pg_replication_slots ;
 slot_name |  plugin  | database | xmin | catalog_xmin | active 
-----------+----------+----------+------+--------------+--------
 myslot    | wal2json | postgres |      |      7558248 | f
(1 row)

postgres=# begin;
BEGIN
postgres=*# insert into test_bloat values(1,'test');
INSERT 0 1
postgres=*# prepare transaction 't1';
PREPARE TRANSACTION
postgres=# select * from pg_prepared_xacts ;
 transaction | gid |           prepared            |  owner   | database 
-------------+-----+-------------------------------+----------+----------
     7560548 | t1  | 2021-07-23 22:16:59.381125+08 | postgres | postgres
(1 row)


postgres=# vacuum verbose pg_class;
INFO:  vacuuming "pg_catalog.pg_class"
INFO:  "pg_class": found 0 removable, 6383 nonremovable row versions in 291 out of 291 pages
DETAIL:  5861 dead row versions cannot be removed yet, oldest xmin: 7558248
There were 11 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
postgres=# select pg_drop_replication_slot('myslot');
 pg_drop_replication_slot 
--------------------------

(1 row)

postgres=# vacuum verbose pg_class;
INFO:  vacuuming "pg_catalog.pg_class"
INFO:  "pg_class": found 0 removable, 6383 nonremovable row versions in 291 out of 291 pages
DETAIL:  5861 dead row versions cannot be removed yet, oldest xmin: 7560548
There were 11 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

postgres=# rollback prepared 't1';
ROLLBACK PREPARED
postgres=# vacuum verbose pg_class;
INFO:  vacuuming "pg_catalog.pg_class"
INFO:  scanned index "pg_class_oid_index" to remove 5861 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "pg_class_relname_nsp_index" to remove 5861 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "pg_class_tblspc_relfilenode_index" to remove 5861 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  "pg_class": removed 5861 row versions in 278 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "pg_class_oid_index" now contains 522 row versions in 31 pages
DETAIL:  5861 index row versions were removed.
26 index pages have been deleted, 11 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "pg_class_relname_nsp_index" now contains 522 row versions in 57 pages
DETAIL:  5861 index row versions were removed.
50 index pages have been deleted, 21 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "pg_class_tblspc_relfilenode_index" now contains 522 row versions in 32 pages
DETAIL:  5861 index row versions were removed.
26 index pages have been deleted, 11 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_class": found 3013 removable, 522 nonremovable row versions in 291 out of 291 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 7566425
There were 11 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_class": truncated 291 to 14 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM

postgres=# begin;
BEGIN
postgres=*# select txid_current();
 txid_current 
--------------
      7566426
(1 row)

另外还有一个有趣的case，先创建XA事务，再创建复制槽：

postgres=# begin;
BEGIN
postgres=*# insert into test_bloat values(1,'test');
INSERT 0 1
postgres=*# prepare transaction 't1';
PREPARE TRANSACTION
postgres=# select pg_create_logical_replication_slot('myslot','wal2json');
---此处卡住

查看阻塞，被0号pid阻塞，0号正是前面创建的预备事务

postgres=# select pg_blocking_pids(pid),pid,query from pg_stat_activity where pid = 16512;
 pg_blocking_pids |  pid  |                              query                              
------------------+-------+-----------------------------------------------------------------
 {0}              | 16512 | select pg_create_logical_replication_slot('myslot','wal2json');
(1 row)

WAL日志膨胀

哪些WAL不能被回收或不能被重复利用？

从最后一次已正常结束的检查点(检查点开始时刻, 不是结束时刻)开始，所有的REDO文件都不能被回收
归档开启后，所有未归档的REDO。(.ready对应的redo文件)
启用SLOT后，还没有被SLOT消费的REDO文件
设置wal_keep_segments时，当REDO文件数还没有达到wal_keep_segments个时。

什么时候可能膨胀？

archive failed ，归档失败
user defined archive BUG，用户开启了归档，但是没有正常的将.ready改成.done，使得WAL堆积
wal_keep_segments 设置太大，WAL保留过多
max_wal_size设置太大，并且checkpoint_completion_target设置太大，导致检查点跨度很大，保留WAL文件很多
lot slow(dead) ，包括(physical | logical replication) ， restart_lsn 开始的所有WAL文件都要被保留

最佳实践

一定要开启autovacuum。
提高系统的IO能力，越高越好。
调整触发阈值，让触发阈值和记录数匹配。调小autovacuum_vacuum_scale_factor和autovacuum_analyze_scale_factor。比如我想在有1万条垃圾记录后就触发垃圾回收，那么对于一个1000万的表来说，我应该把autovacuum_vacuum_scale_factor调到千分之一即0.001，而autovacuum_analyze_scale_factor应该调到0.0005。
增加autovacuum_max_workers，同时增加autovacuum_work_mem，同时增加系统内存。例如对于有大量表需要频繁更新的数据库集群，可以将autovacuum_max_workers调整为与CPU核数一致，并将autovacuum_work_mem调整为2GB，同时需要确保系统预留的内存大于autovacuum_max_workers*autovacuum_work_mem。
应用程序设计时，尽量避免如下：

LONG SQL(包括查，增，删，改，DDL所有的SQL)，

或者打开游标后不关闭，

或者在不必要的场景使用repeatable read或serializable事务隔离级别，

或者对大的数据库执行pg_dump进行逻辑备份(隐式repeatable read隔离级别的全库备份)，

或者长时间不关闭申请了事务号的事务(增，删，改，DDL的SQL)，还包括2PC
设置idle_in_transaction_session_timeout参数，控制长事务的存活实践
设置old_snapshot_threshold参数，强制删除为过老的事务快照保留的dead元组。这会导致长事务读取已被删除的tuple时出错。
对于大表，建议使用分区，可以加快vacuum的速度
对于IO没有问题的系统，关闭autovacuum_vacuum_cost_delay。
合理调整autovacuum_naptime参数
应用程序设计时，避免使用大批量的更新，删除操作，可以切分为多个事务进行。
使用大的数据块，对于现代的硬件水平，32KB是比较好的选择，fillfactor实际上不需要太关注，100就可以了，调低它其实没有必要，因为数据库总是有垃圾，也就是说每个块在被更新后实际上都不可能是满的。
万一真的膨胀了，可以通过table rewrite来回收(如vacuum full, cluster)，但是需要迟排他锁。建议使用pg_reorg或者pg_repack来回收，实际上用到了交换filenode可以缩短需要持有排他锁的时间。

参考

https://github.com/digoal/blog

https://www.citusdata.com/blog/2017/10/20/monitoring-your-bloat-in-postgres/

https://www.compose.com/articles/postgresql-bloat-origins-monitoring-and-managing/

https://stackoverflow.com/questions/62642234/index-size-after-autovacuum

https://github.com/pgexperts/pgx_scripts/tree/master/bloat

https://github.com/pgexperts/pgx_scripts/blob/master/bloat/no_stats_table_check.sql

https://github.com/ioguix/pgsql-bloat-estimation/tree/master/table

https://github.com/ioguix/pgsql-bloat-estimation/tree/master/btree

https://wiki.postgresql.org/wiki/Show_database_bloat

http://www.louisemeta.com/blog/indexes-btree-algorithms/

https://www.postgresql.org/docs/14/routine-reindex.html

PostgreSQL关系膨胀原理，监控与处理

揭开表膨胀的神秘面纱

目录

引言

什么是表膨胀

为什么会表膨胀

MVCC实现机制

如何预防表膨胀

哪些情况会导致表膨胀

存在很老的backend_xid

存在很老的backend_xmin

存在失效的复制槽

XA事务

未开启autovacuum

autovacuum不给力

IO差

延迟触发

worker过于繁忙

存在长事务

基于成本的清理

调度周期太长

力不从心

vacuum的演进

如何观察表膨胀

获取膨胀率

SQL I

SQL II

获取膨胀情况

SQL I

SQL II

SQL III

获取索引膨胀

SQL I

SQL II

获取当前库表和索引大小

其他SQL

自行估算

pg_table_bloat

pg_freespacemap

pgstattuple

check_postgres

如何处理表膨胀

vacuum full

pg_repack

pg_sequeeze

pgcompacttable

为什么普通vacuum也可以收缩表大小

为什么普通vacuum不能收缩索引的大小

索引膨胀

Btree

GIN

GIST

HOT

系统表膨胀

WAL日志膨胀

最佳实践

参考