来源：PostgreSQL学徒

前言

这些天研究了一下Greenplum AO表的设计与底层原理，在此也感谢好友 japin 的指导。

正文

前阵子有位客户反馈两个类似表结构的表，其中一个表的数据量远小于另外一个表（百万级和千万级），但百万级的表（后文以A表作为替代）大小远大于千万级的表（后文以B表作为替代）大小。按照以往单机PostgreSQL思维来思考的话，可能是TOAST，也可能是表中存在了很多大字段，大宽表之类的，亦或是数据类型差异、字节对齐等原因造成的，但是经过排查，发现这些并没有太大差异，甚至A表中的json/text字段更多，所以这个思路行不通了。不过怀疑的方向是这个A表经常会insert xxx。

Greenplum存储引擎除了支持类似PostgreSQL的传统堆表，还支持AO表，最开始设计时，AO表被称为Append Only表，只支持追加新元组。在后来的迭代中，也支持了删除和更新元组操作，因此现在AO表指的是Append Optimized，AO表又可细分为行存和列存，此次环境预期异常的是AOCO表（列存）。

AOCO以列的形式进行组织与存储，每一个属性一个文件，因此读取任一列的成本其实是差不多的，通过列存能够减少IO的读取量、减少执行过程中的节点函数调用次数和压缩能力的提升。

Append Optimized，顾名思义，追加，意味着是在文件末尾不断添加新元组，所以不同于传统PostgreSQL通过xmin、xmax以及CLOG来判断可见性，AOCO表通过追加的方式，依靠文件的结束地址EOF来作为可见性判断的依据，只有当事务成功提交后，从原来 EOF 之后新追加的元组才对外可见，否则对外只能看到文件原来的 EOF。

为此，Greenplum也提供了几个额外的系统表，比如类似pg_class地位的pg_appendonly，存放AO表所属的segment文件的元数据信息的pg_aocsseg_< relid >，其中包括段号、EOF等，另外还有pg_aoblkdir等，更多信息可以阅读https://github.com/greenplum-db/gpdb/blob/main/src/backend/access/appendonly/README.md，此处不做过多描述。

让我们分析一下：

postgres=# CREATE TABLE test_single_insert2 (
    col1 varchar(5),
    col2 varchar(5),
    col3 varchar(5))
USING ao_column distributed BY(col2);
CREATE TABLE
postgres=# insert into test_single_insert2 values('hello','hello','hello');
INSERT 0 1
postgres=# select * from pg_appendonly where relid = 'test_single_insert2'::regclass;
-[ RECORD 1 ]---+------
relid           | 24032
blocksize       | 32768
safefswritesize | 0
compresslevel   | 0
checksum        | t
compresstype    | 
columnstore     | t
segrelid        | 24034
blkdirrelid     | 0
blkdiridxid     | 0
visimaprelid    | 24036
visimapidxid    | 24038

由于CN上只是元数据，所以需要去segment节点查看。此处OID指的是AO表的OID，即24032。

postgres=# select * from pg_aoseg.pg_aocsseg_24032;
-[ RECORD 1 ]-+---------------------------------------------------------------------------------------------------------------------------
segno         | 1
tupcount      | 1
varblockcount | 0
vpinfo        | \x000000000300000000000000300000000000000030000000000000003000000000000000300000000000000030000000000000003000000000000000
modcount      | 1
formatversion | 3
state         | 1

segno：段号
tupcount：元组数量
modcount：表被操作次数
formatversion：版本信息
state：状态信息

这几个字段好理解，稍微麻烦一点的是这个vpinfo，并且是个bytea类型的。好在Greenplum的代码文件名类似PostgreSQL，也好找，结构体位于 aocssegfiles.h 中

/*
 * Descriptor of a single AO Col relation file segment.
 *
 * Note that the first three variables should be the same as
 * the AO Row relation file segment (see FileSegInfo). This is
 * implicitly used by the block directory to obtain those info
 * using the same structure -- FileSegInfo.
 */
typedef struct AOCSFileSegInfo
{
 int32  segno;

 /*
  * total number of tuples in the segment. This number includes invisible
  * tuples
  */
 int64  total_tupcount;
 int64  varblockcount;

 /*
  * Number of data modification operations
  */
 int64  modcount;

 /*
  * state of the segno. The state is only maintained on the segments.
  */
 FileSegInfoState state;

 int16  formatversion;

 /* Must be last */
 AOCSVPInfo vpinfo;
} AOCSFileSegInfo;

该结构体正是在该表中看到的信息，vpinfo的结构如下

typedef struct AOCSVPInfoEntry
{
 int64  eof;
 int64  eof_uncompressed;
} AOCSVPInfoEntry;

typedef struct AOCSVPInfo
{
 /* total len.  Have to be the very first */
 int32  _len;
 int32  version;
 int32  nEntry;

 /* Var len array */
 AOCSVPInfoEntry entry[1];
} AOCSVPInfo;

可以看到熟悉的EOF。len顾名思义总长度，nEntry指的是有多少个属性

现在让我们拆解一下这串十六进制数字，根据pg_relation_filepath定位一下磁盘上的文件

这样的话就很明显了

第一行是segno
0x7f20，前面8个字节是tupcount，后面8个字节blockcount
0x7f30，bd，长度，然后是version，然后0x00 00 00 03是nEntry，代表有三个属性
剩下4字节对齐
然后是0x 00 00 00 00 00 00 00 30 (小端序)，代表着EOF
接着类似，代表EOF uncompressed

接下来的字节类似，同样代表着EOF，所以三个列对应文件的EOF都是0x30，也就是十进制的48

postgres=# select pg_relation_filepath('test_single_insert2');
 pg_relation_filepath 
----------------------
 base/13376/66206
(1 row)

[mxadmin@segment2 13376]$ ls -l 66206.1
-rw-------. 1 mxadmin mxadmin 48 10月 24 18:17 66206.1
[mxadmin@segment2 13376]$ ls -l 66206.129
-rw-------. 1 mxadmin mxadmin 48 10月 24 18:17 66206.129
[mxadmin@segment2 13376]$ ls -l 66206.257
-rw-------. 1 mxadmin mxadmin 48 10月 24 18:17 66206.257
[mxadmin@segment2 13376]$ echo $((0x30))
48

可以看到，文件大小确实是48个字节。那让我们再次插入一行数据试下

postgres=# insert into test_single_insert2 values('hello','hello','hello');
INSERT 0 1

postgres=# select * from pg_aoseg.pg_aocsseg_24032;
-[ RECORD 1 ]-+---------------------------------------------------------------------------------------------------------------------------
segno         | 1
tupcount      | 2
varblockcount | 0
vpinfo        | \x000000000300000000000000600000000000000060000000000000006000000000000000600000000000000060000000000000006000000000000000
modcount      | 2
formatversion | 3
state         | 1

可以看到文件大小扩大了一倍

[mxadmin@segment2 13376]$ ls -l 66206.*
-rw-------. 1 mxadmin mxadmin 96 10月 24 23:56 66206.1
-rw-------. 1 mxadmin mxadmin 96 10月 24 23:56 66206.129
-rw-------. 1 mxadmin mxadmin 96 10月 24 23:56 66206.257

因此，现在让我们来比较一下，单个事务插入5行数据，和5个事务插入5行数据后，表的大小差异

单个事务：

postgres=# insert into test_single_insert2 values('hello','hello','hello'),('hello','hello','hello'),('hello','hello','hello'),('hello','hello','hello'),('hello','hello','hello');
INSERT 0 5

postgres=# select vpinfo from pg_aoseg.pg_aocsseg_24032;
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------
vpinfo | \x000000000300000000000000480000000000000048000000000000004800000000000000480000000000000048000000000000004800000000000000

postgres=# \! ls -l 66618*
-rw-------. 1 mxadmin mxadmin  0 10月 24 23:58 66618
-rw-------. 1 mxadmin mxadmin 72 10月 24 23:59 66618.1
-rw-------. 1 mxadmin mxadmin 72 10月 24 23:59 66618.129
-rw-------. 1 mxadmin mxadmin 72 10月 24 23:59 66618.257

多个事务：

postgres=# truncate table test_single_insert2;
TRUNCATE TABLE
postgres=# insert into test_single_insert2 values('hello','hello','hello');
INSERT 0 1
postgres=# insert into test_single_insert2 values('hello','hello','hello');
INSERT 0 1
postgres=# insert into test_single_insert2 values('hello','hello','hello');
INSERT 0 1
postgres=# insert into test_single_insert2 values('hello','hello','hello');
INSERT 0 1
postgres=# insert into test_single_insert2 values('hello','hello','hello');
INSERT 0 1

postgres=# select vpinfo from pg_aoseg.pg_aocsseg_24032;
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------
vpinfo | \x000000000300000000000000f000000000000000f000000000000000f000000000000000f000000000000000f000000000000000f000000000000000

postgres=# select pg_relation_filepath('test_single_insert2');
-[ RECORD 1 ]--------+-----------------
pg_relation_filepath | base/13376/66628

postgres=# \! ls -l 66628*
-rw-------. 1 mxadmin mxadmin   0 10月 25 00:01 66628
-rw-------. 1 mxadmin mxadmin 240 10月 25 00:01 66628.1
-rw-------. 1 mxadmin mxadmin 240 10月 25 00:01 66628.129
-rw-------. 1 mxadmin mxadmin 240 10月 25 00:01 66628.257

差异一目了然！虽然都是5行数据，但是EOF和文件大小都要远大于前一种，不难理解，对于AOCO表，每次事务结束后，调用fsync，记录最后一次写入对应的数据块的EOF，并且即使数据块即使只有一条记录，下次再发起事务又会重新追加一个数据块。

这也是为什么官网上有这么一段说明：

“
Single row INSERT statements are not recommended.

AO表适用于数仓中的事实表，数据量不断增长，通常批量加载大批量数据。

小结

至于为什么在vpinfo中没有查询到0xdb，来自japin的解释是

“
由于这个结构体的前面是 _len，他可以与 varlena 这个结构兼容，因此在读取的时候不会把 _len 的值读取出来。
bytea 这种类型，它的前面在存储的时候是包含了 _len 的，但是查询出来的时候就会被忽略掉

过于专业，不过对于我们理解此问题现象并不影响。

当然，也可以通过pageinspect查看(aocsseg也是行存表)，就可以看到0xdb了，即总长度。

postgres=# select t_data FROM heap_page_items(get_raw_page('pg_aoseg.pg_aocsseg_24032', 0)) ;
                                                                                                   t_data                                                                                                   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 \x010000000000000000000000000000000000000000000000bd000000000300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003000100
 \x010000000000000001000000000000000000000000000000bd000000000300000000000000300000000000000030000000000000003000000000000000300000000000000030000000000000003000000000000000000000010000000000000003000100
(2 rows)

参考

https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-ddl-ddl-storage.html?hWord=N4IghgNiBcIIIHkDCCQF8g

https://github.com/greenplum-db/gpdb/blob/main/src/backend/access/appendonly/README.md

Greenplum AO表存储分析

来源：PostgreSQL学徒

前言

正文

小结

参考