PG 聊聊 transaction wraparound

原创大表哥 2023-02-13

1605

大家好，最近和单位同事聊了聊 PG的MVCC 机制，顺便记录了下来一些内容主要围绕PG事务ID的 transaction wraparound 问题。

熟悉PG的朋友们都知道 PG 的MVCC 的实现与 MYSQL, ORACLE的undo 机制不同， PG在数据页上直接插入数据，类似于 insert append 直接追加数据，当数据读取的时候（这里说的隔离级别为read-committed), 通过当前连接的session 的 txid 与数据页中的 t_xmin,t_xmax 来判断比对从而得出可见的数据。

我们这里暂时不讨论具体的数据可见性的比对规则。我们要聊的是 PG全局的事务ID ：txid

这个值可以通过函数 txid_current() 得到

postgres@[local:/tmp]:1992=#45175 SELECT txid_current();
txid_current
--------------
     5303884
(1 row)

复制

TXID 这个数值当初设计的时候是一个无符号的 INT 32 位的数字，大约的数值范围在42亿左右。这个数值的范围很容易用满，所以在PG采用 cycle 的循环利用的方式来重置 TXID.
由于transaction id 是正整数，所以只有无符号32位整型数字的一半，大致20亿左右。

举个例子，我们借用JP 业界知名大神的著作中的一张图来描述一下 transaction wraparound 问题。

有1个表 t1 写入一条记录 tuple1 的时候是 txid = 100 , PG数据库没有自动或者手动的 vacumm freeze, 或者是这条insert 语句的一直没有提交，
这个系统如果很忙， TPS > 500K/s ，我们假设能达到每秒5W笔的事务，消耗20亿transaction id 的容量范围的话, 大致需要 11 个小时（2000000000/50000/3600=11+ hours）
这个时候再次像表t1 中插入一条新的数据 tuple2 ，这个时候假设 transaction ID 会重置成 20，那么就会出现过去的元祖 tuple1 的 txid = 100 大于现在或者未来的元祖 tuple2 的 txid = 20的现象，从而造成了多版本MVCC的可见性的混乱的问题。

为了避免这个 transaction wraparound的问题：
PG 采用 freeze 过时的 txid 的方式来实现，这个会发生在执行命令 vacuum 的时候.会自动根据参数设置来判断是否只能 txid freeze 的操作。如果长时间存在未提交的transacction, 或者存在废弃的replicate slot，那么将不会触发 freeze . 这个数据库会进入到 read-only 的模式下面

FREEZE 的大致是分为2个版本（图依然来自JP的业界大神）

1.版本9.3之前的版本是直接更新元祖 head 中的隐藏列的值 t_xmin , 更新为 2。这样设置t_xmin =2 对于所有的transaction 来说都是可见的。

2.对于版本 9.4开始，会更新位图 t_infomask 标记为 HEAP_XMIN_FROZEN

关于更新 t_infomask 标记为 HEAP_XMIN_FROZEN 的信息，网上的资料很少，目前只能从源代码查看这个位图的定义：

我们查看源码文件 src/include/access/htup_details.h 中的 HEAP_XMIN_FROZEN 的定义： HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID

HEAP_XMIN_COMMITTED 0x0100 -》 t_infomask & 256
HEAP_XMIN_INVALID 0x0200 -》 t_infomask & 512
HEAP_XMIN_FROZEN 的值取决 HEAP_XMIN_COMMITTED和HEAP_XMIN_INVALID 做或运算，其中有一个为真，这个元祖即为冻结状态

#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */

#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)

复制

我们测试一下 vacuum freeze 的执行过程的前后，HEAP_XMIN_FROZEN的值的改变情况，
为了到达测试效果，我们暂时先关闭 auto vacuum 这个参数

postgres@[local:/tmp]:1992=#92720 alter system set autovacuum =off;
ALTER SYSTEM

复制

这个参数需要重启数据库：

/opt/postgreSQL/pg15/bin/pg_ctl restart -D /data/postgreSQL/1992/backups
postgres@[local:/tmp]:1992=#102177 show autovacuum;
autovacuum
------------
off
(1 row)

复制

我们来创建一张表tab1，插入20条数据： transaction id = 5303898

postgres@[local:/tmp]:1992=#102177 create table tab1 (id int, name varchar(200));
CREATE TABLE
postgres@[local:/tmp]:1992=#102177 begin;
BEGIN
postgres@[local:/tmp]:1992=#27290 select txid_current();
txid_current
--------------
     5303898
(1 row)



postgres@[local:/tmp]:1992=#102177 insert into tab1 select generate_series(1,20),'jason';
INSERT 0 20
postgres@[local:/tmp]:1992=#102177 commit;
COMMIT

复制

我们用extension pageinpect 来查看一下每条记录关于 transaction id 的隐藏信息：
我们可以看到 txid = 5303898 这2条记录，隐藏列 t_infomask 中的 heap_xmin_committed 为 False, heap_xmin_invalid 为False ,
所以我们想要查看的 t_infomask 中的 HEAP_XMIN_FROZEN 是 heap_xmin_committed | heap_xmin_invalid ，也是 False 的

postgres@[local:/tmp]:1992=#27290 SELECT lp,
postgres-#        t_ctid AS ctid,
postgres-#        t_xmin AS xmin,
postgres-#        t_xmax AS xmax,
postgres-#        (t_infomask & 256)::boolean AS HEAP_XMIN_COMMITTED,
postgres-#        (t_infomask & 512)::boolean AS HEAP_XMIN_INVALID,
postgres-#       ((t_infomask & 256)::boolean or (t_infomask & 512)::boolean)::boolean AS HEAP_XMIN_FROZEN
postgres-# FROM heap_page_item_attrs(
postgres(#         get_raw_page('tab1', 0),
postgres(#         'tab1'
postgres(#      );
lp |  ctid  |  xmin   | xmax | heap_xmin_committed | heap_xmin_invalid | heap_xmin_frozen
----+--------+---------+------+---------------------+-------------------+------------------
 1 | (0,1)  | 5303898 |    0 | f                   | f                 | f
 2 | (0,2)  | 5303898 |    0 | f                   | f                 | f
 3 | (0,3)  | 5303898 |    0 | f                   | f                 | f
 4 | (0,4)  | 5303898 |    0 | f                   | f                 | f
 5 | (0,5)  | 5303898 |    0 | f                   | f                 | f
 6 | (0,6)  | 5303898 |    0 | f                   | f                 | f
 7 | (0,7)  | 5303898 |    0 | f                   | f                 | f
 8 | (0,8)  | 5303898 |    0 | f                   | f                 | f
 9 | (0,9)  | 5303898 |    0 | f                   | f                 | f
10 | (0,10) | 5303898 |    0 | f                   | f                 | f
11 | (0,11) | 5303898 |    0 | f                   | f                 | f
12 | (0,12) | 5303898 |    0 | f                   | f                 | f
13 | (0,13) | 5303898 |    0 | f                   | f                 | f
14 | (0,14) | 5303898 |    0 | f                   | f                 | f
15 | (0,15) | 5303898 |    0 | f                   | f                 | f
16 | (0,16) | 5303898 |    0 | f                   | f                 | f
17 | (0,17) | 5303898 |    0 | f                   | f                 | f
18 | (0,18) | 5303898 |    0 | f                   | f                 | f
19 | (0,19) | 5303898 |    0 | f                   | f                 | f
20 | (0,20) | 5303898 |    0 | f                   | f                 | f
(20 rows)

复制

接下来，我们手动执行命令 autovacuum freeze 一下 table tab1:

postgres@[local:/tmp]:1992=#27290 vacuum FREEZE  tab1;
VACUUM

复制

我们再次查看元祖隐藏列t_infomask中的 heap_xmin_committed，heap_xmin_invalid和HEAP_XMIN_FROZEN 的值：

我们可以看到 heap_xmin_committed，heap_xmin_invalid和HEAP_XMIN_FROZEN 的值均变成了 true.

postgres@[local:/tmp]:1992=#27290 SELECT lp,
postgres-#        t_ctid AS ctid,
postgres-#        t_xmin AS xmin,
postgres-#        t_xmax AS xmax,
postgres-#        (t_infomask & 256)::boolean AS HEAP_XMIN_COMMITTED,
postgres-#        (t_infomask & 512)::boolean AS HEAP_XMIN_INVALID,
postgres-#       ((t_infomask & 256)::boolean or (t_infomask & 512)::boolean)::boolean AS HEAP_XMIN_FROZEN
postgres-# FROM heap_page_item_attrs(
postgres(#         get_raw_page('tab1', 0),
postgres(#         'tab1'
postgres(#      );
lp |  ctid  |  xmin   | xmax | heap_xmin_committed | heap_xmin_invalid | heap_xmin_frozen
----+--------+---------+------+---------------------+-------------------+------------------
 1 | (0,1)  | 5303898 |    0 | t                   | t                 | t
 2 | (0,2)  | 5303898 |    0 | t                   | t                 | t
 3 | (0,3)  | 5303898 |    0 | t                   | t                 | t
 4 | (0,4)  | 5303898 |    0 | t                   | t                 | t
 5 | (0,5)  | 5303898 |    0 | t                   | t                 | t
 6 | (0,6)  | 5303898 |    0 | t                   | t                 | t
 7 | (0,7)  | 5303898 |    0 | t                   | t                 | t
 8 | (0,8)  | 5303898 |    0 | t                   | t                 | t
 9 | (0,9)  | 5303898 |    0 | t                   | t                 | t
10 | (0,10) | 5303898 |    0 | t                   | t                 | t
11 | (0,11) | 5303898 |    0 | t                   | t                 | t
12 | (0,12) | 5303898 |    0 | t                   | t                 | t
13 | (0,13) | 5303898 |    0 | t                   | t                 | t
14 | (0,14) | 5303898 |    0 | t                   | t                 | t
15 | (0,15) | 5303898 |    0 | t                   | t                 | t
16 | (0,16) | 5303898 |    0 | t                   | t                 | t
17 | (0,17) | 5303898 |    0 | t                   | t                 | t
18 | (0,18) | 5303898 |    0 | t                   | t                 | t
19 | (0,19) | 5303898 |    0 | t                   | t                 | t
20 | (0,20) | 5303898 |    0 | t                   | t                 | t
(20 rows)

复制

接下来，我们需要在做一个实验，看看能不能复现PG的 transaction wrapround 的场景：

正如我们理解的 transaction 的一圈长度为 2147483647 ，理论上，如果存在一笔事务长时间不提交或者rollback, session 一直驻留在数据库，
而其他的事务在频繁的操作数据库，类似于OLTP的高负载的情况下，例如 tps = 5W, 大致需要11个小时就可以复现这个情况。

我们如果没有模拟TPS=5W 这么高并发的机器，而且不想等 10多个小时的话，我们可以尝试修改一段源代码，大致修改的逻辑就是在下次拿到 TXID的时候，
比较当前ID和下一个ID的时候，设置一个基础值，这样就很容易使下一个 txid 达到上限。

我们找到源码文件：src/backend/access/transam/transam.c

/*
* TransactionIdFollowsOrEquals --- is id1 logically >= id2?
*/
bool
TransactionIdFollowsOrEquals(TransactionId id1, TransactionId id2)
{
    int32        diff;


    if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
        return (id1 >= id2);

    diff = (int32) (id1 - id2);
    return (diff >= 0);
}

复制

我们先尝试把id1 和 id2的值打印出来： printf("%d --> %d ckpTxidNext \n",id1,id2);

bool
TransactionIdFollowsOrEquals(TransactionId id1, TransactionId id2)
{
    int32        diff;


    if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
        return (id1 >= id2);

    diff = (int32) (id1 - id2);
    printf("%d --> %d ckpTxidNext \n",id1,id2);
    return (diff >= 0);
}

复制

Build 一下项目，重新启动postgres 实例，其实上PG启动的时候就会进行 next txid 的检查

我们尝试登录数据库，产生一个txid: 这个时候我们发现 PG启动时候检验了 next txid 和 xidVacLimit 的比较

postgres=# begin;
BEGIN
postgres=*# select txid_current();
txid_current
--------------
      1893535
(1 row)

复制

1893536 是我们下一个 txid 的值， 200000478 这个值是 xidVacLimit ，
如果下一个 txid > xidVacLimit 的值的话，PG会在后台强制启动 autovacuum 的进程


TransactionId xidVacLimit;    /* start forcing autovacuums here */

复制

比较xidVacLimit，并启动auto vacuum 进程： SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);

if (TransactionIdFollowsOrEquals(xid, ShmemVariableCache->xidVacLimit))
	{
		/*
		 * For safety's sake, we release XidGenLock while sending signals,
		 * warnings, etc.  This is not so much because we care about
		 * preserving concurrency in this situation, as to avoid any
		 * possibility of deadlock while doing get_database_name(). First,
		 * copy all the shared values we'll need in this path.
		 */
		TransactionId xidWarnLimit = ShmemVariableCache->xidWarnLimit;
		TransactionId xidStopLimit = ShmemVariableCache->xidStopLimit;
		TransactionId xidWrapLimit = ShmemVariableCache->xidWrapLimit;
		Oid			oldest_datoid = ShmemVariableCache->oldestXidDB;

		LWLockRelease(XidGenLock);

		/*
		 * To avoid swamping the postmaster with signals, we issue the autovac
		 * request only once per 64K transaction starts.  This still gives
		 * plenty of chances before we get into real trouble.
		 */
		if (IsUnderPostmaster && (xid % 65536) == 0)
			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
复制

这个值的取值逻辑是：
xidVacLimit = oldest_datfrozenxid（478） + autovacuum_freeze_max_age （200000000）;

我们可以通过执行SQL来验证： 1893535

postgres=# show autovacuum_freeze_max_age;
autovacuum_freeze_max_age
---------------------------
200000000
(1 row)


postgres=# SELECT datname, datfrozenxid, age(datfrozenxid)
postgres-# FROM pg_database;
  datname  | datfrozenxid |   age   
-----------+--------------+---------
postgres  |          478 | 1893058
template1 |          478 | 1893058
template0 |          478 | 1893058
(3 rows)

复制

okay. 我们接下来看看启动的时候，第二个检查点的比较：
这个是 next txid 与 xidStopLimit 比较

关于xidStopLimit 这个值，可以粗线条的理解为（int32 整型上限的数据 20亿左右 + oldest_datfrozenxid）- 1000000（1百万为 vacuum 操作预留的 txid ）：

xidStopLimit 细节的可以参考代码：

计算公式为： xidStopLimit = （oldest_datfrozenxid + (MaxTransactionId >> 1);） - 1000000
= 478 + 2147483647 - 1000000
= 2146484125

/*
     * The place where we actually get into deep trouble is halfway around
     * from the oldest potentially-existing XID.  (This calculation is
     * probably off by one or two counts, because the special XIDs reduce the
     * size of the loop a little bit.  But we throw in plenty of slop below,
     * so it doesn't matter.)
     */
    xidWrapLimit = oldest_datfrozenxid + (MaxTransactionId >> 1);
    if (xidWrapLimit < FirstNormalTransactionId)
        xidWrapLimit += FirstNormalTransactionId;


    /*
     * We'll refuse to continue assigning XIDs in interactive mode once we get
     * within 1M transactions of data loss.  This leaves lots of room for the
     * DBA to fool around fixing things in a standalone backend, while not
     * being significant compared to total XID space. (Note that since
     * vacuuming requires one transaction per table cleaned, we had better be
     * sure there's lots of XIDs left...)
     */
    xidStopLimit = xidWrapLimit - 1000000;
    if (xidStopLimit < FirstNormalTransactionId)
        xidStopLimit -= FirstNormalTransactionId;

复制

如果next txid的值，大于了 xidStopLimit 这个极限值，那么数据库变成不可用的状态。也就是说数据库为了避免发生 transaction wraparound, 将整个数据库变成了只读的状态。

我们来修改源代码模拟一下： src/backend/access/transam/transam.c

比较的时候添加一个基础值为 2134590490 = 2136484125（xidStopLimit） - 1893535（TXID） - 100（这个100是预留给我们做实验用的）
这个基础值的作用是类似于提前替我们消耗了 2134590490 多个 txid, 我们只需自己写满预留的100个就可以复现数据库为了防止 transaction wrapround 进行的报错。

重新编译启动项目：这次我们可以看到 2136484125 - 2136484027 = 98 个可用的ID

PG刚才关闭和启动各自消耗了一个TXID，所以还剩98个可以让我们自己消耗掉

我们只需要手动的耗费98个ID即可：

我们打开2个窗口：
session1: 模拟一个不提交的空session
session2: 用PGPLSQL loop一个消耗99 条事务的循环

session1:模拟一个不提交事务的空session

postgres=# begin;
BEGIN
postgres=*# select txid_current();
txid_current
--------------
      1893537
(1 row)

复制

session 2: PGPLSQL loop一个消耗99 条事务的循环

我们成功的模拟出数据库错误：

WARNING: database “postgres” must be vacuumed within 2145590490 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.

postgres=# DO
postgres-# $$
postgres$# DECLARE
postgres$#    i INTEGER := 1;
postgres$# BEGIN
postgres$#    FOR i IN 1..98 LOOP
postgres$#         begin
postgres$#
postgres$# insert into t1 values (i);
postgres$#
postgres$# commit;
postgres$#
postgres$# end;
postgres$#    END LOOP;
postgres$# END;
postgres$# $$ language plpgsql;
WARNING:  database "postgres" must be vacuumed within 2145590490 transactions
HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.

复制

如何解决：其实错误日志已经告诉了我们：

1.找出问题根源，是否存在提交或者回滚的事务，我们可以用检查事务打开的语句：

select locktype,
relation::regclass as rel,
virtualxid as vxid,
transactionid as xid ,
virtualtransaction as vxid2,
pid,
mode,
granted
from pg_locks;
   locktype    |   rel    | vxid  |   xid   | vxid2 |  pid  |      mode       | granted
---------------+----------+-------+---------+-------+-------+-----------------+---------
relation      | pg_locks |       |         | 4/104 | 45976 | AccessShareLock | t
virtualxid    |          | 4/104 |         | 4/104 | 45976 | ExclusiveLock   | t
virtualxid    |          | 3/8   |         | 3/8   | 45763 | ExclusiveLock   | t
transactionid |          |       | 1893537 | 3/8   | 45763 | ExclusiveLock   | t
(4 rows)

复制

需要手动kill 掉这个 xid = 1893537的后台进程45763 , （生产环境需要和你们的开发人员和客户确认）

postgres=# select pg_terminate_backend(45763);
pg_terminate_backend
----------------------
t
(1 row)

复制

2.手动 vacuum freeze

postgres=# vacuum freeze;
VACUUM

复制

写到最后，

1)其实关于PG TXID 的问题已经在社区内讨论了很久，已经很多次提议升级扩展到到int 64bit:
Add 64-bit XIDs into PostgreSQL 16 - Commitfests https://commitfest.postgresql.org/38/3594/

这个是最新的建议加到PG16中, 当然了之前也有建议加入到PG15版本中的讨论。目前尚无考证会修改这个问题。

2）对于数据库和表的年龄要做好监控，以及表上最后的vacuum, analyze 的时间

3）需要数据库中监控长事务和不（提交|回滚）的短事务，无论是active 的session 跑的大的report的，还是应用开发的代码的有bug，造成的短小事务忘记提交。
可以设置参数 idle_in_transaction_session_timeout （默认是0是不限制超时）来强制kill掉 idle_in_transaction 的session , 此类型的session 多为应用开发代码的bug造成的

4）如果standby 退役了，存在相关的replicate slot 要及时删掉

5）超级大表，可以考虑进行分区，充分利用parallel vacuum来加快大表 freeze, analyze 的速度

6）根据数据库业务负载的情况，定期(比如每周一次,每天夜里某个时刻) 在low business hour的时候，手动进行 vacumm 的操作。

Have a fun 😃 !

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

PG 聊聊 transaction wraparound

评论