AntDB数据库常见问题FAQ——使用相关

亚信AntDB数据库 2023-02-28

2082

uncommitted xmin 1783478473 from before xid cutoff 1848062627 needs to be frozen \N

解决方式

该错误出现在数据冻结操作的时候，根据执行时报错的具体信息，找到报错的表，去对应的DN节点做如下操作：
set xc_maintenance_mode = on;
update ud.dr_gprs_731_1_20210426 set dr_type=(select distinct dr_type from ud.dr_gprs_731_1_20210426 where xmin=1783478473) where xmin= 1783478473;
复制

原因说明

产生该错误的原因可能是执行数据冻结操作（Vacuum Freeze）分布式事务ID同步不及时、相关DN节点上数据目录adb_xact下的文件异常或者丢失损坏造成的；

ERROR: attempted to local committed but global uncommitted transaction, which version is 1749254057 \N

wait session last xid commit time out, which version is 1963201898 \N

解决方式

该错误出现在CN节点执行SQL操作时；
在报错的CN节点查询adb_snap_state()扩展试图：
select * from adb_snap_state();
--如果提示不存在则在对应的数据库下创建该插件即可：create extension if not exits adb_snap_state;
检查报错的事务ID号是否位于查询结果的xid_assign: []集合中，如果集合中有，则在对应的CN节点执行如下操作：
找到CN进程下的postgres: snapshot receiver process进程号，并尝试kill -15 该进程号（注意：一定要是kill -15不可以-9或者其他），执行完成即可。

正常情况下通过kill -15方式触发事务手工同步即可解决问题，当采用该手段未解决问题时，可以采用如下方式继续处理：
登录GTMC节点，找到gxid sender的数据库进程（譬如进程号是：30099）
窗口一：
通过gdb命令进行如下操作：
gdb -p 30099
handle SIGUSR1 nostop noprint   --键入如上命令并回车
b gxidsender.c:140  --键入如上命令并回车
command 1   --键入如上命令并回车
p GxidSender ->xcnt=0   --键入如上命令并回车
c   --键入如上命令并回车
end --键入如上命令并回车

窗口二：
此时新开一个终端窗口执行命令kill -15 30099（gxid sender进程号）后继续切换到窗口一执行c命令
窗口一：
c   --键入如上命令并回车

通过如上方式强制解决GTMC和CN事务状态不同步的问题；

复制

原因说明

产生该错误的原因可能是CN节点和GC节点事务号未及时同步导致的，可以手工进行同步；

ERROR could not import the requested snapshot \N

解决方式

该错误出现在CN节点执行SQL操作时；
在报错的CN节点查询adb_snap_state()扩展试图：
select * from adb_snap_state();
--如果提示不存在则在对应的数据库下创建该插件即可：create extension if not exits adb_snap_state;
检查查询结果global_xmin和local oldest_xmin、local global_xmin差值是不是较大，如果差值较大说明该CN事务和GTMC事务同步存在异常，需要采取如下方式触发手工同步：
找到CN进程下的postgres: snapshot receiver process进程号，并尝试kill -15 该进程号（注意：一定要是kill -15不可以-9或者其他），执行完成即可。

复制

原因说明

产生该错误的原因可能是CN节点和GC节点事务号未及时同步导致的，可以手工触发同步；

ERROR： attempted to local committed but global uncommitted transaction, which version is xxxx \N

错误信息中出现如下提示：you can modfiy guc parameter “waitglobaltransaction” on coordinators to wait the global transaction id committed on agtm

在每个dn主库上，查询接受事务快照进程：postgres: snapshot receiver process，然后ps把这个进程的pid找出来，kill -15

报错的cn上，建个表写写数据验证问题是否已经修复。

原因说明

部分节点的事务状态和gtm的不一致，在节点内，可以通过select adb_xact_status(xid);来查看事务状态，kill上面的进程，目的是让节点重新同步一下gtm的事务状态。

adb:/unibss/dmp/hqy/gprs4/DR_GPRS_201812_A_P1_1.sql:18895: invalid command \N

解决方式

由于adb批量导入时，刷新速度太快，该错误信息并非最原始的错误。
添加-v ON_ERROR_STOP=1选项，即可看到最原始的错误信息。

adb -p xxx -d xxx -f xxx.sql -v ON_ERROR_STOP=1
复制

原因说明

产生该错误的原因较多。如 adb导入的表结构未创建、表上某列存在自增序列却没有创建。
请结合上述参数重新执行adb导入后，确认原始错误信息后，对症下药即可。

ERROR: adb_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly

adb_log报错信息：terminating walsender process due to replication timeout

解决方式

测试备机ssh至主机能否成功

ssh datanode_master_ip -p ssh_port
复制

若调通ssh登录后，仍然失败，则进行步骤2的排查

测试备机adb至主机能否成功

adb -p datanode_master_port -h master_ip -d replication
复制

若调通adb登录后，仍然失败，则进行步骤3的排查

测试备机adb至主机能否成功

wal_sender_timeout 由默认的60s调整为0. (0 没有时间限制)

wal_sender_timeout参数说明：
服务端会中断那些停止活动超过该配置的复制连接。
这对发送服务器检测一个备机崩溃或网络中断有用。
设置为0将禁用该超时机制。
该参数只能在postgresql.conf文件中或在服务器命令行上设置。默认值是 60 秒。

其他可能相关配置项

--提升wal_keep_segments，由128调整至1024
wal_keep_segments = 1024
--打开归档模式
archive_mode = "on"
archive_command = "rsync -a %p /data2/antdb/data/arch/dn1/%f"

复制

原因说明

产生原因较多，请按上述步骤依次排查。

ERROR: cannot execute INSERT in a read-only transaction

解决方式

antdb的datanode节点，默认只有读权限，只有coordinator具有读写权限。
这里adb连接的是datanode，而不是coordinator，可以让adb指定端口选项-p。
也可能配置了pgport的环境变量，如果配了pgport的环境变量，adb默认连到环境变量指向的那个端口。

adb -p xxx -d xxx -f xxx.sql -v ON_ERROR_STOP=1
复制

原因说明

按上述说明依次排查

LOG: checkpoints are occurring too frequently

解决方式

在数据库繁忙时，导致XLOG还没被应用，就被数据库重复使用写入数据。
AntDB7.2前(checkpoint_segments设置过小)
AntDB7.2后(max_wal_size设置过小)

AntDB7.2前(增加checkpoint_segments设置，>=128)
AntDB7.2后(增加max_wal_size设置，>=4GB)
复制

原因说明

无

LOG: archive command failed with exit code (X)

解决方式

硬盘空间不足或归档路径不存在
或用户没有写权限
或用户ssh或scp或rsync命令执行失败

原因说明

按上述说明依次排查

LOG: number of page slots needed (X) exceeds max_fsm_pages (Y)

解决方式

max_fsm_pages最大自由空间映射不足。
建议增加max_fsm_pages的同时进行VACUUM FULL

原因说明

max_fsm_pages最大自由空间映射不足

ERROR: current transaction is aborted, commands ignored until end of transaction block

解决方式

业务在代码中捕获该异常，并手工执行一次rollback操作。
或断开该连接后重新建链即可。
下面给出一个示例说明：

antdb=# begin;
BEGIN
antdb=# select * from sy01;
                  id                  
--------------------------------------
 adc8775e-4539-4861-9454-ceae45c568f7
(1 row)

antdb=# select * from sy011;
ERROR:  relation "sy011" does not exist
LINE 1: select * from sy011;
                      ^
antdb=# select * from sy011;
ERROR:  current transaction is aborted, commands ignored until end of transaction block
antdb=# rollback ;
ROLLBACK
antdb=# begin;
BEGIN
antdb=# select * from sy01;
                  id                  
--------------------------------------
 adc8775e-4539-4861-9454-ceae45c568f7
(1 row)

antdb=# commit;
COMMIT
复制

原因说明

AntDB区别于oracle的设计，不会在发生异常后自动回滚。需用户手工执行一次回滚操作即可。
手工回滚后复用该连接就不会报错了。

ERROR: operator does not exist: character = integer

解决方式

Postgresql8.3以后取消了数据类型隐式转换，因此比较的数据类型需要一致。
AntDB兼容了2种语法模式：默认的postgres和兼容的oracle。
oracle语法模式下，AntDB已经自研兼容了部分数据类型隐式转换的场景，包括该问题的场景已经兼容。
postgres语法模式下，依然会报该错误。
下面给出一个示例说明：

antdb=# \d sy02
            Table "public.sy02"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 id     | character varying(10) | 

antdb=# set grammar TO postgres;
SET
antdb=# select count(*) from sy02 where id=123;
ERROR:  operator does not exist: character varying = integer
LINE 1: select count(*) from sy02 where id=123;
                                          ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
antdb=# set grammar TO oracle;
SET
antdb=# select count(*) from sy02 where id=123;
 count 
-------
     0
(1 row)
复制

原因说明

为了兼容oracle语法，AntDB自研兼容了较大部分的oracle数据类型隐式转换的场景。
建议优先尝试使用oracle语法模式。

canceling statement due to lock timeout

解决方式

某一个长事务占用的锁尚未释放，新的个事务又申请相同对象的锁。
当达到lock_timeout设置的时间后，就会报这个错误。
客户端需要及时提交或回滚事务，长事务是非常消耗数据库资源的一种行为，请尽量避免。

--查看锁表情况
select locktype,relation::regclass as relation,virtualxid as vxid,transactionid as xid,virtualtransaction vxid2,pid,mode,granted from adb_locks where granted = 'f';
--查看执行时间大于5分钟的长事务
select datname,pid,usename,client_addr,query,backend_start,xact_start,now()-xact_start xact_duration,query_start,now()-query_start query_duration,state from adb_stat_activity where state<>$$idle$$ and now()-xact_start > interval $$5 min$$ order by xact_start;
--kill 长事务。2种方式如下（PID是上述sql语句查询出来的pid返回值）：
方法一：
SELECT adb_cancel_backend(PID);
这种方式只能kill select查询，对update、delete 及DML不生效)

方法二：
SELECT adb_terminate_backend(PID);
这种可以kill掉各种操作(select、update、delete、drop等)操作

复制

如果在 adb_locks 中没有查到表相关的锁信息，那么需要去各个 datanode 上查看是否有两阶段未完成的事务挂在那，查询视图：select * from adb_prepared_xacts;
根据 prepared 字段的时间值判断是否有异常的事务，所谓的异常，满足以下条件：

prepared 字段值显示的时间距离当前时间较长，比如超过单个语句预期的执行时间。
每次查询，始终是某些事务，一直存在。

一般来说，这些事务算是异常事务了。可以在各个节点上查询这个事务的状态：select adb_xact_status(50996670) ; ，参数值为 adb_prepared_xacts 中的 gid 值去掉 T。

如果事务在 GTMCOORD 上已经提交，则需要在本节点提交该事务：commit 'T784168121';
如果事务在 GTMCOORD 上未提交，则需要在本节点回滚该事务：rollback prepared 'T784168121';

上述操作需要在事务对应的 database 上执行，通过 adb_prepared_xacts 的 database 列值来决定。

可以用如下语句生成批量操作语句：

select 'rollback prepared '''||gid||''';' 
from adb_prepared_xacts 
where  to_char(prepared,'yyyy-mm-dd hh24:mi') ='2020-01-01 14:30'
and database = 'db1';
复制

原因说明

无

INSERT has more target columns than expressions

解决方式

目标列与表结构的列不匹配。

原因说明

查询语句中的目标列与表结构的列不匹配，或多或少，请仔细检查。

ERROR: No Datanode defined in cluster

解决方式

登录coordinator执行select * from pgxc_node,检查是否存在node_type=D 的节点信息。
执行select pgxc_pool_reload() 重新加载pgxc_node信息之后，重新执行上述的查询。
若仍然没有node_type=D 的节点信息，则需要重新init集群。
或若登录adbmgr执行monitor all,显示所有节点均为running状态，也可以手工初始化pgxc_node表的信息，但比较麻烦。

重新初始化集群的步骤：
登录adbmgr操作

stop all mode fast;
clean all;
init all;
复制

手工添加pgxc_node表的初始化信息的步骤：
登录每个coordinator操作

--创建一个coordinator的节点信息
create node ${node_name} with (type=coordinator, host='${node_ip}', port=${node_port}, primary=false);

--创建第一个datanode master的节点信息(datanode slave不需要初始化)
create node ${node_name} with (type=datanode, host='${node_ip}', port=${node_port}, primary=true);
--创建其他datanode master的节点信息(datanode slave不需要初始化)
create node ${node_name} with (type=datanode, host='${node_ip}', port=${node_port}, primary=false);

**注：该方式比较原始，不建议这样操作。**
复制

原因说明

init all初始化集群时，agtm没有正常初始化，导致各个节点在初始化pgxc_node时，向agtm获取事务号失败，导致pgxc_node该表初始化异常。

ERROR: Cannot create index whose evaluation cannot be enforced to remote nodes

解决方式

目前非分片键不允许创建主键或唯一索引。
若一定要创建主键，带上分片键即可。
以下给出一个示例说明：

antdb=# create table sy01(id int,name text) distribute by hash(name);
CREATE TABLE
antdb=# ALTER TABLE sy01 add constraint pk_sy01_1 primary key (id);
ERROR:  Cannot create index whose evaluation cannot be enforced to remote nodes
antdb=# ALTER TABLE sy01 add constraint pk_sy01_1 primary key (id,name);
ALTER TABLE
antdb=# \d+ sy01
                         Table "public.sy01"
 Column |  Type   | Modifiers | Storage  | Stats target | Description 
--------+---------+-----------+----------+--------------+-------------
 id     | integer | not null  | plain    |              | 
 name   | text    | not null  | extended |              | 
Indexes:
    "pk_sy01_1" PRIMARY KEY, btree (id, name)
Distribute By: HASH(name)
Location Nodes: ALL DATANODES
复制

原因说明

无

cannot create foreign key whose evaluation cannot be enforced to Remote nodes

解决方式

目前不允许在非分片键上创建外键,处理方式：

修改子表外键字段为分片键后再创建外键。
如果父表数据量很小的话，可以修改父表的为复制表。

复现SQL：

antdb=# create table t_parent (id int primary key,name varchar(30));
create table t_child (id int,name varchar(30)) distribute by hash(name);

CREATE TABLE
antdb=# create table t_child (id int,name varchar(30)) distribute by hash(name);
CREATE TABLE
antdb=# 
antdb=# alter table t_child
postgres-#     add constraint fkey_t_child
postgres-#     foreign key (id) 
postgres-#     references t_parent (id);
ERROR:  Cannot create foreign key whose evaluation cannot be enforced to remote nodes
antdb=#
antdb=# alter table t_child distribute by hash (id);
ALTER TABLE
antdb=# alter table t_child                         
    add constraint fkey_t_child
    foreign key (id) 
    references t_parent (id);
ALTER TABLE

antdb=# drop table t_child;
DROP TABLE
antdb=# create table t_child (id int,name varchar(30)) distribute by hash(name);
CREATE TABLE
antdb=# alter table t_parent distribute by replication;
ALTER TABLE
antdb=# alter table t_child                            
postgres-#     add constraint fkey_t_child
postgres-#     foreign key (id) 
postgres-#     references t_parent (id);
ALTER TABLE
antdb=# 
复制

fe_sendauth: no password supplied

可能的报错信息：

WARNING:  on coordinator   execute "set FORCE_PARALLEL_MODE = off; 				SELECT adb_PAUSE_CLUSTER();" fail ERROR:  error message from poolmgr:reconnect three thimes , fe_sendauth: no password supplied

复制

处理方式：
检查下集群中coord的hba信息，是否存在：对于集群内部主机IP有md5的认证方式。

在adbmgr中执行 :show hba nodename 来查看节点的hba信息。

FATAL: invalid value for parameter “TimeZone”: “Asia/Shanghai”

可能的报错信息：

FATAL: invalid value for parameter "TimeZone": "Asia/Shanghai"
FATAL: invalid value for parameter "TimeZone": "asia/shanghai"
FATAL: invalid value for parameter "TimeZone": "utc"
复制

处理方式：

检查JDBC的JAVA_OPTS，是否配置了user.timezone参数，若配置了该参数，需严格匹配数据库内默认支持的时区名的大小写。

数据库内支持的时区，使用下列sql查询，注意时区名的大小写。

select * from adb_catalog.adb_timezone_names;

若JDBC中没有配置该参数，则按步骤2的说明检查。

检查AntDB二进制文件目录下的share,并确认timezone下的时区是否完整。若缺失或不完整，需要重新从一个完整的节点deploy所需的文件。

ll $ADBHOME/share/postgresql/timezone
total 232
drwxr-xr-x 2 antdb antdb 4096 Apr 16 15:59 Africa
drwxr-xr-x 6 antdb antdb 4096 Apr 16 15:59 America
drwxr-xr-x 2 antdb antdb 4096 Apr 16 15:59 Antarctica
drwxr-xr-x 2 antdb antdb   25 Apr 16 15:59 Arctic
drwxr-xr-x 2 antdb antdb 4096 Apr 16 15:59 Asia
......
drwxr-xr-x 2 antdb antdb 4096 Apr 16 15:59 US
-rwxr-xr-x 1 antdb antdb  114 Apr 16 15:48 UTC
-rwxr-xr-x 1 antdb antdb 1905 Apr 16 15:48 WET
-rwxr-xr-x 1 antdb antdb 1535 Apr 16 15:48 W-SU
-rwxr-xr-x 1 antdb antdb  114 Apr 16 15:48 Zulu
复制

cannot find the datanode master which oid is “xxxx” in pgxc_node of coordinator

解决方式

需要确认当前 datanode 节点的主备状况。根据主备状况确认 pgxc_node 中的 datanode 信息：

gtm主库&cn主库的pgxc_node 中的 datanode 信息正确的话

update pgxc_class set nodeoids='xxx yyy zzz' where nodeoids='aaa bbb ccc';
复制

pgxc_node 中的 datanode 信息不正确的话

update pgxc_node set node_name='xxxx', node_host='' where oid=xxxx;
复制

原因说明

datanode主备切换成功后，可能会遗漏 pgxc_node 的相关修改。

switchover datanode slave 失败

解决方式

需要确认当前 datanode 节点的切换状况：

datanode 未进行实际的切换：根据 mgr 中的错误信息，解决问题后再次尝试进行切换

datanode 已经进行了实际的切换：需要进行如下的操作

mgr 节点

set command_mode = sql;
select oid,* from adb_catalog.mgr_node;
update adb_catalog.mgr_node set nodetype='xxxx', nodesync='xxxx',nodemasternameoid='xxxx' where oid=xxxx; --此处需要将datanode的主备库都需要进行update
复制

gtm主库&cn主库

select oid,* from pgxc_node;
update pgxc_node set node_name='xxxx', node_host='' where oid=xxxx;
复制

原因说明

datanode主备切换成功后，在进行一些校验的时候可能会出错，这时，就不能再次进行切换，只能通过修改元数据的方式进行。

最后修改时间：2023-02-28 10:54:05

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

AntDB 亚信数据库（共172篇）

高性能、高可用的分布式关系型数据库

uncommitted xmin 1783478473 from before xid cutoff 1848062627 needs to be frozen \N
ERROR: attempted to local committed but global uncommitted transaction, which version is 1749254057 \N
wait session last xid commit time out, which version is 1963201898 \N
ERROR could not import the requested snapshot \N
ERROR： attempted to local committed but global uncommitted transaction, which version is xxxx \N
adb:/unibss/dmp/hqy/gprs4/DR_GPRS_201812_A_P1_1.sql:18895: invalid command \N
ERROR: adb_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly
ERROR: cannot execute INSERT in a read-only transaction
LOG: checkpoints are occurring too frequently
LOG: archive command failed with exit code (X)
LOG: number of page slots needed (X) exceeds max_fsm_pages (Y)
ERROR: current transaction is aborted, commands ignored until end of transaction block
ERROR: operator does not exist: character = integer
canceling statement due to lock timeout
INSERT has more target columns than expressions
ERROR: No Datanode defined in cluster
ERROR: Cannot create index whose evaluation cannot be enforced to remote nodes
cannot create foreign key whose evaluation cannot be enforced to Remote nodes
fe_sendauth: no password supplied
FATAL: invalid value for parameter “TimeZone”: “Asia/Shanghai”
cannot find the datanode master which oid is “xxxx” in pgxc_node of coordinator
switchover datanode slave 失败

AntDB数据库常见问题FAQ——使用相关

uncommitted xmin 1783478473 from before xid cutoff 1848062627 needs to be frozen \N

ERROR: attempted to local committed but global uncommitted transaction, which version is 1749254057 \N

wait session last xid commit time out, which version is 1963201898 \N

ERROR could not import the requested snapshot \N

ERROR： attempted to local committed but global uncommitted transaction, which version is xxxx \N

adb:/unibss/dmp/hqy/gprs4/DR_GPRS_201812_A_P1_1.sql:18895: invalid command \N

ERROR: adb_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly

ERROR: cannot execute INSERT in a read-only transaction

LOG: checkpoints are occurring too frequently

LOG: archive command failed with exit code (X)

LOG: number of page slots needed (X) exceeds max_fsm_pages (Y)

ERROR: current transaction is aborted, commands ignored until end of transaction block

ERROR: operator does not exist: character = integer

canceling statement due to lock timeout

INSERT has more target columns than expressions

ERROR: No Datanode defined in cluster

ERROR: Cannot create index whose evaluation cannot be enforced to remote nodes

cannot create foreign key whose evaluation cannot be enforced to Remote nodes

fe_sendauth: no password supplied

FATAL: invalid value for parameter “TimeZone”: “Asia/Shanghai”

cannot find the datanode master which oid is “xxxx” in pgxc_node of coordinator

switchover datanode slave 失败

文章被以下合辑收录

评论