暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

openGauss故障恢复案例:主库异常宕机后,备库处于 Standby Need repair(Disconnected)

原创 gelyon 2020-07-15
2988

环境: openGauss1.0.0 一主一备

[omm@gsdb01 ~]$ gs_om -t status --detail [ Cluster State ] cluster_state : Normal redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip instance state | node node_ip instance state ------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 gsdb01 192.168.0.195 6001 /u01/openGauss/data/db1 P Primary Normal | 2 gsdb02 192.168.0.96 6002 /u01/openGauss/data/db1 S Standby Normal [omm@gsdb01 ~]$

模拟主库宕机,备库处于 Standby Need repair(Disconnected)

[omm@gsdb01 ~]$ [omm@gsdb01 ~]$ gs_ctl stop -D /u01/openGauss/data/db1 [2020-07-15 09:18:53.064][21012][][gs_ctl]: gs_ctl stopped ,datadir is -D "/u01/openGauss/data/db1" waiting for server to shut down........ done server stopped [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ gs_om -t status --detail [ Cluster State ] cluster_state : Unavailable redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip instance state | node node_ip instance state ------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 gsdb01 192.168.0.195 6001 /u01/openGauss/data/db1 P Down Manually stopped | 2 gsdb02 192.168.0.96 6002 /u01/openGauss/data/db1 S Standby Need repair(Disconnected) [omm@gsdb01 ~]$

主库恢复后,gs_ctl start -D启动主库,默认是normal状态,非primary状态.

[omm@gsdb01 ~]$ [omm@gsdb01 ~]$ gs_ctl start -D /u01/openGauss/data/db1 [2020-07-15 09:35:12.777][21111][][gs_ctl]: gs_ctl started,datadir is -D "/u01/openGauss/data/db1" [2020-07-15 09:35:12.829][21111][][gs_ctl]: waiting for server to start... .0 [BACKEND] LOG: Begin to start openGauss Database. 2020-07-15 09:35:12.928 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 DB001 0 [REDO] LOG: Recovery parallelism, cpu count = 4, max = 4, actual = 42020-07-15 09:35:12.928 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 DB001 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recov ery_parallelism:4, max_recovery_parallelism:42020-07-15 09:35:12.928 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: Transparent encryption disabled. 2020-07-15 09:35:12.944 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: InitNuma numaNodeNum: 1 numa_distribut e_mode: none inheritThreadPool: 0.2020-07-15 09:35:12.944 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 01000 0 [BACKEND] WARNING: Failed to initialize the memory pr otect for g_instance.attr.attr_storage.cstore_buffers (1024 Mbytes) or shared memory (4213 Mbytes) is larger.2020-07-15 09:35:13.024 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 00000 0 [CACHE] LOG: set data cache size(805306368) 2020-07-15 09:35:13.053 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 00000 0 [CACHE] LOG: set metadata cache size(268435456) 2020-07-15 09:35:13.302 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: gaussdb: fsync file "/u01/openGauss/da ta/db1/gaussdb.state.temp" success2020-07-15 09:35:13.302 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: create gaussdb state file success: db state(STARTING_STATE), server mode(Normal)2020-07-15 09:35:13.325 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: max_safe_fds = 978, usable_fds = 1000, already_open = 122020-07-15 09:35:13.326 5f0e5d50.1 [unknown] 139956031434496 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: Success to start openGauss Database, p lease press any key to exit... [2020-07-15 09:35:13.843][21111][][gs_ctl]: done [2020-07-15 09:35:13.843][21111][][gs_ctl]: server started (/u01/openGauss/data/db1) [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ gs_om -t status --detail [ Cluster State ] cluster_state : Unavailable redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip instance state | node node_ip instance state ------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 gsdb01 192.168.0.195 6001 /u01/openGauss/data/db1 P Normal Normal | 2 gsdb02 192.168.0.96 6002 /u01/openGauss/data/db1 S Standby Need repair(Disconnected) [omm@gsdb01 ~]$

在备端查看集群状态:集群不可用Unavailable,主库是Normal状态, 备库Standby Need repair(Disconnected)

[omm@gsdb02 ~]$ gs_om -t status --detail [ Cluster State ] cluster_state : Unavailable redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip instance state | node node_ip instance state ------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 gsdb01 192.168.0.195 6001 /u01/openGauss/data/db1 P Normal Normal | 2 gsdb02 192.168.0.96 6002 /u01/openGauss/data/db1 S Standby Need repair(Disconnected) [omm@gsdb02 ~]$ [omm@gsdb02 ~]$ [omm@gsdb02 ~]$

pg_log日志:显示server_mode is NORMAL, could not accept HA connection

[BACKEND] LOG: Connecting to remote server :host=192.168.0.195 port=40001 localhost=192.168.0.96 localport=40001 dbname=replication replication=true fallback_application_name=dn_6002 connect_timeout=2 [BACKEND] FATAL: walreceiver could not connect to the remote server,the connection info :host=192.168.0.195 port=40001 localhost=192.168.0.96 localport=40001 : FATAL: the current t_thrd.postmaster_cxt.server_mode is NORMAL, could not accept HA connection. FATAL: the current t_thrd.postmaster_cxt.server_mode is NORMAL, could not accept HA connection.

原因:当主库以normal状态启动后,主库不是Primary角色,导致备库不能接收主备HA连接。

处理:停掉主库,重新以primary角色启动主库,此时主库状态正常,备库Standby Need repair(WAL),再build重建备库,集群正常。

[omm@gsdb01 ~]$ gs_ctl stop -D /u01/openGauss/data/db1 [2020-07-15 10:04:28.977][5509][][gs_ctl]: gs_ctl stopped ,datadir is -D "/u01/openGauss/data/db1" waiting for server to shut down........ done server stopped [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ gs_om -t status --detail [ Cluster State ] cluster_state : Unavailable redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip instance state | node node_ip instance state ------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 gsdb01 192.168.0.195 6001 /u01/openGauss/data/db1 P Down Manually stopped | 2 gsdb02 192.168.0.96 6002 /u01/openGauss/data/db1 S Standby Need repair(Disconnected) [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ gs_ctl start -D /u01/openGauss/data/db1 -M primary [2020-07-15 10:04:49.145][5860][][gs_ctl]: gs_ctl started,datadir is -D "/u01/openGauss/data/db1" [2020-07-15 10:04:49.197][5860][][gs_ctl]: waiting for server to start... .0 [BACKEND] LOG: Begin to start openGauss Database. 2020-07-15 10:04:49.298 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 DB001 0 [REDO] LOG: Recovery parallelism, cpu count = 4, max = 4, actual = 42020-07-15 10:04:49.298 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 DB001 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recov ery_parallelism:4, max_recovery_parallelism:42020-07-15 10:04:49.298 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: Transparent encryption disabled. 2020-07-15 10:04:49.315 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: InitNuma numaNodeNum: 1 numa_distribut e_mode: none inheritThreadPool: 0.2020-07-15 10:04:49.315 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 01000 0 [BACKEND] WARNING: Failed to initialize the memory pr otect for g_instance.attr.attr_storage.cstore_buffers (1024 Mbytes) or shared memory (4213 Mbytes) is larger.2020-07-15 10:04:49.396 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 00000 0 [CACHE] LOG: set data cache size(805306368) 2020-07-15 10:04:49.425 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 00000 0 [CACHE] LOG: set metadata cache size(268435456) 2020-07-15 10:04:49.679 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: gaussdb: fsync file "/u01/openGauss/da ta/db1/gaussdb.state.temp" success2020-07-15 10:04:49.679 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: create gaussdb state file success: db state(STARTING_STATE), server mode(Primary)2020-07-15 10:04:49.702 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: max_safe_fds = 978, usable_fds = 1000, already_open = 122020-07-15 10:04:49.703 5f0e6441.1 [unknown] 140154620174080 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: Success to start openGauss Database, p lease press any key to exit... [2020-07-15 10:04:50.210][5860][][gs_ctl]: done [2020-07-15 10:04:50.210][5860][][gs_ctl]: server started (/u01/openGauss/data/db1) [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ [omm@gsdb01 ~]$ gs_om -t status --detail [ Cluster State ] cluster_state : Degraded redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip instance state | node node_ip instance state ------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 gsdb01 192.168.0.195 6001 /u01/openGauss/data/db1 P Primary Normal | 2 gsdb02 192.168.0.96 6002 /u01/openGauss/data/db1 S Standby Need repair(WAL) [omm@gsdb01 ~]$

备端查看集群状态:备库处于Standby Need repair(WAL)

[omm@gsdb02 ~]$ [omm@gsdb02 ~]$ gs_om -t status --detail [ Cluster State ] cluster_state : Degraded redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip instance state | node node_ip instance state ------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 gsdb01 192.168.0.195 6001 /u01/openGauss/data/db1 P Primary Normal | 2 gsdb02 192.168.0.96 6002 /u01/openGauss/data/db1 S Standby Need repair(WAL) [omm@gsdb02 ~]$ [omm@gsdb02 ~]$

build重建备库,集群状态恢复正常。

[omm@gsdb02 ~]$ [omm@gsdb02 ~]$ gs_ctl build -D /u01/openGauss/data/db1 [2020-07-15 10:06:07.024][5550][][gs_ctl]: gs_ctl incremental build ,datadir is -D "/u01/openGauss/data/db1" waiting for server to shut down.... done server stopped [2020-07-15 10:06:08.054][5550][dn_6001_6002][gs_rewind]: set gaussdb state file when rewind:db state(BUILDING_STATE), server mode(STANDBY_MODE), bu ild mode(INC_BUILD).[2020-07-15 10:06:08.079][5550][dn_6001_6002][gs_rewind]: connected to server: host=192.168.0.195 port=40001 dbname=postgres application_name=gs_rew ind connect_timeout=5 rw_timeout=10[2020-07-15 10:06:08.082][5550][dn_6001_6002][gs_rewind]: connect to primary success [2020-07-15 10:06:08.082][5550][dn_6001_6002][gs_rewind]: get pg_control success [2020-07-15 10:06:08.082][5550][dn_6001_6002][gs_rewind]: target server was interrupted in mode 2. [2020-07-15 10:06:08.082][5550][dn_6001_6002][gs_rewind]: sanityChecks success [2020-07-15 10:06:08.082][5550][dn_6001_6002][gs_rewind]: find last checkpoint at 0/321ADA0 on timeline 1 from control file [2020-07-15 10:06:08.083][5550][dn_6001_6002][gs_rewind]: The source slot restart_lsn at WAL position 0/321AA08. [2020-07-15 10:06:08.084][5550][dn_6001_6002][gs_rewind]: The target slot restart_lsn at WAL position 0/300C838. [2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: FindMaxLSN success find max lsn rec (0/321ADA0) success. [2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: servers diverged at WAL position 0/321AA08. [2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: the local diverge xlogfile is 000000010000000000000003, older xlog files will not be copie d or removed.[2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: find last common checkpoint at 0/321A980 on timeline 1, cooresponding redo point at 0/321A 900[2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: find diverge point success [2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: read checkpoint redo (0/321A900) success before rewinding. [2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: rewinding from checkpoint redo point at 0/321A900 on timeline 1 [2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: the CommonAncestor checkpoint xlogfile is 000000010000000000000003,older xlog files will n ot copy[2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: targetFileStatThread success pid 140402068141824. [2020-07-15 10:06:08.086][5550][dn_6001_6002][gs_rewind]: reading source file list [2020-07-15 10:06:08.093][5550][dn_6001_6002][gs_rewind]: targetFileStatThread return success. [2020-07-15 10:06:08.101][5550][dn_6001_6002][gs_rewind]: reading target file list [2020-07-15 10:06:08.102][5550][dn_6001_6002][gs_rewind]: traverse target datadir success [2020-07-15 10:06:08.102][5550][dn_6001_6002][gs_rewind]: reading WAL in target [2020-07-15 10:06:08.102][5550][dn_6001_6002][gs_rewind]: could not read WAL record at 0/321AE28: invalid record length at 0/321AE28: wanted 32, got 0[2020-07-15 10:06:08.103][5550][dn_6001_6002][gs_rewind]: calculate totals rewind success [2020-07-15 10:06:08.103][5550][dn_6001_6002][gs_rewind]: need to copy 283MB (total source directory size is 348MB) [2020-07-15 10:06:09.873][5550][dn_6001_6002][gs_rewind]: backup target files success [2020-07-15 10:06:09.905][5550][dn_6001_6002][gs_rewind]: pg_xlog type 1. [2020-07-15 10:06:09.905][5550][dn_6001_6002][gs_rewind]: remove file pg_tblspc/16407/PG_9.2_201611171_dn_6001_6002/pgsql_tmp, type 1 [2020-07-15 10:06:09.905][5550][dn_6001_6002][gs_rewind]: remove file global/pgstat.stat, type 0 [2020-07-15 10:06:09.905][5550][dn_6001_6002][gs_rewind]: remove file full_backup_label, type 0 [2020-07-15 10:06:09.905][5550][dn_6001_6002][gs_rewind]: remove file build_completed.done, type 0 [2020-07-15 10:06:09.908][5550][dn_6001_6002][gs_rewind]: receiving and unpacking files... [2020-07-15 10:06:11.680][5550][dn_6001_6002][gs_rewind]: execute file map success [2020-07-15 10:06:11.680][5550][dn_6001_6002][gs_rewind]: read checkpoint redo (0/321A900) success. [2020-07-15 10:06:11.680][5550][dn_6001_6002][gs_rewind]: read checkpoint rec (0/321A980) success. [2020-07-15 10:06:11.681][5550][dn_6001_6002][gs_rewind]: update pg_control file success [2020-07-15 10:06:11.726][5550][dn_6001_6002][gs_rewind]: update pg_dw file success [2020-07-15 10:06:11.726][5550][dn_6001_6002][gs_rewind]: creating backup label and updating control file [2020-07-15 10:06:11.726][5550][dn_6001_6002][gs_rewind]: create backup label success [2020-07-15 10:06:11.726][5550][dn_6001_6002][gs_rewind]: dn incremental build completed. [2020-07-15 10:06:11.726][5550][dn_6001_6002][gs_rewind]: fetch MOT checkpoint [2020-07-15 10:06:11.779][5550][dn_6001_6002][gs_ctl]: waiting for server to start... .0 [BACKEND] LOG: Begin to start openGauss Database. 2020-07-15 10:06:11.880 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 DB001 0 [REDO] LOG: Recovery parallelism, cpu count = 4, max = 4, actual = 42020-07-15 10:06:11.880 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 DB001 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recov ery_parallelism:4, max_recovery_parallelism:42020-07-15 10:06:11.880 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: Transparent encryption disabled. 2020-07-15 10:06:11.898 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: InitNuma numaNodeNum: 1 numa_distribut e_mode: none inheritThreadPool: 0.2020-07-15 10:06:11.898 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 01000 0 [BACKEND] WARNING: Failed to initialize the memory pr otect for g_instance.attr.attr_storage.cstore_buffers (1024 Mbytes) or shared memory (4213 Mbytes) is larger.2020-07-15 10:06:11.977 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 00000 0 [CACHE] LOG: set data cache size(805306368) 2020-07-15 10:06:12.007 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 00000 0 [CACHE] LOG: set metadata cache size(268435456) 2020-07-15 10:06:12.258 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: gaussdb: fsync file "/u01/openGauss/da ta/db1/gaussdb.state.temp" success2020-07-15 10:06:12.259 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: create gaussdb state file success: db state(STARTING_STATE), server mode(Standby)2020-07-15 10:06:12.281 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: max_safe_fds = 978, usable_fds = 1000, already_open = 122020-07-15 10:06:12.283 5f0e6493.1 [unknown] 139835654893312 [unknown] 0 dn_6001_6002 00000 0 [BACKEND] LOG: Success to start openGauss Database, p lease press any key to exit.... [2020-07-15 10:06:13.798][5550][dn_6001_6002][gs_ctl]: done [2020-07-15 10:06:13.798][5550][dn_6001_6002][gs_ctl]: server started (/u01/openGauss/data/db1) [omm@gsdb02 ~]$ [omm@gsdb02 ~]$ gs_om -t status --detail [ Cluster State ] cluster_state : Normal redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip instance state | node node_ip instance state ------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 gsdb01 192.168.0.195 6001 /u01/openGauss/data/db1 P Primary Normal | 2 gsdb02 192.168.0.96 6002 /u01/openGauss/data/db1 S Standby Normal [omm@gsdb02 ~]$
最后修改时间:2020-07-16 09:16:22
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
1人已赞赏
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论