openGauss
故障节点替换操作
背景信息
节点故障或者节点替换(主机名和ip与原主机保持一致)的情况下,尝试使用拷贝正常节点的app
二进制文件和om
文件来恢复故障或替换节点,并通过gs_ctl build
[从备机进行build
]来将节点重新加入到现有集群中。
本次验证是在测试环境下,数据库无压力,生产环境请谨慎测试。
集群信息
2023-08-04 07:43:24 [line:905] INFO <module> 94105 [ Cluster State ] cluster_state : Normal redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip port instance state --------------------------------------------------------------------------------------- 1 pghost3 192.168.56.30 26000 6001 /app/ogdata/data/dn1 P Primary Normal 2 pghost5 192.168.56.50 26000 6002 /app/ogdata/data/dn1 S Standby Normal 3 pghost6 192.168.56.60 26000 6003 /app/ogdata/data/dn1 S Standby Normal
复制
模拟故障
root@pghost6 /app# rm -rf ogdata/ root@pghost6 /app# rm -rf opengauss/ root@pghost6 /app# rm -rf ogxlog/ root@pghost6 /app# rm -rf ogarchive/ kill -9 ${GAUSSDB-PID}
复制
集群状态
omm@pghost3 ~$ gs_om -t status --detail [ Cluster State ] cluster_state : Degraded redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip port instance state --------------------------------------------------------------------------------------- 1 pghost3 192.168.56.30 26000 6001 /app/ogdata/data/dn1 P Primary Normal 2 pghost5 192.168.56.50 26000 6002 /app/ogdata/data/dn1 S Standby Normal 3 pghost6 192.168.56.60 26000 6003 /app/ogdata/data/dn1 S Unknown Unknown
复制
恢复
安装python3
根据系统情况决定是否需要安装python3
。
拷贝目录及文件
在/etc/hosts
中加入节点的映射关系
omm@pghost6 ~$ more /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.56.60 pghost6 192.168.56.30 pghost3 192.168.56.50 pghost5
复制
从pghost5
节点拷贝目录到故障节点pghost6
对应目录下。
# pghost6 上创建对应目录
root@pghost6 /app# mkdir opengauss
root@pghost6 /app# chown omm: opengauss
# 拷贝 app 和 tool 目录
omm@pghost5 /app/opengauss$ scp -r app omm pghost6:/app/opengauss/
omm@pghost5 /app/opengauss$ scp -r tool pghost6:/app/opengauss/
复制
从pghost5
节点拷贝pg_hba.conf
、postgresql.conf
文件到故障节点pghost6
对应目录下。
root@pghost6 /app# mkdir -p ogdata/data/dn1 root@pghost6 /app# chown omm: ogdata/data/dn1/ omm@pghost5 /app/ogdata/data/dn1$ scp pg_hba.conf postgresql.conf pghost6:/app/ogdata/data/dn1/
复制
修改postgresql.conf
对应的值
local_bind_address = '192.168.56.60' replconninfo1 = 'localhost=192.168.56.60 localport=26001 localheartbeatport=26005 localservice=26004 remotehost=192.168.56.30 remoteport=26001 remoteheartbeatport=26005 remoteservice=26004' replconninfo2 = 'localhost=192.168.56.60 localport=26001 localheartbeatport=26005 localservice=26004 remotehost=192.168.56.50 remoteport=26001 remoteheartbeatport=26005 remoteservice=26004' synchronous_standby_names = 'ANY 1(dn_6001,dn_6002)' log_directory = '/app/opengauss/gaussdb_log/omm/pg_log/dn_6003' audit_directory = '/app/opengauss/gaussdb_log/omm/pg_audit/dn_6003' application_name = 'dn_6003'
复制
在.bashrc
中加入如下内容
PATH="$HOME/.local/bin:$HOME/bin:$PATH" export PATH export GPHOME=/app/opengauss/tool export PATH=$GPHOME/script/gspylib/pssh/bin:$GPHOME/script:$PATH export LD_LIBRARY_PATH=$GPHOME/lib:$LD_LIBRARY_PATH export PYTHONPATH=$GPHOME/lib export GAUSSHOME=/app/opengauss/app/2.0.1 export PATH=$GAUSSHOME/bin:$PATH export LD_LIBRARY_PATH=$GAUSSHOME/lib:$LD_LIBRARY_PATH export S3_CLIENT_CRT_FILE=$GAUSSHOME/lib/client.crt export GAUSS_VERSION=3.0.3 export PGHOST=/app/opengauss/tmp export GAUSSLOG=/app/opengauss/gaussdb_log/omm umask 077 export GAUSS_ENV=2 export GS_CLUSTER_NAME=gauss_omm
复制
build
拉齐数据
# 从备机进行build
gs_ctl build -D /app/ogdata/data/dn1 -b standby_full -C "localhost=192.168.56.60 localport=26000 remotehost=192.168.56.50 remoteport=26000"
0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
0 LOG: [Alarm Module]Host Name: pghost6
0 LOG: [Alarm Module]Host IP: pghost6. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>
0 LOG: [Alarm Module]Cluster Name: gauss_omm
0 LOG: [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57
0 WARNING: failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING: failed to parse feature control file: gaussdb.version.
0 WARNING: Failed to load the product control file, so gaussdb cannot distinguish product version.
The core dump path is an invalid directory
[2023-08-04 08:36:04.775][68234][][gs_ctl]: gs_ctl standby full build ,datadir is /app/ogdata/data/dn1,conn_str is 'localhost=192.168.56.60 localport=26000 remotehost=192.168.56.50 remoteport=26000'
[2023-08-04 08:36:04.775][68234][][gs_ctl]: fopen build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:04.775][68234][][gs_ctl]: fprintf build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:04.779][68234][][gs_ctl]: fsync build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:04.780][68234][][gs_ctl]: stop failed, killing gaussdb by force ...
[2023-08-04 08:36:04.780][68234][][gs_ctl]: command [ps c -eo pid,euid,cmd | grep gaussdb | grep -v grep | awk '{if($2 == curuid && $1!="-n") print "/proc/"$1"/cwd"}' curuid=`id -u`| xargs ls -l | awk '{if ($NF=="/app/ogdata/data/dn1") print $(NF-2)}' | awk -F/ '{print $3 }' | xargs kill -9 >/dev/null 2>&1 ] path: [/app/ogdata/data/dn1]
[2023-08-04 08:36:04.812][68234][][gs_ctl]: server stopped
[2023-08-04 08:36:04.812][68234][][gs_ctl]: current workdir is (/home/omm).
[2023-08-04 08:36:04.814][68234][][gs_ctl]: set gaussdb state file when standby full build build:db state(BUILDING_STATE), server mode(STANDBY_MODE), build mode(FULL_BUILD).
[2023-08-04 08:36:04.814][68234][dn_6001_6002_6003][gs_ctl]: Get repl_auth_mode is and repl_uuid is
[2023-08-04 08:36:04.915][68234][dn_6001_6002_6003][gs_ctl]: standby build try host(192.168.56.50) port(26000) success
[2023-08-04 08:36:04.915][68234][dn_6001_6002_6003][gs_ctl]: connected to server success, build started.
[2023-08-04 08:36:04.915][68234][dn_6001_6002_6003][gs_ctl]: clear old target dir success
[2023-08-04 08:36:04.915][68234][dn_6001_6002_6003][gs_ctl]: create build tag file success
[2023-08-04 08:36:04.916][68234][dn_6001_6002_6003][gs_ctl]: create build tag file again success
[2023-08-04 08:36:04.916][68234][dn_6001_6002_6003][gs_ctl]: get system identifier success
[2023-08-04 08:36:04.916][68234][dn_6001_6002_6003][gs_ctl]: receiving and unpacking files...
[2023-08-04 08:36:04.916][68234][dn_6001_6002_6003][gs_ctl]: create backup label success
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: xlog start point: 0/5008718
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: begin build tablespace list
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: finish build tablespace list
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: begin get xlog by xlogstream
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: starting background WAL receiver
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: starting walreceiver
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: begin receive tar files
[2023-08-04 08:36:07.392][68234][dn_6001_6002_6003][gs_ctl]: receiving and unpacking files...
[2023-08-04 08:36:07.424][68234][dn_6001_6002_6003][gs_ctl]: standby build try host(192.168.56.50) port(26000) success
[2023-08-04 08:36:07.429][68234][dn_6001_6002_6003][gs_ctl]: check identify system success
[2023-08-04 08:36:07.437][68234][dn_6001_6002_6003][gs_ctl]: send START_REPLICATION 0/5000000 success
[2023-08-04 08:36:12.641][68234][dn_6001_6002_6003][gs_ctl]: finish receive tar files
[2023-08-04 08:36:12.641][68234][dn_6001_6002_6003][gs_ctl]: xlog end point: 0/5008838
[2023-08-04 08:36:12.642][68234][dn_6001_6002_6003][gs_ctl]: fetching MOT checkpoint
[2023-08-04 08:36:12.820][68234][dn_6001_6002_6003][gs_ctl]: waiting for background process to finish streaming...
[2023-08-04 08:36:18.521][68234][dn_6001_6002_6003][gs_ctl]: starting fsync all files come from source.
[2023-08-04 08:36:26.308][68234][dn_6001_6002_6003][gs_ctl]: finish fsync all files.
[2023-08-04 08:36:26.313][68234][dn_6001_6002_6003][gs_ctl]: build dummy dw file success
[2023-08-04 08:36:26.313][68234][dn_6001_6002_6003][gs_ctl]: rename build status file success
[2023-08-04 08:36:26.321][68234][dn_6001_6002_6003][gs_ctl]: standby full build build completed(/app/ogdata/data/dn1).
[2023-08-04 08:36:26.758][68234][dn_6001_6002_6003][gs_ctl]: waiting for server to start...
.0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
0 LOG: [Alarm Module]Host Name: pghost6
0 LOG: [Alarm Module]Host IP: pghost6. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>
0 LOG: [Alarm Module]Cluster Name: gauss_omm
0 LOG: [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57
0 WARNING: failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING: failed to parse feature control file: gaussdb.version.
0 WARNING: Failed to load the product control file, so gaussdb cannot distinguish product version.
The core dump path is an invalid directory
2023-08-04 08:36:26.895 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [REDO] LOG: Recovery parallelism, cpu count = 1, max = 4, actual = 1
2023-08-04 08:36:26.895 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]Host Name: pghost6
2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]Host IP: pghost6. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>
2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]Cluster Name: gauss_omm
2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57
2023-08-04 08:36:27.139 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG: loaded library "security_plugin"
2023-08-04 08:36:27.144 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets
2023-08-04 08:36:27.144 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] FATAL: could not create lock file "/app/opengauss/tmp/.s.PGSQL.26000.lock": No such file or directory
[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: waitpid 68257 failed, exitstatus is 256, ret is 2
[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: stopped waiting
[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: could not start server
Examine the log output.
[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: fopen build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: fprintf build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:27.164][68234][dn_6001_6002_6003][gs_ctl]: fsync build pid file "/app/ogdata/data/dn1/gs_build.pid" success
复制
报错1
could not create lock file "/app/opengauss/tmp/.s.PGSQL.26000.lock": No such file or directory
# 创建 /app/opengauss/tmp 目录,再次build。
.........................................
[2023-08-04 08:45:10.215][68303][dn_6001_6002_6003][gs_ctl]: done
[2023-08-04 08:45:10.533][68303][dn_6001_6002_6003][gs_ctl]: server started (/app/ogdata/data/dn1)
[2023-08-04 08:45:10.602][68303][dn_6001_6002_6003][gs_ctl]: fopen build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:45:10.602][68303][dn_6001_6002_6003][gs_ctl]: fprintf build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:45:11.525][68303][dn_6001_6002_6003][gs_ctl]: fsync build pid file "/app/ogdata/data/dn1/gs_build.pid" success
# 从以上日志可以看到build已经成功,查看进程和集群状态,发现集群已经恢复正常。
omm@pghost6 /app/ogdata/data/dn1$ ps x
PID TTY STAT TIME COMMAND
68150 pts/0 S 0:00 -bash
68320 ? Ssl 0:03 /app/opengauss/app/2.0.1/bin/gaussdb -D /app/ogdata/data/dn1 -M standby
68377 pts/0 R+ 0:00 ps x
omm@pghost6 /app/ogdata/data/dn1$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Normal
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
---------------------------------------------------------------------------------------
1 pghost3 192.168.56.30 26000 6001 /app/ogdata/data/dn1 P Primary Normal
2 pghost5 192.168.56.50 26000 6002 /app/ogdata/data/dn1 S Standby Normal
3 pghost6 192.168.56.60 26000 6003 /app/ogdata/data/dn1 S Standby Normal
复制
删除pg_tblspc
无效目录
如果pghost6
节点是通过安装单节点集群以后再build
修复的话,修复成功后需要注意pg_tblspc
目录下无效文件的大小,如太大,要考虑删除,避免占用较大的磁盘空间。
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。
评论
相关阅读
2025年3月国产数据库大事记
墨天轮编辑部
695次阅读
2025-04-03 15:21:16
内蒙古公司成功完成新一代BOSS云原生系统割接上线
openGauss
209次阅读
2025-03-24 09:40:40
第4期 openGauss 中级认证OGCP直播班招生中!3月30日开课
墨天轮小教习
172次阅读
2025-03-17 15:48:40
openGauss 7.0.0-RC1 版本正式发布!
Gauss松鼠会
159次阅读
2025-04-01 12:27:03
openGauss 7.0.0-RC1 版本体验:一主一备快速安装指南
孙莹
139次阅读
2025-04-01 10:30:07
从数据库源码比较 PostgreSql和OpenGauss的启动过程
maozicb
90次阅读
2025-03-24 15:55:04
一文快速上手openGauss
进击的CJR
86次阅读
2025-03-26 16:12:54
openGauss HASH JOIN原理
lbsswhu
65次阅读
2025-03-18 10:45:01
openGauss 学习之路:集群部署实战探索
openGauss
62次阅读
2025-03-21 10:34:13
opengauss使用gs_probackup进行增量备份恢复
进击的CJR
52次阅读
2025-04-09 16:11:58