问题背景
部门的测试环境,Mogdb的一主一备集群,版本是2.1.1。双机在断电之前,Mogdb集群正常,断电重启后,操作系统启动正常,启动mogdb集群,启动失败。
主机ip:192.168.137.110
备机ip:192.168.137.111
问题现象
$ gs_om -t start Starting cluster. ========================================= ========================================= [GAUSS-53600]: Can not start the database, the cmd is . /home/omm/.bashrc; python3 '/dbdata/app/tools/script/local/StartInstance.py' -U omm -R /dbdata/app/mogdb -t 300 --security-mode=off, Error: [FAILURE] master: [GAUSS-51607] : Failed to start instance. Error: Please check the gs_ctl log for failure details. [2022-09-08 17:47:43.905][1718][][gs_ctl]: gs_ctl started,datadir is /dbdata/data/db1 [2022-09-08 17:47:51.180][1718][][gs_ctl]: waiting for server to start... .0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env. 0 LOG: [Alarm Module]Host Name: master 0 LOG: [Alarm Module]Host IP: 192.168.137.110 0 LOG: [Alarm Module]Cluster Name: dbCluster ..2022-09-08 17:47:53.433 6319ba47.1 [unknown] 140561613260352 [unknown] 0 dn_6001_6002 DB010 0 [REDO] LOG: Recovery parallelism, cpu count = 2, max = 4, actual = 2 2022-09-08 17:47:53.433 6319ba47.1 [unknown] 140561613260352 [unknown] 0 dn_6001_6002 DB010 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4 Failed to read gaussdb.state: 0Failed to set gaussdb.state with UNKNOWN_STATE[2022-09-08 17:47:54.185][1718][][gs_ctl]: waitpid 1722 failed, exitstatus is 256, ret is 2 [2022-09-08 17:47:54.185][1718][][gs_ctl]: stopped waiting [2022-09-08 17:47:54.185][1718][][gs_ctl]: could not start server Examine the log output.[FAILURE] standby: [GAUSS-51607] : Failed to start instance. Error: Please check the gs_ctl log for failure details. [2022-09-08 17:48:00.805][1344][][gs_ctl]: gs_ctl started,datadir is /dbdata/data/db1 [2022-09-08 17:48:02.935][1344][][gs_ctl]: waiting for server to start... .0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env. 0 LOG: [Alarm Module]Host Name: standby 0 LOG: [Alarm Module]Host IP: 192.168.137.111 0 LOG: [Alarm Module]Cluster Name: dbCluster 2022-09-08 17:48:03.632 6319ba53.1 [unknown] 139726745114176 [unknown] 0 dn_6001_6002 DB010 0 [REDO] LOG: Recovery parallelism, cpu count = 2, max = 4, actual = 2 2022-09-08 17:48:03.632 6319ba53.1 [unknown] 139726745114176 [unknown] 0 dn_6001_6002 DB010 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4 Failed to read gaussdb.state: 0Failed to set gaussdb.state with UNKNOWN_STATE[2022-09-08 17:48:03.937][1344][][gs_ctl]: waitpid 1347 failed, exitstatus is 256, ret is 2 [2022-09-08 17:48:03.937][1344][][gs_ctl]: stopped waiting [2022-09-08 17:48:03.937][1344][][gs_ctl]: could not start server Examine the log output.
复制
问题分析
1.查看集群状况
#su - omm $ gs_om -t status --detail [ Cluster State ] cluster_state : Unavailable redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip port instance state ----------------------------------------------------------------------------------- 1 master 192.168.137.110 26000 6001 /dbdata/data/db1 P Down Manually stopped 2 standby 192.168.137.111 26000 6002 /dbdata/data/db1 S Down Manually stopped
复制
2.查看数据库版本
$ gs_ctl --V gs_ctl (openGauss) 9.2.4 $ mogdb -V gaussdb (MogDB 2.1.1 build b5f25b20) compiled at 2022-03-21 14:42:30 commit 0 last mr
复制
3.查看日志
#查询日志目录
cat /dbdata/data/db1/postgresql.conf |grep -i log_dir log_directory = '/dbdata/log/omm/pg_log/dn_6001' # directory where log files are written,
复制
查看日志列表
cd /dbdata/log/omm/pg_log/dn_6001 ls -l -rw------- 1 omm dbgrp 100076 Sep 8 15:47 postgresql-2022-09-08_144356.log -rw------- 1 omm dbgrp 0 Sep 8 17:04 postgresql-2022-09-08_170442.log
复制
最新日志已经不打印。
4.查看官方手册
根据错误码查看官方手册,[GAUSS-53600]和[GAUSS-51607]
GAUSS-53600: "CA password must contain at least eight characters." SQLSTATE: 无 错误原因: 系统内部错误。 解决办法: 请联系技术支持工程师提供技术支持。
复制
GAUSS-51607: "Failed to start %s." 错误原因: 启动集群/节点/实例失败。 解决办法: 1.检查网络连接是否正常;2.检查配置文件是否正确。
复制
5.查看源码
报错里面提到文件gaussdb.state,在官方手册搜gaussdb.state,没有发现主题。
根据报错“Failed to read gaussdb.state”语句,查看官方源码,找到相关代码
postmaster.cpp
/* * Only update gaussdb.state file's state field. * * PARAMETERS: * state: INPUT new state * RETURN: * true if success, otherwise false. * * NOTE: unsafe function is not expected here since it is referred in signal handler. */ bool SetDBStateFileState(DbState state, bool optional) { /* do nothing while core dump be appeared so early. */ if (strlen(gaussdb_state_file) > 0) { char temppath[MAXPGPATH] = {0}; GaussState s; int len = 0; /* zero it in case gaussdb.state doesn't exist. */ int rc = memset_s(&s, sizeof(GaussState), 0, sizeof(GaussState)); securec_check_c(rc, "\0", "\0"); rc = snprintf_s(temppath, MAXPGPATH, MAXPGPATH - 1, "%s.temp", gaussdb_state_file); securec_check_intval(rc, , false); /* Write the new content into a temp file and rename it at last. */ int fd = open(gaussdb_state_file, O_RDONLY); if (fd == -1) { if (errno == ENOENT && optional) { write_stderr("gaussdb.state does not exist, and skipt setting since it is optional."); return true; } else { write_stderr("Failed to open gaussdb.state.temp: %d", errno); return false; } } /* Read old content from file. */ len = read(fd, &s, sizeof(GaussState)); /* sizeof(int) is for current_connect_idx of GaussState */ if ((len != sizeof(GaussState)) && (len != sizeof(GaussState) - sizeof(int))) { write_stderr("Failed to read gaussdb.state: %d", errno); (void)close(fd); return false; }
复制
在源码文件postmaster.cpp里面发现代码函数SetDBStateFileState。在启动Mogdb的时候,会通过读取gaussdb.state来设置数据库运行状态,而在读取gaussdb.state的字节长度大小比较失败,输出错误,返回false,终止启动。
问题解决
1.查看gaussdb.state
cd /dbdata/data/db1/ ll gaussdb.state -rw------- 1 omm dbgrp 0 Sep 8 17:04 gaussdb.state
复制
权限和属组正常,但是文件大小0异常。
cat gaussdb.state
复制
返回空
vi gaussdb.state
复制
返回空
2.替换gaussdb.state
rm -f gaussdb.state
复制
从另外的mogdb正常环境复制一个gaussdb.state到主机和备机
cp gaussdb.state ll gaussdb.state -rw-r--r-- 1 root root 72 Sep 9 15:14 gaussdb.state chown omm.dbgrp gaussdb.state ll gaussdb.state -rw-r--r-- 1 omm dbgrp 72 Sep 9 15:24 gaussdb.state
复制
查看正常的gaussdb.state
cat gaussdb.state
复制
返回空
vi gaussdb.state ^B^@^@^@^A^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
复制
3.启动集群,并查看集群状态
gs_om -t start Starting cluster. ========================================= [SUCCESS] master [SUCCESS] standby ========================================= Successfully started. gs_om -t status --detail [ Cluster State ] cluster_state : Normal redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip port instance state ----------------------------------------------------------------------------------- 1 master 192.168.137.110 26000 6001 /dbdata/data/db1 P Primary Normal 2 standby 192.168.137.111 26000 6002 /dbdata/data/db1 S Standby Normal
复制
发散思维验证
1.误删gaussdb.state是否可以正常启动?
删除文件gaussdb.state
rm -f gaussdb.state
复制
启动数据库
gs_ctl -D /dbdata/data/db1/ start
复制
启动成功,新生成一个gaussdb.state
2.修改gaussdb.state里面内容是否可以正常启动?
vi gaussdb.state
复制
清空已有内容,随便插入几个数字,保存
启动数据库
gs_ctl -D /dbdata/data/db1/ start
复制
重现上面故障
清除非法内容,插回字符长串,保存
^B^@^@^@^A^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
复制
启动数据库成功
总结
1.数据库要保障正常启动和关停,保障供电正常,忌突然断电,容易造成数据文件损坏,数据库异常。
2.数据库无法启动,通过报错或者错误日志分析原因,可以查询官方手册,可以官方源码搜关键字词等
参考文档
Mogdb官方手册https://docs.mogdb.io/zh/mogdb/v3.0/overview
openGauss源码地址:https://gitee.com/opengauss/openGauss-server/blob/master/src/gausskernel/process/postmaster/postmaster.cpp#L877