虚拟机安装磐维数据库2.0.2结束时，数据库无法成功启动

原创飞天 2024-07-08

664

前言

经常碰到有人问这样的问题：虚拟机里面执行gs_install安装磐维2.0.2时，为啥数据库起不来呢，我都重新安装好几遍了。。。

今天正好又碰到了这个问题，现在把整个处理过程整理出来，以便能帮助到大家。

问题再现

虚拟机里面执行gs_install安装磐维2.0.2数据库，gs_install执行过程记录如下：

omm@node1 ~]$ gs_install -X /database/panweidb/soft/panweidb1m2s.xml \
> --gsinit-parameter="--encoding=UTF8" \
> --gsinit-parameter="--lc-collate=C" \
> --gsinit-parameter="--lc-ctype=C" \
> --gsinit-parameter="--dbcompatibility=B"
Parsing the configuration file.
Successfully checked gs_uninstall on every node.
Check preinstall on every node.
Successfully checked preinstall on every node.
Creating the backup directory.
Successfully created the backup directory.
begin deploy..
Installing the cluster.
begin prepare Install Cluster..
Checking the installation environment on all nodes.
begin install Cluster..
Installing applications on all nodes.
Successfully installed APP.
begin init Instance..
encrypt cipher and rand files for database.
Please enter password for database:
Please repeat for database:
begin to create CA cert files
The sslcert will be generated in /database/panweidb/app/share/sslcert/om
Create CA files for cm beginning.
Create CA files on directory [/database/panweidb/app_2b900fc/share/sslcert/cm]. file list: ['client.key.rand', 'client.crt', 'cacert.pem', 'server.key', 'server.key.rand', 'server.crt', 'client.key.cipher', 'client.key', 'server.key.cipher']
Non-dss_ssl_enable, no need to create CA for DSS
Cluster installation is completed.
Configuring.
Deleting instances from all nodes.
Successfully deleted instances from all nodes.
Checking node configuration on all nodes.
Initializing instances on all nodes.
Updating instance configuration on all nodes.
Check consistence of memCheck and coresCheck on database nodes.
Successful check consistence of memCheck and coresCheck on all nodes.
Warning: The license file does not exist, so there is no need to copy it to the home directory.
Configuring pg_hba on all nodes.
Configuration is completed.
Starting cluster.
======================================================================
[GAUSS-51607] : Failed to start cluster. Error: 
cm_ctl: checking cluster status.
cm_ctl: checking cluster status.
cm_ctl: checking finished in 610 ms.
cm_ctl: start cluster. 
cm_ctl: start nodeid: 1
cm_ctl: start nodeid: 2
cm_ctl: start nodeid: 3
..........................................................................................................................................................................................................................................................................................................
cm_ctl: start cluster failed in (300)s!

HINT: Maybe the cluster is continually being started in the background.
You can wait for a while and check whether the cluster starts, or increase the value of parameter "-t", e.g -t 600.
The cluster may continue to start in the background.
If you want to see the cluster status, please try command gs_om -t status.
If you want to stop the cluster, please try command gs_om -t stop.
[GAUSS-51607] : Failed to start cluster. Error: 
cm_ctl: checking cluster status.
cm_ctl: checking cluster status.
cm_ctl: checking finished in 610 ms.
cm_ctl: start cluster. 
cm_ctl: start nodeid: 1
cm_ctl: start nodeid: 2
cm_ctl: start nodeid: 3
..........................................................................................................................................................................................................................................................................................................
cm_ctl: start cluster failed in (300)s!

HINT: Maybe the cluster is continually being started in the background.
You can wait for a while and check whether the cluster starts, or increase the value of parameter "-t", e.g -t 600.
[omm@node1 ~]$ 
复制

处理过程

在集群另一个节点查看集群状态，可以看到 CMServer 已经正常启动，但是Datanode State异常。

[omm@node2 ~]$ gs_om -t status --detail
[  CMServer State   ]

node     node_ip         instance                             state
---------------------------------------------------------------------
1  node1 *.*.*.50  1    /database/panweidb/cm/cm_server Primary
2  node2 *.*.*.52  2    /database/panweidb/cm/cm_server Standby
3  node3 *.*.*.54  3    /database/panweidb/cm/cm_server Standby

[   Cluster State   ]

cluster_state   : Unavailable
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[  Datanode State   ]

node     node_ip         instance                     state            
-----------------------------------------------------------------------
1  node1 *.*.*.50  6001 /database/panweidb/data P Down    Unknown
2  node2 *.*.*.52  6002 /database/panweidb/data S Down    Unknown
3  node3 *.*.*.54  6003 /database/panweidb/data S Down    Unknown
复制

在datanode主节点查看数据库日志

[omm@node1 ~]$ cd /database/panweidb/log/omm/pg_log/dn_6001/
[omm@node1 dn_6001]$ ls -l
total 0
复制

可以看到，数据库日志没有生成。

在datanode主节点尝试手工拉起数据库单实例

[omm@node1 dn_6001]$ gs_ctl start -D /database/panweidb/data
[2024-07-08 19:44:12.084][84774][][gs_ctl]: gs_ctl started,datadir is /database/panweidb/data 
[2024-07-08 19:44:12.135][84774][][gs_ctl]: waiting for server to start...
...<略>
...<略>
2024-07-08 19:44:12.267 668bd10c.1 [unknown] 139661863719424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  Set max backend reserve memory is: 660 MB, max dynamic memory is: 4192174 MB
2024-07-08 19:44:12.267 668bd10c.1 [unknown] 139661863719424 [unknown] 0 dn_6001_6002_6003 42809  0 [BACKEND] FATAL:  the values of memory out of limit, the database failed to be started, max_process_memory (2048MB) must greater than 2GB + cstore_buffers(512MB) + (udf_memory_limit(200MB) - UDF_DEFAULT_MEMORY(200MB)) + shared_buffers(647MB) + preserved memory(3018MB) = 6225MB, reduce the value of shared_buffers, max_pred_locks_per_transaction, max_connection, wal_buffers..etc will help reduce the size of preserved memory
2024-07-08 19:44:12.275 668bd10c.1 [unknown] 139661863719424 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  FiniNuma allocIndex: 0.
[2024-07-08 19:44:13.136][84774][][gs_ctl]: waitpid 84777 failed, exitstatus is 256, ret is 2

[2024-07-08 19:44:13.136][84774][][gs_ctl]: stopped waiting
[2024-07-08 19:44:13.136][84774][][gs_ctl]: could not start server
Examine the log output.
复制

可以看到这样的报错：the database failed to be started, max_process_memory (2048MB) must greater than …
至此，问题已经一目了然，是因为 max_process_memory 内存参数小于其他内存参数值的总和导致的数据库实例无法正常启动。

解决方法

方法一：
使用gs_guc命令调整max_process_memory参数的值，使其大于日志里面提示的其他各项参数的和即可。
方法二：
减小其他各项参数的值，shared_buffers, cstore_buffers，max_pred_locks_per_transaction, max_connection参数。

这里使用方法一，使用gs_guc命令调整加大max_process_memory参数的值解决。方法二大家可以自行去尝试。

[omm@node1 dn_6001]$ gs_guc set -N all -I all -c "max_process_memory=7000MB"
The pw_guc run with the following arguments: [gs_guc -N all -I all -c max_process_memory=7000MB set ].
Begin to perform the total nodes: 3.
Popen count is 3, Popen success count is 3, Popen failure count is 0.
Begin to perform gs_guc for datanodes.
Command count is 3, Command success count is 3, Command failure count is 0.

Total instances: 3. Failed instances: 0.
ALL: Success to perform gs_guc!

[omm@node1 dn_6001]$ gs_guc set -N all -I all -c "max_process_memory=7000MB"
The pw_guc run with the following arguments: [gs_guc -N all -I all -c max_process_memory=7000MB set ].
Begin to perform the total nodes: 3.
Popen count is 3, Popen success count is 3, Popen failure count is 0.
Begin to perform gs_guc for datanodes.
Command count is 3, Command success count is 3, Command failure count is 0.

Total instances: 3. Failed instances: 0.
ALL: Success to perform gs_guc!


[omm@node1 dn_6001]$ cm_ctl stop
cm_ctl: stop cluster. 
cm_ctl: stop nodeid: 1
cm_ctl: stop nodeid: 2
cm_ctl: stop nodeid: 3
............
cm_ctl: stop cluster successfully.
[omm@node1 dn_6001]$ cm_ctl start
cm_ctl: checking cluster status.
cm_ctl: checking cluster status.
cm_ctl: checking finished in 600 ms.
cm_ctl: start cluster. 
cm_ctl: start nodeid: 1
cm_ctl: start nodeid: 2
cm_ctl: start nodeid: 3
...................................
cm_ctl: start cluster successfully.
复制

查看集群状态

[omm@node1 dn_6001]$ gs_om -t status --detail
[  CMServer State   ]

node     node_ip         instance                             state
---------------------------------------------------------------------
1  node1 *.*.*.50  1    /database/panweidb/cm/cm_server Primary
2  node2 *.*.*.52  2    /database/panweidb/cm/cm_server Standby
3  node3 *.*.*.54  3    /database/panweidb/cm/cm_server Standby

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : Yes
current_az      : AZ_ALL

[  Datanode State   ]

node     node_ip         instance                     state            
-----------------------------------------------------------------------
1  node1 *.*.*.50  6001 /database/panweidb/data P Primary Normal
2  node2 *.*.*.52  6002 /database/panweidb/data S Standby Normal
3  node3 *.*.*.54  6003 /database/panweidb/data S Standby Normal
复制

至此，数据库集群已经正常启动。

总结

把处理问题的过程分享出来是一种快乐，希望大家都能学有所获。
同时，希望大家能一起交流，共同进步！

墨力计划磐维数据库 panweidb

最后修改时间：2024-07-10 10:49:38

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

磐维数据库（共48篇）

磐维数据库，简称"PanWeiDB"。是中国移动信息技术中心首个基于中国本土开源数据库打造的面向ICT基础设施的自研数据库产品。其产品内核能力基于华为openGauss开源软件，并进一步提升了系统稳定性。磐维数据库具有高性能、高可靠、高安全、高兼容等特点，能够满足复杂多变的业务需求。磐维数据库提供了自动化、流程化的解决方案，实现了一键式数据迁移。这种高效的数据迁移方式不仅提高了迁移数据的效率，也降低了操作难度，为用户带来了极大的便利。