暂无图片
暂无图片
8
暂无图片
暂无图片
3
暂无图片

某国产数据库元旦首日挂起,背后原因揭秘

285

大家好,我是 JiekeXu,江湖人称“强哥”,青学会 MOP 技术社区主席,荣获 Oracle ACE Pro 称号,墨天轮 MVP,墨天轮年度“墨力之星”,拥有 Oracle OCP/OCM 认证,MySQL 5.7/8.0 OCP 认证以及 PCA、PCTA、OBCA、OGCA、金仓KCA、KCP 等众多国产数据库认证证书,今天和大家一起来看看 遇到一个比较奇葩的 Oracle 小故障,欢迎关注我的微信公众号“JiekeXu DBA之路”,然后点击右上方三个点“设为星标”置顶,更多干货文章才能第一时间推送,谢谢!

facebook_pro_light_1920 × 1080  副本.png

前 言

新年第一天开门红需要现场支持,还在路上就看到了 Prometheus 告警,不过还好这是我们自己测试环境,只有 DBA 在使用,没有其他影响。一套国产数据库 DM 数据守护(DM Data Watch)发出告警,实例状态异常。

Alert Name: 集群实例状态异常 告警级别: 严重 实例: 192.168.216.249:9201 实例标签: 单机测试2 描述: 主机:192.168.216.249:9201 集群实例状态异常 开始时间: 2025-01-01 08:03:14 结束时间:
复制

正 文

DM 数据守护(DM Data Watch)的实现原理非常简单:将主库(生产库)产生的 Redo 日志传输到备库,备库接收并重新应用 Redo 日志,从而实现备库与主库的数据同步。DM 数据守护的核心思想是监控数据库状态,获取主、备库数据同步情况,为 Redo 日志传输与重演过程中出现的各种异常情况提供一系列的解决方案。DM 数据守护系统结构参考图如下,主要由主库、备库、Redo 日志、Redo 日志传输、Redo 日志重演、守护进程(dmwatcher)、监视器(dmmonitor)组成。

image.png

到达公司以后,登录数据库实例发现是挂起状态,检查节点进程都还在。

[dmdba@jieke-dba-249:~]$ disql /:25238 as sysdba 服务器[LOCALHOST:15236]:处于主库挂起状态 登录使用时间 : 42.517(ms) 密钥过期时间:2024-12-25 disql V8
复制
--主节点 [dmdba@jieke-dba-249:~]$ ps -ef | grep dmdbms dmdba 16235 1 0 2024 ? 00:38:09 /data/dm8/dmdbms/bin/dmap dmdba 18519 1 0 2024 ? 17:15:47 /data/dm8/dmdbms/bin/dmserver path=/data/dm8/data/DAMENG/dm.ini -noconsole mount dmdba 19573 1 0 2024 ? 05:11:28 /data/dm8/dmdbms/bin/dmwatcher path=/data/dm8/data/DAMENG/dmwatcher.ini -noconsole --从节点 [dmdba@jieke-dba-250:~]$ ps -ef | grep dmdbms dmdba 15394 15190 0 10:20 pts/9 00:00:00 grep --color=auto dmdbms dmdba 18709 1 0 2024 ? 13:51:40 /data/dm8/dmdbms/bin/dmserver /data/dm8/data/DAMENG/dm.ini mount dmdba 25400 1 0 2024 ? 00:17:51 /data/dm8/dmdbms/bin/dmap dmdba 27464 1 0 2024 ? 00:21:35 /data/dm8/dmdbms/bin/dmwatcher path=/data/dm8/data/DAMENG/dmwatcher.ini -noconsole
复制

检查 dmwatcher 和 dm_DMSERVER 日志发现不能连接到从节点的私网地址 10.0.10.250,当然还有 License 过期的警告和致命错误,这个不是问题所在,也不影响问题排查。

[dmdba@jieke-dba-249:~]$ cd /data/dm8/dmdbms/log [dmdba@jieke-dba-249:/data/dm8/dmdbms/log]$ vi dm_dmwatcher_DMSERVER_202412.log 2024-12-31 15:17:04.350 [WARNING] dmwatcher P0000019573 T0000000000000007350 Tcp port from dmwatcher vio(7) close, ip:10.0.10.250, port:15436, mid(-1), is confirm(FALSE), from name:DMSERVER1 2024-12-31 15:17:04.350 [INFO] dmwatcher P0000019573 T0000000000000007350 dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.0.10.250, port:15436! 2024-12-31 15:17:04.350 [INFO] dmwatcher P0000019573 T0000000000000007349 dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.0.10.250, port:15436! 2024-12-31 15:17:04.962 [INFO] dmwatcher P0000019573 T0000000000000019577 没有收到远程守护进程(DMSERVER1)消息,原状态为(OPEN),距上次接收消息时间间隔(8)s, 设置远程守护进程为ERROR状态 2024-12-31 15:17:04.963 [INFO] dmwatcher P0000019573 T0000000000000019577 Local instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(DMSERVER) 模式(PRIMARY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(17) FLSN(82822) CLSN(82822) SLSN(82822) SSLSN(82822) 2024-12-31 15:17:04.963 [INFO] dmwatcher P0000019573 T0000000000000019577 Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(DMSERVER1) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(17) FLSN(82822) CLSN(82822) SLSN(82822) SSLSN(82822) 2024-12-31 15:17:10.061 [ERROR] dmwatcher P0000019573 T0000000000000019579 Can't connect to DM server on '10.0.10.250' port(15436) errno(115) 2024-12-31 15:17:11.062 [INFO] dmwatcher P0000019573 T0000000000000019579 dw2_send_port_set to dmwatcher vio(7) set, mid(-1), to name:DMSERVER1, ip:10.0.10.250 2024-12-31 15:17:11.084 [INFO] dmwatcher P0000019573 T0000000000000028041 远程实例的模式、状态或者归档状态发生变化,原状态是: 2024-12-31 15:17:11.084 [INFO] dmwatcher P0000019573 T0000000000000028041 Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(DMSERVER1) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(17) FLSN(82822) CLSN(82822) SLSN(82822) SSLSN(82822) 2024-12-31 15:17:11.084 [INFO] dmwatcher P0000019573 T0000000000000028041 远程实例的模式、状态或者归档状态发生变化,新状态是: 2024-12-31 15:17:11.084 [INFO] dmwatcher P0000019573 T0000000000000028041 Instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(DMSERVER1) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(17) FLSN(82822) CLSN(82822) SLSN(82822) SSLSN(82822) 2024-12-31 15:21:50.371 [INFO] dmwatcher P0000019573 T0000000000000028041 socket_err_should_retry errno:110 2024-12-31 15:21:50.372 [INFO] dmwatcher P0000019573 T0000000000000028041 dw2_group_tcp_recv_thread, receive tcp msg failed, close tcp port, vio:7, mid:-1, from_flag:0, ip:10.0.10.250, port:15436, errno:110, error:Failure occurs in data_recv_inet_once, code(110) len(8128, 0), code:-6007 2024-12-31 15:21:50.372 [WARNING] dmwatcher P0000019573 T0000000000000028041 Tcp port from dmwatcher vio(7) close, ip:10.0.10.250, port:15436, mid(-1), is confirm(FALSE), from name:DMSERVER1 2024-12-31 15:21:50.372 [INFO] dmwatcher P0000019573 T0000000000000028041 dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.0.10.250, port:15436! 2024-12-31 15:21:50.372 [INFO] dmwatcher P0000019573 T0000000000000028040 dw2_tcp_port_close_low, reset report_err to TRUE, ip:10.0.10.250, port:15436! 2024-12-31 15:21:51.358 [INFO] dmwatcher P0000019573 T0000000000000019577 没有收到远程守护进程(DMSERVER1)消息,原状态为(OPEN),距上次接收消息时间间隔(8)s, 设置远程守护进程为ERROR状态 2024-12-31 15:21:51.358 [INFO] dmwatcher P0000019573 T0000000000000019577 Local instance: 守护进程状态(OPEN) 实例状态(OK) 实例名(DMSERVER) 模式(PRIMARY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(17) FLSN(82822) CLSN(82822) SLSN(82822) SSLSN(82822) 2024-12-31 15:21:51.358 [INFO] dmwatcher P0000019573 T0000000000000019577 Instance: 守护进程状态(ERROR) 实例状态(OK) 实例名(DMSERVER1) 模式(STANDBY) 实例状态(OPEN) 归档状态(UNKNOWN) POCNT(17) FLSN(82822) CLSN(82822) SLSN(82822) SSLSN(82822) 2024-12-31 15:21:56.116 [ERROR] dmwatcher P0000019573 T0000000000000019579 Can't connect to DM server on '10.0.10.250' port(15436) errno(115) 2024-12-31 15:23:02.772 [INFO] dmwatcher P0000019573 T0000000000000027988 socket_err_should_retry errno:110 2024-12-31 15:23:02.773 [INFO] dmwatcher P0000019573 T0000000000000027988 dw2_group_tcp_recv_thread, receive tcp msg failed, close tcp port, vio:5, mid:1741977666, from_flag:1, ip:::ffff:10.0.10.251, port:48989, errno:110, error:Failure occurs in data_recv_inet_once, code(110) len(8128, 0), code:-6007 [dmdba@jieke-dba-249:/data/dm8/dmdbms/log]$ vim dm_DMSERVER_202412.log 2024-12-31 15:21:50.937 [WARNING] database P0000018519 T0000000000000018662 mal_site_letter_recv code=-6007, errno=110, site(0) recv from site(1) failed, socket handle = 13 2024-12-31 15:21:50.937 [WARNING] database P0000018519 T0000000000000018662 MAL receive site(0) lost connect to site(1), ctl_handle(12), data_handle(13), dsc_handle(0) 2024-12-31 15:21:50.937 [WARNING] database P0000018519 T0000000000000018662 site(0) data_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again 2024-12-31 15:21:50.938 [WARNING] database P0000018519 T0000000000000018661 mal_site_letter_recv code=-6007, errno=107, site(0) recv from site(1) failed, socket handle = 0 2024-12-31 15:21:50.938 [WARNING] database P0000018519 T0000000000000018661 MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0) 2024-12-31 15:21:50.938 [WARNING] database P0000018519 T0000000000000018661 site(0) ctl_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again 2024-12-31 15:21:50.939 [WARNING] database P0000018519 T0000000000000018660 mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again 2024-12-31 15:21:50.939 [INFO] database P0000018519 T0000000000000018660 send CMD_MAL_LINK_CHECK(350): (mal_id:9770275204219, stmt_id:9721658, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:-1) 2024-12-31 15:21:50.939 [INFO] database P0000018519 T0000000000000018660 send CMD_MAL_LINK_CHECK(350): (mal_id:9770275204219, stmt_id:9721660, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:9061601) 2024-12-31 15:21:50.941 [WARNING] database P0000018519 T0000000000000018659 mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again 2024-12-31 15:21:50.941 [INFO] database P0000018519 T0000000000000018659 send CMD_MAL_LINK_CHECK(350): (mal_id:9770275204219, stmt_id:9721659, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:9061678) 2024-12-31 15:21:50.941 [INFO] database P0000018519 T0000000000000018659 send CMD_MAL_LINK_CHECK(350): (mal_id:9770277353748, stmt_id:9721661, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:-1) 2024-12-31 15:22:06.361 [ERROR] database P0000018519 T0000000000000018666 Can't connect to DM server on '10.0.10.250' port(15336) errno(115) 2024-12-31 15:24:23.659 [INFO] database P0000018519 T0000000000000018710 checkpoint requested by CKPT_INTERVAL, rlog free space[2147475456], used space[0] 2024-12-31 15:24:23.659 [INFO] database P0000018519 T0000000000000018710 checkpoint generate by ckpt_interval 2024-12-31 15:24:23.660 [INFO] database P0000018519 T0000000000000018670 checkpoint begin, used_space[0], free_space[2147475456]... ...... 2024-12-31 17:13:06.065 [WARNING] database P0000018519 T0000000000000018681 License has already expired. 2024-12-31 17:13:06.065 [FATAL] database P0000018519 T0000000000000018681 License has already expired, contact Dameng Company please. [dmdba@jieke-dba-249:~]$ cd /data/dm8/dmdbms/log [dmdba@jieke-dba-249:/data/dm8/dmdbms/log]$ tail -9000 dm_DMSERVER_202501.log | more 2025-01-01 01:03:36.152 [INFO] database P0000018519 T0000000000000018670 checkpoint end, 0 pages flushed, used_space[0], free_space[2147475456]. 2025-01-01 01:05:38.231 [ERROR] database P0000018519 T0000000000000018671 [realtime arch] mal send to site(DMSERVER1, mal_seqno:1) failed, begin lsn:82823, end lsn:82826, code:-6010 2025-01-01 01:05:38.238 [ERROR] database P0000018519 T0000000000000018671 send realtime archive to instance[DMSERVER1] failed, code = -6010, begin_lsn = 82823, end_lsn = 82826! 2025-01-01 01:05:38.238 [INFO] database P0000018519 T0000000000000018671 rlog4_process_arch_failed, need_suspend:1 2025-01-01 01:05:38.240 [INFO] database P0000018519 T0000000000000018671 Send archive log to remote instance failed, switch all ep to SUSPEND status success! 2025-01-01 01:06:36.373 [INFO] database P0000018519 T0000000000000018710 checkpoint requested by CKPT_INTERVAL, rlog free space[2147471360], used space[4096] 2025-01-01 01:06:36.373 [INFO] database P0000018519 T0000000000000018710 checkpoint generate by ckpt_interval 2025-01-01 01:06:36.374 [INFO] database P0000018519 T0000000000000018670 checkpoint begin, used_space[4096], free_space[2147471360]...
复制

企业微信截图_173572590270.png

发现 dmwatcher 2024-12-31 15:17 左右就出现了错误,不能通过私网 IP 连接到从节点,但 dm_SERVER 直到 2025-01-01 01:05 左右才报不能归档到从节点,挂起实例。

当登录到 250 从节点后,实例是正常的,也可以正常登录。去查看 dm_SERVER 实例日志,发现昨天 2024-12-31 15:17 左右有 MAL 系统连接丢失的报错。

2024-12-31 15:17:09.981 [WARNING] database P0000018709 T0000000000000018764 mal_site_letter_recv code=-6007, errno=104, site(1) recv from site(0) failed, socket handle = 8 2024-12-31 15:17:09.999 [WARNING] database P0000018709 T0000000000000018764 MAL receive site(1) lost connect to site(0), ctl_handle(8), data_handle(9), dsc_handle(0) 2024-12-31 15:17:09.999 [WARNING] database P0000018709 T0000000000000018763 mal_site_tsk_check site(1) connect lost to site(0), socket handle = 0, mal sys status = 0, try get port again 2024-12-31 15:17:09.999 [WARNING] database P0000018709 T0000000000000018765 mal_site_letter_recv code=-6007, errno=0, site(1) recv from site(0) failed, socket handle = 0 2024-12-31 15:17:09.999 [WARNING] database P0000018709 T0000000000000018762 mal_site_tsk_check site(1) connect lost to site(0), socket handle = 0, mal sys status = 0, try get port again 2024-12-31 15:17:10.018 [WARNING] database P0000018709 T0000000000000018764 site(1) ctl_link mal_site_letter_recv from site(0) failed, socket handle = 0, mal sys status is 0, try to get mal_port again 2024-12-31 15:17:10.018 [WARNING] database P0000018709 T0000000000000018765 MAL receive site(1) lost connect to site(0), ctl_handle(0), data_handle(0), dsc_handle(0) 2024-12-31 15:17:10.021 [INFO] database P0000018709 T0000000000000018762 send CMD_MAL_LINK_CHECK(350): (mal_id:9770275204219, stmt_id:9721324, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:0, build_time:123990) 2024-12-31 15:17:10.019 [INFO] database P0000018709 T0000000000000018763 send CMD_MAL_LINK_CHECK(350): (mal_id:9770275204219, stmt_id:9721323, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:0, build_time:-1) 2024-12-31 15:17:10.017 [ERROR] database P0000018709 T0000000000000018778 self_site(1) to dest_site(0) port_closed, return EC_CONNECT_LOST
复制

这个时候想到了昨天下午好像有同事对 10.0 网段的网卡做了什么操作,一看居然都没有10.0 网段的网卡,幸好今天他也来值班,说是网卡子网掩码有问题导致有应用测试连不了,所以关闭了此网卡。。。赶紧让他去改了子网掩码重新 up 此网卡,不一会儿查看日志实例就恢复了。

-- 修改 eth1 子网掩码后启动此网卡 cd /etc/sysconfig/network-scripts/ vim ifcfg-eth1 ifup ifcfg-eth1
复制

恢复日志

2025-01-01 10:14:12.512 [INFO] database P0000018519 T0000000000000018670 checkpoint end, 0 pages flushed, used_space[8192], free_space[2147467264]. 2025-01-01 10:14:13.076 [INFO] database P0000018519 T0000000000000018666 mal_site_ctl_link_create startup from mal_site(0) to mal_site(1)! 2025-01-01 10:14:13.076 [INFO] database P0000018519 T0000000000000018666 mal_site_magic_gen site_magic[1604], src_site:0, dst_site:1 2025-01-01 10:14:13.076 [INFO] database P0000018519 T0000000000000018666 site[0] mal_site_ctl_port_set to site[1, IP: 10.0.10.250, port_num: 15336], socket handle = 12, site_magic = 1604, link_seq = 3 2025-01-01 10:14:13.076 [INFO] database P0000018519 T0000000000000018660 mal_site_port_get site_magic:1604, src_site:0, dst_site:1 2025-01-01 10:14:13.076 [INFO] database P0000018519 T0000000000000018661 mal_site_port_get site_magic:1604, src_site:0, dst_site:1 2025-01-01 10:14:13.079 [INFO] database P0000018519 T0000000000000018665 site[0] mal_site_data_port_set from site[1, IP: 10.0.10.250, port_num: 15336], socket handle = 13, site_magic = 1604, link_seq = 3
复制

看到告警恢复也恢复了。虽然这套环境没人用,但至少得健康呀,新年第一天就得健健康康,才算完美!

Alert Name: 集群实例状态异常 告警级别: 严重 实例: 192.168.216.249:9201 实例标签: 单机测试2 描述: 主机:192.168.216.249:9201 集群实例状态异常 开始时间: 2025-01-01 08:03:14 结束时间: 2025-01-01 10:18:14
复制

总 结

DMDW 主备实例集群通过 MAL 系统来发送日志和同步信息,而 MAL 系统使用 10.0 网段的地址来通信,如果私网通信受阻或者中断,则会导致 DMDW 主备实例出现故障主库挂起,无法提供写操作。因为达梦数据库在Suspend 状态下,会限制 REDO 日志刷盘,虽可以访问数据库对象,甚至可以修改数据,但是一旦执行 Commit 等操作触发 REDO 日志写盘时,当前操作就会被挂起。

所以我们要防止因磁盘空间或者网络故障导致的远程实例无法归档的情况发生,当实时归档失败时,Primary 实例将试图切换成 Suspend 状态,防止后续的日志写入。如果一旦写入日志,当主备库切换时,备库就有可能没有收到最后那次的 RLOG_BUF,导致主库上多一段日志,很容易造成主备库数据不一致。当实例成功切换为 Suspend 状态时,可直接退出,强制丢弃多余的日志,避免主备库数据不一致。

全文完,希望可以帮到正在阅读的你,如果觉得有帮助,可以分享给你身边的朋友,同事,你关心谁就分享给谁,一起学习共同进步~~~

欢迎关注我的公众号【JiekeXu DBA之路】,一起学习新知识!
——————————————————————————
公众号:JiekeXu DBA之路
墨天轮:https://www.modb.pro/u/4347
CSDN :https://blog.csdn.net/JiekeXu
ITPUB:https://blog.itpub.net/69968215
腾讯云:https://cloud.tencent.com/developer/user/5645107
——————————————————————————
facebook_pro_light_1920 × 1080  副本.png

最后修改时间:2025-01-06 09:54:32
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
1人已赞赏
Y
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

吼吼哈嘿
暂无图片
1月前
评论
暂无图片 0
某国产数据库元旦首日挂起,背后原因揭秘
1月前
暂无图片 点赞
评论
鲁鲁
暂无图片
2月前
评论
暂无图片 0
某国产数据库元旦首日挂起,背后原因揭秘
2月前
暂无图片 点赞
评论
星星之火
暂无图片
2月前
评论
暂无图片 0
将主库(生产库)产生的 Redo 日志传输到备库,备库接收并重新应用 Redo 日志,从而实现备库与主库的数据同步。
2月前
暂无图片 点赞
评论