暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

0034.E Infiniband交换机的监控

rundba 2021-05-10
2401

infiniband在ORACLE各种一体机(exadata/BDA/PCA等)中大量使用,通过一些常见的命令确认健康状态、问题排查等。

1.传感器与电源监控

监控命令如下:

    root@dm01sw-ib1 ~]# showunhealthy
    OK - No unhealthy sensors
    复制

    其返回的结果应该为OK-No unhealthy sensors,如果不是,则需要在Infiniband交换机中执行env_test来检查具体出错的传感器。需要注意的是,此命令无法检测Infiniband的电源供应(Power Supply)状态,如果要检测供电状态需要运行以下命令:

      root@dm01sw-ib1 ~]# checkpower
      PSU 0 present OK
      PSU 1 present OK
      All PSUs OK
      复制

      以上命令应该返回All PSUs OK。

      2.端口、链路监控

          当然,有时候我们还需要查看Infiniband网络各个端口链路、发送、接收、中继、缓冲的错误信息,在数据库节点或者Infiniband交换机上执行如下命令就能满足要求:

        root@dm01sw-ib1 ~]# ibqueryerrors.pl -s RcvSwRelayErrors,RcvRemotePhysErrors,Xmt
        Discards,XmtConstraintErrors,RcvConstraintErrors, ExcBufOverrunErrors,VL15Dropped
        Suppressing:RcvSwRelayErrors,RcvRemotePhysErrors,XmtDiscards,XmtConstraintErrors,RcvConstraintErrors
        Errors for 0x00212846901ea0a0 "SUN DCS 36P QDR dm01sw-ib3 10.242.65.9"
        GUID 0x00212846901ea0a0 port 17:[VL15Dropped == 4]
        GUID 0x00212846901ea0a0 port 25:[RcvErrors == 218]
        GUID 0x00212846901ea0a0 port 27:[RcvErrors == 144]
        GUID 0x00212846901ea0a0 port 28:[RcvErrors == 188]
        GUID 0x00212846901ea0a0 port 30:[ExcBufOverrunErrors == 1] [RcvErrors == 678]
        [LinkRecovers == 1]
        GUID 0x00212846901ea0a0 port 31:[VL15Dropped == 13]
        Errors for 0x002128468eada0a0 "SUN DCS 36P QDR dm01sw-ib2 10.242.65.8"
        GUID 0x002128468eada0a0 port 7:[ExcBufOverrunErrors == 3] [RcvErrors == 1299]
        [LinkRecovers == 3]
        GUID 0x002128468eada0a0 port 9:[RcvErrors == 225]
        GUID 0x002128468eada0a0 port 10:[ExcBufOverrunErrors == 3] [RcvErrors ==
        1434] [LinkRecovers == 3]
        GUID 0x002128468eada0a0 port 12:[ExcBufOverrunErrors == 4] [RcvErrors ==
        2382] [LinkRecovers == 4]
        GUID 0x002128468eada0a0 port 13:[LinkDowned == 1]
        GUID 0x002128468eada0a0 port 14:[LinkDowned == 1]
        GUID 0x002128468eada0a0 port 15:[LinkDowned == 1]
        GUID 0x002128468eada0a0 port 16:[LinkDowned == 1]
        GUID 0x002128468eada0a0 port 17:[LinkDowned == 1]
        GUID 0x002128468eada0a0 port 31:[LinkDowned == 1]
        Errors for 0x002128469566a0a0 "SUN DCS 36P QDR dm01sw-ib1 10.242.65.7"
        GUID 0x002128469566a0a0 port 19:[LinkDowned == 3]
        GUID 0x002128469566a0a0 port 21:[LinkDowned == 2]
        复制

              在数据库节点及存储节点运行ibstatus,用于查询本机Infiniband端口的状态:

          # ibstatus
          Infiniband device 'mlx4_0' port 1 status:
          default gid:fe80:0000:0000:0000:0021:2800:01a1:3fed
          base lid:0x2
          sm lid:0x1
          state: 4:ACTIVE
          phys state:5:LinkUp
          rate: 40Gb/sec (4X QDR)
          link_layer: IB
          Infiniband device 'mlx4_0' port 2 status:
          default gid:fe80:0000:0000:0000:0021:2800:01a1:3fee
          base lid:0x5
          sm lid:0x1
          state: 4:ACTIVE
          phys state:5:LinkUp
          rate: 40Gb/sec (4X QDR)
          link_layer: IB
          复制

          预期正常的返回结果应该是:

            State:4:ACTIVE
            Phys state5:LinkUp
            Rate:40Gb/sec (4X QDR)
            复制

            Infiniband端口的状态

              # ifconfig ib0
              ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:0
              0:00:00:00:00:00:00
              UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
              RX packets:133506 errors:0 dropped:0 overruns:0 frame:0
              TX packets:114796 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1024
              RX bytes:33936833 (32.3 MiB) TX bytes:33524268 (31.9 MiB)
              # ifconfig ib1
              ib1 Link encap:InfiniBand HWaddr 80:00:00:49:FE:80:00:00:00:00:00:00:00:0
              0:00:00:00:00:00:00
              UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
              RX packets:1702 errors:0 dropped:1702 overruns:0 frame:0
              TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1024
              RX bytes:194784 (190.2 KiB) TX bytes:0 (0.0 b)
              # ifconfig bondib0
              bondib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:0
              0:00:00:00:00:00:00
              inet addr:192.168.10.9 Bcast:192.168.11.255 Mask:255.255.252.0
              inet6 addr:fe80:221:2800:1a1:ffd/64 Scope:Link
              UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
              RX packets:135278 errors:0 dropped:1702 overruns:0 frame:0
              TX packets:114847 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:0
              RX bytes:34150057 (32.5 MiB) TX bytes:33540576 (31.9 MiB)
              复制

              其状态的正常返回结果应该是UP BROADCAST RUNNING MASTER MULTICAST,并且同时需要严密监控errors、dropped、overruns的值。

              使用rds-ping和ping测试数据库节点与所有存储节点之间的链路是否畅通:

                # rds-ping 192.168.10.9
                1:61 usec
                2:55 usec
                3:53 usec……
                # ping 192.168.10.9
                PING 192.168.10.9 (192.168.10.9) 56(84) bytes of data.
                64 bytes from 192.168.10.9:icmp_seq=1 ttl=64 time=1.80 ms
                64 bytes from 192.168.10.9:icmp_seq=2 ttl=64 time=0.078 ms
                64 bytes from 192.168.10.9:icmp_seq=3 ttl=64 time=0.083 ms
                复制

                查看端口错误信息:

                  # perfquery
                  # Port countersLid 40 port 1 (CapMask:0x1400)
                  PortSelect:…………………………1
                  PortSelect:…………………………1
                  CounterSelect:……………………0x0000
                  SymbolErrorCounter:……………….0 #####
                  LinkErrorRecoveryCounter:……….0
                  LinkDownedCounter:…………………0 #####
                  PortRcvErrors:……………………0 #####
                  PortRcvRemotePhysicalErrors:…….0
                  PortRcvSwitchRelayErrors:……….0
                  PortXmitDiscards:…………………0
                  PortXmitConstraintErrors:……….0
                  PortRcvConstraintErrors:…………0
                  CounterSelect2:……………………0x00
                  LocalLinkIntegrityErrors:……….0 #####
                  ExcessiveBufferOverrunErrors:……0 #####
                  VL15Dropped:…………………………0
                  PortXmitData:…………………….4294967295
                  PortRcvData:…………………………4294967295
                  PortXmitPkts:…………………….648093271
                  PortRcvPkts:…………………………285784546
                  复制

                  监控SymbolErrorCounter、LinkDownedCounter、PortRcvErrors、LocalLinkIntegrityErrors、ExcessiveBufferOverrunErrors这几项指标,在正常情况下不应该有增长。

                  3.Web监控

                      当然我们也可以登录Web版本的Infiniband交换机管理界面,在Configuration->System Mamagement Access->SNMP下配置SNMP,将Infiniband交换机纳入网管监控平台。

                        在这个Web控制界面中,在System Monitoring下,有所有与系统相关的监控信息,包括传感器、事件日志等。

                  4.吞吐量测试

                        在数据库节点的/opt/oracle.SupportTools/ibdiagtools目录下提供了一系列的诊断工具对Infiniband的故障进行检测,其中包括最常用的verify_topology和infinicheck等。

                  Infnincheck是用来检查Infiniband网络最大吞吐量的命令,需要在空载的情况下运行,否则可能影响正常的业务,同时首次执行需要加上-z以清理上次运行时生成的文件。

                    # opt/oracle.SupportTools/ibdiagtools/infinicheck -z
                    # opt/oracle.SupportTools/ibdiagtools/infinicheck
                    复制

                    —END—


                    长按二维码                                    

                          加入>>西安ORACLE用户组

                           

                       请注明:来自rundba,申请加入西安ORACLE用户组                 

                                 



                    文章转载自rundba,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                    评论