暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

性能下降的“隐形杀手” 如何通过临时文件捕获“真凶”

GBASE数据库 2025-03-27
106


日常系统运维过程中,小明发现某一时段系统出报表的性能总是异常下降。经过初步排查,他发现是后台数据库返回结果的速度变慢,导致前台业务响应速度下降。

这个系统后台是使用GBase 8a MPP Cluster建设的数据仓库,存储了百TB级的数据,公司众多分析类业务都是这款数据仓库在支持,排查难度可想而知。几次尝试自行解决无果后,小明果断求助GBASE技术团队。

技术专家小吉接到求助后迅速赶到现场。通过和小明沟通,他了解到系统性能下降总是发生在特定时间段。于是,在性能再次下降时,小吉开始监控数据库的CPU、内存、磁盘10等关键指标,很快发现磁盘IO在这段时间异常繁忙。

凭借丰富的经验,小吉迅速做出判断:这很可能是由于某些复杂SQL的中间结果集过于庞大,导致 opt/gnode/tmpdata 目录下产生了大量中间临时文件。这些临时文件不仅会导致查询响应延迟翻倍,严重时还可能占满磁盘空间,使整个集群陷入瘫痪。

果然,在 opt/gnode/tmpdata 目录下,小吉发现了大量正在不断增加的临时文件。更糟糕的是现场并发执行着数十条复杂SQL,传统的通过 show processlist 逐条排查的方法在这种情况下无异于大海捞针。


    suse103:/opt/gnode/tmpdata/cache_gbase # pwd
    /opt/gnode/tmpdata/cache_gbase
    suse103:/opt/gnode/tmpdata/cache_gbase # ll
    total 75497472
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000001965940xa3d8500.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000001CCF3A0xa3a61c0.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000002D97070x38ae3c0.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT0000000753F980x9626440.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT0000000FB81C50xbd28000.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000013199B0x17fc7f40.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000026185130x619a6c0.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT000000268BA140x3d14440.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000030F188E0x9626300.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000035DD840x17fc7e00.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000057D15420x38aea00.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT0000005C49BAC0x3d12a00.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT0000005F38C290xbd28140.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000069942570x619bac0.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000069F92D60xa3dbe80.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:25 GB_MAT00000078F35830xa3dbd40.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT00000101965940xa3d8500.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT00000101CCF3A0xa3a61c0.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT00000102D97070x38ae3c0.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT00000130F188E0x9626300.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT00000135DD840x17fc7e00.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT00000157D15420x38aea00.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT0000015C49BAC0x3d12a00.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT0000015F38C290xbd28140.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT00000169942570x619bac0.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT00000169F92D60xa3dbe80.express_tmp
    -rw-rw---- 1 gbase gbase 1979711488 Nov 8 14:26 GB_MAT00000178F35830xa3dbd40.express_tmp
    -rw-rw---- 1 gbase gbase 469762048 Nov 8 14:27 GB_MAT00000201965940xa3d8500.express_tmp
    -rw-rw---- 1 gbase gbase 469762048 Nov 8 14:27 GB_MAT00000201CCF3A0xa3a61c0.express_tmp
    -rw-rw---- 1 gbase gbase 402653184 Nov 8 14:27 GB_MAT00000202D97070x38ae3c0.express_tmp
    -rw-rw---- 1 gbase gbase 469762048 Nov 8 14:27 GB_MAT0000020753F980x9626440.express_tmp
    -rw-rw---- 1 gbase gbase 436207616 Nov 8 14:27 GB_MAT0000020FB81C50xbd28000.express_tmp
    -rw-rw---- 1 gbase gbase 436207616 Nov 8 14:27 GB_MAT00000213199B0x17fc7f40.express_tmp
    -rw-rw---- 1 gbase gbase 436207616 Nov 8 14:27 GB_MAT00000226185130x619a6c0.express_tmp
    -rw-rw---- 1 gbase gbase 469762048 Nov 8 14:27 GB_MAT00000269F92D60xa3dbe80.express_tmp
    -rw-rw---- 1 gbase gbase 469762048 Nov 8 14:27 GB_MAT00000278F35830xa3dbd40.express_tmp
    drwxrwx--x 2 gbase gbase 6 Nov 1 17:34 HashJoin
    drwxrwx--x 2 gbase gbase 6 Nov 8 13:58 TrashCan
    drwxrwx--x 3 gbase gbase 21 Nov 1 17:34 tmp_materialized
    suse103:/opt/gnode/tmpdata # pwd
    /opt/gnode/tmpdatasuse103:/opt/gnode/tmpdata # du -sh *
    97G cache_gbase
    suse103:/opt/gnode/tmpdata #




    系统性能异常 GBASE捕获“真凶”




    面对数以万计的临时文件,如何快速锁定“元凶SQL”成为解决问题的关键。

    小吉果断采取行动,他首先查询了performance_schema中的session memory_usage_info表。这个表在GBase 8a数据库的GN层中扮演着重要角色,其中的temp_space字段表示占用tmpdata空间的大小,ID 字段则表示SQL执行的ID(与show processlist 中的第一个字段“ID”相对应)


      gbase> select * from performance_schema.SESSION_MEMORY_USAGE_INFO where TEMP_SPACE>0;
      +---------+-----+---------+------+---------------------+-------------+
      | HOST | ID | CURRENT | PEAK | PEAK_TIMESTAMP | TEMP_SPACE |
      +---------+-----+---------+------+---------------------+-------------+
      | suse103 | 342 | 0 | 0 | 1970-01-01 08:00:00 | 86872424448 |
      +---------+-----+---------+------+---------------------+-------------+
      1 row in set (Elapsed: 00:00:00.00)
      gbase>


      通过查询这个表,小吉获得了占用大量临时空间的SQL的ID。然后,他在gn层的 showprocesslist 中找到了对应的SQL语句。


        suse103:~ # gncli -uroot -e"show full processlist" | grep 342
        342 root 10.10.10.103:35972 ssbm Query 224 init SELECT /*10.10.10.103_179_2_2016-11-08_14:24:21*/ /*+ TID('78') */ `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_orderkey` AS `lo_orderkey`, `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_linenumber` AS `lo_linenumber`,
        `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_custkey` AS `lo_custkey`, `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_partkey` AS `lo_partkey`,
        `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_suppkey` AS `lo_suppkey`, `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_orderdate` AS `lo_orderdate`,
        `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_orderpriority` AS `lo_orderpriority`, `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_shippriority` AS `lo_shippriority`,
        `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_quantity` AS `lo_quantity`, `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_extendedprice` AS `lo_extendedprice`,
        `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_ordtotalprice` AS `lo_ordtotalprice`, `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_discount` AS `lo_discount`,
        `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_revenue` AS `lo_revenue`, `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_supplycost` AS `lo_supplycost`,
        `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_tax` AS `lo_tax`, `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_commitdate` AS `lo_commitdate`,
        `_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_shipmode` AS `lo_shipmode`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c1` AS `c1`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c2` AS `c2`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c3` AS `c3`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c4` AS `c4`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c5` AS `c5`,
        `_tmp_rht_n4_179_t9_2_1477982902_s`.`c6` AS `c6`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c7` AS `c7`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c8` AS `c8`,
        `_tmp_rht_n4_179_t9_2_1477982902_s`.`c9` AS `c9`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c10` AS `c10`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c11` AS `c11`,
        `_tmp_rht_n4_179_t9_2_1477982902_s`.`c12` AS `c12`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c13` AS `c13`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c14` AS `c14`,
        `_tmp_rht_n4_179_t9_2_1477982902_s`.`c15` AS `c15`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c16` AS `c16`, `_tmp_rht_n4_179_t9_2_1477982902_s`.`c17` AS `c17` FROM `gctmpdb`._tmp_rht_n4_179_t9_1_1477982902_s INNER JOIN `gctmpdb`._tmp_rht_n4_179_t9_2_1477982902_s ON (`_tmp_rht_n4_179_t9_1_1477982902_s`.`lo_linenumber` = `_tmp_rht_n4_179_t9_2_1477982902_s`.`c2`)
        suse103:~ #


        根据node层的SQL语句中的特定标记(如IP地址、时间戳等)或临时表名,他进一步查找到了gc层的SQL session id,从而成功定位到了具体的SQL语句——这个导致系统性能下降的“真凶”。

          suse103:~ # gccli -uroot -e"show full processlist" | grep -v Sleep | grep 179 179 root 127.0.0.1:63945 ssbm Query 571 Sending task to gnodes select * from lineorder a,lineorder b where a.lo_linenumber=b.lo_linenumber


          最终,小吉通过与业务部门深入沟通,了解清楚业务需求后,协助业务部门改写了SQL语句,成功解决了性能下降的问题,使系统性能恢复了正常。




          GBASE小课堂




          现场在执行一些中间结果集较大的sql时,在tmpdata文件夹下会产生大量的中间临时文件,导致磁盘io繁忙,影响集群整体性能。如果现场并发sql较多,通过show processlist排查sql,效率难以保障。如何通过tmpdata下临时文件快速准确找出所对应的sql?

          此时,在GBase 8a数据库GN层,可以查询performance_schema中的session memory_usage_info表。


            gbase> select * from performance_schema.SESSION_MEMORY_USAGE_INFO;
            +---------+-----+---------+------+---------------------+------------+
            | HOST | ID | CURRENT | PEAK | PEAK_TIMESTAMP | TEMP_SPACE |
            +---------+-----+---------+------+---------------------+------------+
            | suse103 | 344 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 342 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 324 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 289 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 274 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 273 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 241 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 215 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 214 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 213 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 132 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 52 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 31 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 26 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 25 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            | suse103 | 17 | 0 | 0 | 1970-01-01 08:00:00 | 0 |
            +---------+-----+---------+------+---------------------+------------+
            16 rows in set (Elapsed: 00:00:00.00)
            gbase> select * from performance_schema.SESSION_MEMORY_USAGE_INFO where TEMP_SPACE>0;
            Empty set (Elapsed: 00:00:00.00)


            其中temp_space字段就是表示占用tmpdata空间大小,ID字段表示SQL执行的ID(对应show processlist 中的第一个字段“ID”)。

            在gn层的show processlist中可以找到对应的SQL,根据node层的SQL即可查找对应的gc层的SQL。




            本期供稿 | GBase 8a产品经营部

            本期编辑 | Zoey

            内容审核 | 生态发展部


            文章转载自GBASE数据库,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

            评论