暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

0052.D DorisDB 60亿记录SSB POC测试记录

rundba 2021-06-10
4218

在计算领域,星型模式是最简单的数据集市模式,也是开发数据仓库和维度数据集市最广泛使用的方法。星型模式由一个或多个引用任意数量维度表的事实表组成。

Star schema benchmark(星型架构基准测试),简称SSB,SSB旨在衡量支持经典数据仓库应用程序的数据库产品的性能,它基于TPC-H基准[TPC-H],但形式有所修改。查询也是基于TPC-H查询,但减少了查询次数,方便在不同平台上运行SSB。SSB已被用于衡量 Linux上的许多主要商业数据库及OLAP查询类产品,已被用于但不限于Oracle Autonomous Database、Clickhouse、Tidb、greenplum、Oceanbase、Apache Kylin等。

Star schema benchmark论文详见:https://www.cs.umb.edu/~poneil/StarSchemaB.pdf

        

1. 测试环境

1) 测试基础环境

HOST3台 * VMWEAR虚拟机
CPU8C * Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHz
MEMORY32G swap 0
NIC10Gb/s
DISKSATA
/poc 该目录,存放压测工具及生成的文件数据,60亿条约占空间600G
/dorisdb数据库数据分布式存放,600G数据会均匀分散到3台主机

2) 测试数据说明

测试表共6张,本次模拟单表lineorder 60亿条记录,lineorder_flat宽表36亿条记录。

表名行数备注
lineorder60亿 SSB商品订单表
customer3000万SSB客户表
part 200万SSB零部件表
supplier200万SSB供应商表
dates2556日期表
lineorder_flat36亿 SSB打平后的宽表

2. 测试数据生成

1) 首先下载ssb-poc工具包并编译

    # cd poc
    # wget http://dorisdb-public.oss-cn-zhangjiakou.aliyuncs.com/ssb-poc-0.9.zip
    # unzip ssb-poc-0.9.zip
    # cd ssb-poc
    # make && make install

    2) 所有相关工具安装到output目录

    进入output目录

      [root@doris1 output]# cd poc/ssb-poc/output

      --如果使用因子100,生成6亿条记录的初始文件,约59G,用时15分钟,本次未执行,仅供参考。

        [root@doris1 output]# bin/gen-ssb.sh 100 data_dir

        本次测试使用因子1000,生成60亿条记录的初始文件,约600G,用时4.5小时.

          [root@doris1 output]# time bin/gen-ssb.sh 1000 data_dir
          /poc/ssb-poc/output/bin
          /poc/ssb-poc/output
          Generating new data set of scale factor 1000

          说明:尝试生成100亿条记录,但该工具超过10000GB的数据文件功能仍在测试中,生成的数据集可能不合规!

            [root@doris1 output]# time bin/gen-ssb.sh 1700 data_dir
            /poc/ssb-poc/output/bin
            /poc/ssb-poc/output
            Generating new data set of scale factor 1700
            NOTE: Data generation for scale factors > 1000 GB is still in development,
            and is not yet supported.


            Your resulting data set MAY NOT BE COMPLIANT!

            建议可以使用小于1000G的数据测试,因子小于1400,可进行测试,本人未做验证。

            3. 创建表结构

            1) 确定测试目录

              [root@doris1 output]# pwd
              /root/ssb-poc/output

              2) 修改配置文件conf/doris.conf

                [root@doris1 output]# vim conf/doris.conf        #即/poc/ssb-poc/output/conf/doris.conf,指定脚本操作的Doris集群地址
                [doris]
                # for mysql cmd
                mysql_host: doris1
                mysql_port: 9030
                mysql_user: root
                mysql_password: demo_415
                doris_db: ssb
                # cluster ports
                http_port: 8030
                be_heartbeat_port: 9050
                broker_port: 8000

                参数说明:

                  [doris]
                  # for mysql cmd
                  mysql_host: doris1 #FE节点主机名,IP地址也可以
                  mysql_port: 9030 #FE端口号
                  mysql_user: root #DorisDB用户名root
                  mysql_password: poc_ssb #root密码
                  doris_db: ssb #生成的测试数据库名为ssb(Star schema benchmark)

                  # cluster ports
                  http_port: 8030 #BE端口
                  be_heartbeat_port: 9050 #BE心跳端口
                  broker_port: 8000 #broker端口号

                  3) 执行建表语句

                  执行脚本建表,ddl_100表示6亿条记录规模,ddl_1000表示60亿条记录规模,ddl_1677约100亿条记录规模,生成数据在output/data_dir目录.

                  --生成6亿条记录规模空表,本次未执行,仅供参考。

                    [root@doris1 output]# time bin/create_db_table.sh ddl_100
                    sql: ssb_create success
                    sql: ssb_flat_create success


                    real 0m0.318s
                    user 0m0.110s
                    sys 0m0.015s

                    生成60亿条记录规模空表,本次测试规模。

                      [root@doris1 output]# time bin/create_db_table.sh ddl_1000
                      sql: ssb_create success
                      sql: ssb_flat_create success


                      real 0m0.423s
                      user 0m0.091s
                      sys 0m0.013s

                      4) 查看生成的表

                        mysql> use ssb;
                        Reading table information for completion of table and column names
                        You can turn off this feature to get a quicker startup with -A


                        Database changed
                        mysql> show tables;
                        +----------------+
                        | Tables_in_ssb |
                        +----------------+
                        | customer |
                        | dates |
                        | lineorder |
                        | lineorder_flat |
                        | part |
                        | supplier |
                        +----------------+
                        6 rows in set (0.00 sec)

                        生成表截图:

                        查看记录数:

                        4. 数据导入

                        1) 导入单表记录,用时41m5.584s

                        使用Stream load导入单表数据,用时41m5.584s,data_dir为前述数据生成目录,将数据导入DorisDB数据库中。

                          [root@doris1 output]# time bin/stream_load.sh data_dir
                          lineorder/customer/dates/part/supplier

                          real 41m5.584s
                          user 0m30.070s
                          sys 8m24.738s

                          其中lineorder导入记录60亿条。其它记录数详见1.2测试数据说明。

                          2) 插入数据到宽表lineorder_flat,用时60m0.412s

                            [root@doris1 output]# time bin/flat_insert.sh
                            sql: ssb_flat_insert start
                            sql: ssb_flat_insert. flat insert error, msg: (1064, 'Cancelled')

                            real 60m0.412s
                            user 0m0.102s
                            sys 0m0.029s

                            60亿宽表插入报错,可能是SSB-POC工具问题,也可能是DorisDB问题,当前使用DorisDB1.15.2,原因待查。

                            实际只导入36亿条,36亿条记录可满足测试需求。

                            5. 设置查询并行度

                            首先在客户端执行命令,因只有8个core,此处设置Doris的并行度为8,类似clickhouse set max_threads= 8。

                              mysql> set global parallel_fragment_exec_instance_num  = 8;
                              Query OK, 0 rows affected (0.01 sec)
                              mysql> show variables like 'parallel_fragment_exec_instance_num';
                              +-------------------------------------+-------+
                              | Variable_name | Value |
                              +-------------------------------------+-------+
                              | parallel_fragment_exec_instance_num | 8 |
                              +-------------------------------------+-------+
                              1 row in set (0.00 sec)

                              此处只进行了并行度设置,未进行fe、be参数优化设置。

                              6. SSB测试

                              共执行了4类场景,13条查询语句,分别为单表查询和多表查询,结果为毫秒,并发为1。

                              1) 单表测试场景

                              执行语句:

                                [root@doris1 output]# time bin/benchmark.sh -p -d ssb
                                ------ dataset: ssb, concurrency: 1 ------
                                sql\time(ms)\parallel_num 1
                                q1 14430.0
                                q2 4153.0
                                q3 3265.0
                                q4 29478.0
                                q5 19391.0
                                q6 14638.0
                                q7 83782.0
                                q8 21801.0
                                q9 17182.0
                                q10 5030.0
                                q11 68037.0
                                q12 31603.0
                                q13 39640.0

                                real 5m59.162s
                                user 0m0.154s
                                sys 0m0.113s

                                单表查询时间超过分钟的为查询q7、q11均是对宽表的查询。

                                执行截图:

                                2) 多表jion测试场景

                                执行语句:

                                  [root@doris1 output]# time bin/benchmark.sh -p -d ssb-flat
                                  ------ dataset: ssb-flat, concurrency: 1 ------
                                  sql\time(ms)\parallel_num 1
                                  q1 20464.0
                                  q2 34725.0
                                  q3 25810.0
                                  q4 35177.0
                                  q5 31152.0
                                  q6 23387.0
                                  q7 22354.0
                                  q8 11362.0
                                  q9 1970.0
                                  q10 35.0
                                  q11 36059.0
                                  q12 63.0
                                  q13 32.0

                                  real 4m9.385s
                                  user 0m0.167s
                                  sys 0m0.099s

                                  多表join查询时间未超过分钟级。

                                  执行截图:

                                  7. 测试语句

                                  1) 单表测试SQL

                                    --Q1.1 
                                    SELECT sum(lo_extendedprice * lo_discount) AS `revenue`
                                    FROM lineorder_flat
                                    WHERE lo_orderdate >= '1993-01-01' and lo_orderdate <= '1993-12-31' AND lo_discount BETWEEN 1 AND 3 AND lo_quantity < 25;

                                    --Q1.2
                                    SELECT sum(lo_extendedprice * lo_discount) AS revenue FROM lineorder_flat
                                    WHERE lo_orderdate >= '1994-01-01' and lo_orderdate <= '1994-01-31' AND lo_discount BETWEEN 4 AND 6 AND lo_quantity BETWEEN 26 AND 35;

                                    --Q1.3
                                    SELECT sum(lo_extendedprice * lo_discount) AS revenue
                                    FROM lineorder_flat
                                    WHERE weekofyear(lo_orderdate) = 6 AND lo_orderdate >= '1994-01-01' and lo_orderdate <= '1994-12-31'
                                    AND lo_discount BETWEEN 5 AND 7 AND lo_quantity BETWEEN 26 AND 35;


                                    --Q2.1
                                    SELECT sum(lo_revenue), year(lo_orderdate) AS year, p_brand
                                    FROM lineorder_flat
                                    WHERE p_category = 'MFGR#12' AND s_region = 'AMERICA'
                                    GROUP BY year, p_brand
                                    ORDER BY year, p_brand;

                                    --Q2.2
                                    SELECT
                                    sum(lo_revenue), year(lo_orderdate) AS year, p_brand
                                    FROM lineorder_flat
                                    WHERE p_brand >= 'MFGR#2221' AND p_brand <= 'MFGR#2228' AND s_region = 'ASIA'
                                    GROUP BY year, p_brand
                                    ORDER BY year, p_brand;

                                    --Q2.3
                                    SELECT sum(lo_revenue), year(lo_orderdate) AS year, p_brand
                                    FROM lineorder_flat
                                    WHERE p_brand = 'MFGR#2239' AND s_region = 'EUROPE'
                                    GROUP BY year, p_brand
                                    ORDER BY year, p_brand;


                                    --Q3.1
                                    SELECT c_nation, s_nation, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue FROM lineorder_flat
                                    WHERE c_region = 'ASIA' AND s_region = 'ASIA' AND lo_orderdate >= '1992-01-01' AND lo_orderdate <= '1997-12-31'
                                    GROUP BY c_nation, s_nation, year
                                    ORDER BY year ASC, revenue DESC;

                                    --Q3.2
                                    SELECT c_city, s_city, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue
                                    FROM lineorder_flat
                                    WHERE c_nation = 'UNITED STATES' AND s_nation = 'UNITED STATES' AND lo_orderdate >= '1992-01-01' AND lo_orderdate <= '1997-12-31'
                                    GROUP BY c_city, s_city, year
                                    ORDER BY year ASC, revenue DESC;


                                    --Q3.3
                                    SELECT c_city, s_city, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue
                                    FROM lineorder_flat
                                    WHERE c_city in ( 'UNITED KI1' ,'UNITED KI5') AND s_city in ( 'UNITED KI1' ,'UNITED KI5') AND lo_orderdate >= '1992-01-01' AND lo_orderdate <= '1997-12-31'
                                    GROUP BY c_city, s_city, year
                                    ORDER BY year ASC, revenue DESC;


                                    --Q3.4
                                    SELECT c_city, s_city, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue
                                    FROM lineorder_flat
                                    WHERE c_city in ('UNITED KI1', 'UNITED KI5') AND s_city in ( 'UNITED KI1', 'UNITED KI5') AND lo_orderdate >= '1997-12-01' AND lo_orderdate <= '1997-12-31'
                                    GROUP BY c_city, s_city, year
                                    ORDER BY year ASC, revenue DESC;


                                    --Q4.1
                                    set vectorized_engine_enable = FALSE;
                                    SELECT year(lo_orderdate) AS year, c_nation, sum(lo_revenue - lo_supplycost) AS profit FROM lineorder_flat
                                    WHERE c_region = 'AMERICA' AND s_region = 'AMERICA' AND p_mfgr in ( 'MFGR#1' , 'MFGR#2')
                                    GROUP BY year, c_nation
                                    ORDER BY year ASC, c_nation ASC;


                                    --Q4.2
                                    SELECT year(lo_orderdate) AS year,
                                    s_nation, p_category, sum(lo_revenue - lo_supplycost) AS profit
                                    FROM lineorder_flat
                                    WHERE c_region = 'AMERICA' AND s_region = 'AMERICA' AND lo_orderdate >= '1997-01-01' and lo_orderdate <= '1998-12-31' AND p_mfgr in ( 'MFGR#1' , 'MFGR#2')
                                    GROUP BY year, s_nation, p_category
                                    ORDER BY year ASC, s_nation ASC, p_category ASC;


                                    --Q4.3
                                    SELECT year(lo_orderdate) AS year, s_city, p_brand,
                                    sum(lo_revenue - lo_supplycost) AS profit
                                    FROM lineorder_flat
                                    WHERE s_nation = 'UNITED STATES' AND lo_orderdate >= '1997-01-01' and lo_orderdate <= '1998-12-31' AND p_category = 'MFGR#14'
                                    GROUP BY year, s_city, p_brand
                                    ORDER BY year ASC, s_city ASC, p_brand ASC;

                                    2) 多表jion测试SQL

                                      --Q1.1 
                                      select sum(lo_revenue) as revenue
                                      from lineorder join dates on lo_orderdate = d_datekey
                                      where d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;


                                      --Q1.2
                                      select sum(lo_revenue) as revenue
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      where d_yearmonthnum = 199401
                                      and lo_discount between 4 and 6
                                      and lo_quantity between 26 and 35;


                                      --Q1.3
                                      select sum(lo_revenue) as revenue
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      where d_weeknuminyear = 6 and d_year = 1994
                                      and lo_discount between 5 and 7
                                      and lo_quantity between 26 and 35;


                                      --Q2.1
                                      select sum(lo_revenue) as lo_revenue, d_year, p_brand
                                      from lineorder
                                      inner join dates on lo_orderdate = d_datekey
                                      join part on lo_partkey = p_partkey
                                      join supplier on lo_suppkey = s_suppkey
                                      where p_category = 'MFGR#12' and s_region = 'AMERICA'
                                      group by d_year, p_brand
                                      order by d_year, p_brand;


                                      --Q2.2
                                      select sum(lo_revenue) as lo_revenue, d_year, p_brand
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      join part on lo_partkey = p_partkey
                                      join supplier on lo_suppkey = s_suppkey
                                      where p_brand between 'MFGR#2221' and 'MFGR#2228' and s_region = 'ASIA'
                                      group by d_year, p_brand
                                      order by d_year, p_brand;


                                      --Q2.3
                                      select sum(lo_revenue) as lo_revenue, d_year, p_brand
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      join part on lo_partkey = p_partkey
                                      join supplier on lo_suppkey = s_suppkey
                                      where p_brand = 'MFGR#2239' and s_region = 'EUROPE'
                                      group by d_year, p_brand
                                      order by d_year, p_brand;


                                      --Q3.1
                                      select c_nation, s_nation, d_year, sum(lo_revenue) as lo_revenue
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      join customer on lo_custkey = c_custkey
                                      join supplier on lo_suppkey = s_suppkey
                                      where c_region = 'ASIA' and s_region = 'ASIA'and d_year >= 1992 and d_year <= 1997
                                      group by c_nation, s_nation, d_year
                                      order by d_year asc, lo_revenue desc;


                                      --Q3.2
                                      select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      join customer on lo_custkey = c_custkey
                                      join supplier on lo_suppkey = s_suppkey
                                      where c_nation = 'UNITED STATES' and s_nation = 'UNITED STATES'
                                      and d_year >= 1992 and d_year <= 1997
                                      group by c_city, s_city, d_year
                                      order by d_year asc, lo_revenue desc;


                                      --Q3.3
                                      select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      join customer on lo_custkey = c_custkey
                                      join supplier on lo_suppkey = s_suppkey
                                      where (c_city='UNITED KI1' or c_city='UNITED KI5')
                                      and (s_city='UNITED KI1' or s_city='UNITED KI5')
                                      and d_year >= 1992 and d_year <= 1997
                                      group by c_city, s_city, d_year
                                      order by d_year asc, lo_revenue desc;


                                      --Q3.4
                                      select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      join customer on lo_custkey = c_custkey
                                      join supplier on lo_suppkey = s_suppkey
                                      where (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_yearmonth
                                      = 'Dec1997'
                                      group by c_city, s_city, d_year
                                      order by d_year asc, lo_revenue desc;


                                      --Q4.1
                                      select d_year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      join customer on lo_custkey = c_custkey
                                      join supplier on lo_suppkey = s_suppkey
                                      join part on lo_partkey = p_partkey
                                      where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
                                      group by d_year, c_nation
                                      order by d_year, c_nation;


                                      --Q4.2
                                      select d_year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profit
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      join customer on lo_custkey = c_custkey
                                      join supplier on lo_suppkey = s_suppkey
                                      join part on lo_partkey = p_partkey
                                      where c_region = 'AMERICA'and s_region = 'AMERICA'
                                      and (d_year = 1997 or d_year = 1998)
                                      and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
                                      group by d_year, s_nation, p_category
                                      order by d_year, s_nation, p_category;



                                      --Q4.3
                                      select d_year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit
                                      from lineorder
                                      join dates on lo_orderdate = d_datekey
                                      join customer on lo_custkey = c_custkey
                                      join supplier on lo_suppkey = s_suppkey
                                      join part on lo_partkey = p_partkey
                                      where c_region = 'AMERICA'and s_nation = 'UNITED STATES'
                                      and (d_year = 1997 or d_year = 1998)
                                      and p_category = 'MFGR#14'
                                      group by d_year, s_city, p_brand
                                      order by d_year, s_city, p_brand;


                                      8. 小结

                                      在60亿条单表记录,36亿条宽表记录情况下,通过SSB测试,DorisDB单表查询时间超过分钟的为查询q7、q11,均是对宽表的查询,多表join查询时间未超过分钟级,性能还是较为给力。作为国产OLAP一哥,还是值得一试。

                                      和同类竞品clickhouse、greenplum相比具有很大的优势,文中未做对比。


                                      --  完  --

                                      作者:王坤,微信公众号:rundba,欢迎转载,转载请注明出处。

                                      如需公众号转发,请联系wx:landnow。

                                      文章转载自rundba,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                                      评论