在计算领域,星型模式是最简单的数据集市模式,也是开发数据仓库和维度数据集市最广泛使用的方法。星型模式由一个或多个引用任意数量维度表的事实表组成。
Star schema benchmark(星型架构基准测试),简称SSB,SSB旨在衡量支持经典数据仓库应用程序的数据库产品的性能,它基于TPC-H基准[TPC-H],但形式有所修改。查询也是基于TPC-H查询,但减少了查询次数,方便在不同平台上运行SSB。SSB已被用于衡量 Linux上的许多主要商业数据库及OLAP查询类产品,已被用于但不限于Oracle Autonomous Database、Clickhouse、Tidb、greenplum、Oceanbase、Apache Kylin等。
Star schema benchmark论文详见:https://www.cs.umb.edu/~poneil/StarSchemaB.pdf
1. 测试环境
1) 测试基础环境
HOST | 3台 * VMWEAR虚拟机 |
CPU | 8C * Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHz |
MEMORY | 32G swap 0 |
NIC | 10Gb/s |
DISK | SATA |
/poc | 该目录,存放压测工具及生成的文件数据,60亿条约占空间600G |
/dorisdb | 数据库数据分布式存放,600G数据会均匀分散到3台主机 |
2) 测试数据说明
测试表共6张,本次模拟单表lineorder 60亿条记录,lineorder_flat宽表36亿条记录。
表名 | 行数 | 备注 |
lineorder | 60亿 | SSB商品订单表 |
customer | 3000万 | SSB客户表 |
part | 200万 | SSB零部件表 |
supplier | 200万 | SSB供应商表 |
dates | 2556 | 日期表 |
lineorder_flat | 36亿 | SSB打平后的宽表 |
2. 测试数据生成
1) 首先下载ssb-poc工具包并编译
# cd poc
# wget http://dorisdb-public.oss-cn-zhangjiakou.aliyuncs.com/ssb-poc-0.9.zip
# unzip ssb-poc-0.9.zip
# cd ssb-poc
# make && make install
2) 所有相关工具安装到output目录
进入output目录
[root@doris1 output]# cd poc/ssb-poc/output
--如果使用因子100,生成6亿条记录的初始文件,约59G,用时15分钟,本次未执行,仅供参考。
[root@doris1 output]# bin/gen-ssb.sh 100 data_dir
本次测试使用因子1000,生成60亿条记录的初始文件,约600G,用时4.5小时.
[root@doris1 output]# time bin/gen-ssb.sh 1000 data_dir
/poc/ssb-poc/output/bin
/poc/ssb-poc/output
Generating new data set of scale factor 1000
说明:尝试生成100亿条记录,但该工具超过10000GB的数据文件功能仍在测试中,生成的数据集可能不合规!
[root@doris1 output]# time bin/gen-ssb.sh 1700 data_dir
/poc/ssb-poc/output/bin
/poc/ssb-poc/output
Generating new data set of scale factor 1700
NOTE: Data generation for scale factors > 1000 GB is still in development,
and is not yet supported.
Your resulting data set MAY NOT BE COMPLIANT!
建议可以使用小于1000G的数据测试,因子小于1400,可进行测试,本人未做验证。
3. 创建表结构
1) 确定测试目录
[root@doris1 output]# pwd
/root/ssb-poc/output
2) 修改配置文件conf/doris.conf
[root@doris1 output]# vim conf/doris.conf #即/poc/ssb-poc/output/conf/doris.conf,指定脚本操作的Doris集群地址
[doris]
# for mysql cmd
mysql_host: doris1
mysql_port: 9030
mysql_user: root
mysql_password: demo_415
doris_db: ssb
# cluster ports
http_port: 8030
be_heartbeat_port: 9050
broker_port: 8000
参数说明:
[doris]
# for mysql cmd
mysql_host: doris1 #FE节点主机名,IP地址也可以
mysql_port: 9030 #FE端口号
mysql_user: root #DorisDB用户名root
mysql_password: poc_ssb #root密码
doris_db: ssb #生成的测试数据库名为ssb(Star schema benchmark)
# cluster ports
http_port: 8030 #BE端口
be_heartbeat_port: 9050 #BE心跳端口
broker_port: 8000 #broker端口号
3) 执行建表语句
执行脚本建表,ddl_100表示6亿条记录规模,ddl_1000表示60亿条记录规模,ddl_1677约100亿条记录规模,生成数据在output/data_dir目录.
--生成6亿条记录规模空表,本次未执行,仅供参考。
[root@doris1 output]# time bin/create_db_table.sh ddl_100
sql: ssb_create success
sql: ssb_flat_create success
real 0m0.318s
user 0m0.110s
sys 0m0.015s
生成60亿条记录规模空表,本次测试规模。
[root@doris1 output]# time bin/create_db_table.sh ddl_1000
sql: ssb_create success
sql: ssb_flat_create success
real 0m0.423s
user 0m0.091s
sys 0m0.013s
4) 查看生成的表
mysql> use ssb;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables;
+----------------+
| Tables_in_ssb |
+----------------+
| customer |
| dates |
| lineorder |
| lineorder_flat |
| part |
| supplier |
+----------------+
6 rows in set (0.00 sec)
生成表截图:
查看记录数:
4. 数据导入
1) 导入单表记录,用时41m5.584s
使用Stream load导入单表数据,用时41m5.584s,data_dir为前述数据生成目录,将数据导入DorisDB数据库中。
[root@doris1 output]# time bin/stream_load.sh data_dir
lineorder/customer/dates/part/supplier
real 41m5.584s
user 0m30.070s
sys 8m24.738s
其中lineorder导入记录60亿条。其它记录数详见1.2测试数据说明。
2) 插入数据到宽表lineorder_flat,用时60m0.412s
[root@doris1 output]# time bin/flat_insert.sh
sql: ssb_flat_insert start
sql: ssb_flat_insert. flat insert error, msg: (1064, 'Cancelled')
real 60m0.412s
user 0m0.102s
sys 0m0.029s
60亿宽表插入报错,可能是SSB-POC工具问题,也可能是DorisDB问题,当前使用DorisDB1.15.2,原因待查。
实际只导入36亿条,36亿条记录可满足测试需求。
5. 设置查询并行度
首先在客户端执行命令,因只有8个core,此处设置Doris的并行度为8,类似clickhouse set max_threads= 8。
mysql> set global parallel_fragment_exec_instance_num = 8;
Query OK, 0 rows affected (0.01 sec)
mysql> show variables like 'parallel_fragment_exec_instance_num';
+-------------------------------------+-------+
| Variable_name | Value |
+-------------------------------------+-------+
| parallel_fragment_exec_instance_num | 8 |
+-------------------------------------+-------+
1 row in set (0.00 sec)
此处只进行了并行度设置,未进行fe、be参数优化设置。
6. SSB测试
共执行了4类场景,13条查询语句,分别为单表查询和多表查询,结果为毫秒,并发为1。
1) 单表测试场景
执行语句:
[root@doris1 output]# time bin/benchmark.sh -p -d ssb
------ dataset: ssb, concurrency: 1 ------
sql\time(ms)\parallel_num 1
q1 14430.0
q2 4153.0
q3 3265.0
q4 29478.0
q5 19391.0
q6 14638.0
q7 83782.0
q8 21801.0
q9 17182.0
q10 5030.0
q11 68037.0
q12 31603.0
q13 39640.0
real 5m59.162s
user 0m0.154s
sys 0m0.113s
单表查询时间超过分钟的为查询q7、q11均是对宽表的查询。
执行截图:
2) 多表jion测试场景
执行语句:
[root@doris1 output]# time bin/benchmark.sh -p -d ssb-flat
------ dataset: ssb-flat, concurrency: 1 ------
sql\time(ms)\parallel_num 1
q1 20464.0
q2 34725.0
q3 25810.0
q4 35177.0
q5 31152.0
q6 23387.0
q7 22354.0
q8 11362.0
q9 1970.0
q10 35.0
q11 36059.0
q12 63.0
q13 32.0
real 4m9.385s
user 0m0.167s
sys 0m0.099s
多表join查询时间未超过分钟级。
执行截图:
7. 测试语句
1) 单表测试SQL
--Q1.1
SELECT sum(lo_extendedprice * lo_discount) AS `revenue`
FROM lineorder_flat
WHERE lo_orderdate >= '1993-01-01' and lo_orderdate <= '1993-12-31' AND lo_discount BETWEEN 1 AND 3 AND lo_quantity < 25;
--Q1.2
SELECT sum(lo_extendedprice * lo_discount) AS revenue FROM lineorder_flat
WHERE lo_orderdate >= '1994-01-01' and lo_orderdate <= '1994-01-31' AND lo_discount BETWEEN 4 AND 6 AND lo_quantity BETWEEN 26 AND 35;
--Q1.3
SELECT sum(lo_extendedprice * lo_discount) AS revenue
FROM lineorder_flat
WHERE weekofyear(lo_orderdate) = 6 AND lo_orderdate >= '1994-01-01' and lo_orderdate <= '1994-12-31'
AND lo_discount BETWEEN 5 AND 7 AND lo_quantity BETWEEN 26 AND 35;
--Q2.1
SELECT sum(lo_revenue), year(lo_orderdate) AS year, p_brand
FROM lineorder_flat
WHERE p_category = 'MFGR#12' AND s_region = 'AMERICA'
GROUP BY year, p_brand
ORDER BY year, p_brand;
--Q2.2
SELECT
sum(lo_revenue), year(lo_orderdate) AS year, p_brand
FROM lineorder_flat
WHERE p_brand >= 'MFGR#2221' AND p_brand <= 'MFGR#2228' AND s_region = 'ASIA'
GROUP BY year, p_brand
ORDER BY year, p_brand;
--Q2.3
SELECT sum(lo_revenue), year(lo_orderdate) AS year, p_brand
FROM lineorder_flat
WHERE p_brand = 'MFGR#2239' AND s_region = 'EUROPE'
GROUP BY year, p_brand
ORDER BY year, p_brand;
--Q3.1
SELECT c_nation, s_nation, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue FROM lineorder_flat
WHERE c_region = 'ASIA' AND s_region = 'ASIA' AND lo_orderdate >= '1992-01-01' AND lo_orderdate <= '1997-12-31'
GROUP BY c_nation, s_nation, year
ORDER BY year ASC, revenue DESC;
--Q3.2
SELECT c_city, s_city, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue
FROM lineorder_flat
WHERE c_nation = 'UNITED STATES' AND s_nation = 'UNITED STATES' AND lo_orderdate >= '1992-01-01' AND lo_orderdate <= '1997-12-31'
GROUP BY c_city, s_city, year
ORDER BY year ASC, revenue DESC;
--Q3.3
SELECT c_city, s_city, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue
FROM lineorder_flat
WHERE c_city in ( 'UNITED KI1' ,'UNITED KI5') AND s_city in ( 'UNITED KI1' ,'UNITED KI5') AND lo_orderdate >= '1992-01-01' AND lo_orderdate <= '1997-12-31'
GROUP BY c_city, s_city, year
ORDER BY year ASC, revenue DESC;
--Q3.4
SELECT c_city, s_city, year(lo_orderdate) AS year, sum(lo_revenue) AS revenue
FROM lineorder_flat
WHERE c_city in ('UNITED KI1', 'UNITED KI5') AND s_city in ( 'UNITED KI1', 'UNITED KI5') AND lo_orderdate >= '1997-12-01' AND lo_orderdate <= '1997-12-31'
GROUP BY c_city, s_city, year
ORDER BY year ASC, revenue DESC;
--Q4.1
set vectorized_engine_enable = FALSE;
SELECT year(lo_orderdate) AS year, c_nation, sum(lo_revenue - lo_supplycost) AS profit FROM lineorder_flat
WHERE c_region = 'AMERICA' AND s_region = 'AMERICA' AND p_mfgr in ( 'MFGR#1' , 'MFGR#2')
GROUP BY year, c_nation
ORDER BY year ASC, c_nation ASC;
--Q4.2
SELECT year(lo_orderdate) AS year,
s_nation, p_category, sum(lo_revenue - lo_supplycost) AS profit
FROM lineorder_flat
WHERE c_region = 'AMERICA' AND s_region = 'AMERICA' AND lo_orderdate >= '1997-01-01' and lo_orderdate <= '1998-12-31' AND p_mfgr in ( 'MFGR#1' , 'MFGR#2')
GROUP BY year, s_nation, p_category
ORDER BY year ASC, s_nation ASC, p_category ASC;
--Q4.3
SELECT year(lo_orderdate) AS year, s_city, p_brand,
sum(lo_revenue - lo_supplycost) AS profit
FROM lineorder_flat
WHERE s_nation = 'UNITED STATES' AND lo_orderdate >= '1997-01-01' and lo_orderdate <= '1998-12-31' AND p_category = 'MFGR#14'
GROUP BY year, s_city, p_brand
ORDER BY year ASC, s_city ASC, p_brand ASC;
2) 多表jion测试SQL
--Q1.1
select sum(lo_revenue) as revenue
from lineorder join dates on lo_orderdate = d_datekey
where d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;
--Q1.2
select sum(lo_revenue) as revenue
from lineorder
join dates on lo_orderdate = d_datekey
where d_yearmonthnum = 199401
and lo_discount between 4 and 6
and lo_quantity between 26 and 35;
--Q1.3
select sum(lo_revenue) as revenue
from lineorder
join dates on lo_orderdate = d_datekey
where d_weeknuminyear = 6 and d_year = 1994
and lo_discount between 5 and 7
and lo_quantity between 26 and 35;
--Q2.1
select sum(lo_revenue) as lo_revenue, d_year, p_brand
from lineorder
inner join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_category = 'MFGR#12' and s_region = 'AMERICA'
group by d_year, p_brand
order by d_year, p_brand;
--Q2.2
select sum(lo_revenue) as lo_revenue, d_year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_brand between 'MFGR#2221' and 'MFGR#2228' and s_region = 'ASIA'
group by d_year, p_brand
order by d_year, p_brand;
--Q2.3
select sum(lo_revenue) as lo_revenue, d_year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_brand = 'MFGR#2239' and s_region = 'EUROPE'
group by d_year, p_brand
order by d_year, p_brand;
--Q3.1
select c_nation, s_nation, d_year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where c_region = 'ASIA' and s_region = 'ASIA'and d_year >= 1992 and d_year <= 1997
group by c_nation, s_nation, d_year
order by d_year asc, lo_revenue desc;
--Q3.2
select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where c_nation = 'UNITED STATES' and s_nation = 'UNITED STATES'
and d_year >= 1992 and d_year <= 1997
group by c_city, s_city, d_year
order by d_year asc, lo_revenue desc;
--Q3.3
select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where (c_city='UNITED KI1' or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and d_year >= 1992 and d_year <= 1997
group by c_city, s_city, d_year
order by d_year asc, lo_revenue desc;
--Q3.4
select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_yearmonth
= 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, lo_revenue desc;
--Q4.1
select d_year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by d_year, c_nation
order by d_year, c_nation;
--Q4.2
select d_year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profit
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA'and s_region = 'AMERICA'
and (d_year = 1997 or d_year = 1998)
and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by d_year, s_nation, p_category
order by d_year, s_nation, p_category;
--Q4.3
select d_year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA'and s_nation = 'UNITED STATES'
and (d_year = 1997 or d_year = 1998)
and p_category = 'MFGR#14'
group by d_year, s_city, p_brand
order by d_year, s_city, p_brand;
8. 小结
在60亿条单表记录,36亿条宽表记录情况下,通过SSB测试,DorisDB单表查询时间超过分钟的为查询q7、q11,均是对宽表的查询,多表join查询时间未超过分钟级,性能还是较为给力。作为国产OLAP一哥,还是值得一试。
和同类竞品clickhouse、greenplum相比具有很大的优势,文中未做对比。
-- 完 --
作者:王坤,微信公众号:rundba,欢迎转载,转载请注明出处。
如需公众号转发,请联系wx:landnow。