【GreatSQL优化器-14】直方图应用

GreatSQL社区 2025-02-19

* GreatSQL社区原创内容未经授权不得随意使用，转载请联系小编并注明来源。

一、直方图介绍

GreatSQL的优化器负责将SQL查询转换为尽可能高效的执行计划，但因为数据环境不断变化有可能导致优化器对查询数据了解不够充足，可能无法生成最优的执行计划进而影响查询效率，因此推出了直方图(histogram)功能来解决该问题。

直方图用于统计字段值的分布情况，向优化器提供统计信息。利用直方图，可以对一张表的一列数据做分布统计，估算where条件中过滤字段的选择率，从而帮助优化器更准确地估计查询过程中的行数，选择更高效的查询计划。

下面用一个简单的例子来说明直方图怎么应用在优化器。

greatsql> CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 INT,date1 DATETIME);
greatsql> INSERT INTO t1 VALUES (1,10,'2021-03-25 16:44:00.123456'),(2,1,'2022-03-26 16:44:00.123456'),(3,4,'2023-03-27 16:44:00.123456'),(5,5,'2024-03-25 16:44:00.123456');
greatsql> CREATE TABLE t2 (cc1 INT PRIMARY KEY, cc2 INT);
greatsql> INSERT INTO t2 VALUES (1,3),(2,1),(3,2),(4,3),(5,15);
greatsql> CREATE TABLE t3 (ccc1 INT, ccc2 varchar(100));
greatsql> INSERT INTO t3 VALUES (1,'aa1'),(2,'bb1'),(3,'cc1'),(4,'dd1'),(null,'ee');
greatsql> CREATE INDEX idx1 ON t1(c2);
greatsql> CREATE INDEX idx2 ON t1(c2,date1);
greatsql> CREATE INDEX idx2_1 ON t2(cc2);
greatsql> CREATE INDEX idx3_1 ON t3(ccc1);

系统自动创建buckets：
greatsql> ANALYZE TABLE t1 UPDATE HISTOGRAM ON c2 WITH 3 BUCKETS;
greatsql> SELECT json_pretty(histogram)result FROM information_schema.column_statistics WHERE table_name = 't1';
| {
  "buckets": [
    [
      1,
      5,
      0.42857142857142855,
      3
    ],
    [
      10,
      10,
      0.7142857142857143,
      1
    ],
    [
      16,
      16,
      0.8571428571428571,
      1
    ]
  ],
  "data-type": "int",
  "null-values": 0.14285714285714285,
  "collation-id": 8,
  "last-updated": "2024-10-22 08:38:48.858099",
  "sampling-rate": 1.0,
  "histogram-type": "equi-height",
  "number-of-buckets-specified": 3
} 

greatsql> EXPLAIN SELECT * FROM t1 JOIN t3 ON t1.c1=t3.ccc1 OR t1.c2<5;
+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys     | key  | key_len | ref  | rows | filtered | Extra                                          |
+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+
|  1 | SIMPLE      | t3    | NULL       | ALL  | idx3_1            | NULL | NULL    | NULL |    5 |   100.00 | NULL                                           |
|  1 | SIMPLE      | t1    | NULL       | ALL  | PRIMARY,idx1,idx2 | NULL | NULL    | NULL |    7 |    43.67 | Range checked for each record (index map: 0x7) |
+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+

                "plan_prefix": [
                ],
                "table": "`t1`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "rows_to_scan": 7,
                      "filtering_effect": [
                        {
                          "condition": "(`t1`.`c2` < 5)", 对t1.c2的过滤系数估计用到了直方图
                          "histogram_selectivity": 0.342857 这里过滤系数算出来为0.342857，即直方图第一个桶小于5的数据占的百分比
                        }
                      ],
                      "final_filtering_effect": 1,
                      "access_type": "scan",
                      "resulting_rows": 7,
                      "cost": 0.95,
                      "chosen": true
                    }
                  ]
                },

二、get_histogram_selectivity代码解释

直方图用在优化器计算过滤系数的时候，算出来的频率数据更精确。

// Item在计算get_filtering_effect的时候会用到直方图估计过滤系数
static bool get_histogram_selectivity(THD *thd, const Field *field, Item **args,
                                      size_t arg_count,
                                      histograms::enum_operator op,
                                      Item_func *item_func,
                                      const TABLE_SHARE *table_share,
                                      double *selectivity) {
  const histograms::Histogram *histogram =
      table_share->find_histogram(field->field_index());
  if (histogram != nullptr) {
    // 计算过滤系数
    if (!histogram->get_selectivity(args, arg_count, op, selectivity)) {
      return false;
    }
  }
  return true;
}

bool Histogram::get_selectivity(Item **items, size_t item_count,
                                enum_operator op, double *selectivity) const {
  // 找该表的列对应的直方图，取出对应的范围的数据频率。计算方式见表一和表二
  if (get_raw_selectivity(items, item_count, op, selectivity)) return true;
}

表一：等高直方图数据分布频率计算方式

操作op	符合条件区间在第一个桶	符合条件区间不在第一个桶	计算公式
get_equal_to_selectivity	bucket_frequency = 当前找到的频率	bucket_frequency = 当前频率-上一个桶频率	bucket_frequency 不同值个数
get_less_than_selectivity	previous_bucket_cumulative_frequency = 0.0 found_bucket_frequency =当前找到的频率	revious_bucket_cumulative_frequency =上一个桶频率 found_bucket_frequency=当前频率-上一个桶频率	value值在区间左边：previous_bucket_cumulative_frequency value值在区间中间：previous_bucket_cumulative_frequency +(found_bucket_frequency * get_distance_from_lower(value))
get_greater_than_selectivity	found_bucket_frequency=当前找到的频率	found_bucket_frequency=当前频率-上一个桶频率	value值在区间左边：found_bucket_frequency+下一个桶频率 value值在区间中间：get_distance_from_upper() * found_bucket_frequency+下一个桶频率

表二：等宽直方图数据分布频率计算方式

操作op	符合条件区间在第一个桶	符合条件区间不在第一个桶
get_equal_to_selectivity	当前频率	当前频率-上一个桶频率
get_less_than_selectivity	0.0	上一个桶频率
get_greater_than_selectivity	get_non_null_values_fraction()	get_non_null_values_fraction() - 上一个桶频率

表三：等高直方图get_distance_from_upper计算公式

操作op	计算公式	说明
longlong类型	(upper_inclusive - value) (upper_inclusive - get_lower_inclusive() + 1.0)	upper_inclusive指的是区间最大值
ulonglong类型	(upper_inclusive - value) (upper_inclusive - get_lower_inclusive() + 1.0)	upper_inclusive指的是区间最大值
其他类型	1.0 - get_distance_from_lower(value)	get_distance_from_lower计算见表四

表四：等高直方图get_distance_from_lower计算公式

操作op	计算公式	说明
double类型	(value - get_lower_inclusive()) (get_upper_inclusive() - get_lower_inclusive())	lower_inclusive指的是区间最小值
其他类型	(value - lower_inclusive) (get_upper_inclusive() - lower_inclusive + 1.0)	lower_inclusive指的是区间最小值

三、实际例子说明

接下来看几个例子来说明上面的代码。

首先创建等高直方图，看看结果。

greatsql> ANALYZE TABLE t1 UPDATE HISTOGRAM ON c2 WITH 3 BUCKETS;
greatsql> EXPLAIN SELECT * FROM t1 join t3 ON t1.c1=t3.ccc1 or t1.c2<5;
+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys     | key  | key_len | ref  | rows | filtered | Extra                                          |
+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+
|  1 | SIMPLE      | t3    | NULL       | ALL  | idx3_1            | NULL | NULL    | NULL |    5 |   100.00 | NULL                                           |
|  1 | SIMPLE      | t1    | NULL       | ALL  | PRIMARY,idx1,idx2 | NULL | NULL    | NULL |    7 |    43.67 | Range checked for each record (index map: 0x7) |
+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+
                "plan_prefix": [
                ],
                "table": "`t1`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "rows_to_scan": 7,
                      "filtering_effect": [
                        {
                          "condition": "(`t1`.`c2` < 5)", 对t1.c2的过滤系数估计用到了直方图
                          "histogram_selectivity": 0.342857 这里过滤系数计算公式见下面
                        }
                      ],
                      "final_filtering_effect": 1,
                      "access_type": "scan",
                      "resulting_rows": 7,
                      "cost": 0.95,
                      "chosen": true
                    }
                  ]
                },
计算公式：
这里用的是get_less_than_selectivity方法，因为t1.c2<5满足第一个桶的范围，因此数据见下面
previous_bucket_cumulative_frequency = 0.0
found_bucket_frequency = 0.428571428
结果 = previous_bucket_cumulative_frequency + (found_bucket_frequency * get_distance_from_lower(value))
    = 0.0 + 0.428571428 * (value - lower_inclusive) / (get_upper_inclusive() - lower_inclusive + 1.0)
    = 0.428571428 * (5-1) / (5-1+1)
    = 0.428571428 * 0.8
    = 0.342857

看一下实际的数据分布，可以看到小于5的数据实际只有1和4两条，实际在第一个桶的占比应该是2/3=67%，而上面计算出来的占比是80%，也就是比实际的占比会偏大。
greatsql> SELECT c2,count(*) FROM t1 GROUP BY c2; 
+------+----------+
| c2   | count(*) |
+------+----------+
| NULL |        1 |
|    1 |        1 |
|    4 |        1 |
|    5 |        1 |
|   10 |        2 |
|   16 |        1 |
+------+----------+
6 rows in set (0.00 sec)

接着创建等宽直方图，看看结果。

greatsql> ANALYZE TABLE t1 UPDATE HISTOGRAM ON c2 WITH 5 BUCKETS;
greatsql> SELECT json_pretty(histogram)result FROM information_schema.column_statistics WHEREtable_name = 't1';
| {
  "buckets": [
    [
      1,
      0.14285714285714285
    ],
    [
      4,
      0.2857142857142857
    ],
    [
      5,
      0.42857142857142855
    ],
    [
      10,
      0.7142857142857143
    ],
    [
      16,
      0.8571428571428571
    ]
  ],
  "data-type": "int",
  "null-values": 0.14285714285714285,
  "collation-id": 8,
  "last-updated": "2024-10-25 02:18:36.107382",
  "sampling-rate": 1.0,
  "histogram-type": "singleton",
  "number-of-buckets-specified": 5
} 

greatsql> EXPLAIN SELECT * FROM t1 JOIN t3 ON t1.c1=t3.ccc1 OR t1.c2<5;
+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys     | key  | key_len | ref  | rows | filtered | Extra                                          |
+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+
|  1 | SIMPLE      | t3    | NULL       | ALL  | idx3_1            | NULL | NULL    | NULL |    5 |   100.00 | NULL                                           |
|  1 | SIMPLE      | t1    | NULL       | ALL  | PRIMARY,idx1,idx2 | NULL | NULL    | NULL |    7 |    38.78 | Range checked for each record (index map: 0x7) |                    
                                                                                                  过滤系数比等高的43.67低
+----+-------------+-------+------------+------+-------------------+------+---------+------+------+----------+------------------------------------------------+
                "plan_prefix": [
                ],
                "table": "`t1`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "rows_to_scan": 7,
                      "filtering_effect": [
                        {
                          "condition": "(`t1`.`c2` < 5)",
                          "histogram_selectivity": 0.285714 这个数据就是上面第二桶的频率值，对比上面等高直方图的0.342857更精确
                        }
                      ],
                      "final_filtering_effect": 1,
                      "access_type": "scan",
                      "resulting_rows": 7,
                      "cost": 0.95,
                      "chosen": true
                    }
                  ]
                },

现在看一下之前在【GreatSQL优化器-06】条件过滤导致选择非最佳用过的那个例子，看看在直方图的影响下结果是什么

greatsql>CREATE TABLE t3 (ccc1 INT, ccc2 int,ccc3 datetime(6));
greatsql>INSERT INTO t3 VALUES (1,2,'2021-03-25 16:44:00.123456'),(2,10,'2021-03-25 16:44:00.123456'),(3,4,'2022-03-25 16:44:00.123456'),(4,6,'2023-03-25 16:44:00.123456'),(null,7,'2024-03-25 16:44:00.123456'),(4,3,'2024-04-25 16:44:00.123456'),(null,8,'2025-03-25 16:44:00.123456'),(3,4,'2022-06-25 16:44:00.123456'),(5,4,'2021-11-25 16:44:00.123456');
greatsql>CREATE TABLE t4 (d1 INT, d2 int, d3 varchar(100));
greatsql>INSERT INTO t4 VALUES (1,2,'aa1'),(2,1,'bb1'),(2,3,'cc1'),(3,3,'cc1'),(4,2,'ff1'),(4,4,'ert'),(4,2,'f5fg'),(null,2,'ee'),(5,30,'cc1'),(5,4,'fcc1'),(4,10,'cc1'),(6,4,'ccd1'),(null,1,'fee'),(1,2,'aa1'),(2,1,'bb1'),(2,3,'cc1'),(3,3,'cc1'),(4,2,'ff1'),(4,4,'ert'),(4,2,'f5fg'),(null,2,'ee'),(5,30,'cc1'),(5,4,'fcc1'),(4,10,'cc1'),(6,4,'ccd1'),(null,1,'fee'),(1,2,'aa1'),(2,1,'bb1'),(2,3,'cc1'),(3,3,'cc1'),(4,2,'ff1'),(4,4,'ert'),(4,2,'f5fg'),(null,2,'ee'),(5,30,'cc1'),(5,4,'fcc1'),(4,10,'cc1'),(6,4,'ccd1'),(null,1,'fee');
greatsql>CREATE INDEX idx3_2 ON t3(ccc2);
greatsql>CREATE INDEX idx3_3 ON t3(ccc3);
greatsql>CREATE INDEX idx4_2 ON t4(d2);

首先看一下没有建立直方图之前的结果，这里t4用了全表扫描。
greatsql>  EXPLAIN SELECT * FROM t4 join t3 ON t4.d1=t3.ccc1 and t4.d2=t3.ccc2 where t4.d1<5 and t3.ccc3 < '2023-11-15';
+----+-------------+-------+------------+------+---------------+--------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key    | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+--------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t4    | NULL       | ALL  | idx4_2        | NULL   | NULL    | NULL      |   39 |    33.33 | Using where |
|  1 | SIMPLE      | t3    | NULL       | ref  | idx3_2,idx3_3 | idx3_2 | 5       | db1.t4.d2 |    1 |    11.11 | Using where |
+----+-------------+-------+------------+------+---------------+--------+---------+-----------+------+----------+-------------+

-- 对t4没有索引的列d1创建直方图
greatsql> ANALYZE TABLE t4 UPDATE HISTOGRAM ON d1 WITH 5 BUCKETS;

-- 可以看到结果已经变为更优的t3作为驱动表了，这里看出直方图对于估计结果更为精确。
greatsql>  EXPLAIN SELECT * FROM t4 join t3 ON t4.d1=t3.ccc1 and t4.d2=t3.ccc2 where t4.d1<5 and t3.ccc3 < '2023-11-15';
+----+-------------+-------+------------+-------+---------------+--------+---------+-------------+------+----------+------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key    | key_len | ref         | rows | filtered | Extra                              |
+----+-------------+-------+------------+-------+---------------+--------+---------+-------------+------+----------+------------------------------------+
|  1 | SIMPLE      | t3    | NULL       | range | idx3_2,idx3_3 | idx3_3 | 9       | NULL        |    6 |   100.00 | Using index condition; Using where |
|  1 | SIMPLE      | t4    | NULL       | ref   | idx4_2        | idx4_2 | 5       | db1.t3.ccc2 |    6 |     6.15 | Using where                        |
+----+-------------+-------+------------+-------+---------------+--------+---------+-------------+------+----------+------------------------------------+