暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

MySQL8.0直方图

mysql code tracer 2021-04-25
477

导读

相信对于Oracle DBA来说,直方图一定不陌生了,但是对于MySQL DBA来说,尤其是MySQL5.6 5.7的使用者来说,可能就有一点陌生了。在使用MySQL5.7及以下版本的时候,经常会碰到由于数据倾斜等原因,导致优化器根据现有的索引统计信息并没有给出最优的执行计划,而直方图就能够很好地解决这种尴尬的窘境,今天我们就一起来看看什么是直方图。

什么是直方图

直方图是MySQL8.0.2开始推出的,也是MySQL8.0最重要的特性之一。在某些情况下,优化器无法找到最佳的执行计划,因为忽略了某些未索引的列。引入直方图的目的是为优化器提供基于直方图的统计信息,从而生成更优的执行计划。

假如有一个存储火车出发时间表:

  1. CREATE TABLE train_schedule(

  2. id INT PRIMARY KEY,

  3. train_code VARCHAR(10),

  4. departure_station VARCHAR(100),

  5. departure_time TIME);

假设高峰时段(从7AM到9AM),而夜间一般来说班次很少,也就是说在7-9点之间的行很多,而凌晨的行数很少,现有查询如下

  1. SELECT * FROM train_schedule WHERE departure_time BETWEEN '07:30:00' AND '09:15:00';

  2. SELECT * FROM train_schedule WHERE departure_time BETWEEN '01:00:00' AND '03:00:00';

在没有任何统计信息的情况下,优化器默认情况下会假设departure_time列中的值是均匀分布的,但实际上并不是均匀的,第一个SQL返回的行数更多。

直方图的出现能够为优化器提供更好的预估,尤其是在JOIN中包含多个表,对于优化器决定执行计划中考虑表的顺序,返回的行数就非常重要了。

如果对返回的行进行了很好的预估,优化器可以在返回少数行的情况下在第一阶段打开表,这样可以使笛卡尔积的总行数最小化,使查询更快。

直方图的分类

  • 等宽直方图

  • 等高直方图

等宽直方图

等宽直方图的特点:

  1. 每个桶代表一个值

  2. 每个桶存储了值、频率数据

  3. 适用于等值和范围条件

等高直方图

等高直方图的特点:

  1. 每个桶存储了多个值

  2. 每个桶存储了最大值、最小值、累积频率、不同值的数量

  3. 不是真正的等高:频繁的值存储在单独的桶中

  4. 适合范围条件

直方图运维

创建直方图
  1. mysql> ANALYZE TABLE city UPDATE HISTOGRAM ON population WITH 1024 BUCKETS;

  2. +------------+-----------+----------+-------------------------------------------------------+

  3. | Table | Op | Msg_type | Msg_text |

  4. +------------+-----------+----------+-------------------------------------------------------+

  5. | world.city | histogram | status | Histogram statistics created for column 'Population'. |

  6. +------------+-----------+----------+-------------------------------------------------------+

说明:如果不指定BUCKETS的数量,默认为100

删除直方图
  1. mysql> ANALYZE TABLE city DROP HISTOGRAM ON population;

  2. +------------+-----------+----------+-------------------------------------------------------+

  3. | Table | Op | Msg_type | Msg_text |

  4. +------------+-----------+----------+-------------------------------------------------------+

  5. | world.city | histogram | status | Histogram statistics removed for column 'population'. |

  6. +------------+-----------+----------+-------------------------------------------------------+

查看直方图
  1. mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM)

  2. -> FROM information_schema.column_statistics

  3. -> WHERE COLUMN_NAME = 'population'\G

  4. *************************** 1. row ***************************

  5. SCHEMA_NAME: world

  6. TABLE_NAME: city

  7. COLUMN_NAME: Population

  8. JSON_PRETTY(HISTOGRAM): {

  9. "buckets": [

  10. [

  11. 42,

  12. 455,

  13. 0.000980632507967639,

  14. 4

  15. ],

  16. ...

  17. [

  18. 9696300,

  19. 10500000,

  20. 1.0,

  21. 4

  22. ]

  23. ],

  24. "data-type": "int",

  25. "null-values": 0.0,

  26. "collation-id": 8,

  27. "last-updated": "2020-04-26 13:06:01.657385",

  28. "sampling-rate": 1.0,

  29. "histogram-type": "equi-height",

  30. "number-of-buckets-specified": 1024

  31. }

  32. 1 row in set (0.00 sec)

说明:

  • schema_name:schema的名字

  • table_name:表名

  • column_name:列名字

  • json_pretty(histogram):json格式化直方图信息

  • data-type:数据类型

  • null-values:null值个数

  • collation-id:直方图数据的collation id,对应于INFORMATION_SCHEMA.COLLATIONS表中的值,对于string类型的列该列才有意义

  • last-updated:最近更新时间

  • sampling-rate:采样率

  • histogram-type:直方图类型

  • number-of-buckets-specified:buckets数量

来看一个等宽直方图

  1. mysql> ANALYZE TABLE country UPDATE HISTOGRAM ON Region;

  2. +---------------+-----------+----------+---------------------------------------------------+

  3. | Table | Op | Msg_type | Msg_text |

  4. +---------------+-----------+----------+---------------------------------------------------+

  5. | world.country | histogram | status | Histogram statistics created for column 'Region'. |

  6. +---------------+-----------+----------+---------------------------------------------------+

  7. 1 row in set (0.01 sec)


  8. mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) FROM information_schema.column_statistics WHERE COLUMN_NAME = 'Region'\G

  9. *************************** 1. row ***************************

  10. SCHEMA_NAME: world

  11. TABLE_NAME: country

  12. COLUMN_NAME: Region

  13. JSON_PRETTY(HISTOGRAM): {

  14. "buckets": [

  15. [

  16. "base64:type254:QW50YXJjdGljYQ==",

  17. 0.02092050209205021

  18. ],

  19. ...

  20. [

  21. "base64:type254:V2VzdGVybiBFdXJvcGU=",

  22. 1.0

  23. ]

  24. ],

  25. "data-type": "string",

  26. "null-values": 0.0,

  27. "collation-id": 8,

  28. "last-updated": "2020-04-26 13:10:44.368923",

  29. "sampling-rate": 1.0,

  30. "histogram-type": "singleton",

  31. "number-of-buckets-specified": 100

  32. }

什么时候创建等宽直方图,什么时候创建等高直方图?

这一点在MySQL创建直方图时自动将值划分为存储桶,并自动决定要创建哪种直方图。

关于histogramgenerationmaxmemsize

histogramgenerationmaxmemsize定义了直方图使用的最大内存,默认值为20000000,单位是bytes,约为20M,该值建议不要设置过大,避免直方图占用过多的内存。

但是这个值如果不够大的话,也会导致采样比例过低,数据统计不准确。

直方图测试

  1. # 数据及查询条件说明

  2. mysql> select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;

  3. +-----------------+-----------+

  4. | name | language |

  5. +-----------------+-----------+

  6. | Mumbai (Bombay) | Asami |

  7. | Mumbai (Bombay) | Bengali |

  8. | Mumbai (Bombay) | Gujarati |

  9. | Mumbai (Bombay) | Hindi |

  10. | Mumbai (Bombay) | Kannada |

  11. | Mumbai (Bombay) | Malajalam |

  12. | Mumbai (Bombay) | Marathi |

  13. | Mumbai (Bombay) | Orija |

  14. | Mumbai (Bombay) | Punjabi |

  15. | Mumbai (Bombay) | Tamil |

  16. | Mumbai (Bombay) | Telugu |

  17. | Mumbai (Bombay) | Urdu |

  18. +-----------------+-----------+

  19. 12 rows in set (0.04 sec)


  20. mysql> show create table city\G

  21. *************************** 1. row ***************************

  22. Table: city

  23. Create Table: CREATE TABLE `city` (

  24. `ID` int NOT NULL AUTO_INCREMENT,

  25. `Name` char(35) NOT NULL DEFAULT '',

  26. `CountryCode` char(3) NOT NULL DEFAULT '',

  27. `District` char(20) NOT NULL DEFAULT '',

  28. `Population` int NOT NULL DEFAULT '0',

  29. PRIMARY KEY (`ID`),

  30. KEY `CountryCode` (`CountryCode`),

  31. CONSTRAINT `city_ibfk_1` FOREIGN KEY (`CountryCode`) REFERENCES `country` (`Code`)

  32. ) ENGINE=InnoDB AUTO_INCREMENT=4080 DEFAULT CHARSET=latin1

  33. 1 row in set (0.00 sec)


  34. mysql> show create table countrylanguage\G

  35. *************************** 1. row ***************************

  36. Table: countrylanguage

  37. Create Table: CREATE TABLE `countrylanguage` (

  38. `CountryCode` char(3) NOT NULL DEFAULT '',

  39. `Language` char(30) NOT NULL DEFAULT '',

  40. `IsOfficial` enum('T','F') NOT NULL DEFAULT 'F',

  41. `Percentage` float(4,1) NOT NULL DEFAULT '0.0',

  42. PRIMARY KEY (`CountryCode`,`Language`),

  43. KEY `CountryCode` (`CountryCode`),

  44. CONSTRAINT `countryLanguage_ibfk_1` FOREIGN KEY (`CountryCode`) REFERENCES `country` (`Code`)

  45. ) ENGINE=InnoDB DEFAULT CHARSET=latin1

  46. 1 row in set (0.00 sec)


  47. #执行计划1

  48. mysql> desc select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;

  49. +----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+

  50. | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

  51. +----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+

  52. | 1 | SIMPLE | city | NULL | ALL | CountryCode | NULL | NULL | NULL | 4188 | 33.33 | Using where |

  53. | 1 | SIMPLE | countrylanguage | NULL | ref | PRIMARY,CountryCode | CountryCode | 3 | world.city.CountryCode | 4 | 100.00 | Using index |

  54. +----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+

  55. 2 rows in set, 1 warning (0.00 sec)


  56. # 创建直方图

  57. mysql> ANALYZE TABLE city UPDATE HISTOGRAM ON population WITH 1024 BUCKETS;

  58. +------------+-----------+----------+-------------------------------------------------------+

  59. | Table | Op | Msg_type | Msg_text |

  60. +------------+-----------+----------+-------------------------------------------------------+

  61. | world.city | histogram | status | Histogram statistics created for column 'Population'. |

  62. +------------+-----------+----------+-------------------------------------------------------+

  63. 1 row in set (0.02 sec)


  64. #执行计划2

  65. mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;

  66. +----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+

  67. | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

  68. +----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+

  69. | 1 | SIMPLE | city | NULL | ALL | CountryCode | NULL | NULL | NULL | 4188 | 0.06 | Using where |

  70. | 1 | SIMPLE | countrylanguage | NULL | ref | PRIMARY,CountryCode | CountryCode | 3 | world.city.CountryCode | 4 | 100.00 | Using index |

  71. +----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+

  72. 2 rows in set, 1 warning (0.00 sec)


  73. #创建索引

  74. mysql> create index index idx_population on city(population);

  75. Query OK, 0 rows affected (0.03 sec)

  76. Records: 0 Duplicates: 0 Warnings: 0


  77. #执行计划3

  78. mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;

  79. +----+-------------+-----------------+------------+-------+----------------------------+----------------+---------+------------------------+------+----------+-----------------------+

  80. | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

  81. +----+-------------+-----------------+------------+-------+----------------------------+----------------+---------+------------------------+------+----------+-----------------------+

  82. | 1 | SIMPLE | city | NULL | range | CountryCode,idx_population | idx_population | 4 | NULL | 1 | 100.00 | Using index condition |

  83. | 1 | SIMPLE | countrylanguage | NULL | ref | PRIMARY,CountryCode | CountryCode | 3 | world.city.CountryCode | 4 | 100.00 | Using index |

  84. +----+-------------+-----------------+------------+-------+----------------------------+----------------+---------+------------------------+------+----------+-----------------------+

  85. 2 rows in set, 1 warning (0.00 sec)

说明:

  • 从执行计划1和执行计划2可以看出,两个执行计划貌似没有很大的区别,但是可以注意到的是filtered列差距很大,没有直方图时为33.33,有直方图的情况下是0.06,因此对于行数语句会更加准确,在关联情况下可能会影响到驱动顺序(本案例中并没有体现)

  • 对比执行计划2和执行计划3,可以看出,在有索引的情况下,优化器还是选择了索引,并没有使用直方图

结论

  • 通常可以依靠索引的执行计划是最好的,但是直方图在某些情况下会有所帮助,例如对于非索引列特别有用

  • 直方图并不像MySQL的索引一样会自动维护,必须通过手动去维护,另外如果真的需要使用直方图,不要滥用,务必控制好histogramgenerationmaxmemsize,并且建议业务低峰期进行统计。

  • 通常来说直方图的适用场景是:

    • 随时间变化不大的值

    • 低基数值

    • 分布不均

  • 另外直方图无法使用的场景是:

    • 加密表、临时表

    • JSON数据类型、空间数据类型

    • 已创建唯一索引的单列


参考文章:

《Percona Server和MySQL 8.0上的列直方图》:

https://www.percona.com/blog/2019/10/29/column-histograms-on-percona-server-and-mysql-8-0/

《一文读懂MySQL 8.0直方图》


文章转载自mysql code tracer,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论