一、前言
本文是对上一篇文章《First Time! 记一次MySQL主从节点执行计划不一致的诡异问题》的进一步深入,从上篇文章看从节点的filter不准确导致执行计划不准确,我也参考了网上的一篇文章对filter进行了实验,并在MySQL8.0和MySQL5.7上做了对比测试。
二、FIlter的含义
先看mysql官方对执行计划的中filtered的解释:
https://dev.mysql.com/doc/refman/5.7/en/explain-output.html#explain_filtered
"
The filtered column indicates an estimated percentage of table rows filtered by the table condition. The maximum value is 100, which means no filtering of rows occurred.
Values decreasing from 100 indicate increasing amounts of filtering. rows shows the estimated number of rows examined and rows × filtered shows the number of rows joined with the following table.
For example, if rows is 1000 and filtered is 50.00 (50%), the number of rows to be joined with the following table is 1000 × 50% = 500.
"
filtered 字段内容是一个估算百分比值,MySQL先根据join type和key等信息,估算总共需要检查的记录行数rows,然后根据当前表的查询条件估算出总 rows 中实际可能需要join连接后续表的记录行数,由此可以计算出 filtered 百分比值。
复制
以上是开了一个sr,oracle后线的回复。我简单说下我的理解,可能并不准确,欢迎大佬们批评指正:
1、这是一个评估值,是需要统计信息支撑的
2、这个值是描述通过where条件可以对表过滤的结果集的百分比。
3、这个值需要结合rows来看,也就是说执行计划的这一步返回的预估行数=rows*filtered。
4、这个值是大了好还是小了好?据我的观察和理解,这个值不能孤立的看大小,而且由于MySQL8.0以前的统计信息只有索引的基数统计信息,没有表中行的基数,所以这个filted在全表扫描的时候,大部分时候不那么准确。如果rows和filted都很大,说明where条件的过滤性不好;如果rows很大,filted很小,我们可以理where条件过滤性不错;如果rows很小,filted的很大,那说明我的rows是目标结果集的大部分数据,不需要做过滤,这个实际上感觉也不错。
5、对于MySQL8.0,由于有直方图的存在,可以通过收集直方图,让优化器知道表的数据的基数情况,从而对filted做出准确的判断。
三、实验测试
关于filtered以上几点,是我通过在MySQL5.7和MySQL8.0上做实验得出的结论,下面看下实验过程:
1、MySQL8.0 没有收集直方图的情况下的执行计划,我一并收集了explain analyze,看起来表现是跟MySQL5.7是一致的,从analyze可以看出,t1评估行数为89319,实际返回行数88294,通过ind_status和remove_flag筛选以后,评估行数为893,也是就是89319*1%=893,这个1%就是filtered的值,但是实际上是返回了82488行,可以看出这个巨大的差距,尤其是t1作为驱动表,性能差距是比较大的。
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 87266 | 1.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | t3 | NULL | eq_ref | id_idx | id_idx | 8 | RSIND.t1.DATA_SOURCE | 1 | 100.00 | Using where |
| 1 | SIMPLE | t2 | NULL | eq_ref | PRIMARY | PRIMARY | 162 | RSIND.t1.IND_ID,const | 1 | 10.00 | Using where |
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.00 sec)
| -> Sort: t2.ORDER_NUMBER (actual time=650.922..650.923 rows=6 loops=1)
-> Stream results (cost=9870.38 rows=89) (actual time=152.485..650.899 rows=6 loops=1)
-> Nested loop inner join (cost=9870.38 rows=89) (actual time=152.460..650.864 rows=6 loops=1)
-> Nested loop left join (cost=9557.77 rows=893) (actual time=0.085..138.725 rows=82488 loops=1)
-> Filter: ((t1.IND_STATUS = 1) and (t1.REMOVE_FLAG = 0)) (cost=9245.15 rows=893) (actual time=0.067..97.339 rows=82488 loops=1)
-> Table scan on t1 (cost=9245.15 rows=89319) (actual time=0.064..82.010 rows=88294 loops=1)
-> Filter: (t1.DATA_SOURCE = t3.SOURCE_ID) (cost=0.25 rows=1) (actual time=0.000..0.000 rows=1 loops=82488)
-> Single-row index lookup on t3 using id_idx (SOURCE_ID=t1.DATA_SOURCE) (cost=0.25 rows=1) (actual time=0.000..0.000 rows=1 loops=82488)
-> Filter: (t2.REMOVE_FLAG = 0) (cost=0.25 rows=0) (actual time=0.006..0.006 rows=0 loops=82488)
-> Single-row index lookup on t2 using PRIMARY (IND_ID=t1.IND_ID, NODE_ID=110035) (cost=0.25 rows=1) (actual time=0.006..0.006 rows=0 loops=82488)
复制
2、当我在indx_status和remove_flag上收集了直方图以后,执行计划就变成了正常的了,优化器发现这个t1表的ind_status和remove_flag基数很低,就不会用它做驱动表。
ANALYZE TABLE ind_der_base UPDATE HISTOGRAM ON IND_STATUS,remove_flag;
mysql> SELECT * FROM INFORMATION_SCHEMA.COLUMN_STATISTICS \G
*************************** 1. row ***************************
SCHEMA_NAME: RSIND
TABLE_NAME: ind_der_base
COLUMN_NAME: IND_STATUS
HISTOGRAM: {"buckets": [[0, 0.039878134414569506], [1, 1.0]], "data-type": "int", "null-values": 0.0, "collation-id": 8, "last-updated": "2022-11-24 03:38:58.667972", "sampling-rate": 1.0, "histogram-type": "singleton", "number-of-buckets-specified": 100}
*************************** 2. row ***************************
SCHEMA_NAME: RSIND
TABLE_NAME: ind_der_base
COLUMN_NAME: REMOVE_FLAG
HISTOGRAM: {"buckets": [[0, 0.9726028948739439], [1, 1.0]], "data-type": "int", "null-values": 0.0, "collation-id": 8, "last-updated": "2022-11-24 03:38:58.670820", "sampling-rate": 1.0, "histogram-type": "singleton", "number-of-buckets-specified": 100}
2 rows in set (0.02 sec)
mysql> explain select t1.ind_id id, t1.ind_cname name, t1.ind_frequency frequency, t1.ind_unit unit, t1.data_begin_date dataBeginDate, t1.data_end_date dataEndDate, t1.data_update_date dataUpdateDate, t1.ind_grade_tag gradeTag, t1.ind_ra_name analystName, t1.ind_ra_id analystId, t1.create_user createUser, t1.data_source dataSourceId,t1.ind_scope scope, t3.source_cname dataSourceName, t1.IND_DESCRIPTION description from ind_der_base t1 join ind_mapping_tree_der_ind t2 on t1.ind_id = t2.ind_id left join zjtb_file_source t3 on t1.data_source=t3.source_id where t1.remove_flag=0 and t2.remove_flag=0 and t2.node_id=110035 and t1.IND_STATUS=1 order by order_number;
+----+-------------+-------+------------+--------+---------------+---------+---------+----------------------+--------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+----------------------+--------+----------+-----------------------------+
| 1 | SIMPLE | t2 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 103057 | 1.00 | Using where; Using filesort |
| 1 | SIMPLE | t1 | NULL | eq_ref | PRIMARY | PRIMARY | 154 | RSIND.t2.IND_ID | 1 | 93.38 | Using where |
| 1 | SIMPLE | t3 | NULL | eq_ref | id_idx | id_idx | 8 | RSIND.t1.DATA_SOURCE | 1 | 100.00 | Using where |
+----+-------------+-------+------------+--------+---------------+---------+---------+----------------------+--------+----------+-----------------------------+
3 rows in set, 1 warning (0.01 sec)
| -> Nested loop left join (cost=30854.02 rows=96236) (actual time=37.212..37.270 rows=6 loops=1)
-> Nested loop inner join (cost=20989.79 rows=96236) (actual time=37.192..37.245 rows=6 loops=1)
-> Sort: t2.ORDER_NUMBER (cost=10426.45 rows=103057) (actual time=37.136..37.137 rows=6 loops=1)
-> Filter: ((t2.NODE_ID = 110035) and (t2.REMOVE_FLAG = 0)) (cost=10426.45 rows=103057) (actual time=11.811..37.098 rows=6 loops=1)
-> Table scan on t2 (cost=10426.45 rows=103057) (actual time=0.076..31.417 rows=103324 loops=1)
-> Filter: ((t1.IND_STATUS = 1) and (t1.REMOVE_FLAG = 0)) (cost=0.25 rows=1) (actual time=0.017..0.017 rows=1 loops=6)
-> Single-row index lookup on t1 using PRIMARY (IND_ID=t2.IND_ID) (cost=0.25 rows=1) (actual time=0.016..0.016 rows=1 loops=6)
-> Filter: (t1.DATA_SOURCE = t3.SOURCE_ID) (cost=0.25 rows=1) (actual time=0.004..0.004 rows=1 loops=6)
-> Single-row index lookup on t3 using id_idx (SOURCE_ID=t1.DATA_SOURCE) (cost=0.25 rows=1) (actual time=0.003..0.003 rows=1 loops=6)
|
复制
3、对于MySQL5.7上没有直方图的概念,怎么破?我想到MySQL虽然在表上没有基数统计信息,但是在索引上有,那是不是给ind_status和remove_flag加个索引就ok了?我们看,加了索引以后,由于统计信息里有了t1表indx_status和remove_flag的基数,就不会选择他做驱动表了。
mysql> alter table ind_der_base add index ind_remove_status(remove_flag,ind_status);
Query OK, 0 rows affected (0.26 sec)
Records: 0 Duplicates: 0 Warnings: 0
+----+-------------+-------+------------+--------+---------------------------+---------+---------+----------------------+--------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------------------+---------+---------+----------------------+--------+----------+-----------------------------+
| 1 | SIMPLE | t2 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 103057 | 1.00 | Using where; Using filesort |
| 1 | SIMPLE | t1 | NULL | eq_ref | PRIMARY,ind_remove_status | PRIMARY | 154 | rsind.t2.IND_ID | 1 | 50.00 | Using where |
| 1 | SIMPLE | t3 | NULL | eq_ref | id_idx | id_idx | 8 | rsind.t1.DATA_SOURCE | 1 | 100.00 | Using where |
+----+-------------+-------+------------+--------+---------------------------+---------+---------+----------------------+--------+----------+-----------------------------+
3 rows in set, 1 warning (0.00 sec)
表的索引的基数:
mysql> show index from ind_der_base;
+--------------+------------+-------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------+------------+-------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| ind_der_base | 0 | PRIMARY | 1 | IND_ID | A | 87017 | NULL | NULL | | BTREE | | |
| ind_der_base | 1 | ind_remove_status | 1 | REMOVE_FLAG | A | 1 | NULL | NULL | YES | BTREE | | |
| ind_der_base | 1 | ind_remove_status | 2 | IND_STATUS | A | 3 | NULL | NULL | | BTREE | | |
+--------------+------------+-------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
3 rows in set (0.00 sec)
复制
四、总结
1、MySQL8.0以前,如果字段没有索引,那么优化器对filtered的判断是很不稳定的。
2、MySQL8.0可以通过收集直方图解决对基数判断不准的问题,从而找到更好的执行计划。
3、在没有直方图的版本里,可以通过添加索引让优化器知道字段的基数,可能这个字段的选择性并不好,添加索引的用处并不是想走索引,而是让优化器选择更准确的驱动表和非驱动表。
4、引流一个大佬的文章《MySQL: 执行计划中filtered的计算方式和影响》https://www.jianshu.com/p/e70de1f7cdcf,这个讲的就比较清楚了。
评论

