在上一篇中,我请求你们相信全局统计信息是个好东西,因此,在这篇中,我希望向你们展示为什么会是这样,以便你可以确信,你没有欺骗想要避免收集全局统计信息的自己(注:全部运行于10.2.0.4)
为何你们甚至要避免它们(指全局统计信息)?如果你正在使用一个非常大的对象的,会花费大量的工作去收集它的全局统计信息,因为Oracle必须访问所有分区。做为一种替代方案,Oracle具备聚合低级别的统计信息,从而在同一对象的高级别上,模拟生成全局统计信息。在我们的示例中,当我们向新的子分区中插入新数据或者使用分区交换操作时,我们在子分区级别上收集统计信息,从而允许统计信息向上,在分区级别和表级别上聚合。我们来看一下它是如何工作的。。。。
我删除已存在的,表的统计信息,然后在子分区级别上重新收集。
SQL> exec dbms_stats.delete_table_stats('TESTUSER', 'TEST_TAB1') PL/SQL procedure successfully completed. SQL> exec dbms_stats.gather_table_stats('TESTUSER', 'TEST_TAB1', GRANULARITY => 'SUBPARTITION') PL/SQL procedure successfully completed.
复制
注意,由于我没有指定子分区的名称,因此所有的子分区都将被访问,以便收集它们的统计信息。并且在表,分区和子分区上导致如下的统计信息结果。
SQL> select table_name, global_stats, last_analyzed, num_rows from dba_tables where table_name='TEST_TAB1' and owner='TESTUSER' order by 1, 3 desc nulls last TABLE_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ --- ---------------------- ---------- TEST_TAB1 NO 16-FEB-2010 16:23:32 11 1 row selected.
复制
可以看到,表的统计信息并不是全局的,但行数看起来是正确的。这些统计信息实际上是Oracle生成的,基于子分区上的数据聚合的统计信息。让我们看一下分区的情况。
SQL> select table_name, partition_name, global_stats, last_analyzed, num_rows from dba_tab_partitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last TABLE_NAME PARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131 NO 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100201 NO 16-FEB-2010 16:23:32 4 TEST_TAB1 P_20100202 NO 16-FEB-2010 16:23:32 2 TEST_TAB1 P_20100203 NO 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100204 NO 16-FEB-2010 16:23:32 2 TEST_TAB1 P_20100205 NO 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100206 NO 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100207 NO 16-FEB-2010 16:23:32 0 8 rows selected.
复制
同样的,显示出来的聚合统计信息精确反映了数据分布。那么子分区上的统计信息又会如何呢?
SQL> select table_name, subpartition_name, global_stats, last_analyzed, num_rows from dba_tab_subpartitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last TABLE_NAME SUBPARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131_GROT YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100131_HALO YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100131_JUNE YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100131_OTHERS YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100201_GROT YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100201_HALO YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100201_JUNE YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100201_OTHERS YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100202_GROT YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100202_HALO YES 16-FEB-2010 16:23:32 1 <<输出截断>> 32 rows selected.
复制
这看起来很好,不是吗?我们在子分区上收集”全局“统计信息,同时也会在表和子分区级别上看到精确的统计信息。我们为什么不使用这个办法呢?以我当前使用的系统为例,这是一个近实时的数据仓库,每天会创建上万个子分区,它们中的大多数只涵盖数小时的时段,所以,如果我们在表和分区级别重新收集统计信息,会有大量的,统计信息收集相关的工作负载,系统也会因此而承受相应的压力。因此,如果我们可以仅仅在子分区级别,对新的子分区收集统计信息,让Oracle聚合并推导出表和分区上的统计信息,这会好得多。为了模拟这一行为,我会插入更多一些的数据,看看在增加了数据,并且在子分区级别重新收集统计信息后,是否统计信息仍然像看起来那样精确。
SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 100000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 3000000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 200000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 110000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 240000, 'U'); 1 row created. SQL> COMMIT; Commit complete.
复制
我只对数据发生改变的子分区收集统计信息。
SQL> exec dbms_stats.gather_table_stats('TESTUSER', 'TEST_TAB1', GRANULARITY => 'SUBPARTITION', PARTNAME => 'P_20100206_GROT'); PL/SQL procedure successfully completed.
复制
查看统计信息的时刻到了
SQL> select table_name, global_stats, last_analyzed, num_rows from dba_tables where table_name='TEST_TAB1' and owner='TESTUSER' order by 1, 2, 4 desc nulls last; TABLE_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ --- ---------------------- ---------- TEST_TAB1 NO 16-FEB-2010 16:23:34 16 1 row selected. SQL> select table_name, partition_name, global_stats, last_analyzed, num_rows from dba_tab_partitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last; TABLE_NAME PARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131 NO 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100201 NO 16-FEB-2010 16:23:34 4 TEST_TAB1 P_20100202 NO 16-FEB-2010 16:23:34 2 TEST_TAB1 P_20100203 NO 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100204 NO 16-FEB-2010 16:23:34 2 TEST_TAB1 P_20100205 NO 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100206 NO 16-FEB-2010 16:23:34 6 TEST_TAB1 P_20100207 NO 16-FEB-2010 16:23:34 0 8 rows selected. SQL> select table_name, subpartition_name, global_stats, last_analyzed, num_rows from dba_tab_subpartitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last; TABLE_NAME SUBPARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131_GROT YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100131_HALO YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100131_JUNE YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100131_OTHERS YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100201_GROT YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100201_HALO YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100201_JUNE YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100201_OTHERS YES 16-FEB-2010 16:23:34 1 <<output_snipped>> TEST_TAB1 P_20100206_GROT YES 16-FEB-2010 16:23:34 5 TEST_TAB1 P_20100206_HALO YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100206_JUNE YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100206_OTHERS YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100207_GROT YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100207_HALO YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100207_JUNE YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100207_OTHERS YES 16-FEB-2010 16:23:34 0 32 rows selected.
复制
在这个例子中 ,每一样看来起都很好。这看起来是一个极好的策略–低收集开销和精确的统计信息。除非你使用Greg Rahn的查询脚本
下钻到列级的统计信息,并发现一些可怕的问题前,是这样的。下面是在表一级看到的统计信息。
SQL> select a.column_name, a.num_distinct, display_raw(a.low_value,b.data_type) as low_val, display_raw(a.high_value,b.data_type) as high_val, b.data_type from dba_tab_col_statistics a, dba_tab_cols b where a.owner='TESTUSER' and a.table_name='TEST_TAB1' and a.table_name=b.table_name and a.column_name=b.column_name order by 1; COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL DATA_TYPE -------------------------- ------------ -------------------- -------------------- ----------- REPORTING_DATE 6 20100201 20100206 NUMBER SEQ_ID 14 400 3000000 NUMBER SOURCE_SYSTEM 7 GROT WINE VARCHAR2 STATUS 4 P U VARCHAR2 4 rows selected.
复制
乍看上去很不错。举一个具体的例子,STATUS列确实有正确的高值“U”,它刚刚出现在我最后插入的一组记录中。根据我到目前为止看到的情况,Oracle在生成聚合统计数据时,确实准确地更新了高/低列值和行计数,但这里有一个问题。根据列统计信息,表中有4种不同的STATUS列值,但事实并非如此,只有2种
SQL> select distinct STATUS from test_tab1; S - U P SQL>
复制
基于这样一个如此少的行数和仅有两个唯一值的问题,在统计信息聚合期间,精确地计算唯一值的数量的风险看起来十分轻微。但当我们考虑到这些值是基于成本计算的关键输入时。。。
为什么这些值不能精确呢?让我们比较一下高/低值和唯一值数量(NDV)。
当我们在新子分区上收集统计信息时,我们必须在表级访问之前的高/低值。以下是我们在新分区上收集前,列上的统计信息。
SQL> select 2 a.column_name, 3 a.num_distinct, 4 display_raw(a.low_value,b.data_type) as low_val, 5 display_raw(a.high_value,b.data_type) as high_val, 6 b.data_type 7 from 8 dba_tab_col_statistics a, dba_tab_cols b 9 where 10 a.owner='TESTUSER' and 11 a.table_name='TEST_TAB1' and 12 a.table_name=b.table_name and 13 a.column_name=b.column_name 14 order by 1 15 / COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL DATA_TYPE -------------------------- ------------ -------------------- -------------------- ----------- REPORTING_DATE 6 20100201 20100206 NUMBER SEQ_ID 9 400 30000 NUMBER SOURCE_SYSTEM 8 GROT WINE VARCHAR2 STATUS 1 P P VARCHAR2
复制
在这时,只有一个STATUS值P。当我们在新子分区上收集统计信息时,Oracle可以看到在这个子分区上的全部STATUS值,并注意到其中一行的STAUTS=‘U’,然后非常容易的计算出它比’P’更大。从而像早前我们显示的例子,更新了高值。
COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL DATA_TYPE -------------------------- ------------ -------------------- -------------------- ------------ REPORTING_DATE 6 20100201 20100206 NUMBER SEQ_ID 14 400 3000000 NUMBER SOURCE_SYSTEM 7 GROT WINE VARCHAR2 STATUS 4 P U VARCHAR2
复制
现在,NDV要如何处理呢?要知道,Oracle不能看到除了我们正在收集统计信息的子分区(这是为了减少开销)之外的子分区或分区的数据。因此,它必须基于以下几个输入来确定新的NDV是什么。
1)我们能看到的子分区中实际的STATUS值。
2)之前收集并存储的其它子分区上的NDV值。
3)之前在表和分区级上聚合和存储的NDV值
而问题是Oracle知道在其它子分区上的唯一值的数量(计算数量前要看到数据),但并不知道数据的值本身,没有这些信息,我们如何说这个子分区中的2个唯一值(P和U),和其它子分区中的值是不同的?实际上,在本例中,我们可能期望Oracle去做一些聪明的事情,并意识到,在新子分区之前,只有一个唯一值P,并且,当下的子分区中只有值P和U,那样,就会只有2个唯一值了。不幸的是,Oracle并不是这样工作的。
虽然依赖低级别的统计信息数据向高级别聚合,最初看起来似乎是一个巧妙的技巧,但它将导致一些非常奇怪的统计信息数据,至少在10g中是如此。这就是为什么Oracle建议您在表和分区级别收集全局统计数据
我希望我可以说这个问题是聚合过程中仅有的问题,但是,在下一篇中,我们还会看到更多。。。
原文链接:http://oracledoug.com/serendipity/index.php?/archives/1563-Statistics-on-Partitioned-Tables-Part-2.html
原文内容:
In the last part, I asked you to trust me that true Global Stats are a good thing so in this post I hope to show you why they are, to make sure you don’t kid yourself that you can avoid them. (Updated later - this is all on 10.2.0.4)
Why would you even want to avoid them? Global stats can take a lot of work to gather if you’re working with very large objects because Oracle has to visit all partitions. As an alternative, Oracle has the capability to aggregate lower level statistics to generate simulated global statistics at higher levels of the same object. In our case, as we INSERT data into new subpartitions or use partition exchange operations, we gather statistics at the SUBPARTITION level and allow the statistics to aggregate up to the PARTITION and TABLE level. Here’s how it looks …
I’ll delete the existing table stats and regather at the SUBPARTITION level.
SQL> exec dbms_stats.delete_table_stats('TESTUSER', 'TEST_TAB1') PL/SQL procedure successfully completed. SQL> exec dbms_stats.gather_table_stats('TESTUSER', 'TEST_TAB1', GRANULARITY => 'SUBPARTITION') PL/SQL procedure successfully completed.
复制
Note that, because I haven’t specified a subpartition name, all of the subpartitions will need to be visited to gather these statistics and will result in the following Table, Partition and Subpartition stats.
SQL> select table_name, global_stats, last_analyzed, num_rows from dba_tables where table_name='TEST_TAB1' and owner='TESTUSER' order by 1, 3 desc nulls last TABLE_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ --- ---------------------- ---------- TEST_TAB1 NO 16-FEB-2010 16:23:32 11 1 row selected.
复制
I can see that the table statistics are not global stats, but the number of rows looks right. These stats are actually aggregated statistics that Oracle has populated, based on the data found in the subpartitions. Let’s look at the partitions.
SQL> select table_name, partition_name, global_stats, last_analyzed, num_rows from dba_tab_partitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last TABLE_NAME PARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131 NO 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100201 NO 16-FEB-2010 16:23:32 4 TEST_TAB1 P_20100202 NO 16-FEB-2010 16:23:32 2 TEST_TAB1 P_20100203 NO 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100204 NO 16-FEB-2010 16:23:32 2 TEST_TAB1 P_20100205 NO 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100206 NO 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100207 NO 16-FEB-2010 16:23:32 0 8 rows selected.
复制
Again, the aggregated stats appear to be an accurate reflection of the data. How do the subpartition stats look?
SQL> select table_name, subpartition_name, global_stats, last_analyzed, num_rows from dba_tab_subpartitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last TABLE_NAME SUBPARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131_GROT YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100131_HALO YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100131_JUNE YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100131_OTHERS YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100201_GROT YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100201_HALO YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100201_JUNE YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100201_OTHERS YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100202_GROT YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100202_HALO YES 16-FEB-2010 16:23:32 1 <<output_snipped>> 32 rows selected.
复制
So this looks pretty good, doesn’t it? We’ve gathered ‘Global’ statistics at the subpartition level and yet the stats at the table and partition level look accurate too. Why would we want to use this approach? Well, to use my current system as an example, it’s a near-real-time datawarehouse which creates tens of thousands of subpartitions per day, most of them over a period of a few hours so if we were to re-gather global statistics at the table and partition levels, there would be substantial associated stats-gathering workload and the system is under enough strain as it is. Therefore, if we can just gather stats at the subpartition level for the new subpartitions and have Oracle aggregate them to generate derived Table and Partition stats at the same time, so much the better. To simulate that, I’ll insert some more data and see if the stats still look accurate after adding data and regathering at the SUBPARTITION level.
SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 100000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 3000000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 200000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 110000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 240000, 'U'); 1 row created. SQL> COMMIT; Commit complete.
复制
I’ll gather stats at the SUBPARTITION level for the only subpartition that has changed data.
SQL> exec dbms_stats.gather_table_stats('TESTUSER', 'TEST_TAB1', GRANULARITY => 'SUBPARTITION', PARTNAME => 'P_20100206_GROT'); PL/SQL procedure successfully completed.
复制
Time to look at the stats …
SQL> select table_name, global_stats, last_analyzed, num_rows from dba_tables where table_name='TEST_TAB1' and owner='TESTUSER' order by 1, 2, 4 desc nulls last; TABLE_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ --- ---------------------- ---------- TEST_TAB1 NO 16-FEB-2010 16:23:34 16 1 row selected. SQL> select table_name, partition_name, global_stats, last_analyzed, num_rows from dba_tab_partitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last; TABLE_NAME PARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131 NO 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100201 NO 16-FEB-2010 16:23:34 4 TEST_TAB1 P_20100202 NO 16-FEB-2010 16:23:34 2 TEST_TAB1 P_20100203 NO 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100204 NO 16-FEB-2010 16:23:34 2 TEST_TAB1 P_20100205 NO 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100206 NO 16-FEB-2010 16:23:34 6 TEST_TAB1 P_20100207 NO 16-FEB-2010 16:23:34 0 8 rows selected. SQL> select table_name, subpartition_name, global_stats, last_analyzed, num_rows from dba_tab_subpartitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last; TABLE_NAME SUBPARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131_GROT YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100131_HALO YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100131_JUNE YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100131_OTHERS YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100201_GROT YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100201_HALO YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100201_JUNE YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100201_OTHERS YES 16-FEB-2010 16:23:34 1 <<output_snipped>> TEST_TAB1 P_20100206_GROT YES 16-FEB-2010 16:23:34 5 TEST_TAB1 P_20100206_HALO YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100206_JUNE YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100206_OTHERS YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100207_GROT YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100207_HALO YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100207_JUNE YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100207_OTHERS YES 16-FEB-2010 16:23:34 0 32 rows selected.
复制
Everything’s still looking very good in this case and so it looks like a great strategy - low collection overhead and accurate statistics. That is, until you start drilling down to the column level statistics, using Greg Rahn’s query and identify some horrible problems. These are the statistics at the table level.
SQL> select a.column_name, a.num_distinct, display_raw(a.low_value,b.data_type) as low_val, display_raw(a.high_value,b.data_type) as high_val, b.data_type from dba_tab_col_statistics a, dba_tab_cols b where a.owner='TESTUSER' and a.table_name='TEST_TAB1' and a.table_name=b.table_name and a.column_name=b.column_name order by 1; COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL DATA_TYPE -------------------------- ------------ -------------------- -------------------- ----------- REPORTING_DATE 6 20100201 20100206 NUMBER SEQ_ID 14 400 3000000 NUMBER SOURCE_SYSTEM 7 GROT WINE VARCHAR2 STATUS 4 P U VARCHAR2 4 rows selected.
复制
At first glance they look pretty good, too. To give you a specific example, the STATUS column does have the correct High Value of ‘U’, which has just appeared in the last set of rows that were inserted. Based on what I’ve seen to date, Oracle does accurately update the High/Low column values and row counts when generating aggregated stats, but there’s a problem here. According to the column statistics, there are 4 distinct STATUSes in the table, but that’s not true, there are only 2
SQL> select distinct STATUS from test_tab1; S - U P SQL>
复制
Based on a problem with such a small number of rows and only two distinct values, the chances of the Number of Distinct Values calculated during stats aggregation being accurate looks pretty slim and, when you consider what a key input to cost-based calculations those values are …
Why aren’t the values accurate? Well let’s compare High/Low values to Number of Distinct Values (NDV).
When we gathered statistics on the new subpartition, we had access to the previous High/Low values at the table level. Here are the column statistics before stats were gathered on the new subpartition.
SQL> select 2 a.column_name, 3 a.num_distinct, 4 display_raw(a.low_value,b.data_type) as low_val, 5 display_raw(a.high_value,b.data_type) as high_val, 6 b.data_type 7 from 8 dba_tab_col_statistics a, dba_tab_cols b 9 where 10 a.owner='TESTUSER' and 11 a.table_name='TEST_TAB1' and 12 a.table_name=b.table_name and 13 a.column_name=b.column_name 14 order by 1 15 / COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL DATA_TYPE -------------------------- ------------ -------------------- -------------------- ----------- REPORTING_DATE 6 20100201 20100206 NUMBER SEQ_ID 9 400 30000 NUMBER SOURCE_SYSTEM 8 GROT WINE VARCHAR2 STATUS 1 P P VARCHAR2
复制
So at this stage, there is one distinct value of status, which is P. When we gathered stats on the new subpartition, Oracle could see all of the STATUS values for the rows in that subpartition and noticed STATUS=‘U’ on one of the rows and could work out very easily that it’s higher than ‘P’, so updated the High Value accordingly as per the example shown earlier.
COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL DATA_TYPE -------------------------- ------------ -------------------- -------------------- ------------ REPORTING_DATE 6 20100201 20100206 NUMBER SEQ_ID 14 400 3000000 NUMBER SOURCE_SYSTEM 7 GROT WINE VARCHAR2 STATUS 4 P U VARCHAR2
复制
Now, what to do about the NDV? Remember, Oracle can’t look at any of the data in Partitions or Subpartitions other than the one we’re gathering stats on (that’s the point, to reduce overhead). So it has to decide what the new NDV should be based on several inputs
- The actual values in STATUS for the rows in the subpartition we can look at.
- The previously-gathered and stored NDV for the other subpartitions.
- The previously-aggregated NDV stored at the table and partition levels.
The problem is that Oracle knows the number of distinct values in other subpartitions (we looked at the data previously to calculate them) but not the values themselves and, without that information, how can it say whether the 2 distinct values (P and U) in this subpartition are distinct when compared to the values in the other subpartitions? Actually, in this case, we might expect Oracle to do something clever and realise that, as there was only one distinct value of P prior to our new subpartition and the only values in the current subpartition are P and U, that there can only be 2 distinct values. Sadly, it just doesn’t work that way!
Although relying on lower-level statistics being aggregated up to higher levels might initially seem like a neat trick, it’s going to lead to some pretty strange statistics, at least in 10g. Which is why Oracle recommend you gather Global Stats at the TABLE and PARTITION levels.
I wish I could say that was the only problem with this aggregation process, but there’s more to come in the next post …
评论
