列组

原创赵勇 2020-12-20

831

Filed under: extended stats,Oracle,Statistics — Jonathan Lewis @ 12:14 pm BST Oct 21,2020
这是有一个关于列组统计信息的奇怪的小细节。乍一看，这是违反正常预期结果的，但实际上这是一个“显然”（一旦你考虑了一下）的结果，这是收集统计信息使用的近似算法的结果。

我将以问题的形式提出：

我有一个有两列的表：flag和v1。虽然列未声明为not null，但也不包含任何null。假如flag有26个不同的值，v1有1000000个不同的值，那么如果我创建列组（flag，v1），我应该看到的最小的不同值的数量是多少？

当然，这个问题有点模棱两可：不同值的数量是指列组所实际拥用的，或者还是一个全新收集的统计信息中所报告的。下面是用一个简单测试脚本，在我的表中创建，生成并收集统计之后的结果：

select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/
 
COLUMN_NAME                      NUM_DISTINCT
-------------------------------- ------------
FLAG                                       26
ID                                    1000000
V1                                     999040
SYS_STUQ#TO6BT1REX3P1BKO0ULVR9         989120

实际上v1有1000000个不同的值（它是varchar2（）表示的ID列），但是approximate_ndv()机制可能会有大约1.3%的误差（我相信），所以Oracle在这里的估计有点偏差。

列组（由内部列定义 SYS_STUQ#TO6BT1REX3P1BKO0ULVR9表示）一定包含（至少）1000000个不同的值，但这个案例下的错误比v1中的错误稍大，其效果是组合数似乎比v1的不同值的数目还要小！

在这个案例中，实际值和估计值之间没有太大差别。但测试结果表明，如果列组不存在，则评估值和Oracle将执行的算法之间可能存在显著差异。名义上优化器会假设有2600万个不同的值（尽管在本例中，我只在表中创建了100万行，并且优化器将理智地对待2600万行）。

仅管实际值和评估值差异很小，但我还是要提出一个问题：是否在某种情况下，优化器会忽略列组统计信息。因为理智地看，评估值可被“证明”是“错误”的。毕竟，如果num_distinct小于其中一个组件的num_distinct，那么它一定是错的。然后，可能还有一个理智的检查，只在评估值“足够错误”时忽略列组，但允许有小的误差。

我注意到这个问题，仅仅是因为近来在Oracle-L 邮件列表中出现了一些奇怪的优化器评估结果的案例。而且我能看到的唯一明显区别（目前）是一个坏的计划出现在一个分区中，列组在统计信息中的值显示是异常的，但是当列组异常不存在时，一个好的计划就会出现。

备注:
如果你想重构上面的结果，以下是我所使用的测试模型（在19.3.0.0.和11.2.0.4中测试过）:

	
rem
rem     Script:         column_group_stats_5.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested 
rem             19.3.0.0
rem             11.2.0.4
rem 
 
execute dbms_random.seed(0)
 
create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        chr(65 + mod(rownum,26))        flag,
        rownum                          id,
        lpad(rownum,10,'0')             v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6   -- > comment to avoid WordPress format issue
order by
        dbms_random.value
/
 
 
select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/
 
begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns(v1, flag) size 1'
        );
end;
/
  
select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/

备注 2:
作为一个有趣的统计小问题，如果我将列组定义为（flag，v1）而不是（v1，flag），那么列组的 num_distinct估计值是1000000。

原文链接：https://jonathanlewis.wordpress.com/2020/10/21/column-groups-6/
原文内容：
Filed under: extended stats,Oracle,Statistics — Jonathan Lewis @ 12:14 pm BST Oct 21,2020

Here’s an odd little detail about the statistics of column groups. At first glance it’s counter-intuitive but it’s actually an “obvious” (once you’ve thought about it for a bit) consequence of the approximate_ndv() algorithm for gathering stats.

I’ll present it as a question:

I have a table with two columns: flag and v1. Although the column are not declared as non-null neither holds any nulls. If there are 26 distinct values for flag, and 1,000,000 distinct values for v1, what’s the smallest number of distinct values I should see if I create the column group (flag, v1) ?

The question is, of course, a little ambiguous – there’s the number of distinct values that the column (group) holds and the number that a fresh gather of statistics reports it as holding. Here are the stats from a test run of a simple script that creates, populates and gathers stats on my table:

select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/
 
COLUMN_NAME                      NUM_DISTINCT
-------------------------------- ------------
FLAG                                       26
ID                                    1000000
V1                                     999040
SYS_STUQ#TO6BT1REX3P1BKO0ULVR9         989120

There are actually 1,000,000 distinct values for v1 (it’s a varchar2() representation of the id column), but the approximate_ndv() mechanism can have an error of (I believe) up to roughly 1.3%, so Oracle’s estimate here is a little bit off.

The column group (represented by the internal column defonition SYS_STUQ#TO6BT1REX3P1BKO0ULVR9) must hold (at least) 1,000,000 distinct values – but the error in this case is a little larger than the error in v1, with the effect that the number of combinations appears to be less than the number of distinct values for v1!

There’s not much difference in this case between actual and estimate, but there test demonstrates the potential for a significant difference between the estimate and the arithmetic that Oracle would do if the column group didn’t exist. Nominally the optimizer would assume there were 26 million distinct values (though in this case I had only created 1M rows in the table and the optimizer would sanity check that 26M).

So, although the difference between actual and estimate is small, we have to ask the question – are there any cases where the optimizer will ignore the column group stats because of a sanity check that “proves” the estimate is “wrong” – after all it must be wrong if the num_distinct is less than the num_distinct of one of the components. Then again maybe there’s a sanity check that only ignores the column group if the estimate is “wrong enough”, but allows for small variations.

I mention this only because an odd optimizer estimate has shown up recently on the Oracle-L mailing list, and the only significant difference I can see (at present) is that a bad plan appears for a partition where this column group anomaly shows up in the stats, but a good plan appears when the column group anomaly isn’t present.
Footnote:

If you want to recreate the results above, here’s the model I’ve used (tested on 19.3.0.0 and 11.2.0.4):

	
rem
rem     Script:         column_group_stats_5.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested 
rem             19.3.0.0
rem             11.2.0.4
rem 
 
execute dbms_random.seed(0)
 
create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        chr(65 + mod(rownum,26))        flag,
        rownum                          id,
        lpad(rownum,10,'0')             v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6   -- > comment to avoid WordPress format issue
order by
        dbms_random.value
/
 
 
select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/
 
begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns(v1, flag) size 1'
        );
end;
/
  
select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/

Footnote 2:

As an interesting little statistical quirk, if I defined the column group as (flag, v1) rather than (v1, flag) the estimate for the column group num_distinct was 1,000,000.

最后修改时间：2020-12-20 19:19:37

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

列组

评论