Filed under: extended stats,Oracle,Statistics — Jonathan Lewis @ 12:14 pm BST Oct 21,2020
这是有一个关于列组统计信息的奇怪的小细节。乍一看,这是违反正常预期结果的,但实际上这是一个“显然”(一旦你考虑了一下)的结果,这是收集统计信息使用的近似算法的结果。
我将以问题的形式提出:
我有一个有两列的表:flag和v1。虽然列未声明为not null,但也不包含任何null。假如flag有26个不同的值,v1有1000000个不同的值,那么如果我创建列组(flag,v1),我应该看到的最小的不同值的数量是多少?
当然,这个问题有点模棱两可:不同值的数量是指列组所实际拥用的,或者还是一个全新收集的统计信息中所报告的。下面是用一个简单测试脚本,在我的表中创建,生成并收集统计之后的结果:
select column_name, num_distinct
from user_tab_cols
where table_name = 'T1'
/
COLUMN_NAME NUM_DISTINCT
-------------------------------- ------------
FLAG 26
ID 1000000
V1 999040
SYS_STUQ#TO6BT1REX3P1BKO0ULVR9 989120
实际上v1有1000000个不同的值(它是varchar2()表示的ID列),但是approximate_ndv()机制可能会有大约1.3%的误差(我相信),所以Oracle在这里的估计有点偏差。
列组(由内部列定义 SYS_STUQ#TO6BT1REX3P1BKO0ULVR9表示)一定包含(至少)1000000个不同的值,但这个案例下的错误比v1中的错误稍大,其效果是组合数似乎比v1的不同值的数目还要小!
在这个案例中,实际值和估计值之间没有太大差别。但测试结果表明,如果列组不存在,则评估值和Oracle将执行的算法之间可能存在显著差异。名义上优化器会假设有2600万个不同的值(尽管在本例中,我只在表中创建了100万行,并且优化器将理智地对待2600万行)。
仅管实际值和评估值差异很小,但我还是要提出一个问题:是否在某种情况下,优化器会忽略列组统计信息。因为理智地看,评估值可被“证明”是“错误”的。毕竟,如果num_distinct小于其中一个组件的num_distinct,那么它一定是错的。然后,可能还有一个理智的检查,只在评估值“足够错误”时忽略列组,但允许有小的误差。
我注意到这个问题,仅仅是因为近来在Oracle-L 邮件列表中出现了一些奇怪的优化器评估结果的案例。而且我能看到的唯一明显区别(目前)是一个坏的计划出现在一个分区中,列组在统计信息中的值显示是异常的,但是当列组异常不存在时,一个好的计划就会出现。
备注:
如果你想重构上面的结果,以下是我所使用的测试模型(在19.3.0.0.和11.2.0.4中测试过):
rem
rem Script: column_group_stats_5.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem
rem Last tested
rem 19.3.0.0
rem 11.2.0.4
rem
execute dbms_random.seed(0)
create table t1
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
chr(65 + mod(rownum,26)) flag,
rownum id,
lpad(rownum,10,'0') v1
from
generator v1,
generator v2
where
rownum <= 1e6 -- > comment to avoid WordPress format issue
order by
dbms_random.value
/
select column_name, num_distinct
from user_tab_cols
where table_name = 'T1'
/
begin
dbms_stats.gather_table_stats(
ownname => null,
tabname => 'T1',
method_opt => 'for all columns size 1 for columns(v1, flag) size 1'
);
end;
/
select column_name, num_distinct
from user_tab_cols
where table_name = 'T1'
/
备注 2:
作为一个有趣的统计小问题,如果我将列组定义为(flag,v1)而不是(v1,flag),那么列组的 num_distinct估计值是1000000。
原文链接:https://jonathanlewis.wordpress.com/2020/10/21/column-groups-6/
原文内容:
Filed under: extended stats,Oracle,Statistics — Jonathan Lewis @ 12:14 pm BST Oct 21,2020
Here’s an odd little detail about the statistics of column groups. At first glance it’s counter-intuitive but it’s actually an “obvious” (once you’ve thought about it for a bit) consequence of the approximate_ndv() algorithm for gathering stats.
I’ll present it as a question:
I have a table with two columns: flag and v1. Although the column are not declared as non-null neither holds any nulls. If there are 26 distinct values for flag, and 1,000,000 distinct values for v1, what’s the smallest number of distinct values I should see if I create the column group (flag, v1) ?
The question is, of course, a little ambiguous – there’s the number of distinct values that the column (group) holds and the number that a fresh gather of statistics reports it as holding. Here are the stats from a test run of a simple script that creates, populates and gathers stats on my table:
select column_name, num_distinct
from user_tab_cols
where table_name = 'T1'
/
COLUMN_NAME NUM_DISTINCT
-------------------------------- ------------
FLAG 26
ID 1000000
V1 999040
SYS_STUQ#TO6BT1REX3P1BKO0ULVR9 989120
There are actually 1,000,000 distinct values for v1 (it’s a varchar2() representation of the id column), but the approximate_ndv() mechanism can have an error of (I believe) up to roughly 1.3%, so Oracle’s estimate here is a little bit off.
The column group (represented by the internal column defonition SYS_STUQ#TO6BT1REX3P1BKO0ULVR9) must hold (at least) 1,000,000 distinct values – but the error in this case is a little larger than the error in v1, with the effect that the number of combinations appears to be less than the number of distinct values for v1!
There’s not much difference in this case between actual and estimate, but there test demonstrates the potential for a significant difference between the estimate and the arithmetic that Oracle would do if the column group didn’t exist. Nominally the optimizer would assume there were 26 million distinct values (though in this case I had only created 1M rows in the table and the optimizer would sanity check that 26M).
So, although the difference between actual and estimate is small, we have to ask the question – are there any cases where the optimizer will ignore the column group stats because of a sanity check that “proves” the estimate is “wrong” – after all it must be wrong if the num_distinct is less than the num_distinct of one of the components. Then again maybe there’s a sanity check that only ignores the column group if the estimate is “wrong enough”, but allows for small variations.
I mention this only because an odd optimizer estimate has shown up recently on the Oracle-L mailing list, and the only significant difference I can see (at present) is that a bad plan appears for a partition where this column group anomaly shows up in the stats, but a good plan appears when the column group anomaly isn’t present.
Footnote:
If you want to recreate the results above, here’s the model I’ve used (tested on 19.3.0.0 and 11.2.0.4):
rem
rem Script: column_group_stats_5.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem
rem Last tested
rem 19.3.0.0
rem 11.2.0.4
rem
execute dbms_random.seed(0)
create table t1
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
chr(65 + mod(rownum,26)) flag,
rownum id,
lpad(rownum,10,'0') v1
from
generator v1,
generator v2
where
rownum <= 1e6 -- > comment to avoid WordPress format issue
order by
dbms_random.value
/
select column_name, num_distinct
from user_tab_cols
where table_name = 'T1'
/
begin
dbms_stats.gather_table_stats(
ownname => null,
tabname => 'T1',
method_opt => 'for all columns size 1 for columns(v1, flag) size 1'
);
end;
/
select column_name, num_distinct
from user_tab_cols
where table_name = 'T1'
/
Footnote 2:
As an interesting little statistical quirk, if I defined the column group as (flag, v1) rather than (v1, flag) the estimate for the column group num_distinct was 1,000,000.




