维护大分区表上的统计信息

原创赵勇 2021-05-04

2224

作者：Sunil Chakkappen
最近，我们收到了很多关于如何收集和维护大的分区表上的优化器统计信息的问题。这些问题的大部分可以归纳为两个主题：

当查询访问一个分区级的统计信息过旧或不存在统计信息的单个分区时，会因“超出范围”的值导致得到一个欠优的执行计划。
从时间和系统资源的角度看，全局统计信息的收集是非常昂贵的。

本文将描述这些问题并解释你可以如何处理他们。
这是一个很重要的主题，所以，我建议你也可以了解关于在分区表上维护增量统计信息的三篇文章（https://blogs.oracle.com/optimizer/efficient-statistics-maintenance-for-partitioned-tables-using-incremental-statistics-part-1）。

越界

大表经常会被分解为被称之为分区的较小部分，以便改善查询性能和方便数据管理。Oracle查询优化器依赖于整个表上的统计信息（全局统计信息）和单独分区上的统计信息（分区统计信息），去为SQL语句选择一个好的执行计划。如果查询只需要访问单独一个分区，优化器仅使用被访问分区上的统计信息。如果查询需要访问一个以上的分区，它会联合使用全局和分区统计信息。
“越界”意味着提供给WHERE子句中的谓词的值，超出了列统计信息中由【最小值，最大值】表示的值域。优化器基于谓词条件值与最大值（假设条件值大于最大值）之间的距离，按比例分配一个选择率，即，离最大值越远的值，会得到一个更小的选择率。这种情况常见于以日期列做范围分区的表上，当一个新分区出现，并且数据还在向新分区中加载期间，对这些行的查询已经开始了。由于持续的数据加载，即使你周期性的刷新统计信息，分区统计信息也很快就会陈旧。优化器所知的最大值是不正确的，这导致了查询优化器选择了一个欠优的执行计划。比如，当全表扫描会是更好选择的情况下，优化器却选择了一个索引访问方法。

“越界”的问题可以通过Oracle 10.2.0.4和11g中新提供的拷贝表统计的存储过程来避免。这个存储过程拷贝源分区（或子分区）的统计信息到目标分区（或子分区）中。同时，它也会拷贝依赖对象的统计信息：列，本地分区索引等。它按照后面的方法调整分区键列的最小值和最大值。对于范围分区表，它使用第一个分区键列的（可能存在多个分区键列）最大边界值做为最大值，使用前一个分区的最大值做为最小值。它还能有选择性的对诸如目标分区上的块数，行数等统计信息做缩放。

假设我们有一个名为SALES的表，在SALES_DATE列上按季度做范围分区。在每天结束时，数据被装进最新的分区中。然而，统计信息却只在每个季度的最后一天分区被装满时才收集。假设全局和分区级的统计信息（对于已装满的分区）已是最新的，为了防止“越界”导致欠优的执行计划，使用如下处理步骤：
1、使用DBMS_STATS中的LOCK_TABLE_STATS过程锁定表的统计信息，以避免受到自动统计信息收集任务的干扰。
EXEC DBMS_STATS.LOCK_TABLE_STATS(‘SH’,‘SALES’);

2、在开始向一个新分区（比如SALES_Q4_2000分区)装载数据前，使用COPY_TABLE_STATS过程拷贝前一个分区（比如SALES_Q3_2000分区）到新分区中。需要指定FORCE=>TRUE参数而无视统计信息锁。
EXEC DBMS_STATS.COPY_TABLE_STATS (‘SH’, ‘SALES’, ‘SALES_Q3_2000’, ‘SALES_Q4_2000’, FORCE=>TRUE);

昂贵的全局统计信息收集

在数据仓库环境中，将大量数据装载到一个或多个空的分区是很常见的事情。这会使得分区统计信息陈旧，并会使全局统计信息也过旧。对受影响的分区和整个表重新收集统计信息会消耗大量时间。传统上，统计信息收集分两步来做：
在第一步，我们会扫描表以便收集全局统计信息。
在第二步，我们会扫描已变化的分区以便收集分区级的统计信息。

依据表的大小，为了全局统计信息的收集而对表进行全扫描会是非常昂贵的。注意，是扫描整个表，虽然我们只是改变了所有分区的一小部分。
通过从分区级的统计信息推导计算出全局统计信息，我们可以避免扫描整个表。一部分统计信息可以很容易和精准地从分区级统计信息中推导出来。比如，在全局级别的行数，就是求和各分区的行数。甚至全局的直方图信息也可以从分区的直方图中推导。但是，列上的唯一值的数量是不能从分区级的唯一值数量推导出来的。为此，Oracle为分区级的每一列，维护另一种被称之为SYNOPSIS的结构。SYNOPSIS可以被看做是对唯一值的采样。唯一值数量可以从SYNOPSIS中被精确推导出来。我们还可以合并多个SYNOPSIS为一个，全局的唯一值数量就是由合并了全部分区级SYNOPSIS的SYNOPSIS推导出来的。总结一下：
- 仅对变化的分区收集统计信息和创建SYNOPSIS
- Oracle自动合并分区级的SYNOPSIS为一个全局级的SYNOPSIS
- 全局统计信息被自动由分区级的统计信息和全局级的SYNOPSIS中推导出来。

增量维护功能默认是关闭的。可以通过改变表级参数INCREMENTAL为TRUE来启用它。它也可以在特定的SCHEMA或数据库级上启用。
假设我们有一个名为SALES的，在SALES_DATE列上按日进行范围分区的表。在每天结束的时时候，数据被加载到最新的分区中，同时收集该分区的统计信息。而全局统计信息仅在每月结束时才收集，因为收集它们是很耗时和消耗资源的。通过如下的步骤，可以实现在每天数据加载后都能维护全局统计信息：
打开该表上的增量功能。

EXEC DBMS_STATS.SET_TABLE_PREFS('SH','SALES','INCREMENTAL','TRUE');
复制

在每次加载数据结束后，使用GATHER_TABLE_STATS命令收集表的统计信息。你并不需要指定分区名，也不需要指定granularity参数。该命令将将收集数据变化了或无统计信息的分区的统计信息，并且基于分区级别上的统计信息和SYNOPSIS更新全局统计信息。

EXEC DBMS_STATS.GATHER_TABLE_STATS('SH','SALES');
复制

注：增量维护功能是在Oracle Database 11g Release 1时引入的。但是，我们也提供了在Oracle Database10g Release 2 (10.2.0.4)上模拟类似行为的方案。在10g的GATHER_TABLE_STATS的GRANULARITY参数中提供了一个新值：‘APPROX_GLOBAL AND PARTITION’，其行为除了不能更新非分区键列上的NDV（唯一值数量）和在全局级别更新索引的唯一键值数量外，与增量维护功能是相同的。对于分区键列的NDV，我们通过在累加分区级别的NDV值获得。而且，我们设置唯一索引键列的NDV值为表的行数。通常情况下，全局级别非分区键列的NDV值不太经常变得过时。收集全局统计信息的频率可能低于默认值（表更改10%时），因为“approx_global”选项可以维护大多数全局统计信息的准确性。

原文链接：https://blogs.oracle.com/optimizer/maintaining-statistics-on-large-partitioned-tables
原文内容：
We have gotten a lot of questions recently regarding how to gather and maintain optimizer statistics on large partitioned tables. The majority of these questions can be summarized into two topics:

When queries access a single partition with stale or non-existent partition level statistics I get a sub optimal plan due to "Out of Range" values
Global statistics collection is extremely expensive in terms of time and system resources
复制

This article will describe both of these issues and explain how you can address them both.

This is big topic so I recommend that you also check out the three-part series of posts on maintaining incremental statistics in partitioned tables.

Out of Range
Large tables are often decomposed into smaller pieces called partitions in order to improve query performance and ease of data management. The Oracle query optimizer relies on both the statistics of the entire table (global statistics) and the statistics of the individual partitions (partition statistics) to select a good execution plan for a SQL statement. If the query needs to access only a single partition, the optimizer uses only the statistics of the accessed partition. If the query access more than one partition, it uses a combination of global and partition statistics.

“Out of Range” means that the value supplied in a where clause predicate is outside the domain of values represented by the [minimum, maximum] column statistics. The optimizer prorates the selectivity based on the distance between the predicate value and the maximum value (assuming the value is higher than the max), that is, the farther the value is from the maximum value, the lower the selectivity will be. This situation occurs most frequently in tables that are range partitioned by a date column, a new partition is added, and then queried while rows are still being loaded in the new partition. The partition statistics will be stale very quickly due to the continuous trickle feed load even if the statistics get refreshed periodically. The maximum value known to the optimizer is not correct leading to the “Out of Range” condition. The under-estimation of selectivity often leads the query optimizer to pick a sub optimal plan. For example, the query optimizer would pick an index access path while a full scan is a better choice.

The “Out of Range” condition can be prevented by using the new copy table statistics procedure available in Oracle Database10.2.0.4 and 11g. This procedure copies the statistics of the source [sub] partition to the destination [sub] partition. It also copies the statistics of the dependent objects: columns, local (partitioned) indexes etc. It adjusts the minimum and maximum values of the partitioning column as follows; it uses the high bound partitioning value as the maximum value of the first partitioning column (it is possible to have concatenated partition columns) and high bound partitioning value of the previous partition as the minimum value of the first partitioning column for range partitioned table. It can optionally scale some of the other statistics like the number of blocks, number of rows etc. of the destination partition.

Assume we have a table called SALES that is ranged partitioned by quarter on the SALES_DATE column. At the end of every day data is loaded into latest partition. However, statistics are only gathered at the end of every quarter when the partition is fully loaded. Assuming global and partition level statistics (for all fully loaded partitions) are up to date, use the following steps in order to prevent getting a sub-optimal plan due to “out of range”.

Lock the table statistics using LOCK_TABLE_STATS procedure in DBMS_STATS. This is to avoid interference from auto statistics job.

EXEC DBMS_STATS.LOCK_TABLE_STATS(‘SH’,‘SALES’);

Before beginning the initial load into each new partition (say SALES_Q4_2000) copy the statistics from the previous partition (say SALES_Q3_2000) using COPY_TABLE_STATS. You need to specify FORCE=>TRUE to override the statistics lock.

EXEC DBMS_STATS.COPY_TABLE_STATS (‘SH’, ‘SALES’, ‘SALES_Q3_2000’, ‘SALES_Q4_2000’, FORCE=>TRUE);

Expensive global statistics collection

In data warehouse environment it is very common to do a bulk load directly into one or more empty partitions. This will make the partition statistics stale and may also make the global statistics stale. Re-gathering statistics for the effected partitions and for the entire table can be very time consuming. Traditionally, statistics collection is done in a two-pass approach:

In the first pass we will scan the table to gather the global statistics
In the second pass we will scan the partitions that have been changed to gather their partition level statistics.
复制

The full scan of the table for global statistics collection can be very expensive depending on the size of the table. Note that the scan of the entire table is done even if we change a small subset of partitions.

We avoid scanning the whole table when computing global statistics by deriving the global statistics from the partition statistics. Some of the statistics can be derived easily and accurately from partition statistics. For example, number of rows at global level is the sum of number of rows of partitions. Even global histogram can be derived from partition histograms. But the number of distinct values (NDV) of a column cannot be derived from partition level NDVs. So, Oracle maintains another structure called a synopsis for each column at the partition level. A synopsis can be considered as sample of distinct values. The NDV can be accurately derived from synopses. We can also merge multiple synopses into one. The global NDV is derived from the synopsis generated by merging all of the partition level synopses. To summarize:

Gather statistics and create synopses for the changed partitions only
Oracle automatically merges partition level synopses into a global synopsis
The global statistics are automatically derived from the partition level statistics and global synopses
复制

Incremental maintenance feature is disabled by default. It can be enabled by changing the INCREMENTAL table preference to true. It can also be enabled for a particular schema or at the database level.

Assume we have table called SALES that is range partitioned by day on the SALES_DATE column. At the end of every day data is loaded into latest partition and partition statistics are gathered. Global statistics are only gathered at the end of every month because gathering them is very time and resource intensive. Use the following steps in order to maintain global statistics after every load.

Turn on incremental feature for the table.

EXEC DBMS_STATS.SET_TABLE_PREFS(‘SH’,‘SALES’,‘INCREMENTAL’,‘TRUE’);

At the end of every load gather table statistics using GATHER_TABLE_STATS command. You don’t need to specify the partition name. Also, do not specify the granularity parameter. The command will collect statistics for partitions where data has change or statistics are missing and update the global statistics based on the partition level statistics and synopsis.

EXEC DBMS_STATS.GATHER_TABLE_STATS(‘SH’,‘SALES’);

Note: that the incremental maintenance feature was introduced in Oracle Database 11g Release 1. However, we also provide a solution in Oracle Database10g Release 2 (10.2.0.4) that simulates the same behavior. The 10g solution is a new value, ‘APPROX_GLOBAL AND PARTITION’ for the GRANULARITY parameter of the GATHER_TABLE_STATS procedures. It behaves the same as the incremental maintenance feature except that we don’t update the NDV for non-partitioning columns and number of distinct keys of the index at the global level. For partitioned column we update the NDV as the sum of NDV at the partition levels. Also we set the NDV of columns of unique indexes as the number of rows of the table. In general, non-partitioning column NDV at the global level becomes stale less often. It may be possible to collect global statistics less frequently then the default (when table changes 10%) since approx_global option maintains most of the global statistics accurately.

分区表统计信息 incremental 增量维护

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

维护大分区表上的统计信息

越界

昂贵的全局统计信息收集

评论

相关阅读