PolarDB-PG原理解读——ANALYZE 源码解读（五）

PolarDB农夫山泉 2023-08-29

188

PolarDB PostgreSQL版（以下简称 PolarDB-PG）是一款阿里云自主研发的企业级数据库产品，采用计算存储分离架构，兼容 PostgreSQL 与 Oracle。PolarDB-PG 的存储与计算能力均可横向扩展，具有高可靠、高可用、弹性扩展等企业级数据库特性。同时，PolarDB-PG 具有大规模并行计算能力，可以应对 OLTP 与 OLAP 混合负载；还具有时空、向量、搜索、图谱等多模创新特性，可以满足企业对数据处理日新月异的新需求。

compute_trivial_stats

如果某个列的数据类型不支持等值运算符和比较运算符，那么就只能进行一些简单的分析，比如：

非空行的比例
列中元组的平均宽度

这些可以通过对采样后的元组数组进行循环遍历后轻松得到。

/*
 *  compute_trivial_stats() -- compute very basic column statistics
 *
 *  We use this when we cannot find a hash "=" operator for the datatype.
 *
 *  We determine the fraction of non-null rows and the average datum width.
 */
static void
compute_trivial_stats(VacAttrStatsP stats,
                      AnalyzeAttrFetchFunc fetchfunc,
                      int samplerows,
                      double totalrows)
{}
复制

compute_distinct_stats

如果某个列只支持等值运算符，也就是说我们只能知道一个数值 是什么，但不能和其它数值比大小。所以无法分析数值在大小范围上的分布，只能分析数值在出现频率上的分布。所以该函数分析的统计数据包含：

非空行的比例
列中元组的平均宽度
最频繁出现的值（MCV）
（估算的）唯一值个数

/*
 *  compute_distinct_stats() -- compute column statistics including ndistinct
 *
 *  We use this when we can find only an "=" operator for the datatype.
 *
 *  We determine the fraction of non-null rows, the average width, the
 *  most common values, and the (estimated) number of distinct values.
 *
 *  The most common values are determined by brute force: we keep a list
 *  of previously seen values, ordered by number of times seen, as we scan
 *  the samples.  A newly seen value is inserted just after the last
 *  multiply-seen value, causing the bottommost (oldest) singly-seen value
 *  to drop off the list.  The accuracy of this method, and also its cost,
 *  depend mainly on the length of the list we are willing to keep.
 */
static void
compute_distinct_stats(VacAttrStatsP stats,
                       AnalyzeAttrFetchFunc fetchfunc,
                       int samplerows,
                       double totalrows)
{}
复制

compute_scalar_stats

如果一个列的数据类型支持等值运算符和比较运算符，那么可以进行最详尽的分析。分析目标包含：

非空行的比例
列中元组的平均宽度
最频繁出现的值（MCV）
（估算的）唯一值个数
数据分布直方图
物理和逻辑位置的相关性

/*
 *  compute_distinct_stats() -- compute column statistics including ndistinct
 *
 *  We use this when we can find only an "=" operator for the datatype.
 *
 *  We determine the fraction of non-null rows, the average width, the
 *  most common values, and the (estimated) number of distinct values.
 *
 *  The most common values are determined by brute force: we keep a list
 *  of previously seen values, ordered by number of times seen, as we scan
 *  the samples.  A newly seen value is inserted just after the last
 *  multiply-seen value, causing the bottommost (oldest) singly-seen value
 *  to drop off the list.  The accuracy of this method, and also its cost,
 *  depend mainly on the length of the list we are willing to keep.
 */
static void
compute_distinct_stats(VacAttrStatsP stats,
                       AnalyzeAttrFetchFunc fetchfunc,
                       int samplerows,
                       double totalrows)
{}
复制