PolarDB PostgreSQL版(以下简称 PolarDB-PG)是一款阿里云自主研发的企业级数据库产品,采用计算存储分离架构,兼容 PostgreSQL 与 Oracle。PolarDB-PG 的存储与计算能力均可横向扩展,具有高可靠、高可用、弹性扩展等企业级数据库特性。同时,PolarDB-PG 具有大规模并行计算能力,可以应对 OLTP 与 OLAP 混合负载;还具有时空、向量、搜索、图谱等多模创新特性,可以满足企业对数据处理日新月异的新需求。
compute_trivial_stats
如果某个列的数据类型不支持等值运算符和比较运算符,那么就只能进行一些简单的分析,比如:
- 非空行的比例
- 列中元组的平均宽度
这些可以通过对采样后的元组数组进行循环遍历后轻松得到。
/*
* compute_trivial_stats() -- compute very basic column statistics
*
* We use this when we cannot find a hash "=" operator for the datatype.
*
* We determine the fraction of non-null rows and the average datum width.
*/
static void
compute_trivial_stats(VacAttrStatsP stats,
AnalyzeAttrFetchFunc fetchfunc,
int samplerows,
double totalrows)
{}
复制
compute_distinct_stats
如果某个列只支持等值运算符,也就是说我们只能知道一个数值 是什么,但不能和其它数值比大小。所以无法分析数值在大小范围上的分布,只能分析数值在出现频率上的分布。所以该函数分析的统计数据包含:
- 非空行的比例
- 列中元组的平均宽度
- 最频繁出现的值(MCV)
- (估算的)唯一值个数
/*
* compute_distinct_stats() -- compute column statistics including ndistinct
*
* We use this when we can find only an "=" operator for the datatype.
*
* We determine the fraction of non-null rows, the average width, the
* most common values, and the (estimated) number of distinct values.
*
* The most common values are determined by brute force: we keep a list
* of previously seen values, ordered by number of times seen, as we scan
* the samples. A newly seen value is inserted just after the last
* multiply-seen value, causing the bottommost (oldest) singly-seen value
* to drop off the list. The accuracy of this method, and also its cost,
* depend mainly on the length of the list we are willing to keep.
*/
static void
compute_distinct_stats(VacAttrStatsP stats,
AnalyzeAttrFetchFunc fetchfunc,
int samplerows,
double totalrows)
{}
复制
compute_scalar_stats
如果一个列的数据类型支持等值运算符和比较运算符,那么可以进行最详尽的分析。分析目标包含:
- 非空行的比例
- 列中元组的平均宽度
- 最频繁出现的值(MCV)
- (估算的)唯一值个数
- 数据分布直方图
- 物理和逻辑位置的相关性
/*
* compute_distinct_stats() -- compute column statistics including ndistinct
*
* We use this when we can find only an "=" operator for the datatype.
*
* We determine the fraction of non-null rows, the average width, the
* most common values, and the (estimated) number of distinct values.
*
* The most common values are determined by brute force: we keep a list
* of previously seen values, ordered by number of times seen, as we scan
* the samples. A newly seen value is inserted just after the last
* multiply-seen value, causing the bottommost (oldest) singly-seen value
* to drop off the list. The accuracy of this method, and also its cost,
* depend mainly on the length of the list we are willing to keep.
*/
static void
compute_distinct_stats(VacAttrStatsP stats,
AnalyzeAttrFetchFunc fetchfunc,
int samplerows,
double totalrows)
{}
复制
总结
以 PostgreSQL 优化器需要的统计信息为切入点,分析了 ANALYZE
命令的大致执行流程。出于简洁性,在流程分析上没有覆盖各种 corner case 和相关的处理逻辑。
参考资料
PostgreSQL 14 Documentation: ANALYZE
PostgreSQL 14 Documentation: 25.1. Routine Vacuuming
PostgreSQL 14 Documentation: 14.2. Statistics Used by the Planner
PostgreSQL 14 Documentation: 52.49. pg_statistic
[阿里云数据库内核月报 2016/05:PostgreSQL 特性分析 统计信息计算方法](