PolarDB PostgreSQL版(以下简称 PolarDB-PG)是一款阿里云自主研发的企业级数据库产品,采用计算存储分离架构,兼容 PostgreSQL 与 Oracle。PolarDB-PG 的存储与计算能力均可横向扩展,具有高可靠、高可用、弹性扩展等企业级数据库特性。同时,PolarDB-PG 具有大规模并行计算能力,可以应对 OLTP 与 OLAP 混合负载;还具有时空、向量、搜索、图谱等多模创新特性,可以满足企业对数据处理日新月异的新需求。
内核执行流程
知道 pg_statistic
最终需要保存哪些信息以后,再来看看内核如何收集和计算这些信息。让我们进入 PostgreSQL 内核的执行器代码中。对于 ANALYZE
这种工具性质的指令,执行器代码通过 standard_ProcessUtility()
函数中的 switch case 将每一种指令路由到实现相应功能的函数中。
/*
* standard_ProcessUtility itself deals only with utility commands for
* which we do not provide event trigger support. Commands that do have
* such support are passed down to ProcessUtilitySlow, which contains the
* necessary infrastructure for such triggers.
*
* This division is not just for performance: it's critical that the
* event trigger code not be invoked when doing START TRANSACTION for
* example, because we might need to refresh the event trigger cache,
* which requires being in a valid transaction.
*/
void
standard_ProcessUtility(PlannedStmt *pstmt,
const char *queryString,
bool readOnlyTree,
ProcessUtilityContext context,
ParamListInfo params,
QueryEnvironment *queryEnv,
DestReceiver *dest,
QueryCompletion *qc)
{
// ...
switch (nodeTag(parsetree))
{
// ...
case T_VacuumStmt:
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
// ...
}
// ...
}
复制
ANALYZE
的处理逻辑入口和 VACUUM
一致,进入 ExecVacuum()
函数。
/*
* Primary entry point for manual VACUUM and ANALYZE commands
*
* This is mainly a preparation wrapper for the real operations that will
* happen in vacuum().
*/
void
ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
{
// ...
/* Now go through the common routine */
vacuum(vacstmt->rels, ¶ms, NULL, isTopLevel);
}
复制
在 parse 了一大堆 option 之后,进入了 vacuum()
函数。在这里,内核代码将会首先明确一下要分析哪些表。因为 ANALYZE
命令在使用上可以:
- 分析整个数据库中的所有表
- 分析某几个特定的表
- 分析某个表的某几个特定列
在明确要分析哪些表以后,依次将每一个表传入 analyze_rel()
函数:
if (params->options & VACOPT_ANALYZE)
{
// ...
analyze_rel(vrel->oid, vrel->relation, params,
vrel->va_cols, in_outer_xact, vac_strategy);
// ...
}
复制
进入 analyze_rel()
函数以后,内核代码将会对将要被分析的表加 ShareUpdateExclusiveLock
锁,以防止两个并发进行的 ANALYZE
。然后根据待分析表的类型来决定具体的处理方式(比如分析一个 FDW 外表就应该直接调用 FDW routine 中提供的 ANALYZE 功能了)。接下来,将这个表传入 do_analyze_rel()
函数中。
/*
* analyze_rel() -- analyze one relation
*
* relid identifies the relation to analyze. If relation is supplied, use
* the name therein for reporting any failure to open/lock the rel; do not
* use it once we've successfully opened the rel, since it might be stale.
*/
void
analyze_rel(Oid relid, RangeVar *relation,
VacuumParams *params, List *va_cols, bool in_outer_xact,
BufferAccessStrategy bstrategy)
{
// ...
/*
* Do the normal non-recursive ANALYZE. We can skip this for partitioned
* tables, which don't contain any rows.
*/
if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
do_analyze_rel(onerel, params, va_cols, acquirefunc,
relpages, false, in_outer_xact, elevel);
// ...
}
复制
进入 do_analyze_rel()
函数后,内核代码将进一步明确要分析一个表中的哪些列:用户可能指定只分析表中的某几个列——被频繁访问的列才更有被分析的价值。然后还要打开待分析表的所有索引,看看是否有可以被分析的列。
为了得到每一列的统计信息,显然我们需要把每一列的数据从磁盘上读起来再去做计算。这里就有一个比较关键的问题了:到底扫描多少行数据呢?理论上,分析尽可能多的数据,最好是全部的数据,肯定能够得到最精确的统计数据;但是对一张很大的表来说,我们没有办法在内存中放下所有的数据,并且分析的阻塞时间也是不可接受的。所以用户可以指定要采样的最大行数,从而在运行开销和统计信息准确性上达成一个妥协:
/*
* Determine how many rows we need to sample, using the worst case from
* all analyzable columns. We use a lower bound of 100 rows to avoid
* possible overflow in Vitter's algorithm. (Note: that will also be the
* target in the corner case where there are no analyzable columns.)
*/
targrows = 100;
for (i = 0; i < attr_cnt; i++)
{
if (targrows < vacattrstats[i]->minrows)
targrows = vacattrstats[i]->minrows;
}
for (ind = 0; ind < nindexes; ind++)
{
AnlIndexData *thisdata = &indexdata[ind];
for (i = 0; i < thisdata->attr_cnt; i++)
{
if (targrows < thisdata->vacattrstats[i]->minrows)
targrows = thisdata->vacattrstats[i]->minrows;
}
}
/*
* Look at extended statistics objects too, as those may define custom
* statistics target. So we may need to sample more rows and then build
* the statistics with enough detail.
*/
minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats);
if (targrows < minrows)
targrows = minrows;
复制
在确定需要采样多少行数据后,内核代码分配了一块相应长度的元组数组,然后开始使用 acquirefunc
函数指针采样数据:
/*
* Acquire the sample rows
*/
rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
if (inh)
numrows = acquire_inherited_sample_rows(onerel, elevel,
rows, targrows,
&totalrows, &totaldeadrows);
else
numrows = (*acquirefunc) (onerel, elevel,
rows, targrows,
&totalrows, &totaldeadrows);
复制
这个函数指针指向的是 analyze_rel()
函数中设置好的 acquire_sample_rows()
函数。该函数使用两阶段模式对表中的数据进行采样:
- 阶段 1:随机选择包含目标采样行数的数据块
- 阶段 2:对每一个数据块使用 Vitter 算法按行随机采样数据
两阶段同时进行。在采样完成后,被采样到的元组应该已经被放置在元组数组中了。对这个元组数组按照元组的位置进行快速排序,并使用这些采样到的数据估算整个表中的存活元组与死元组的个数:
/*
* acquire_sample_rows -- acquire a random sample of rows from the table
*
* Selected rows are returned in the caller-allocated array rows[], which
* must have at least targrows entries.
* The actual number of rows selected is returned as the function result.
* We also estimate the total numbers of live and dead rows in the table,
* and return them into *totalrows and *totaldeadrows, respectively.
*
* The returned list of tuples is in order by physical position in the table.
* (We will rely on this later to derive correlation estimates.)
*
* As of May 2004 we use a new two-stage method: Stage one selects up
* to targrows random blocks (or all blocks, if there aren't so many).
* Stage two scans these blocks and uses the Vitter algorithm to create
* a random sample of targrows rows (or less, if there are less in the
* sample of blocks). The two stages are executed simultaneously: each
* block is processed as soon as stage one returns its number and while
* the rows are read stage two controls which ones are to be inserted
* into the sample.
*
* Although every row has an equal chance of ending up in the final
* sample, this sampling method is not perfect: not every possible
* sample has an equal chance of being selected. For large relations
* the number of different blocks represented by the sample tends to be
* too small. We can live with that for now. Improvements are welcome.
*
* An important property of this sampling method is that because we do
* look at a statistically unbiased set of blocks, we should get
* unbiased estimates of the average numbers of live and dead rows per
* block. The previous sampling method put too much credence in the row
* density near the start of the table.
*/
static int
acquire_sample_rows(Relation onerel, int elevel,
HeapTuple *rows, int targrows,
double *totalrows, double *totaldeadrows)
{
// ...
/* Outer loop over blocks to sample */
while (BlockSampler_HasMore(&bs))
{
bool block_accepted;
BlockNumber targblock = BlockSampler_Next(&bs);
// ...
}
// ...
/*
* If we didn't find as many tuples as we wanted then we're done. No sort
* is needed, since they're already in order.
*
* Otherwise we need to sort the collected tuples by position
* (itempointer). It's not worth worrying about corner cases where the
* tuples are already sorted.
*/
if (numrows == targrows)
qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
/*
* Estimate total numbers of live and dead rows in relation, extrapolating
* on the assumption that the average tuple density in pages we didn't
* scan is the same as in the pages we did scan. Since what we scanned is
* a random sample of the pages in the relation, this should be a good
* assumption.
*/
if (bs.m > 0)
{
*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
}
else
{
*totalrows = 0.0;
*totaldeadrows = 0.0;
}
// ...
}
复制
回到 do_analyze_rel()
函数。采样到数据以后,对于要分析的每一个列,分别计算统计数据,然后更新 pg_statistic
系统表:
/*
* Compute the statistics. Temporary results during the calculations for
* each column are stored in a child context. The calc routines are
* responsible to make sure that whatever they store into the VacAttrStats
* structure is allocated in anl_context.
*/
if (numrows > 0)
{
// ...
for (i = 0; i < attr_cnt; i++)
{
VacAttrStats *stats = vacattrstats[i];
AttributeOpts *aopt;
stats->rows = rows;
stats->tupDesc = onerel->rd_att;
stats->compute_stats(stats,
std_fetch_func,
numrows,
totalrows);
// ...
}
// ...
/*
* Emit the completed stats rows into pg_statistic, replacing any
* previous statistics for the target columns. (If there are stats in
* pg_statistic for columns we didn't process, we leave them alone.)
*/
update_attstats(RelationGetRelid(onerel), inh,
attr_cnt, vacattrstats);
// ...
}
复制
显然,对于不同类型的列,其 compute_stats
函数指针指向的计算函数肯定不太一样。所以我们不妨看看给这个函数指针赋值的地方:
/*
* std_typanalyze -- the default type-specific typanalyze function
*/
bool
std_typanalyze(VacAttrStats *stats)
{
// ...
/*
* Determine which standard statistics algorithm to use
*/
if (OidIsValid(eqopr) && OidIsValid(ltopr))
{
/* Seems to be a scalar datatype */
stats->compute_stats = compute_scalar_stats;
/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
stats->minrows = 300 * attr->attstattarget;
}
else if (OidIsValid(eqopr))
{
/* We can still recognize distinct values */
stats->compute_stats = compute_distinct_stats;
/* Might as well use the same minrows as above */
stats->minrows = 300 * attr->attstattarget;
}
else
{
/* Can't do much but the trivial stuff */
stats->compute_stats = compute_trivial_stats;
/* Might as well use the same minrows as above */
stats->minrows = 300 * attr->attstattarget;
}
// ...
}
复制
这个条件判断语句可以被解读为:
- 如果说一个列的数据类型支持默认的
=
(eqopr
:equals operator)和<
(ltopr
:less than operator),那么这个列应该是一个数值类型,可以使用compute_scalar_stats()
函数进行分析 - 如果列的数据类型只支持
=
运算符,那么依旧还可以使用compute_distinct_stats
进行唯一值的统计分析 - 如果都不行,那么这个列只能使用
compute_trivial_stats
进行一些简单的分析
我们可以分别看看这三个分析函数里做了啥,但我不准备深入每一个分析函数解读其中的逻辑了。因为其中的思想基于一些很古早的统计学论文,古早到连 PDF 上的字母都快看不清了。在代码上没有特别大的可读性,因为基本是参照论文中的公式实现的,不看论文根本没法理解变量和公式的含义。