暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PolarDB-PG原理解读——ANALYZE 源码解读(四)

PolarDB农夫山泉 2023-08-29
208

PolarDB PostgreSQL版(以下简称 PolarDB-PG)是一款阿里云自主研发的企业级数据库产品,采用计算存储分离架构,兼容 PostgreSQL 与 Oracle。PolarDB-PG 的存储与计算能力均可横向扩展,具有高可靠、高可用、弹性扩展等企业级数据库特性。同时,PolarDB-PG 具有大规模并行计算能力,可以应对 OLTP 与 OLAP 混合负载;还具有时空、向量、搜索、图谱等多模创新特性,可以满足企业对数据处理日新月异的新需求。

内核执行流程

知道 pg_statistic 最终需要保存哪些信息以后,再来看看内核如何收集和计算这些信息。让我们进入 PostgreSQL 内核的执行器代码中。对于 ANALYZE 这种工具性质的指令,执行器代码通过 standard_ProcessUtility() 函数中的 switch case 将每一种指令路由到实现相应功能的函数中。

/* * standard_ProcessUtility itself deals only with utility commands for * which we do not provide event trigger support. Commands that do have * such support are passed down to ProcessUtilitySlow, which contains the * necessary infrastructure for such triggers. * * This division is not just for performance: it's critical that the * event trigger code not be invoked when doing START TRANSACTION for * example, because we might need to refresh the event trigger cache, * which requires being in a valid transaction. */ void standard_ProcessUtility(PlannedStmt *pstmt, const char *queryString, bool readOnlyTree, ProcessUtilityContext context, ParamListInfo params, QueryEnvironment *queryEnv, DestReceiver *dest, QueryCompletion *qc) { // ... switch (nodeTag(parsetree)) { // ... case T_VacuumStmt: ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel); break; // ... } // ... }
复制

ANALYZE 的处理逻辑入口和 VACUUM 一致,进入 ExecVacuum() 函数。

/* * Primary entry point for manual VACUUM and ANALYZE commands * * This is mainly a preparation wrapper for the real operations that will * happen in vacuum(). */ void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel) { // ... /* Now go through the common routine */ vacuum(vacstmt->rels, &params, NULL, isTopLevel); }
复制

在 parse 了一大堆 option 之后,进入了 vacuum() 函数。在这里,内核代码将会首先明确一下要分析哪些表。因为 ANALYZE 命令在使用上可以:

  • 分析整个数据库中的所有表
  • 分析某几个特定的表
  • 分析某个表的某几个特定列

在明确要分析哪些表以后,依次将每一个表传入 analyze_rel() 函数:

if (params->options & VACOPT_ANALYZE) { // ... analyze_rel(vrel->oid, vrel->relation, params, vrel->va_cols, in_outer_xact, vac_strategy); // ... }
复制

进入 analyze_rel() 函数以后,内核代码将会对将要被分析的表加 ShareUpdateExclusiveLock 锁,以防止两个并发进行的 ANALYZE。然后根据待分析表的类型来决定具体的处理方式(比如分析一个 FDW 外表就应该直接调用 FDW routine 中提供的 ANALYZE 功能了)。接下来,将这个表传入 do_analyze_rel() 函数中。

/* * analyze_rel() -- analyze one relation * * relid identifies the relation to analyze. If relation is supplied, use * the name therein for reporting any failure to open/lock the rel; do not * use it once we've successfully opened the rel, since it might be stale. */ void analyze_rel(Oid relid, RangeVar *relation, VacuumParams *params, List *va_cols, bool in_outer_xact, BufferAccessStrategy bstrategy) { // ... /* * Do the normal non-recursive ANALYZE. We can skip this for partitioned * tables, which don't contain any rows. */ if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages, false, in_outer_xact, elevel); // ... }
复制

进入 do_analyze_rel() 函数后,内核代码将进一步明确要分析一个表中的哪些列:用户可能指定只分析表中的某几个列——被频繁访问的列才更有被分析的价值。然后还要打开待分析表的所有索引,看看是否有可以被分析的列。

为了得到每一列的统计信息,显然我们需要把每一列的数据从磁盘上读起来再去做计算。这里就有一个比较关键的问题了:到底扫描多少行数据呢?理论上,分析尽可能多的数据,最好是全部的数据,肯定能够得到最精确的统计数据;但是对一张很大的表来说,我们没有办法在内存中放下所有的数据,并且分析的阻塞时间也是不可接受的。所以用户可以指定要采样的最大行数,从而在运行开销和统计信息准确性上达成一个妥协:

/* * Determine how many rows we need to sample, using the worst case from * all analyzable columns. We use a lower bound of 100 rows to avoid * possible overflow in Vitter's algorithm. (Note: that will also be the * target in the corner case where there are no analyzable columns.) */ targrows = 100; for (i = 0; i < attr_cnt; i++) { if (targrows < vacattrstats[i]->minrows) targrows = vacattrstats[i]->minrows; } for (ind = 0; ind < nindexes; ind++) { AnlIndexData *thisdata = &indexdata[ind]; for (i = 0; i < thisdata->attr_cnt; i++) { if (targrows < thisdata->vacattrstats[i]->minrows) targrows = thisdata->vacattrstats[i]->minrows; } } /* * Look at extended statistics objects too, as those may define custom * statistics target. So we may need to sample more rows and then build * the statistics with enough detail. */ minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats); if (targrows < minrows) targrows = minrows;
复制

在确定需要采样多少行数据后,内核代码分配了一块相应长度的元组数组,然后开始使用 acquirefunc 函数指针采样数据:

/* * Acquire the sample rows */ rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple)); pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE, inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH : PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS); if (inh) numrows = acquire_inherited_sample_rows(onerel, elevel, rows, targrows, &totalrows, &totaldeadrows); else numrows = (*acquirefunc) (onerel, elevel, rows, targrows, &totalrows, &totaldeadrows);
复制

这个函数指针指向的是 analyze_rel() 函数中设置好的 acquire_sample_rows() 函数。该函数使用两阶段模式对表中的数据进行采样:

  • 阶段 1:随机选择包含目标采样行数的数据块
  • 阶段 2:对每一个数据块使用 Vitter 算法按行随机采样数据

两阶段同时进行。在采样完成后,被采样到的元组应该已经被放置在元组数组中了。对这个元组数组按照元组的位置进行快速排序,并使用这些采样到的数据估算整个表中的存活元组与死元组的个数:

/* * acquire_sample_rows -- acquire a random sample of rows from the table * * Selected rows are returned in the caller-allocated array rows[], which * must have at least targrows entries. * The actual number of rows selected is returned as the function result. * We also estimate the total numbers of live and dead rows in the table, * and return them into *totalrows and *totaldeadrows, respectively. * * The returned list of tuples is in order by physical position in the table. * (We will rely on this later to derive correlation estimates.) * * As of May 2004 we use a new two-stage method: Stage one selects up * to targrows random blocks (or all blocks, if there aren't so many). * Stage two scans these blocks and uses the Vitter algorithm to create * a random sample of targrows rows (or less, if there are less in the * sample of blocks). The two stages are executed simultaneously: each * block is processed as soon as stage one returns its number and while * the rows are read stage two controls which ones are to be inserted * into the sample. * * Although every row has an equal chance of ending up in the final * sample, this sampling method is not perfect: not every possible * sample has an equal chance of being selected. For large relations * the number of different blocks represented by the sample tends to be * too small. We can live with that for now. Improvements are welcome. * * An important property of this sampling method is that because we do * look at a statistically unbiased set of blocks, we should get * unbiased estimates of the average numbers of live and dead rows per * block. The previous sampling method put too much credence in the row * density near the start of the table. */ static int acquire_sample_rows(Relation onerel, int elevel, HeapTuple *rows, int targrows, double *totalrows, double *totaldeadrows) { // ... /* Outer loop over blocks to sample */ while (BlockSampler_HasMore(&bs)) { bool block_accepted; BlockNumber targblock = BlockSampler_Next(&bs); // ... } // ... /* * If we didn't find as many tuples as we wanted then we're done. No sort * is needed, since they're already in order. * * Otherwise we need to sort the collected tuples by position * (itempointer). It's not worth worrying about corner cases where the * tuples are already sorted. */ if (numrows == targrows) qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows); /* * Estimate total numbers of live and dead rows in relation, extrapolating * on the assumption that the average tuple density in pages we didn't * scan is the same as in the pages we did scan. Since what we scanned is * a random sample of the pages in the relation, this should be a good * assumption. */ if (bs.m > 0) { *totalrows = floor((liverows / bs.m) * totalblocks + 0.5); *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5); } else { *totalrows = 0.0; *totaldeadrows = 0.0; } // ... }
复制

回到 do_analyze_rel() 函数。采样到数据以后,对于要分析的每一个列,分别计算统计数据,然后更新 pg_statistic 系统表:

/* * Compute the statistics. Temporary results during the calculations for * each column are stored in a child context. The calc routines are * responsible to make sure that whatever they store into the VacAttrStats * structure is allocated in anl_context. */ if (numrows > 0) { // ... for (i = 0; i < attr_cnt; i++) { VacAttrStats *stats = vacattrstats[i]; AttributeOpts *aopt; stats->rows = rows; stats->tupDesc = onerel->rd_att; stats->compute_stats(stats, std_fetch_func, numrows, totalrows); // ... } // ... /* * Emit the completed stats rows into pg_statistic, replacing any * previous statistics for the target columns. (If there are stats in * pg_statistic for columns we didn't process, we leave them alone.) */ update_attstats(RelationGetRelid(onerel), inh, attr_cnt, vacattrstats); // ... }
复制

显然,对于不同类型的列,其 compute_stats 函数指针指向的计算函数肯定不太一样。所以我们不妨看看给这个函数指针赋值的地方:

/* * std_typanalyze -- the default type-specific typanalyze function */ bool std_typanalyze(VacAttrStats *stats) { // ... /* * Determine which standard statistics algorithm to use */ if (OidIsValid(eqopr) && OidIsValid(ltopr)) { /* Seems to be a scalar datatype */ stats->compute_stats = compute_scalar_stats; /*-------------------- * The following choice of minrows is based on the paper * "Random sampling for histogram construction: how much is enough?" * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in * Proceedings of ACM SIGMOD International Conference on Management * of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5 * says that for table size n, histogram size k, maximum relative * error in bin size f, and error probability gamma, the minimum * random sample size is * r = 4 * k * ln(2*n/gamma) / f^2 * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain * r = 305.82 * k * Note that because of the log function, the dependence on n is * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66 * bin size error with probability 0.99. So there's no real need to * scale for n, which is a good thing because we don't necessarily * know it at this point. *-------------------- */ stats->minrows = 300 * attr->attstattarget; } else if (OidIsValid(eqopr)) { /* We can still recognize distinct values */ stats->compute_stats = compute_distinct_stats; /* Might as well use the same minrows as above */ stats->minrows = 300 * attr->attstattarget; } else { /* Can't do much but the trivial stuff */ stats->compute_stats = compute_trivial_stats; /* Might as well use the same minrows as above */ stats->minrows = 300 * attr->attstattarget; } // ... }
复制

这个条件判断语句可以被解读为:

  • 如果说一个列的数据类型支持默认的 =eqopr:equals operator)和 <ltopr:less than operator),那么这个列应该是一个数值类型,可以使用 compute_scalar_stats() 函数进行分析
  • 如果列的数据类型只支持 = 运算符,那么依旧还可以使用 compute_distinct_stats 进行唯一值的统计分析
  • 如果都不行,那么这个列只能使用 compute_trivial_stats 进行一些简单的分析

我们可以分别看看这三个分析函数里做了啥,但我不准备深入每一个分析函数解读其中的逻辑了。因为其中的思想基于一些很古早的统计学论文,古早到连 PDF 上的字母都快看不清了。在代码上没有特别大的可读性,因为基本是参照论文中的公式实现的,不看论文根本没法理解变量和公式的含义。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论

目录
  • 内核执行流程