PolarDB-PG原理解读——ANALYZE 源码解读（四）

PolarDB农夫山泉 2023-08-29

208

PolarDB PostgreSQL版（以下简称 PolarDB-PG）是一款阿里云自主研发的企业级数据库产品，采用计算存储分离架构，兼容 PostgreSQL 与 Oracle。PolarDB-PG 的存储与计算能力均可横向扩展，具有高可靠、高可用、弹性扩展等企业级数据库特性。同时，PolarDB-PG 具有大规模并行计算能力，可以应对 OLTP 与 OLAP 混合负载；还具有时空、向量、搜索、图谱等多模创新特性，可以满足企业对数据处理日新月异的新需求。

内核执行流程

知道 pg_statistic 最终需要保存哪些信息以后，再来看看内核如何收集和计算这些信息。让我们进入 PostgreSQL 内核的执行器代码中。对于 ANALYZE 这种工具性质的指令，执行器代码通过 standard_ProcessUtility() 函数中的 switch case 将每一种指令路由到实现相应功能的函数中。

/*
 * standard_ProcessUtility itself deals only with utility commands for
 * which we do not provide event trigger support.  Commands that do have
 * such support are passed down to ProcessUtilitySlow, which contains the
 * necessary infrastructure for such triggers.
 *
 * This division is not just for performance: it's critical that the
 * event trigger code not be invoked when doing START TRANSACTION for
 * example, because we might need to refresh the event trigger cache,
 * which requires being in a valid transaction.
 */
void
standard_ProcessUtility(PlannedStmt *pstmt,
                        const char *queryString,
                        bool readOnlyTree,
                        ProcessUtilityContext context,
                        ParamListInfo params,
                        QueryEnvironment *queryEnv,
                        DestReceiver *dest,
                        QueryCompletion *qc)
{
    // ...

    switch (nodeTag(parsetree))
    {
        // ...

        case T_VacuumStmt:
            ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
            break;

        // ...
    }

    // ...
}
复制

ANALYZE 的处理逻辑入口和 VACUUM 一致，进入 ExecVacuum() 函数。

/*
 * Primary entry point for manual VACUUM and ANALYZE commands
 *
 * This is mainly a preparation wrapper for the real operations that will
 * happen in vacuum().
 */
void
ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
{
    // ...

    /* Now go through the common routine */
    vacuum(vacstmt->rels, &params, NULL, isTopLevel);
}
复制

在 parse 了一大堆 option 之后，进入了 vacuum() 函数。在这里，内核代码将会首先明确一下要分析哪些表。因为 ANALYZE 命令在使用上可以：

分析整个数据库中的所有表
分析某几个特定的表
分析某个表的某几个特定列

在明确要分析哪些表以后，依次将每一个表传入 analyze_rel() 函数：

if (params->options & VACOPT_ANALYZE)
{
    // ...

    analyze_rel(vrel->oid, vrel->relation, params,
                vrel->va_cols, in_outer_xact, vac_strategy);

    // ...
}
复制

进入 analyze_rel() 函数以后，内核代码将会对将要被分析的表加 ShareUpdateExclusiveLock 锁，以防止两个并发进行的 ANALYZE。然后根据待分析表的类型来决定具体的处理方式（比如分析一个 FDW 外表就应该直接调用 FDW routine 中提供的 ANALYZE 功能了）。接下来，将这个表传入 do_analyze_rel() 函数中。

/*
 *  analyze_rel() -- analyze one relation
 *
 * relid identifies the relation to analyze.  If relation is supplied, use
 * the name therein for reporting any failure to open/lock the rel; do not
 * use it once we've successfully opened the rel, since it might be stale.
 */
void
analyze_rel(Oid relid, RangeVar *relation,
            VacuumParams *params, List *va_cols, bool in_outer_xact,
            BufferAccessStrategy bstrategy)
{
    // ...

    /*
     * Do the normal non-recursive ANALYZE.  We can skip this for partitioned
     * tables, which don't contain any rows.
     */
    if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
        do_analyze_rel(onerel, params, va_cols, acquirefunc,
                       relpages, false, in_outer_xact, elevel);

    // ...
}
复制

进入 do_analyze_rel() 函数后，内核代码将进一步明确要分析一个表中的哪些列：用户可能指定只分析表中的某几个列——被频繁访问的列才更有被分析的价值。然后还要打开待分析表的所有索引，看看是否有可以被分析的列。

为了得到每一列的统计信息，显然我们需要把每一列的数据从磁盘上读起来再去做计算。这里就有一个比较关键的问题了：到底扫描多少行数据呢？理论上，分析尽可能多的数据，最好是全部的数据，肯定能够得到最精确的统计数据；但是对一张很大的表来说，我们没有办法在内存中放下所有的数据，并且分析的阻塞时间也是不可接受的。所以用户可以指定要采样的最大行数，从而在运行开销和统计信息准确性上达成一个妥协：

/*
 * Determine how many rows we need to sample, using the worst case from
 * all analyzable columns.  We use a lower bound of 100 rows to avoid
 * possible overflow in Vitter's algorithm.  (Note: that will also be the
 * target in the corner case where there are no analyzable columns.)
 */
targrows = 100;
for (i = 0; i < attr_cnt; i++)
{
    if (targrows < vacattrstats[i]->minrows)
        targrows = vacattrstats[i]->minrows;
}
for (ind = 0; ind < nindexes; ind++)
{
    AnlIndexData *thisdata = &indexdata[ind];

    for (i = 0; i < thisdata->attr_cnt; i++)
    {
        if (targrows < thisdata->vacattrstats[i]->minrows)
            targrows = thisdata->vacattrstats[i]->minrows;
    }
}

/*
 * Look at extended statistics objects too, as those may define custom
 * statistics target. So we may need to sample more rows and then build
 * the statistics with enough detail.
 */
minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats);

if (targrows < minrows)
    targrows = minrows;
复制

在确定需要采样多少行数据后，内核代码分配了一块相应长度的元组数组，然后开始使用 acquirefunc 函数指针采样数据：

/*
 * Acquire the sample rows
 */
rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
                             inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
                             PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
if (inh)
    numrows = acquire_inherited_sample_rows(onerel, elevel,
                                            rows, targrows,
                                            &totalrows, &totaldeadrows);
else
    numrows = (*acquirefunc) (onerel, elevel,
                              rows, targrows,
                              &totalrows, &totaldeadrows);
复制

这个函数指针指向的是 analyze_rel() 函数中设置好的 acquire_sample_rows() 函数。该函数使用两阶段模式对表中的数据进行采样：

阶段 1：随机选择包含目标采样行数的数据块
阶段 2：对每一个数据块使用 Vitter 算法按行随机采样数据

两阶段同时进行。在采样完成后，被采样到的元组应该已经被放置在元组数组中了。对这个元组数组按照元组的位置进行快速排序，并使用这些采样到的数据估算整个表中的存活元组与死元组的个数：

/*
 * acquire_sample_rows -- acquire a random sample of rows from the table
 *
 * Selected rows are returned in the caller-allocated array rows[], which
 * must have at least targrows entries.
 * The actual number of rows selected is returned as the function result.
 * We also estimate the total numbers of live and dead rows in the table,
 * and return them into *totalrows and *totaldeadrows, respectively.
 *
 * The returned list of tuples is in order by physical position in the table.
 * (We will rely on this later to derive correlation estimates.)
 *
 * As of May 2004 we use a new two-stage method:  Stage one selects up
 * to targrows random blocks (or all blocks, if there aren't so many).
 * Stage two scans these blocks and uses the Vitter algorithm to create
 * a random sample of targrows rows (or less, if there are less in the
 * sample of blocks).  The two stages are executed simultaneously: each
 * block is processed as soon as stage one returns its number and while
 * the rows are read stage two controls which ones are to be inserted
 * into the sample.
 *
 * Although every row has an equal chance of ending up in the final
 * sample, this sampling method is not perfect: not every possible
 * sample has an equal chance of being selected.  For large relations
 * the number of different blocks represented by the sample tends to be
 * too small.  We can live with that for now.  Improvements are welcome.
 *
 * An important property of this sampling method is that because we do
 * look at a statistically unbiased set of blocks, we should get
 * unbiased estimates of the average numbers of live and dead rows per
 * block.  The previous sampling method put too much credence in the row
 * density near the start of the table.
 */
static int
acquire_sample_rows(Relation onerel, int elevel,
                    HeapTuple *rows, int targrows,
                    double *totalrows, double *totaldeadrows)
{
    // ...

    /* Outer loop over blocks to sample */
    while (BlockSampler_HasMore(&bs))
    {
        bool        block_accepted;
        BlockNumber targblock = BlockSampler_Next(&bs);
        // ...
    }

    // ...

    /*
     * If we didn't find as many tuples as we wanted then we're done. No sort
     * is needed, since they're already in order.
     *
     * Otherwise we need to sort the collected tuples by position
     * (itempointer). It's not worth worrying about corner cases where the
     * tuples are already sorted.
     */
    if (numrows == targrows)
        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

    /*
     * Estimate total numbers of live and dead rows in relation, extrapolating
     * on the assumption that the average tuple density in pages we didn't
     * scan is the same as in the pages we did scan.  Since what we scanned is
     * a random sample of the pages in the relation, this should be a good
     * assumption.
     */
    if (bs.m > 0)
    {
        *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
        *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
    }
    else
    {
        *totalrows = 0.0;
        *totaldeadrows = 0.0;
    }

    // ...
}
复制

回到 do_analyze_rel() 函数。采样到数据以后，对于要分析的每一个列，分别计算统计数据，然后更新 pg_statistic 系统表：

/*
 * Compute the statistics.  Temporary results during the calculations for
 * each column are stored in a child context.  The calc routines are
 * responsible to make sure that whatever they store into the VacAttrStats
 * structure is allocated in anl_context.
 */
if (numrows > 0)
{
    // ...

    for (i = 0; i < attr_cnt; i++)
    {
        VacAttrStats *stats = vacattrstats[i];
        AttributeOpts *aopt;

        stats->rows = rows;
        stats->tupDesc = onerel->rd_att;
        stats->compute_stats(stats,
                             std_fetch_func,
                             numrows,
                             totalrows);

        // ...
    }

    // ...

    /*
     * Emit the completed stats rows into pg_statistic, replacing any
     * previous statistics for the target columns.  (If there are stats in
     * pg_statistic for columns we didn't process, we leave them alone.)
     */
    update_attstats(RelationGetRelid(onerel), inh,
                    attr_cnt, vacattrstats);

    // ...
}
复制

显然，对于不同类型的列，其 compute_stats 函数指针指向的计算函数肯定不太一样。所以我们不妨看看给这个函数指针赋值的地方：

/*
 * std_typanalyze -- the default type-specific typanalyze function
 */
bool
std_typanalyze(VacAttrStats *stats)
{
    // ...

    /*
     * Determine which standard statistics algorithm to use
     */
    if (OidIsValid(eqopr) && OidIsValid(ltopr))
    {
        /* Seems to be a scalar datatype */
        stats->compute_stats = compute_scalar_stats;
        /*--------------------
         * The following choice of minrows is based on the paper
         * "Random sampling for histogram construction: how much is enough?"
         * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
         * Proceedings of ACM SIGMOD International Conference on Management
         * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
         * says that for table size n, histogram size k, maximum relative
         * error in bin size f, and error probability gamma, the minimum
         * random sample size is
         *      r = 4 * k * ln(2*n/gamma) / f^2
         * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
         *      r = 305.82 * k
         * Note that because of the log function, the dependence on n is
         * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
         * bin size error with probability 0.99.  So there's no real need to
         * scale for n, which is a good thing because we don't necessarily
         * know it at this point.
         *--------------------
         */
        stats->minrows = 300 * attr->attstattarget;
    }
    else if (OidIsValid(eqopr))
    {
        /* We can still recognize distinct values */
        stats->compute_stats = compute_distinct_stats;
        /* Might as well use the same minrows as above */
        stats->minrows = 300 * attr->attstattarget;
    }
    else
    {
        /* Can't do much but the trivial stuff */
        stats->compute_stats = compute_trivial_stats;
        /* Might as well use the same minrows as above */
        stats->minrows = 300 * attr->attstattarget;
    }

    // ...
}
复制

这个条件判断语句可以被解读为：

如果说一个列的数据类型支持默认的 =（eqopr：equals operator）和 <（ltopr：less than operator），那么这个列应该是一个数值类型，可以使用 compute_scalar_stats() 函数进行分析
如果列的数据类型只支持 = 运算符，那么依旧还可以使用 compute_distinct_stats 进行唯一值的统计分析
如果都不行，那么这个列只能使用 compute_trivial_stats 进行一些简单的分析

我们可以分别看看这三个分析函数里做了啥，但我不准备深入每一个分析函数解读其中的逻辑了。因为其中的思想基于一些很古早的统计学论文，古早到连 PDF 上的字母都快看不清了。在代码上没有特别大的可读性，因为基本是参照论文中的公式实现的，不看论文根本没法理解变量和公式的含义。

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

PolarDB-PG介绍（共76篇）

关于阿里云开源PolarDB—PG的内容介绍

内核执行流程

PolarDB-PG原理解读——ANALYZE 源码解读（四）

内核执行流程

文章被以下合辑收录

评论