暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PolarDB PostgreSQL版autovacuum

PolarDB 2025-01-16
78

关于 PolarDB PostgreSQL 版

PolarDB PostgreSQL 版是一款阿里云自主研发的云原生关系型数据库产品,100% 兼容 PostgreSQL,高度兼容Oracle语法;采用基于 Shared-Storage 的存储计算分离架构,具有极致弹性、毫秒级延迟、HTAP 、Ganos全空间数据处理能力和高可靠、高可用、弹性扩展等企业级数据库特性。同时,PolarDB PostgreSQL 版具有大规模并行计算能力,可以应对 OLTP 与 OLAP 混合负载。

功能介绍

原生的PostgreSQL中autovacuum 由两类进程构成: autovacuum launcher和 autovacuum work。 autovacuum launcher进程是一个持续运行的进程,在入口函数StartAutoVacLauncher 中fork创建Postmaster的子进程。autovacuum work进程一般由autovacuum launcher进程决策出需要vacuum的DB,则告知postmaster进程fork autovacuum work去处理。autovacuum中vacuum主要清理死元组占用空间的回收,以及老的事务ID freeze。

原理概述

autovacuum launcher

关键结构体

typedef struct
{

 sig_atomic_t av_signal[AutoVacNumSignals];
 pid_t  av_launcherpid;
 dlist_head av_freeWorkers;
 dlist_head av_runningWorkers;
 WorkerInfo av_startingWorker;
 AutoVacuumWorkItem av_workItems[NUM_WORKITEMS];
} AutoVacuumShmemStruct;

复制

其中,av_freeWorkers 记录剩余的可以launch autovacuum work的个数,初始化的个数和guc参数autovacuum_max_workers配置相关。av_runningWorkers中记录正在做autovacuum的进程;av_startingWorker记录启动中的autovacuum进程;

typedef struct avl_dbase
{

 Oid   adl_datid;
 TimestampTz adl_next_worker;
 int   adl_score;
 dlist_node adl_node;
} avl_dbase;

复制

关键流程:

  • 设置相应的信号处理函数;
  /*
   * Set up signal handlers.  We operate on databases much like a regular
   * backend, so we use the same signal handling.  See equivalent code in
   * tcop/postgres.c.
   */

  pqsignal(SIGHUP, av_sighup_handler);
  pqsignal(SIGINT, StatementCancelHandler);
  pqsignal(SIGTERM, avl_sigterm_handler);

  pqsignal(SIGQUIT, quickdie);
  InitializeTimeouts();  /* establishes SIGALRM handler */

  pqsignal(SIGPIPE, SIG_IGN);
  pqsignal(SIGUSR1, procsignal_sigusr1_handler);
  pqsignal(SIGUSR2, avl_sigusr2_handler);
  pqsignal(SIGFPE, FloatExceptionHandler);
  pqsignal(SIGCHLD, SIG_DFL);

复制
  • rebuild_database_list 构建数据库列表,每一个数据库对应一个结构体avl_dbase
    ,初始化每一个数据库的adl_score,同时记录adl_next_worker时间,其用于与autovacuum_naptime比较判断是否去处理该db做autovacuum。维护DatabaseList一个列表,autovacuum处理数据库顺序按照从DatabaseList列表的尾部开始;
/*
 * move the elements from the array into the dllist, setting the
 * next_worker while walking the array
 */

for (i = 0; i < nelems; i++)
{
    avl_dbase  *db = &(dbary[i]);

    current_time = TimestampTzPlusMilliseconds(current_time,
                                               millis_increment);
    db->adl_next_worker = current_time;

    /* later elements should go closer to the head of the list */
    dlist_push_head(&DatabaseList, &db->adl_node);
}

复制
  • 循环等待直到满足超时或者信号触发被唤醒时,则去处理db是否做autovauum;
/*
 * This loop is a bit different from the normal use of WaitLatch,
 * because we'd like to sleep before the first launch of a child
 * process.  So it's WaitLatch, then ResetLatch, then check for
 * wakening conditions.
 */


//...

/*
 * Wait until naptime expires or we get some type of signal (all the
 * signal handlers will wake us by calling SetLatch).
 */

rc = WaitLatch(MyLatch,
               WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
               (nap.tv_sec * 1000L) + (nap.tv_usec  1000L),
               WAIT_EVENT_AUTOVACUUM_MAIN);

//...

复制
  • 判断:1)av_freeWorkers
    是否还可以分配进程去做autovacuum;2)判断av_startingWorker
    是否存在正在启动的autovacuum;

  • 满足上述条件则做launch_worker
    ;否则继续循环等待;

  • do_start_worker
    函数的思路 优先考虑 db的frozen信息是否满足触发条件,如果不满足再判断adl_next_worker
    last_autovac_time

//...
/* Check to see if this one is at risk of wraparound */
if (TransactionIdPrecedes(tmp->adw_frozenxid, xidForceLimit))
{
    if (avdb == NULL ||
        TransactionIdPrecedes(tmp->adw_frozenxid,
                              avdb->adw_frozenxid))
        avdb = tmp;
    for_xid_wrap = true;
    continue;
}
elseif (for_xid_wrap)
    continue;   /* ignore not-at-risk DBs */
elseif (MultiXactIdPrecedes(tmp->adw_minmulti, multiForceLimit))
{
    if (avdb == NULL ||
        MultiXactIdPrecedes(tmp->adw_minmulti, avdb->adw_minmulti))
        avdb = tmp;
    for_multi_wrap = true;
    continue;
}
elseif (for_multi_wrap)
    continue;   /* ignore not-at-risk DBs */

//...

dlist_reverse_foreach(iter, &DatabaseList)
{
    avl_dbase  *dbp = dlist_container(avl_dbase, adl_node, iter.cur);

    if (dbp->adl_datid == tmp->adw_datid)
    {
        /*
         * Skip this database if its next_worker value falls between
         * the current time and the current time plus naptime.
         */

        if (!TimestampDifferenceExceeds(dbp->adl_next_worker,
                                        current_time, 0) &&
            !TimestampDifferenceExceeds(current_time,
                                        dbp->adl_next_worker,
                                        autovacuum_naptime * 1000))
            skipit = true;

        break;
    }
}
//...

/*
 * Remember the db with oldest autovac time.  (If we are here, both
 * tmp->entry and db->entry must be non-null.)
 */

if (avdb == NULL ||
    tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
    avdb = tmp;
}

复制
  • 然后告诉master节点去start autovacuum work进程;

autovacuum work

关键流程

autovacuum(不包括analyze):

  • 获取指定的数据库中每一张表,获取pgstat
    中统计信息,relation_needs_vacanalyze
    中根据freeze
    相关参数计算表是否需要做vacuum;
//...

/* Fetch reloptions and the pgstat entry for this table */
relopts = extract_autovac_opts(tuple, pg_class_desc);
tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                     shared, dbentry);

/* Check if it needs vacuum or analyze */
relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                          effective_multixact_freeze_max_age,
                          &dovacuum, &doanalyze, &wraparound);

//...

复制
  • 对指定的table做vacuum时会对表上ShareUpdateExclusiveLock锁,与RowShareLock/RowExclusiveLock(正常的读写表操作)不冲突。vacuum_set_xid_limits
    计算OldestXmin,以及freeze相关的limit;
/*
 * Open the relation and get the appropriate lock on it.
 *
 * There's a race condition here: the rel may have gone away since the
 * last time we saw it.  If so, we don't need to vacuum it.
 *
 * If we've been asked not to wait for the relation lock, acquire it first
 * in non-blocking mode, before calling try_relation_open().
 */

if (!(options & VACOPT_NOWAIT))
    onerel = try_relation_open(relid, lmode);
elseif (ConditionalLockRelationOid(relid, lmode))
    onerel = try_relation_open(relid, NoLock);
else
{
    onerel = NULL;
    rel_lock = false;
}

//...

vacuum_set_xid_limits(onerel,
                      params->freeze_min_age,
                      params->freeze_table_age,
                      params->multixact_freeze_min_age,
                      params->multixact_freeze_table_age,
                      &OldestXmin, &FreezeLimit, 
                      &xidFullScanLimit,
                      &MultiXactCutoff, &mxactFullScanLimit);


复制
  • lazy_scan_heap
    根据vm获取可以skip的block, 循环扫描每一个页面,对每一个page
    prune
    ,移除死元组并对page碎片空间做重排。遍历page中每一个tuple,如果有必要,冻结旧的元组的事务标识,移除指向死亡元组的索引元组。判断是否需要截断最后一个页面,当需要截断时会获取AccessExclusiveLock
    锁;
if ((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
{
    while (next_unskippable_block < nblocks)
    {
        uint8  vmstatus;

        vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,
                                            &vmbuffer);
        if (aggressive)
        {
            if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
                break;
        }
        else
        {
            if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
                break;
        }
        vacuum_delay_point();
        next_unskippable_block++;
    }
//...

/*
 * Prune all HOT-update chains in this page.
 *
 * We count tuples removed by the pruning step as removed by 
 * VACUUM.
 */

tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
                                     &vacrelstats->latestRemovedXid);

//...
/* execute collected freezes */
for (i = 0; i < nfrozen; i++)
{
  ItemId  itemid;
  HeapTupleHeader htup;

  itemid = PageGetItemId(page, frozen[i].offset);
  htup = (HeapTupleHeader) PageGetItem(page, itemid);

  heap_execute_freeze_tuple(htup, &frozen[i]);
}
//...

/*
 * Optionally truncate the relation.
 */

if (should_attempt_truncation(vacrelstats))
    lazy_truncate_heap(onerel, vacrelstats);

//...

复制
  • freeze
    后更新datfrozenxid
    ,并计算所有db最老的freeze用来判断需要移除的CLOG文件。
/*
 * We leak table_toast_map here (among other things), but since we're
 * going away soon, it's not a problem.
 */


/*
 * Update pg_database.datfrozenxid, and truncate pg_xact if possible. We
 * only need to do this once, not after each table.
 *
 * Even if we didn't vacuum anything, it may still be important to do
 * this, because one indirect effect of vac_update_datfrozenxid() is to
 * update ShmemVariableCache->xidVacLimit.  That might need to be done
 * even if we haven't vacuumed anything, because relations with older
 * relfrozenxid values or other databases with older datfrozenxid values
 * might have been dropped, allowing xidVacLimit to advance.
 *
 * However, it's also important not to do this blindly in all cases,
 * because when autovacuum=off this will restart the autovacuum launcher.
 * If we're not careful, an infinite loop can result, where workers find
 * no work to do and restart the launcher, which starts another worker in
 * the same database that finds no work to do.  To prevent that, we skip
 * this if (1) we found no work to do and (2) we skipped at least one
 * table due to concurrent autovacuum activity.  In that case, the other
 * worker has already done it, or will do so when it finishes.
 */

if (did_vacuum || !found_concurrent_worker)
    vac_update_datfrozenxid();

复制

文章转载自PolarDB,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论