PostgreSQL ring buffer浅析

PolarDB 2025-04-18

184

PostgreSQL ring buffer浅析

关于 PolarDB PostgreSQL 版

PolarDB PostgreSQL 版是一款阿里云自主研发的云原生关系型数据库产品，100% 兼容 PostgreSQL，高度兼容Oracle语法；采用基于 Shared-Storage 的存储计算分离架构，具有极致弹性、毫秒级延迟、HTAP 、Ganos全空间数据处理能力和高可靠、高可用、弹性扩展等企业级数据库特性。同时，PolarDB PostgreSQL 版具有大规模并行计算能力，可以应对 OLTP 与 OLAP 混合负载。

ring buffer功能

PostgreSQL中通过缓存--共享内存buffer pool来避免直接与存储IO交互，提高数据库的性能。buffer pool是通过shared_buffers来控制的，不可能无限大，所以当运行一次仅需要访问大量页面的操作时（例如 VACUUM 或大型顺序扫描或批量写入）时，可能会导致整个buffer pool被污染，影响到其他进程的性能，针对这类操作引入了新的buffer分配策略--ring buffer，为这类操作分配更小的缓冲区大小ring_size，buffer的分配在该缓冲区内进行，从而避免污染整个buffer pool。

ring buffer原理

ring buffer的策略主要分为三类：

类别	场景	参数大小
BAS_BULKREAD	主要针对批量读的场景	polar_ring_buffer_bulkread_size(默认1MB)
BAS_BULKWRITE	主要针对批量写的场景	polar_ring_buffer_bulkwrite_size(默认值64MB)
BAS_VACUUM	主要针对VACUUM场景	polar_ring_buffer_vacuum_size(默认值128MB)

BufferAccessStrategy
GetAccessStrategy(BufferAccessStrategyType btype)
{
    /* ... */

/*
  * Select ring size to use.  See buffer/README for rationales.
  *
  * Note: if you change the ring size for BAS_BULKREAD, see also
  * SYNC_SCAN_REPORT_INTERVAL in access/heap/syncscan.c.
  */
switch (btype)
 {
case BAS_NORMAL:
   /* if someone asks for NORMAL, just give 'em a "default" object */
   returnNULL;

case BAS_BULKREAD:
   ring_size = polar_ring_buffer_bulkread_size;
   break;
case BAS_BULKWRITE:
   ring_size = polar_ring_buffer_bulkwrite_size;
   break;
case BAS_VACUUM:
   ring_size = polar_ring_buffer_vacuum_size;
   break;

default:
   elog(ERROR, "unrecognized buffer access strategy: %d",
     (int) btype);
   returnNULL;  /* keep compiler quiet */
 }
    /* ... */
}

其中ring_size为ring buffer的大小，通过GetBufferFromRing
去ring buffer数组里获取有效的buffer。同时会判断当前的buffer的REFCOUNT以及该buffer最近被使用的次数，如果buffer满足上面两个条件，则直接返回，否则去buffer pool中获取。

static BufferDesc *
GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
{
/* ... */

/* Advance to next ring slot */
if (++strategy->current >= strategy->ring_size)
  strategy->current = 0;

/*
  * POLAR: If the slot hasn't been filled yet or bufnum >= NBuffers, tell
  * the caller to allocate a new buffer with the normal allocation
  * strategy.  He will then fill this slot by calling AddBufferToRing with
  * the new buffer.
  */
 bufnum = strategy->buffers[strategy->current];
if (bufnum == InvalidBuffer || bufnum >= NBuffers)
 {
  strategy->current_was_in_ring = false;
returnNULL;
 }

/*
  * If the buffer is pinned we cannot use it under any circumstances.
  *
  * If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
  * since our own previous usage of the ring element would have left it
  * there, but it might've been decremented by clock sweep since then). A
  * higher usage_count indicates someone else has touched the buffer, so we
  * shouldn't re-use it.
  *
  * If the buffer is marked to shrink, the buffer cannot be used again.
  */
 buf = GetBufferDescriptor(bufnum - 1);
 local_buf_state = LockBufHdr(buf);
if (!POLAR_BUFFER_IS_MARK_SHRINK(buf) &&
  BUF_STATE_GET_REFCOUNT(local_buf_state) == 0 &&
  BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 {
  strategy->current_was_in_ring = true;
  *buf_state = local_buf_state;
return buf;
 }
 UnlockBufHdr(buf, local_buf_state);

/*
  * Tell caller to allocate a new buffer with the normal allocation
  * strategy.  He'll then replace this ring element via AddBufferToRing.
  */
 strategy->current_was_in_ring = false;
returnNULL;
}

BAS_BULKREAD

对于批量的顺序扫描，社区使用 256KB的大小数组。因为足够小可以放入 L2缓存，这使得将页面从操作系统缓存传输到共享缓冲区缓存更加高效。但是针对公有云实例，我们实例的共享内存普遍较大，所以适当地调整了ring buffer size。如果ring buffer环形缓冲区的buffer被频繁的修改标脏，则ring buffer策略会降为正常的整个buffer pool的时钟扫描算法策略。

BAS_BULKWRITE

主要场景：

COPY FROM
CREATE TABLE AS

/*
 * GetBulkInsertState - prepare status object for a bulk insert
 */
BulkInsertState
GetBulkInsertState(void)
{
 BulkInsertState bistate;

 bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 bistate->current_buf = InvalidBuffer;
 return bistate;
}

由于PolarDB PostgreSQL架构的IO主要是DIO，pwrite/write是直接写到磁盘而不会写到操作系统的page cache，所以IO代价会比较高。对于COPY IN场景，PostgreSQL中使用了16MB 的环大小（但不超过 shared_buffers 的 1/8），但是ring buffer过于小导致backend进程频繁的做刷脏IO，同时我们又是DIO模型，所以会导致copy的性能严重受到影响，为了减少和规避backend进程本身刷脏影响性能问题，提供了两种方案：

增大bulk writes的ring buffer的size大小，减少backend进程本身刷脏对性能的影响；
关闭bulk writes的ring buffer的size大小，让bulk writes走正常的整个buffer pool的时钟扫描算法策略，让刷脏进程做IO，完全规避backend进程本身刷脏对性能的影响。

BAS_VACUUM

主要场景：

vacuum/autovacuum

因为如果vacuum自身频繁的做刷脏IO会影响vacuum的效率，对于事务产生非常快，同时又是被修改非常平凡的大表的业务场景，vacuum频繁的做刷脏IO，可能会导致事务回收不及时从而导致事务回卷。所以vacuum采用了与bulk writes类似的策略：1）增大bulk writes的ring buffer的size大小；2）关闭bulk writes的ring buffer的size大小。

postgresql polardb

文章转载自PolarDB，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL ring buffer浅析