暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL ring buffer浅析

PolarDB 2025-04-18
184

PostgreSQL ring buffer浅析

关于 PolarDB PostgreSQL 版

PolarDB PostgreSQL 版是一款阿里云自主研发的云原生关系型数据库产品,100% 兼容 PostgreSQL,高度兼容Oracle语法;采用基于 Shared-Storage 的存储计算分离架构,具有极致弹性、毫秒级延迟、HTAP 、Ganos全空间数据处理能力和高可靠、高可用、弹性扩展等企业级数据库特性。同时,PolarDB PostgreSQL 版具有大规模并行计算能力,可以应对 OLTP 与 OLAP 混合负载。

ring buffer功能

PostgreSQL中通过缓存--共享内存buffer pool来避免直接与存储IO交互,提高数据库的性能。buffer pool是通过shared_buffers来控制的,不可能无限大,所以当运行一次仅需要访问大量页面的操作时(例如 VACUUM 或大型顺序扫描或批量写入)时,可能会导致整个buffer pool被污染,影响到其他进程的性能,针对这类操作引入了新的buffer分配策略--ring buffer,为这类操作分配更小的缓冲区大小ring_size,buffer的分配在该缓冲区内进行,从而避免污染整个buffer pool。

ring buffer原理

ring buffer的策略主要分为三类:

类别
场景
参数大小
BAS_BULKREAD
主要针对批量读的场景
polar_ring_buffer_bulkread_size(默认1MB)
BAS_BULKWRITE
主要针对批量写的场景
polar_ring_buffer_bulkwrite_size(默认值64MB)
BAS_VACUUM
主要针对VACUUM场景
polar_ring_buffer_vacuum_size(默认值128MB)
BufferAccessStrategy
GetAccessStrategy(BufferAccessStrategyType btype)
{
    /* ... */

/*
  * Select ring size to use.  See buffer/README for rationales.
  *
  * Note: if you change the ring size for BAS_BULKREAD, see also
  * SYNC_SCAN_REPORT_INTERVAL in access/heap/syncscan.c.
  */

switch (btype)
 {
case BAS_NORMAL:
   /* if someone asks for NORMAL, just give 'em a "default" object */
   returnNULL;

case BAS_BULKREAD:
   ring_size = polar_ring_buffer_bulkread_size;
   break;
case BAS_BULKWRITE:
   ring_size = polar_ring_buffer_bulkwrite_size;
   break;
case BAS_VACUUM:
   ring_size = polar_ring_buffer_vacuum_size;
   break;

default:
   elog(ERROR, "unrecognized buffer access strategy: %d",
     (int) btype);
   returnNULL;  /* keep compiler quiet */
 }
    /* ... */
}

其中ring_size为ring buffer的大小,通过GetBufferFromRing
去ring buffer数组里获取有效的buffer。同时会判断当前的buffer的REFCOUNT以及该buffer最近被使用的次数,如果buffer满足上面两个条件,则直接返回,否则去buffer pool中获取。

static BufferDesc *
GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
{
/* ... */

/* Advance to next ring slot */
if (++strategy->current >= strategy->ring_size)
  strategy->current = 0;

/*
  * POLAR: If the slot hasn't been filled yet or bufnum >= NBuffers, tell
  * the caller to allocate a new buffer with the normal allocation
  * strategy.  He will then fill this slot by calling AddBufferToRing with
  * the new buffer.
  */

 bufnum = strategy->buffers[strategy->current];
if (bufnum == InvalidBuffer || bufnum >= NBuffers)
 {
  strategy->current_was_in_ring = false;
returnNULL;
 }

/*
  * If the buffer is pinned we cannot use it under any circumstances.
  *
  * If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
  * since our own previous usage of the ring element would have left it
  * there, but it might've been decremented by clock sweep since then). A
  * higher usage_count indicates someone else has touched the buffer, so we
  * shouldn't re-use it.
  *
  * If the buffer is marked to shrink, the buffer cannot be used again.
  */

 buf = GetBufferDescriptor(bufnum - 1);
 local_buf_state = LockBufHdr(buf);
if (!POLAR_BUFFER_IS_MARK_SHRINK(buf) &&
  BUF_STATE_GET_REFCOUNT(local_buf_state) == 0 &&
  BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 {
  strategy->current_was_in_ring = true;
  *buf_state = local_buf_state;
return buf;
 }
 UnlockBufHdr(buf, local_buf_state);

/*
  * Tell caller to allocate a new buffer with the normal allocation
  * strategy.  He'll then replace this ring element via AddBufferToRing.
  */

 strategy->current_was_in_ring = false;
returnNULL;
}

BAS_BULKREAD

对于批量的顺序扫描,社区使用 256KB的大小数组。因为足够小可以放入 L2缓存,这使得将页面从操作系统缓存传输到共享缓冲区缓存更加高效。但是针对公有云实例,我们实例的共享内存普遍较大,所以适当地调整了ring buffer size。如果ring buffer环形缓冲区的buffer被频繁的修改标脏,则ring buffer策略会降为正常的整个buffer pool的时钟扫描算法策略。

BAS_BULKWRITE

主要场景:

  • COPY FROM

  • CREATE TABLE AS

/*
 * GetBulkInsertState - prepare status object for a bulk insert
 */

BulkInsertState
GetBulkInsertState(void)
{
 BulkInsertState bistate;

 bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 bistate->current_buf = InvalidBuffer;
 return bistate;
}

由于PolarDB PostgreSQL架构的IO主要是DIO,pwrite/write是直接写到磁盘而不会写到操作系统的page cache,所以IO代价会比较高。对于COPY IN场景,PostgreSQL中使用了16MB 的环大小(但不超过 shared_buffers 的 1/8),但是ring buffer过于小导致backend进程频繁的做刷脏IO,同时我们又是DIO模型,所以会导致copy的性能严重受到影响,为了减少和规避backend进程本身刷脏影响性能问题,提供了两种方案:

  1. 增大bulk writes的ring buffer的size大小,减少backend进程本身刷脏对性能的影响;

  2. 关闭bulk writes的ring buffer的size大小,让bulk writes走正常的整个buffer pool的时钟扫描算法策略,让刷脏进程做IO,完全规避backend进程本身刷脏对性能的影响。

BAS_VACUUM

主要场景:

  • vacuum/autovacuum

因为如果vacuum自身频繁的做刷脏IO会影响vacuum的效率,对于事务产生非常快,同时又是被修改非常平凡的大表的业务场景,vacuum频繁的做刷脏IO,可能会导致事务回收不及时从而导致事务回卷。所以vacuum采用了与bulk writes类似的策略:1)增大bulk writes的ring buffer的size大小;2)关闭bulk writes的ring buffer的size大小。


文章转载自PolarDB,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论