PostgreSQL ring buffer浅析
关于 PolarDB PostgreSQL 版
PolarDB PostgreSQL 版是一款阿里云自主研发的云原生关系型数据库产品,100% 兼容 PostgreSQL,高度兼容Oracle语法;采用基于 Shared-Storage 的存储计算分离架构,具有极致弹性、毫秒级延迟、HTAP 、Ganos全空间数据处理能力和高可靠、高可用、弹性扩展等企业级数据库特性。同时,PolarDB PostgreSQL 版具有大规模并行计算能力,可以应对 OLTP 与 OLAP 混合负载。
ring buffer功能
PostgreSQL中通过缓存--共享内存buffer pool来避免直接与存储IO交互,提高数据库的性能。buffer pool是通过shared_buffers来控制的,不可能无限大,所以当运行一次仅需要访问大量页面的操作时(例如 VACUUM 或大型顺序扫描或批量写入)时,可能会导致整个buffer pool被污染,影响到其他进程的性能,针对这类操作引入了新的buffer分配策略--ring buffer,为这类操作分配更小的缓冲区大小ring_size,buffer的分配在该缓冲区内进行,从而避免污染整个buffer pool。
ring buffer原理
ring buffer的策略主要分为三类:
BufferAccessStrategy
GetAccessStrategy(BufferAccessStrategyType btype)
{
/* ... */
/*
* Select ring size to use. See buffer/README for rationales.
*
* Note: if you change the ring size for BAS_BULKREAD, see also
* SYNC_SCAN_REPORT_INTERVAL in access/heap/syncscan.c.
*/
switch (btype)
{
case BAS_NORMAL:
/* if someone asks for NORMAL, just give 'em a "default" object */
returnNULL;
case BAS_BULKREAD:
ring_size = polar_ring_buffer_bulkread_size;
break;
case BAS_BULKWRITE:
ring_size = polar_ring_buffer_bulkwrite_size;
break;
case BAS_VACUUM:
ring_size = polar_ring_buffer_vacuum_size;
break;
default:
elog(ERROR, "unrecognized buffer access strategy: %d",
(int) btype);
returnNULL; /* keep compiler quiet */
}
/* ... */
}
其中ring_size为ring buffer的大小,通过GetBufferFromRing
去ring buffer数组里获取有效的buffer。同时会判断当前的buffer的REFCOUNT以及该buffer最近被使用的次数,如果buffer满足上面两个条件,则直接返回,否则去buffer pool中获取。
static BufferDesc *
GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
{
/* ... */
/* Advance to next ring slot */
if (++strategy->current >= strategy->ring_size)
strategy->current = 0;
/*
* POLAR: If the slot hasn't been filled yet or bufnum >= NBuffers, tell
* the caller to allocate a new buffer with the normal allocation
* strategy. He will then fill this slot by calling AddBufferToRing with
* the new buffer.
*/
bufnum = strategy->buffers[strategy->current];
if (bufnum == InvalidBuffer || bufnum >= NBuffers)
{
strategy->current_was_in_ring = false;
returnNULL;
}
/*
* If the buffer is pinned we cannot use it under any circumstances.
*
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
* since our own previous usage of the ring element would have left it
* there, but it might've been decremented by clock sweep since then). A
* higher usage_count indicates someone else has touched the buffer, so we
* shouldn't re-use it.
*
* If the buffer is marked to shrink, the buffer cannot be used again.
*/
buf = GetBufferDescriptor(bufnum - 1);
local_buf_state = LockBufHdr(buf);
if (!POLAR_BUFFER_IS_MARK_SHRINK(buf) &&
BUF_STATE_GET_REFCOUNT(local_buf_state) == 0 &&
BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
{
strategy->current_was_in_ring = true;
*buf_state = local_buf_state;
return buf;
}
UnlockBufHdr(buf, local_buf_state);
/*
* Tell caller to allocate a new buffer with the normal allocation
* strategy. He'll then replace this ring element via AddBufferToRing.
*/
strategy->current_was_in_ring = false;
returnNULL;
}
BAS_BULKREAD
对于批量的顺序扫描,社区使用 256KB的大小数组。因为足够小可以放入 L2缓存,这使得将页面从操作系统缓存传输到共享缓冲区缓存更加高效。但是针对公有云实例,我们实例的共享内存普遍较大,所以适当地调整了ring buffer size。如果ring buffer环形缓冲区的buffer被频繁的修改标脏,则ring buffer策略会降为正常的整个buffer pool的时钟扫描算法策略。
BAS_BULKWRITE
主要场景:
COPY FROM
CREATE TABLE AS
/*
* GetBulkInsertState - prepare status object for a bulk insert
*/
BulkInsertState
GetBulkInsertState(void)
{
BulkInsertState bistate;
bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
bistate->current_buf = InvalidBuffer;
return bistate;
}
由于PolarDB PostgreSQL架构的IO主要是DIO,pwrite/write是直接写到磁盘而不会写到操作系统的page cache,所以IO代价会比较高。对于COPY IN场景,PostgreSQL中使用了16MB 的环大小(但不超过 shared_buffers 的 1/8),但是ring buffer过于小导致backend进程频繁的做刷脏IO,同时我们又是DIO模型,所以会导致copy的性能严重受到影响,为了减少和规避backend进程本身刷脏影响性能问题,提供了两种方案:
增大bulk writes的ring buffer的size大小,减少backend进程本身刷脏对性能的影响;
关闭bulk writes的ring buffer的size大小,让bulk writes走正常的整个buffer pool的时钟扫描算法策略,让刷脏进程做IO,完全规避backend进程本身刷脏对性能的影响。
BAS_VACUUM
主要场景:
vacuum/autovacuum
因为如果vacuum自身频繁的做刷脏IO会影响vacuum的效率,对于事务产生非常快,同时又是被修改非常平凡的大表的业务场景,vacuum频繁的做刷脏IO,可能会导致事务回收不及时从而导致事务回卷。所以vacuum采用了与bulk writes类似的策略:1)增大bulk writes的ring buffer的size大小;2)关闭bulk writes的ring buffer的size大小。




