每天5分钟,PG聊通透 - 系列1 - 热门问题 - 链接、驱动、SQL - 第9期 - 为什么OFFSET(翻页)会越来越慢?

原创 digoal 2022-01-20

622

作者

digoal

日期

2021-12-20

首先扯远一点, 如果是个数组, 元组size固定, 大家思考一下offset怎么做到最快? 可以根据offset的个数算出起点, 将访问起点直接位移到指定存储地址. 例如4字节的元组, offset 100, 直接跳到400字节开始访问即可. 所以这样的访问不管offset多少性能都不会变慢.

但是, 不管是索引扫描, 还是全表扫描, 亦或索引+其他非索引条件过滤, 输入条件都可能是变化的、而且元组的长度、索引的长度也可能是变化的, 更重要的是数据的组织形式并不是向数组一样顺序组织的. 所以没有办法像array一样, 直接位移存储来进行优化, 这也是为啥offset越后面越慢的原因? 下面用2个例子详细解释一下:

where x=? and y>? order by x limit 10 offset 100
- 使用x索引, y条件并不可知, 所以得遍历符合x条件的索引页.
- 就算y条件也在索引中, 那么请看下面的另一个例子.

order by x limit 10 offset 100
- 使用x索引, 也得从索引遍历100条后, 才能获取后面的. 为什么要遍历这100条呢?
- 索引没有版本号, 不遍历offset的条数, 怎么知道这条记录对当前事务是否可见.
- 就算索引有版本号, 它也得读每条上面的版本号才能判断是否这条记录是否对当前事务可见, 所以还得遍历offset的条数.
- 就算整个page有个全局可见标识, 它怎么知道1个index page里面有多少条记录? 以及一个page的value边界是什么? 所以还得遍历offset的条数.
- 就算叶子结点知道每个index page有多少条, 但是分支节点和根节点总不知道吧?
- 因为任何一个结点, 如果要存储它的所有下游结点有多少条记录, 就必须实时更新.
- 对于根节点来说, 任何的插入更新删除都会涉及根节点的更新, 性能影响一定是极大的.
- 对于分支节点, 只有它下面的叶子结点中的记录有插入更新删除才会涉及到这个分支节点的更新.
- 对于叶子结点来说, 影响只在这个叶子里面的记录的增删改.
- 所以通常不会这么设计, 除非是静态数据.
- 那么通常只有叶子结点才知道它这个page有多少条记录以及value边界, 所以至少还是得按逻辑顺序扫描叶子结点的头信息(获取这个叶子结点有多少条记录).

说这么多, 就是要证明 offset big value 没法在通用数据库的内核中进行优化(除非你的数据是静态数据, 内核才有优化的必要).
如果关心索引结构可以看一下:
- 《PostgreSQL rum 索引结构 - 比gin posting list|tree 的ctid(行号)多了addition info》
- 《[未完待续] PostgreSQL hash 索引结构介绍》
- 《深入浅出PostgreSQL B-Tree索引结构》
- 《PostgreSQL bloom 索引原理》
- 《PostgreSQL RUM 索引原理》
- 《PostgreSQL SP-GiST 索引原理》
- 《PostgreSQL GiST 索引原理 - 4》
- 《PostgreSQL GiST 索引原理 - 3》
- 《PostgreSQL GiST 索引原理 - 2》
- 《PostgreSQL GiST 索引原理 - 1》
- 《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》
- 《重新发现PostgreSQL之美 - 13 brin 时序索引》
- 《PostgreSQL 14 preview - pageinspect 内窥heap,index存储结构 , 新增对gist索引的支持》
- 《PostgreSQL 黑科技 - 空间聚集存储, 内窥GIN, GiST, SP-GiST索引》

3、那么怎么优化呢?
一页一页的翻, 是有优化方法的.

3.1、如果SQL不是在pk上order by, 那么可以order by联合PK索引. (对于没有pk的表, 可以增加1个序列(这个序列自动生成, 不变), 作为order by字段的联合UK). 用最后一条的value作为下次的条件输入, 去除offset.
PS: 加PK或者UK的目的是防止gap.

例子1:
有PK的表.

create unlogged table a (id serial8 primary key, info text, crt_time timestamp);    
insert into a (info,crt_time) select 'test', clock_timestamp() from generate_series(1,1000000);   
复制

问题SQL:
使用offset翻页

create index idx_a_1 on a (info,crt_time);   
select * from a where info='?' order by crt_time limit 10 offset 100;    
-- limit 10 offset 100 扫描了110行   
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where info='test' order by crt_time limit 10 offset 100;   
                                                             QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------   
 Limit  (cost=3.26..3.55 rows=10 width=21) (actual time=0.072..0.078 rows=10 loops=1)   
   Output: id, info, crt_time   
   Buffers: shared hit=4   
   ->  Index Scan using idx_a_1 on public.a  (cost=0.42..28387.47 rows=1000000 width=21) (actual time=0.031..0.062 rows=110 loops=1)   
         Output: id, info, crt_time   
         Index Cond: (a.info = 'test'::text)   
         Buffers: shared hit=4   
 Planning:   
   Buffers: shared hit=28   
 Planning Time: 0.473 ms   
 Execution Time: 0.101 ms   
(11 rows)   
复制

优化:
联合PK进行排序, 使用上一次翻页的最后一条, 作为临界条件传入下一次where条件中, 从而避免使用offset.

create index idx_a_2 on a (info,crt_time,id);    
select * from a where info=? and crt_time>=? and id>? order by crt_time,id limit 10 ;   
postgres=# select * from a where info='test' and crt_time>='2021-12-20 12:23:38.403651' and id>1 order by crt_time,id limit 10 ;   
   id    | info |          crt_time             
---------+------+----------------------------   
 1000102 | test | 2021-12-20 12:23:38.403651   
 1000103 | test | 2021-12-20 12:23:38.403671   
 1000104 | test | 2021-12-20 12:23:38.403673   
 1000105 | test | 2021-12-20 12:23:38.403675   
 1000106 | test | 2021-12-20 12:23:38.403677   
 1000107 | test | 2021-12-20 12:23:38.403678   
 1000108 | test | 2021-12-20 12:23:38.40368   
 1000109 | test | 2021-12-20 12:23:38.403683   
 1000110 | test | 2021-12-20 12:23:38.403685   
 1000111 | test | 2021-12-20 12:23:38.403687   
(10 rows)   
postgres=# select * from a where info='test' and crt_time>='2021-12-20 12:23:38.403687' and id>1000111 order by crt_time,id limit 10 ;   
   id    | info |          crt_time             
---------+------+----------------------------   
 1000112 | test | 2021-12-20 12:23:38.403689   
 1000113 | test | 2021-12-20 12:23:38.403691   
 1000114 | test | 2021-12-20 12:23:38.403693   
 1000115 | test | 2021-12-20 12:23:38.403695   
 1000116 | test | 2021-12-20 12:23:38.403696   
 1000117 | test | 2021-12-20 12:23:38.403698   
 1000118 | test | 2021-12-20 12:23:38.4037   
 1000119 | test | 2021-12-20 12:23:38.403715   
 1000120 | test | 2021-12-20 12:23:38.403717   
 1000121 | test | 2021-12-20 12:23:38.403719   
(10 rows)   
-- 使用上次的条件作为下次的临界条件, limit 10只需要扫描10行   
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a    
where info='test' and crt_time>='2021-12-20 12:23:38.403687' and id>1000111    
order by crt_time, id limit 10 ;   
                                                                    QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------   
 Limit  (cost=0.42..0.70 rows=10 width=21) (actual time=0.032..0.038 rows=10 loops=1)   
   Output: id, info, crt_time   
   Buffers: shared hit=4   
   ->  Index Only Scan using idx_a_2 on public.a  (cost=0.42..27946.47 rows=999789 width=21) (actual time=0.030..0.033 rows=10 loops=1)   
         Output: id, info, crt_time   
         Index Cond: ((a.info = 'test'::text) AND (a.crt_time >= '2021-12-20 12:23:38.403687'::timestamp without time zone) AND (a.id > 1000111))   
         Heap Fetches: 0   
         Buffers: shared hit=4   
 Planning:   
   Buffers: shared hit=4   
 Planning Time: 0.196 ms   
 Execution Time: 0.062 ms   
(12 rows)   
复制

例子2:
没有PK、UK的表.

create unlogged table b (info text, crt_time timestamp, c1 int);      
insert into b (info,crt_time, c1) select 'test', clock_timestamp(), random()*100 from generate_series(1,1000000);   
复制

问题SQL:
使用offset翻页.

create index idx_b_1 on b (info, c1);   
select * from b where info =? order by c1 limit 10 offset 100;    
postgres=# select * from b where info ='test' order by c1 limit 10 offset 1000;    
 info |          crt_time          | c1    
------+----------------------------+----   
 test | 2021-12-20 13:46:11.975846 |  0    
 test | 2021-12-20 13:46:11.975895 |  0    
 test | 2021-12-20 13:46:11.976153 |  0    
 test | 2021-12-20 13:46:11.976212 |  0    
 test | 2021-12-20 13:46:11.976319 |  0    
 test | 2021-12-20 13:46:11.976413 |  0    
 test | 2021-12-20 13:46:11.976444 |  0    
 test | 2021-12-20 13:46:11.97667  |  0    
 test | 2021-12-20 13:46:11.977002 |  0    
 test | 2021-12-20 13:46:11.977111 |  0    
(10 rows)   
-- limit 10 offset 1000 扫描了1010行   
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from b where info ='test' order by c1 limit 10 offset 1000;    
                                                              QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------   
 Limit  (cost=26.56..26.82 rows=10 width=25) (actual time=0.992..1.006 rows=10 loops=1)   
   Output: info, crt_time, c1, uk   
   Buffers: shared hit=716   
   ->  Index Scan using idx_b_1 on public.b  (cost=0.42..26131.08 rows=1000000 width=25) (actual time=0.034..0.909 rows=1010 loops=1)   
         Output: info, crt_time, c1, uk   
         Index Cond: (b.info = 'test'::text)   
         Buffers: shared hit=716   
 Planning Time: 0.112 ms   
 Execution Time: 1.027 ms   
(9 rows)   
复制

优化:
新建一个字段, 存储序列值防止重复, 联合这个序列值进行排序, 使用上一次翻页的最后一条, 作为临界条件传入下一次where条件中, 从而避免使用offset.

alter table b add column uk serial8;  -- 注意会rewrite table. 大表慎重.   
-- 如果想在逻辑层面确保唯一, 可以再加个唯一约束. 但是会增加索引开销. 看业务需求再做决定   
-- create unique index idx_b_uk on b (uk);   
create index idx_b_2 on b (info, c1, uk);    
select * from a where info=? and c1>=? and uk>? order by c1,uk limit 10 ;   
-- 先不用优化SQL查询一下    
postgres=# select * from b where info ='test' order by c1,uk limit 10 offset 1000;    
 info |          crt_time          | c1 |   uk      
------+----------------------------+----+--------   
 test | 2021-12-20 13:46:11.975846 |  0 | 210694   
 test | 2021-12-20 13:46:11.975895 |  0 | 210707   
 test | 2021-12-20 13:46:11.976153 |  0 | 211162   
 test | 2021-12-20 13:46:11.976212 |  0 | 211219   
 test | 2021-12-20 13:46:11.976319 |  0 | 211385   
 test | 2021-12-20 13:46:11.976413 |  0 | 211531   
 test | 2021-12-20 13:46:11.976444 |  0 | 211599   
 test | 2021-12-20 13:46:11.97667  |  0 | 211880   
 test | 2021-12-20 13:46:11.977002 |  0 | 212354   
 test | 2021-12-20 13:46:11.977111 |  0 | 212528   
(10 rows)    
-- limit 10 offset 1000 扫描了1010行   
postgres=# explain (analyze,verbose,timing,costs,buffers)   
postgres-# select * from b where info ='test' order by c1,uk limit 10 offset 1000;    
                                                              QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------   
 Limit  (cost=31.06..31.36 rows=10 width=25) (actual time=1.181..1.195 rows=10 loops=1)   
   Output: info, crt_time, c1, uk   
   Buffers: shared hit=720   
   ->  Index Scan using idx_b_2 on public.b  (cost=0.42..30632.28 rows=1000000 width=25) (actual time=0.033..1.070 rows=1010 loops=1)   
         Output: info, crt_time, c1, uk   
         Index Cond: (b.info = 'test'::text)   
         Buffers: shared hit=720   
 Planning Time: 0.112 ms   
 Execution Time: 1.221 ms   
(9 rows)   
-- 优化SQL如下   
postgres=# select * from b where info='test' and c1>=0 and uk>212528 order by c1,uk limit 10 ;   
 info |          crt_time          | c1 |   uk      
------+----------------------------+----+--------   
 test | 2021-12-20 13:46:11.977324 |  0 | 212835   
 test | 2021-12-20 13:46:11.97776  |  0 | 213058   
 test | 2021-12-20 13:46:11.97831  |  0 | 213454   
 test | 2021-12-20 13:46:11.978551 |  0 | 213641   
 test | 2021-12-20 13:46:11.978914 |  0 | 213871   
 test | 2021-12-20 13:46:11.979195 |  0 | 214152   
 test | 2021-12-20 13:46:11.979217 |  0 | 214179   
 test | 2021-12-20 13:46:11.979622 |  0 | 214469   
 test | 2021-12-20 13:46:11.980445 |  0 | 215447   
 test | 2021-12-20 13:46:11.980464 |  0 | 215488   
(10 rows)   
-- 使用上次的条件作为下次的临界条件, limit 10只需要扫描10行   
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from b where info='test' and c1>=0 and uk>212528 order by c1,uk limit 10 ;   
                                                            QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------   
 Limit  (cost=0.42..1.21 rows=10 width=25) (actual time=0.027..0.072 rows=10 loops=1)   
   Output: info, crt_time, c1, uk   
   Buffers: shared hit=12   
   ->  Index Scan using idx_b_2 on public.b  (cost=0.42..26206.70 rows=333301 width=25) (actual time=0.026..0.068 rows=10 loops=1)   
         Output: info, crt_time, c1, uk   
         Index Cond: ((b.info = 'test'::text) AND (b.c1 >= 0) AND (b.uk > 212528))   
         Buffers: shared hit=12   
 Planning Time: 0.134 ms   
 Execution Time: 0.097 ms   
(9 rows)   
复制

3.2、如果不想加1个UK来防止返回值的gap, 可以使用PG的新特性fetch ties. (但是注意: 如果某个value的重复值非常多, gang会很大很大, 有一定风险, 例如可能打爆客户端内存.)
《PostgreSQL 13 offset fetch first with ties - 返回ordered peer行S》

例子1:

create unlogged table c (info text, crt_time timestamp, c1 int);      
insert into c (info,crt_time, c1) select 'test', clock_timestamp(), random()*100 from generate_series(1,1000000);   
复制

问题SQL:

create index idx_c_1 on c (info, c1);    
select * from c where info=? order by c1 limit 10 offset 1000 ;    
postgres=# select * from c where info='test' order by c1 limit 10 offset 1000 ;    
 info |          crt_time          | c1    
------+----------------------------+----   
 test | 2021-12-20 13:56:32.740037 |  0   
 test | 2021-12-20 13:56:32.74006  |  0   
 test | 2021-12-20 13:56:32.740173 |  0   
 test | 2021-12-20 13:56:32.740293 |  0   
 test | 2021-12-20 13:56:32.740364 |  0   
 test | 2021-12-20 13:56:32.74052  |  0   
 test | 2021-12-20 13:56:32.740723 |  0   
 test | 2021-12-20 13:56:32.741196 |  0   
 test | 2021-12-20 13:56:32.741273 |  0   
 test | 2021-12-20 13:56:32.741303 |  0   
(10 rows)   
复制

优化:

-- 使用上次的条件作为下次的临界条件, 同时为了防止gap, 返还所有order by的字段最后1条的value相同的所有行.    
-- 这里c1=1有10023行, 全部返回. 扫描了10024行, 即判断第10024行发现c1的值不是1了, 说明已全部返回.    
select * from c where info='test' and c1 > ? order by c1 fetch first 10 row with ties;    
select * from c where info='test' and c1 > 0 order by c1 fetch first 10 row with ties;    
...   
 test | 2021-12-20 13:56:33.289016 |  1   
 test | 2021-12-20 13:56:33.289023 |  1   
 test | 2021-12-20 13:56:33.289037 |  1   
(10023 rows)   
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from c where info='test' and c1 > 0 order by c1 fetch first 10 row with ties;    
                                                              QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------   
 Limit  (cost=0.42..0.70 rows=10 width=17) (actual time=0.034..10.198 rows=10023 loops=1)   
   Output: info, crt_time, c1   
   Buffers: shared hit=5064   
   ->  Index Scan using idx_c_1 on public.c  (cost=0.42..27480.01 rows=994867 width=17) (actual time=0.033..7.837 rows=10024 loops=1)   
         Output: info, crt_time, c1   
         Index Cond: ((c.info = 'test'::text) AND (c.c1 > 0))   
         Buffers: shared hit=5064   
 Planning Time: 0.118 ms   
 Execution Time: 11.191 ms   
(9 rows)   
-- 再取翻页的就取c1>1的了.    
postgres=#        explain (analyze,verbose,timing,costs,buffers) select * from c where info='test' and c1 > 1 order by c1 fetch first 10 row with ties;    
                                                              QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------   
 Limit  (cost=0.42..0.70 rows=10 width=17) (actual time=0.023..10.740 rows=10241 loops=1)   
   Output: info, crt_time, c1   
   Buffers: shared hit=5160 read=8   
   ->  Index Scan using idx_c_1 on public.c  (cost=0.42..27230.51 rows=984715 width=17) (actual time=0.022..8.970 rows=10242 loops=1)   
         Output: info, crt_time, c1   
         Index Cond: ((c.info = 'test'::text) AND (c.c1 > 1))   
         Buffers: shared hit=5160 read=8   
 Planning Time: 0.112 ms   
 Execution Time: 11.442 ms   
(9 rows)   
复制

4、直接跳到第N页? 没法优化.
- 因为前面的临界值已经不知道了, 只能offset, 没有办法优化.

参考:
- 《PostgreSQL 分页, offset, 返回顺序, 扫描方法原理(seqscan, index scan, index only scan, bitmap scan, parallel xx scan)，游标》
- 《PostgreSQL 范围过滤 + 其他字段排序OFFSET LIMIT(多字段区间过滤)的优化与加速》
- 《PostgreSQL 索引扫描offset内核优化 - case》
- 《PostgreSQL 数据访问 offset 的质变 case》
- 《论count与offset使用不当的罪名和分页的优化》
- 《PostgreSQL offset 原理，及使用注意事项》
- 《分页优化 - order by limit x offset y performance tuning》
- 《PostgreSQL 13 offset fetch first with ties - 返回ordered peer行S》

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

每天5分钟,PG聊通透 - 系列1 - 热门问题 - 链接、驱动、SQL - 第9期 - 为什么OFFSET(翻页)会越来越慢?

作者

日期

标签

9、为什么OFFSET会越来越慢?

期望 PostgreSQL 增加什么功能?

PolarDB for PostgreSQL云原生分布式开源数据库

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论

相关阅读