作者
digoal
日期
2020-06-10
标签
PostgreSQL , hll , uv , 估值
背景
留存评估, 评估每个留存日的用户数. 抛去其他维度的查询条件, 对比三种方案的效率.
例如第一次登陆为6-1, 那么在6-2, 6-6登陆表示1,5日留存.
目的是统计每个留存日的用户数.
方案1 :
```
-- 100万用户, 每个用户100个number (52核 pg 12)
CREATE TABLE user_retain (
user_id serial PRIMARY KEY,
fst_login_date date,
pay_retained_num int [] -- 数组存储, 表示留存日.
);
create or replace function gen_rand() returns int[] as $$
select array_AGG((ceil(random()*365)::int)) from generate_series(1,100);
$$ language sql strict;
insert into user_retain select generate_series(1,1000000), now(), gen_rand();
```
explain (analyze,verbose,timing,costs,buffers)
select unnest(pay_retained_num),count(*) from user_retain group by 1;
18秒
为什么要18秒?
unnest 后没发立即并行, 只是function并行, 并没有并行聚合操作, 反而因为需要gather导致并行比不并行还慢. 当然这个我认为PG是可以在内核层面优化的.
```
QUERY PLAN
HashAggregate (cost=70326.50..86732.89 rows=1000008 width=12) (actual time=29802.202..29806.633 rows=365 loops=1)
Output: (unnest(pay_retained_num)), count(*)
Group Key: (unnest(user_retain.pay_retained_num))
Planned Partitions: 8
Peak Memory Usage: 6193 kB
Buffers: shared read=58824
-> Gather (cost=0.00..61420.18 rows=1000008 width=4) (actual time=4.392..16521.356 rows=100000000 loops=1)
Output: (unnest(pay_retained_num))
Workers Planned: 26
Workers Launched: 26
Buffers: shared read=58824
-> ProjectSet (cost=0.00..61420.18 rows=384620 width=4) (actual time=0.040..284.310 rows=3846154 loops=26)
Output: unnest(pay_retained_num)
Buffers: shared read=58824
Worker 0: actual time=0.035..226.005 rows=3107600 loops=1
Buffers: shared read=1828
Worker 1: actual time=0.037..327.656 rows=4547500 loops=1
Buffers: shared read=2675
Worker 2: actual time=0.036..512.300 rows=6907100 loops=1
Buffers: shared read=4063
......
Buffers: shared read=1908
Worker 22: actual time=0.038..262.740 rows=3689000 loops=1
Buffers: shared read=2170
Worker 23: actual time=0.040..203.822 rows=2828800 loops=1
Buffers: shared read=1664
Worker 24: actual time=0.042..161.236 rows=2249100 loops=1
Buffers: shared read=1323
Worker 25: actual time=0.046..340.726 rows=4622300 loops=1
Buffers: shared read=2719
-> Parallel Seq Scan on public.user_retain (cost=0.00..59208.62 rows=38462 width=424) (actual time=0.033..25.329 rows=38462 loops=26)
Output: user_id, fst_login_date, pay_retained_num
Buffers: shared read=58824
Worker 0: actual time=0.029..21.766 rows=31076 loops=1
Buffers: shared read=1828
......
Buffers: shared read=1664
Worker 24: actual time=0.035..16.011 rows=22491 loops=1
Buffers: shared read=1323
Worker 25: actual time=0.036..30.783 rows=46223 loops=1
Buffers: shared read=2719
Planning Time: 0.265 ms
Buffers: shared hit=34 read=4
Execution Time: 29809.325 ms
(124 rows)
```
```
postgres=# explain (analyze,verbose,timing,costs,buffers)
select unnest(pay_retained_num),count(*) from user_retain group by 1;
QUERY PLAN
HashAggregate (cost=215387.75..379451.57 rows=10000080 width=12) (actual time=18522.487..18524.227 rows=365 loops=1)
Output: (unnest(pay_retained_num)), count(*)
Group Key: unnest(user_retain.pay_retained_num)
Planned Partitions: 64
Peak Memory Usage: 6193 kB
Buffers: shared hit=58824
-> ProjectSet (cost=0.00..126324.54 rows=10000080 width=4) (actual time=0.012..6471.708 rows=100000000 loops=1)
Output: unnest(pay_retained_num)
Buffers: shared hit=58824
-> Seq Scan on public.user_retain (cost=0.00..68824.08 rows=1000008 width=424) (actual time=0.009..102.847 rows=1000000 loops=1)
Output: user_id, fst_login_date, pay_retained_num
Buffers: shared hit=58824
Planning Time: 0.065 ms
Execution Time: 18525.843 ms
(14 rows)
```
方案2 :
展开array存储
```
CREATE TABLE user_retain1 (
user_id serial ,
fst_login_date date,
pay_retained_num int
);
insert into user_retain1 select user_id,fst_login_date,unnest(pay_retained_num) from user_retain;
alter table user_retain1 set (parallel_workers =26);
```
26个并行, 0.8秒
```
max_worker_processes = 32
max_parallel_workers_per_gather = 26
parallel_leader_participation = off
max_parallel_workers = 32
parallel_tuple_cost = 0
parallel_setup_cost = 0
min_parallel_table_scan_size = 0
min_parallel_index_scan_size = 0
explain (analyze,verbose,timing,costs,buffers)
select pay_retained_num,count(*) from user_retain1 group by 1;
QUERY PLAN
Finalize HashAggregate (cost=598284.46..598288.11 rows=365 width=12) (actual time=859.487..859.525 rows=365 loops=1)
Group Key: pay_retained_num
Peak Memory Usage: 61 kB
-> Gather (cost=598233.36..598237.01 rows=9490 width=12) (actual time=856.866..858.653 rows=9490 loops=1)
Workers Planned: 26
Workers Launched: 26
-> Partial HashAggregate (cost=598233.36..598237.01 rows=365 width=12) (actual time=851.471..851.508 rows=365 loops=26)
Group Key: pay_retained_num
Peak Memory Usage: 0 kB
-> Parallel Seq Scan on user_retain1 (cost=0.00..579002.57 rows=3846157 width=4) (actual time=0.033..294.534 rows=3846154 loops=26)
Planning Time: 0.100 ms
Execution Time: 860.386 ms
(12 rows)
```
方案3 :
估值计算, 每个留存日一条记录, 将留存天数对应的UID写入到hll类型中.
```
create extension hll;
create table t_hll (
pay_retained_num int primary key,
u_hll hll
);
insert into t_hll select pay_retained_num, hll_add_agg(hll_hash_integer(user_id)) from user_retain1 group by pay_retained_num;
select pay_retained_num, # u_hll from t_hll order by 1;
```
3毫秒.
```
QUERY PLAN
Index Scan using t_hll_pkey on t_hll (cost=0.15..81.53 rows=365 width=12) (actual time=0.020..3.101 rows=365 loops=1)
Planning Time: 0.050 ms
Execution Time: 3.121 ms
(3 rows)
```
估值计算的差异
```
with a as (
select pay_retained_num, count(*) as cnt from user_retain1 group by 1),
b as (select pay_retained_num, # u_hll as cnt from t_hll)
select a.pay_retained_num, a.cnt, b.cnt from a,b where a.pay_retained_num=b.pay_retained_num order by abs(a.cnt-b.cnt);
pay_retained_num | cnt | cnt
------------------+--------+--------------------
18 | 273741 | 242072.93394329798
226 | 273461 | 241358.471066233
257 | 273317 | 241062.44806733675
202 | 274096 | 241617.25101208987
245 | 273489 | 240817.31849724415
319 | 273263 | 240535.045171427
... ...
```
PostgreSQL 许愿链接
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.
9.9元购买3个月阿里云RDS PostgreSQL实例
PostgreSQL 解决方案集合
德哥 / digoal's github - 公益是一辈子的事.





