PostgreSQL hll 在留存、UV统计中的通用用法

digoal 2020-06-10

700

作者

digoal

日期

2020-06-10

背景

留存评估, 评估每个留存日的用户数. 抛去其他维度的查询条件, 对比三种方案的效率.

例如第一次登陆为6-1, 那么在6-2, 6-6登陆表示1,5日留存.

目的是统计每个留存日的用户数.

方案1 :

```
-- 100万用户, 每个用户100个number (52核 pg 12)
CREATE TABLE user_retain (
user_id serial PRIMARY KEY,
fst_login_date date,
pay_retained_num int [] -- 数组存储, 表示留存日.
);

create or replace function gen_rand() returns int[] as $$
select array_AGG((ceil(random()*365)::int)) from generate_series(1,100);
$$ language sql strict;

insert into user_retain select generate_series(1,1000000), now(), gen_rand();
```

explain (analyze,verbose,timing,costs,buffers) select unnest(pay_retained_num),count(*) from user_retain group by 1;

18秒

为什么要18秒?

unnest 后没发立即并行, 只是function并行, 并没有并行聚合操作, 反而因为需要gather导致并行比不并行还慢. 当然这个我认为PG是可以在内核层面优化的.

```
QUERY PLAN

HashAggregate (cost=70326.50..86732.89 rows=1000008 width=12) (actual time=29802.202..29806.633 rows=365 loops=1)
Output: (unnest(pay_retained_num)), count(*)
Group Key: (unnest(user_retain.pay_retained_num))
Planned Partitions: 8
Peak Memory Usage: 6193 kB
Buffers: shared read=58824
-> Gather (cost=0.00..61420.18 rows=1000008 width=4) (actual time=4.392..16521.356 rows=100000000 loops=1)
Output: (unnest(pay_retained_num))
Workers Planned: 26
Workers Launched: 26
Buffers: shared read=58824
-> ProjectSet (cost=0.00..61420.18 rows=384620 width=4) (actual time=0.040..284.310 rows=3846154 loops=26)
Output: unnest(pay_retained_num)
Buffers: shared read=58824
Worker 0: actual time=0.035..226.005 rows=3107600 loops=1
Buffers: shared read=1828
Worker 1: actual time=0.037..327.656 rows=4547500 loops=1
Buffers: shared read=2675
Worker 2: actual time=0.036..512.300 rows=6907100 loops=1
Buffers: shared read=4063
......
Buffers: shared read=1908
Worker 22: actual time=0.038..262.740 rows=3689000 loops=1
Buffers: shared read=2170
Worker 23: actual time=0.040..203.822 rows=2828800 loops=1
Buffers: shared read=1664
Worker 24: actual time=0.042..161.236 rows=2249100 loops=1
Buffers: shared read=1323
Worker 25: actual time=0.046..340.726 rows=4622300 loops=1
Buffers: shared read=2719
-> Parallel Seq Scan on public.user_retain (cost=0.00..59208.62 rows=38462 width=424) (actual time=0.033..25.329 rows=38462 loops=26)
Output: user_id, fst_login_date, pay_retained_num
Buffers: shared read=58824
Worker 0: actual time=0.029..21.766 rows=31076 loops=1
Buffers: shared read=1828
......
Buffers: shared read=1664
Worker 24: actual time=0.035..16.011 rows=22491 loops=1
Buffers: shared read=1323
Worker 25: actual time=0.036..30.783 rows=46223 loops=1
Buffers: shared read=2719
Planning Time: 0.265 ms
Buffers: shared hit=34 read=4
Execution Time: 29809.325 ms
(124 rows)
```

```
postgres=# explain (analyze,verbose,timing,costs,buffers)
select unnest(pay_retained_num),count(*) from user_retain group by 1;
QUERY PLAN

HashAggregate (cost=215387.75..379451.57 rows=10000080 width=12) (actual time=18522.487..18524.227 rows=365 loops=1)
Output: (unnest(pay_retained_num)), count(*)
Group Key: unnest(user_retain.pay_retained_num)
Planned Partitions: 64
Peak Memory Usage: 6193 kB
Buffers: shared hit=58824
-> ProjectSet (cost=0.00..126324.54 rows=10000080 width=4) (actual time=0.012..6471.708 rows=100000000 loops=1)
Output: unnest(pay_retained_num)
Buffers: shared hit=58824
-> Seq Scan on public.user_retain (cost=0.00..68824.08 rows=1000008 width=424) (actual time=0.009..102.847 rows=1000000 loops=1)
Output: user_id, fst_login_date, pay_retained_num
Buffers: shared hit=58824
Planning Time: 0.065 ms
Execution Time: 18525.843 ms
(14 rows)
```

方案2 :

展开array存储

```
CREATE TABLE user_retain1 (
user_id serial ,
fst_login_date date,
pay_retained_num int
);

insert into user_retain1 select user_id,fst_login_date,unnest(pay_retained_num) from user_retain;

alter table user_retain1 set (parallel_workers =26);
```

26个并行, 0.8秒

```
max_worker_processes = 32 max_parallel_workers_per_gather = 26 parallel_leader_participation = off max_parallel_workers = 32 parallel_tuple_cost = 0 parallel_setup_cost = 0 min_parallel_table_scan_size = 0 min_parallel_index_scan_size = 0

explain (analyze,verbose,timing,costs,buffers)
select pay_retained_num,count(*) from user_retain1 group by 1;

                                                                 QUERY PLAN

Finalize HashAggregate (cost=598284.46..598288.11 rows=365 width=12) (actual time=859.487..859.525 rows=365 loops=1)
Group Key: pay_retained_num
Peak Memory Usage: 61 kB
-> Gather (cost=598233.36..598237.01 rows=9490 width=12) (actual time=856.866..858.653 rows=9490 loops=1)
Workers Planned: 26
Workers Launched: 26
-> Partial HashAggregate (cost=598233.36..598237.01 rows=365 width=12) (actual time=851.471..851.508 rows=365 loops=26)
Group Key: pay_retained_num
Peak Memory Usage: 0 kB
-> Parallel Seq Scan on user_retain1 (cost=0.00..579002.57 rows=3846157 width=4) (actual time=0.033..294.534 rows=3846154 loops=26)
Planning Time: 0.100 ms
Execution Time: 860.386 ms
(12 rows)
```

方案3 :

估值计算, 每个留存日一条记录, 将留存天数对应的UID写入到hll类型中.

```
create extension hll;

create table t_hll (
pay_retained_num int primary key,
u_hll hll
);

insert into t_hll select pay_retained_num, hll_add_agg(hll_hash_integer(user_id)) from user_retain1 group by pay_retained_num;

select pay_retained_num, # u_hll from t_hll order by 1;
```

3毫秒.

```
QUERY PLAN

Index Scan using t_hll_pkey on t_hll (cost=0.15..81.53 rows=365 width=12) (actual time=0.020..3.101 rows=365 loops=1)
Planning Time: 0.050 ms
Execution Time: 3.121 ms
(3 rows)
```

估值计算的差异

```
with a as (
select pay_retained_num, count(*) as cnt from user_retain1 group by 1),
b as (select pay_retained_num, # u_hll as cnt from t_hll)
select a.pay_retained_num, a.cnt, b.cnt from a,b where a.pay_retained_num=b.pay_retained_num order by abs(a.cnt-b.cnt);

pay_retained_num | cnt | cnt
------------------+--------+--------------------
18 | 273741 | 242072.93394329798
226 | 273461 | 241358.471066233
257 | 273317 | 241062.44806733675
202 | 274096 | 241617.25101208987
245 | 273489 | 240817.31849724415
319 | 273263 | 240535.045171427
... ...
```

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL hll 在留存、UV统计中的通用用法

作者

日期

标签

背景

方案1 :

方案2 :

方案3 :

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论