暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

pipelinedb 团队加入Confluent,TimescaleDB时序插件支持准实时聚合(类流式计算)

digoal 2019-05-09
921

作者

digoal

日期

2019-05-09

标签

PostgreSQL , timescaledb , pipelinedb


背景

pipelinedb的最终版本停留在1.0,插件化,支持PG 10以上的版本。阿帕奇开源许可。PIPELINEDB团队全部加入Confluent。不会再有新的pipelinedb版本输出。但是1.0的版本会持续的进行BUG FIX。

另外一个好消息是timescaledb发布了1.3.0,支持时序数据的基础上,再增加了类似流计算的能力(pipelinedb团队也建议用户可以使用timescaledb),实时聚合。

This release contains major new functionality that we call continuous aggregates.

Aggregate queries which touch large swathes of time-series data can take a long time to compute because the system needs to scan large amounts of data on every query execution. Our continuous aggregates continuously calculate the results of a query in the background and materialize the results. Queries to the continuous aggregate view are then significantly faster as they do not need to touch the raw data in the hypertable, instead using the pre-computed aggregates in the view.

Continuous aggregates are somewhat similar to PostgreSQL materialized views, but unlike a materialized view, continuous aggregates do not need to be refreshed manually; the view will be refreshed automatically in the background as new data is added, or old data is modified. Additionally, it does not need to re-calculate all of the data on every refresh. Only new and/or invalidated data will be calculated. Since this re-aggregation is automatic, it doesn’t add any maintenance burden to your database.

Our continuous aggregate approach supports high-ingest rates by avoiding the high-write amplification associated with trigger-based approaches. Instead, we use invalidation techniques to track what data has changed, and then correct the materialized aggregate the next time that the automated process executes.

More information can be found on our docs overview or in this tutorial.

timescale实时聚合例子

https://docs.timescale.com/v1.3/using-timescaledb/continuous-aggregates

1、创建hyper table

CREATE TABLE device_readings ( observation_time TIMESTAMPTZ NOT NULL, device_id TEXT NOT NULL, metric DOUBLE PRECISION NOT NULL, PRIMARY KEY(observation_time, device_id) ); SELECT create_hypertable('device_readings', 'observation_time');

2、基于hyper table创建准实时聚合视图

视图的聚合键必须是时序字段time_bucket

CREATE VIEW device_summary WITH (timescaledb.continuous) --This flag is what makes the view continuous AS SELECT time_bucket('1 hour', observation_time) as bucket, --time_bucket is required device_id, avg(metric) as metric_avg, --We can use any parallelizable aggregate max(metric)-min(metric) as metric_spread --We can also use expressions on aggregates and constants FROM device_readings GROUP BY bucket, device_id; --We have to group by the bucket column, but can also add other group-by columns

例子2

CREATE VIEW cagg_rides_view WITH (timescaledb.continuous, timescaledb.refresh_interval = ’30m’) AS SELECT vendor_id, time_bucket('1h', pickup_datetime) as day, count(*) total_rides, avg(fare_amount) avg_fare, max(trip_distance) as max_trip_distance, min(trip_distance) as min_trip_distance FROM rides GROUP BY vendor_id, time_bucket('1h', pickup_datetime);

3、查看视图结构

\d cagg_rides_view View "public.cagg_rides_view" Column | Type | Collation | Nullable | Default -------------------+-----------------------------+-----------+----------+--------- vendor_id | text | | | day | timestamp without time zone | | | total_rides | bigint | | | avg_fare | numeric | | | max_trip_distance | numeric | | | min_trip_distance | numeric | | |

4、查看视图的聚合统计信息

SELECT view_name, refresh_lag, refresh_interval, max_interval_per_job, materialization_hypertable FROM timescaledb_information.continuous_aggregates; -[ RECORD 1 ]--------------+------------------------------------------------- view_name | cagg_rides_view refresh_lag | 02:00:00 refresh_interval | 00:30:00 max_interval_per_job | 20:00:00 materialization_hypertable | _timescaledb_internal._materialized_hypertable_2

5、视图相关参数

timescaledb.max_interval_per_job,一次增量合并最多计算多少原始hyper table表的数据,例如30分钟,那么一次最多合并30分钟的记录。

timescaledb.refresh_interval ,多长时间刷新一次视图

timescaledb.refresh_lag , 至少延迟多少,例如延迟1小时,那么hyper table里面的数据在到达一小时后才会开始合并到视图中。设置这个参数的目的可能是担心造成空洞(例如有些时间戳可能很早,但是事务提交时间很晚的记录,可能会被漏掉:猜测)

6、修改视图参数

ALTER VIEW device_summary SET (timescaledb.refresh_interval = '10 min');

其他详见文档

参考

https://github.com/timescale/timescaledb/releases

https://docs.timescale.com/v1.3/using-timescaledb/continuous-aggregates

https://docs.timescale.com/v1.3/tutorials/continuous-aggs-tutorial

https://www.pipelinedb.com/blog/pipelinedb-is-joining-confluent

https://github.com/pipelinedb/pipelinedb

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论