暂无图片
暂无图片
1
暂无图片
暂无图片
暂无图片

PostgreSQL-12分区表详解


姜瑞海
中国PG分会认证专家
PostgreSQL资深内核研发工程师


一、初识分区表


         通常情况下,扫描一个大表会很慢。 例如,如果一个订单表orders的数据量是50G,统计某个州范围内订单的平均额度,往往会消耗几分钟的时间。

      select avg(total_amount) from orders where state_code=1;
    复制

             如果能够把大表分拆成小表,查询数据的时猴,只扫描数据所属的小表,就能大大降低扫描时间,提高查询速度。

             PostgreSQL的分区表(Table Partitioning)可以用来解决此类问题。解决方式是:创建一个表orders,作为分区表(partitionedtable),再创建50个分区(partition),orders_1, orders_2, …, orders_50, 每一个分区对应一个州的数据,分区的数据量平均是1G。分区表和分区都是表。本例中,这50分区联合在一起,组成分区表orders。在执行查询语句(如下)的时候:

      select avg(total_amount) from orders where state_code=1;
      复制

               PostgreSQL通过对执行语句的分析处理,最终把扫描的任务定位在分区order_1上,把查询语句转换成下面的语句,其他分区根本不需要扫描。

        select avg(total_amount) from orders_1;
        复制


        二、PostgreSQL分区表应用举例


                 温度采集在物联网应用中非常普遍,通常一个系统中部署大量的温度传感器,传感器按照设定的采集频率把温度数据发送到服务器。 下面是一个温度采集的例子,表temperature_sensor_data,用于保存温度传感器采集的温度数据。 如果有10万个传感器,每隔一小时采集一次数据,则每一个月会产生3.7G的数据,一年会产生大约43G的数据。

                 对于这样量级的数据,通常需要采用特殊的处理方式。一种可能的方式是:按照月创建分区,数据按照所属的月份,被存储到较小的分区。


        2.1 创建分区表

                 在下面的例子中,创建了分区表temperature_sensor_data和12分区。分区表代表2017年全年的数据,而每一个分区代表单月的数据。

          droptableifexists temperature_sensor_data ;

          CREATETABLEtemperature_sensor_data (
          sensor_id integer NOTNULL,
          timestamp timestampNOTNULL,
          temperature decimal(5,2) NOTNULL
          ) PARTITION BY RANGE (timestamp);

          droptableifexists temperature_sensor_data_2017_1;
          CREATETABLEtemperature_sensor_data_2017_1
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-01-01') TO ('2017-02-01');

          droptableifexists temperature_sensor_data_2017_2;
          CREATETABLEtemperature_sensor_data_2017_2
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-02-01') TO ('2017-03-01');


          droptableifexists temperature_sensor_data_2017_3;
          CREATETABLEtemperature_sensor_data_2017_3
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-03-01') TO ('2017-04-01');

          droptableifexists temperature_sensor_data_2017_4;
          CREATETABLEtemperature_sensor_data_2017_4
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-04-01') TO ('2017-05-01');


          droptableifexists temperature_sensor_data_2017_5;
          CREATETABLEtemperature_sensor_data_2017_5
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-05-01') TO ('2017-06-01');


          droptableifexists temperature_sensor_data_2017_6;
          CREATETABLEtemperature_sensor_data_2017_6
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-06-01') TO ('2017-07-01');

          droptableifexists temperature_sensor_data_2017_7;
          CREATETABLEtemperature_sensor_data_2017_7
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-07-01') TO ('2017-08-01');

          droptableifexists temperature_sensor_data_2017_8;
          CREATETABLEtemperature_sensor_data_2017_8
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-08-01') TO ('2017-09-01');

          droptableifexists temperature_sensor_data_2017_9;
          CREATETABLEtemperature_sensor_data_2017_9
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-09-01') TO ('2017-10-01');

          droptableifexists temperature_sensor_data_2017_10;
          CREATETABLEtemperature_sensor_data_2017_10
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-10-01') TO ('2017-11-01');

          droptableifexists temperature_sensor_data_2017_11;
          CREATETABLEtemperature_sensor_data_2017_11
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-11-01') TO ('2017-12-01');

          droptableifexists temperature_sensor_data_2017_12;
          CREATETABLEtemperature_sensor_data_2017_12
          PARTITION OF temperature_sensor_data
          FORVALUESFROM ('2017-12-01') TO ('2018-01-01');
          复制

          2.2 模拟加载数据

          • 100000个传感器

          • 每隔1小时采集一次数据

          • 总共12个月

            with   ids as ( select generate_series(1,100000) as sensor_id ), 
            times as ( SELECT generate_series( '2017-01-01 00:00:00'::timestamp,'2017-12-31 23:59:00', '1 hour' ) as timestamp ),
            samples as ( select sensor_id, timestamp, random()*100::decimal as temperature from ids full join times on 1=1 )
            insert into temperature_sensor_data
            select sensor_id, timestamp, round(temperature::decimal,2) as temperature from samples;
            postgres=# \d+
            List of relations
            Schema | Name | Type | Owner | Size | Description
            --------+---------------------------------+-------------------+-------------+---------+-------------
            public | temperature_sensor_data | partitioned table | postgres | 0 bytes |
            public | temperature_sensor_data_2017_1 | table | postgres | 3703 MB |
            public | temperature_sensor_data_2017_10 | table | postgres | 3703 MB |
            public | temperature_sensor_data_2017_11 | table | postgres | 3584 MB |
            public | temperature_sensor_data_2017_12 | table | postgres | 3703 MB |
            public | temperature_sensor_data_2017_2 | table | postgres | 3345 MB |
            public | temperature_sensor_data_2017_3 | table | postgres | 3703 MB |
            public | temperature_sensor_data_2017_4 | table | postgres | 3584 MB |
            public | temperature_sensor_data_2017_5 | table | postgres | 3703 MB |
            public | temperature_sensor_data_2017_6 | table | postgres | 3584 MB |
            public | temperature_sensor_data_2017_7 | table | postgres | 3703 MB |
            public | temperature_sensor_data_2017_8 | table | postgres | 3703 MB |
            public | temperature_sensor_data_2017_9 | table | postgres | 3584 MB |
            (13 rows)
            复制


            2.3 统计1月份的平均温度


            • 1月份的数据量是3703M

            • 耗时大约33秒

              postgres=#selectavg(temperature) from temperature_sensor_data wheretimestampbetween  '2017-01-01 00:00:00'and'2017-01-0123:59:00';
              avg
              ---------------------
              50.0171680480000000
              (1 row)

              Time: 33305.055 ms(00:33.305)
              postgres=#
              复制

              2.4 使用一个大表,不使用分区表的查询结果

              • 单个表数据量是43G

              • 耗时大约7分51秒

                postgres=# \d+
                List of relations
                Schema | Name | Type | Owner | Size | Description
                --------+-------------------------+-------+-------------+-------+-------------
                public | temperature_sensor_data | table | postgres | 43 GB |
                (1 row)
                postgres=# select avg(temperature) from temperature_sensor_data where timestamp between '2017-01-01 00:00:00' and '2017-01-01 23:59:00';
                avg
                ---------------------
                50.0010354000000000
                (1 row)


                Time: 471373.514 ms (07:51.374)
                复制

                 三、使用DeclarativePartitioning定义分区表

                         PostgreSQL从版本10开始,支持DeclarativePartitioning功能,就是使用create table语句定义分区表和分区。

                         创建分区表的方式是:create table tablename (…) partition by (…)

                         创建分区的方式是: create table partitionname partition oftablename for values (…);

                         其中partition by (…)定义来分区表根据哪些列来分区,使用什么算法;for values (…)定义一个分区内,落入该分区的数据的取值范围。

                         目前PostgreSQL12提供来3种分区算法:

                • partition by range(…),pg10引入

                • partition by list(…),pg10引入

                • parition by hash(…),pg11引入


                3.1 使用PARTITION BY RANGE方式定义分区

                         在创建分区表的时候,需要使用PARTITION BY指明该表是一个分区表,并且定义分区的方式。 以下是PostgreSQL官方文档中一个例子:

                         该例子中,根据logdate字段做分区,使用RANGE方式。分区表measurement对应3个分区:measurement_y2006m02,measurement_y2006m03,measurement_def。其中measurement_def是默认分区。

                         在插入数据的时候,如果logdate的取值在2016年2月份,则数据插入到分区measurement_y2006m02;如果logdate的取值在2016年3月份,则数据被插入到分区measurement_y2006m03;其它的数据,插入到默认分区measurement_def。

                  CREATE TABLE measurement (
                  city_id int not null,
                  logdate date not null,
                  peaktemp int,
                  unitsales int
                  ) PARTITION BY RANGE (logdate);
                  CREATE TABLE measurement_y2006m02 PARTITION OF measurement
                  FOR VALUES FROM ('2006-02-01') TO ('2006-03-01');


                  CREATE TABLE measurement_y2006m03 PARTITION OF measurement
                  FOR VALUES FROM ('2006-03-01') TO ('2006-04-01');


                  CREATE TABLE measurement_def PARTITION OF measurement DEFAULT;
                  复制

                            查询数据的时候,PostgreSQL能够根据合适的过滤条件,选择正确的分区做查询;如果没有适当的过滤条件,则扫描所有分区。

                    postgres=# explain select * from measurement where logdate='2006-02-10';
                    QUERY PLAN
                    ----------------------------------------------------------------------
                    Seq Scan on measurement_y2006m02 (cost=0.00..33.12 rows=9 width=16)
                    Filter: (logdate = '2006-02-10'::date)
                    (2 rows)




                    postgres=# explain select * from measurement;
                    QUERY PLAN
                    -------------------------------------------------------------------------------
                    Append (cost=0.00..113.25 rows=5550 width=16)
                    -> Seq Scan on measurement_y2006m02 (cost=0.00..28.50 rows=1850 width=16)
                    -> Seq Scan on measurement_y2006m03 (cost=0.00..28.50 rows=1850 width=16)
                    -> Seq Scan on measurement_def (cost=0.00..28.50 rows=1850 width=16)
                    (4 rows)
                    复制


                    3.2 使用PARTITION BY LIST(column )定义分区

                               列的取值范围值是一个小的集合,类似编程中的枚举概念。当列值等于某个特定值的时候,落入指定的分区。

                               下面的例子中,分区表sale_order包含3个分区:

                    europe_order,asia_order,default_order。当列country等于'germany'或者'sweden'时,数据落入分区europe_order;当country的值等于india或japan时,行落入分区asia_order;当country等于其它值时,则行数据落入分区default_order。

                      CREATE TABLE sale_order
                      (
                      order_no integer,
                      store_no integer,
                      country varchar(20),
                      date date,
                      amount decimal(5,2)
                      ) PARTITION BY LIST(country);


                      CREATE TABLE europe_order
                      PARTITION OF sale_order FOR VALUES IN ('germany', 'sweden');


                      CREATE TABLE asia_order
                      PARTITION OF sale_order FOR VALUES IN ('india', 'japan');


                      CREATE TABLE default_order
                      PARTITION OF sale_order DEFAULT;
                      复制

                                  查询数据的时候,PostgreSQL能够根据合适的过滤条件,选择正确的分区做查询;如果没有适当的过滤条件,则扫描所有分区。

                        postgres=#explain select * from sale_order where country='india';
                        QUERY PLAN
                        ------------------------------------------------------------
                        Seq Scan on asia_order (cost=0.00..19.25 rows=4 width=82)
                        Filter: ((country)::text = 'india'::text)
                        (2 rows)

                        postgres=#explain select * from sale_order;
                        QUERY PLAN
                        -----------------------------------------------------------------------
                        Append (cost=0.00..63.30 rows=2220 width=82)
                        -> Seq Scan on europe_order (cost=0.00..17.40 rows=740 width=82)
                        -> Seq Scan on asia_order (cost=0.00..17.40 rows=740 width=82)
                        -> Seq Scan on default_order (cost=0.00..17.40 rows=740 width=82)
                        (4 rows)
                        复制

                        3.3 使用PARTITION BY HASH( column )定义分区

                                   对列的值做哈希,哈希值把数据分割成几个分区。

                                   下面的例子中,分区表orders包含4个分区:orders_p1,orders_p2,orders_p3,orders_p4。

                                 插入数据时,对列o_w_id取余,结果等于0,1,2,3,行数据分别落入分区orders_p1, orders_p2, orders_p3,orders_p4。

                          createtableorders (
                          o_w_id integer notnull,
                          o_d_id integer notnull,
                          o_id integer notnull,
                          o_c_id integer,
                          o_carrier_id integer,
                          o_ol_cnt integer,
                          o_all_local integer,
                          o_entry_d timestamp
                          )PARTITIONBY HASH ( o_w_id );

                          CREATETABLEorders_p1 PARTITION OF orders
                          FORVALUESWITH (MODULUS 4, REMAINDER 0);
                          CREATETABLEorders_p2 PARTITION OF orders
                          FORVALUESWITH (MODULUS 4, REMAINDER 1);
                          CREATETABLEorders_p3 PARTITION OF orders
                          FORVALUESWITH (MODULUS 4, REMAINDER 2);
                          CREATETABLEorders_p4 PARTITION OF orders
                          FORVALUESWITH (MODULUS 4, REMAINDER 3);
                          复制

                          3.4 分区的其它特性

                          • 可以在分区表上建立索引,相应的所有分区都能自动建立索引;或者,可以为分区单独建立索引。

                          • 可以根据需要,卸载或这增加一个分区。

                          • 分区可以指定单独的表空间,能充分利用多个磁盘。

                          • 分区可以指向一个PG外表,即FDW表。

                          • 分区表可以根据多个列的值来分区。

                          • 分区可以再次分区。

                          四、使用表继承(Inheritance)方式定义分区表

                                      分区表也可以使用继承的方式来使用。该方式早在PostgreSQL8就支持了。创建方式举例如下:

                          • 1. 创建一个普通表measurement

                            CREATE TABLE measurement (
                            city_id int not null,
                            logdate date not null,
                            peaktemp int,
                            unitsales int
                            );
                            复制
                            • 2. 创建子表,继承自measurement

                              • 每个子表的check约束是为了确保字表只运行符合条件的数据插入。

                              CREATE TABLE measurement_y2006m02 (
                              CHECK ( logdate >= DATE '2006-02-01' AND logdate < DATE '2006-03-01' )
                              ) INHERITS (measurement);


                              CREATE TABLE measurement_y2006m03 (
                              CHECK ( logdate >= DATE '2006-03-01' AND logdate < DATE '2006-04-01' )
                              ) INHERITS (measurement);


                              CREATE TABLE measurement_y2007m12 (
                              CHECK ( logdate >= DATE '2007-12-01' AND logdate < DATE '2008-01-01' )
                              ) INHERITS (measurement);


                              CREATE TABLE measurement_y2008m01 (
                              CHECK ( logdate >= DATE '2008-01-01' AND logdate < DATE '2008-02-01' )
                              ) INHERITS (measurement);
                              复制
                              • 3. 创建函数和触发器,用于把数据插入到相应的分区。

                                CREATE OR REPLACE FUNCTION measurement_insert_trigger()
                                RETURNS TRIGGER AS $$
                                BEGIN
                                IF ( NEW.logdate >= DATE '2006-02-01' AND
                                NEW.logdate < DATE '2006-03-01' ) THEN
                                INSERT INTO measurement_y2006m02 VALUES (NEW.*);
                                ELSIF ( NEW.logdate >= DATE '2006-03-01' AND
                                NEW.logdate < DATE '2006-04-01' ) THEN
                                INSERT INTO measurement_y2006m03 VALUES (NEW.*);
                                ELSIF ( NEW.logdate >= DATE '2007-12-01' AND
                                NEW.logdate < DATE '2008-01-01' ) THEN
                                INSERT INTO measurement_y2007m12 VALUES (NEW.*);
                                ELSIF ( NEW.logdate >= DATE '2008-01-01' AND
                                NEW.logdate < DATE '2008-02-01' ) THEN
                                INSERT INTO measurement_y2008m01 VALUES (NEW.*);
                                ELSE
                                RAISE EXCEPTION 'Date out of range. Fix the measurement_insert_trigger() function!';
                                END IF;
                                RETURN NULL;
                                END;
                                $$
                                LANGUAGE plpgsql;


                                CREATE TRIGGER insert_measurement_trigger
                                BEFORE INSERT ON measurement
                                    FOR EACH ROW EXECUTE FUNCTION measurement_insert_trigger();
                                复制
                                • 4. 插入数据

                                插入4条数据,应该分别落入4个子表。

                                  insertinto measurement 
                                  values (1, '2006-02-10', 10, 1),
                                  (1, '2006-03-10', 10, 1),
                                  (1, '2007-12-10', 10, 1),
                                  (1, '2008-01-10', 10, 1);
                                  复制
                                  • 5. 查询数据

                                    postgres=#select * from measurement;
                                    city_id | logdate | peaktemp | unitsales
                                    ---------+------------+----------+-----------
                                    1 | 2006-02-10 | 10 | 1
                                    1 | 2006-03-10 | 10 | 1
                                    1 | 2007-12-10 | 10 | 1
                                    1 | 2008-01-10 | 10 | 1
                                    (4 rows)

                                    postgres=#select * from measurement_y2006m02;
                                    city_id | logdate | peaktemp | unitsales
                                    ---------+------------+----------+-----------
                                    1 | 2006-02-10 | 10 | 1
                                    (1 row)

                                    postgres=#select * from measurement_y2006m03;
                                    city_id | logdate | peaktemp | unitsales
                                    ---------+------------+----------+-----------
                                    1 | 2006-03-10 | 10 | 1
                                    (1 row)

                                    postgres=#select * from measurement_y2007m12;
                                    city_id | logdate | peaktemp | unitsales
                                    ---------+------------+----------+-----------
                                    1 | 2007-12-10 | 10 | 1
                                    (1 row)

                                    postgres=#select * from measurement_y2008m01 ;
                                    city_id | logdate | peaktemp | unitsales
                                    ---------+------------+----------+-----------
                                    1 | 2008-01-10 | 10 | 1
                                    (1 row)

                                    postgres=#explain select * from measurement;
                                    QUERYPLAN
                                    -------------------------------------------------------------------------------
                                    Append (cost=0.00..151.00 rows=7401 width=16)
                                    -> Seq Scan on measurement (cost=0.00..0.00 rows=1 width=16)
                                    -> Seq Scan on measurement_y2006m02 (cost=0.00..28.50 rows=1850 width=16)
                                    -> Seq Scan on measurement_y2006m03 (cost=0.00..28.50 rows=1850 width=16)
                                    -> Seq Scan on measurement_y2007m12 (cost=0.00..28.50 rows=1850 width=16)
                                    -> Seq Scan on measurement_y2008m01 (cost=0.00..28.50 rows=1850 width=16)
                                    (6 rows)

                                    postgres=#explain select * from measurement where logdate='2007-01-10';
                                    QUERY PLAN
                                    ------------------------------------------------------------
                                    Seq Scan on measurement (cost=0.00..0.00 rows=1 width=16)
                                    Filter: (logdate = '2007-01-10'::date)
                                    (2 rows)
                                    复制


                                    最后修改时间:2019-11-06 10:59:10
                                    文章转载自开源软件联盟PostgreSQL分会,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                                    评论