暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL 12 on 阿里云ecs 本地ssd VS essd pl3性能 - 含fio,fsync test,pgbench test,优缺点、云盘PG内核优化建议

digoal 2019-12-28
688

作者

digoal

日期

2019-12-28

标签

PostgreSQL , fio , 阿里云 , local ssd , essd , PostgreSQL , 并行 , 优缺点 , IO


背景

云上存储支持很丰富:本地ssd,ssd,essd, essd pl1, pl2, pl3。对于PG数据库应该选哪个?优缺点是什么?不同的场景如何选择,云盘一定会增加单次io的延迟,是否所有业务都关心io延迟?如何合理规避单次io延迟的问题?

首先先测试性能,然后分析本地与云ssd的异同。找到数据库合理选择存储的决策点。

测试环境如下:

本地盘机器

磁盘

1.8T \* 8 标称iops: 每块 240000

cpu:

32核64线程,2.5GHz

mem: 512GB

```

lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Stepping: 4
CPU MHz: 2500.006
BogoMIPS: 5000.01
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0-63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 spec_ctrl intel_stibp

lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 253:0 0 100G 0 disk
└─vda1 253:1 0 100G 0 part /
vdb 253:16 0 1.8T 0 disk
vdc 253:32 0 1.8T 0 disk
vdd 253:48 0 1.8T 0 disk
vde 253:64 0 1.8T 0 disk
vdf 253:80 0 1.8T 0 disk
vdg 253:96 0 1.8T 0 disk
vdh 253:112 0 1.8T 0 disk
vdi 253:128 0 1.8T 0 disk
```

essd PL3云盘机器

磁盘1

1.8T \* 8 标称iops: 每块 91800

磁盘2

20T \* 1 标称iops: 每块 1000000

cpu:

26核52线程,2.5GHz

mem: 192GB

```

lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 52
On-line CPU(s) list: 0-51
Thread(s) per core: 2
Core(s) per socket: 26
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Stepping: 7
CPU MHz: 2500.002
BogoMIPS: 5000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-51
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni

lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 253:0 0 100G 0 disk
└─vda1 253:1 0 100G 0 part /
vdb 253:16 0 1.8T 0 disk
vdc 253:32 0 1.8T 0 disk
vdd 253:48 0 1.8T 0 disk
vde 253:64 0 1.8T 0 disk
vdf 253:80 0 1.8T 0 disk
vdg 253:96 0 1.8T 0 disk
vdh 253:112 0 1.8T 0 disk
vdi 253:128 0 1.8T 0 disk
vdj 253:144 0 19.5T 0 disk
```

部署机器

内核参数

```
vi /etc/sysctl.conf

add by digoal.zhou

fs.aio-max-nr = 1048576
fs.file-max = 76724600

可选:kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p

/data01/corefiles 事先建好,权限777,如果是软链接,对应的目录修改为777

kernel.sem = 4096 2147483647 2147483646 512000

信号量, ipcs -l 或 -u 查看,每16个进程一组,每组信号量需要17个信号量。

kernel.shmall = 107374182

所有共享内存段相加大小限制 (建议内存的80%),单位为页。

kernel.shmmax = 274877906944

最大单个共享内存段大小 (建议为内存一半), >9.2的版本已大幅降低共享内存的使用,单位为字节。

kernel.shmmni = 819200

一共能生成多少共享内存段,每个PG数据库集群至少2个共享内存段

net.core.netdev_max_backlog = 10000
net.core.rmem_default = 262144

The default setting of the socket receive buffer in bytes.

net.core.rmem_max = 4194304

The maximum receive socket buffer size in bytes

net.core.wmem_default = 262144

The default setting (in bytes) of the socket send buffer.

net.core.wmem_max = 4194304

The maximum send socket buffer size in bytes.

net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_mem = 8388608 12582912 16777216
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syncookies = 1

开启SYN Cookies。当出现SYN等待队列溢出时,启用cookie来处理,可防范少量的SYN攻击

net.ipv4.tcp_timestamps = 1

减少time_wait

net.ipv4.tcp_tw_recycle = 0

如果=1则开启TCP连接中TIME-WAIT套接字的快速回收,但是NAT环境可能导致连接失败,建议服务端关闭它

net.ipv4.tcp_tw_reuse = 1

开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接

net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216

net.nf_conntrack_max = 1200000
net.netfilter.nf_conntrack_max = 1200000

vm.dirty_background_bytes = 409600000

系统脏页到达这个值,系统后台刷脏页调度进程 pdflush(或其他) 自动将(dirty_expire_centisecs/100)秒前的脏页刷到磁盘

默认为10%,大内存机器建议调整为直接指定多少字节

vm.dirty_expire_centisecs = 3000

比这个值老的脏页,将被刷到磁盘。3000表示30秒。

vm.dirty_ratio = 95

如果系统进程刷脏页太慢,使得系统脏页超过内存 95 % 时,则用户进程如果有写磁盘的操作(如fsync, fdatasync等调用),则需要主动把系统脏页刷出。

有效防止用户进程刷脏页,在单机多实例,并且使用CGROUP限制单实例IOPS的情况下非常有效。

vm.dirty_writeback_centisecs = 100

pdflush(或其他)后台刷脏页进程的唤醒间隔, 100表示1秒。

vm.swappiness = 0

不使用交换分区

vm.mmap_min_addr = 65536
vm.overcommit_memory = 0

在分配内存时,允许少量over malloc, 如果设置为 1, 则认为总是有足够的内存,内存较少的测试环境可以使用 1 .

vm.overcommit_ratio = 90

当overcommit_memory = 2 时,用于参与计算允许指派的内存大小。

vm.swappiness = 0

关闭交换分区

vm.zone_reclaim_mode = 0

禁用 numa, 或者在vmlinux中禁止.

net.ipv4.ip_local_port_range = 40000 65535

本地自动分配的TCP, UDP端口号范围

fs.nr_open=20480000

单个进程允许打开的文件句柄上限

以下参数请注意

vm.extra_free_kbytes = 4096000
vm.min_free_kbytes = 6291456 # vm.min_free_kbytes 建议每32G内存分配1G vm.min_free_kbytes

如果是小内存机器,以上两个值不建议设置

vm.nr_hugepages = 66536

建议shared buffer设置超过64GB时 使用大页,页大小 /proc/meminfo Hugepagesize

vm.lowmem_reserve_ratio = 1 1 1

对于内存大于64G时,建议设置,否则建议默认值 256 256 32

生效

sysctl -p

```

配置限制

```
vi /etc/security/limits.conf

nofile超过1048576的话,一定要先将sysctl的fs.nr_open设置为更大的值,并生效后才能继续设置nofile.

注释其他行 , 添加如下

  • soft nofile 1024000
  • hard nofile 1024000
  • soft nproc unlimited
  • hard nproc unlimited
  • soft core unlimited
  • hard core unlimited
  • soft memlock unlimited
  • hard memlock unlimited
    ```

同时修改(若有)

/etc/security/limits.d/20-nproc.conf

关闭透明大页(可选)

echo never > /sys/kernel/mm/transparent_hugepage/enabled

配置永久生效

```
chmod +x /etc/rc.d/rc.local

vi /etc/rc.local

touch /var/lock/subsys/local

if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
```

修改时钟(可选)

```
vi /etc/rc.local

touch /var/lock/subsys/local

if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi

echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
```

支持的时钟:

cat /sys/devices/system/clocksource/clocksource0/available_clocksource kvm-clock tsc acpi_pm

修改时钟:

echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

时钟性能:

```
kvm-clock:

pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 38.05 ns
Histogram of timing durations:
< us % of total count
1 96.22568 75875976
2 3.76667 2970102
4 0.00055 435
8 0.00704 5550
16 0.00005 42
32 0.00000 1
64 0.00000 1
128 0.00000 0
256 0.00000 0
512 0.00000 0
1024 0.00000 1

tsc:

pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 31.93 ns
Histogram of timing durations:
< us % of total count
1 96.82838 90964258
2 3.16507 2973390
4 0.00078 731
8 0.00574 5394
16 0.00003 26
32 0.00000 2
```

tsc 略好

启动irqbalance服务

systemctl enable irqbalance systemctl start irqbalance systemctl status irqbalance

确认开启网卡队列

```
ethtool -l eth0

Channel parameters for eth0:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 32
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 32
```

如未开启,启动(取Pre-set maximums Combined最大值)

ethtool -L eth0 combined 32

部署 PostgreSQL 12 软件

yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm yum install -y centos-release-scl yum install -y postgresql12* yum install -y lvm2

部署存储

逻辑卷、文件系统部署

本地盘版(1.8T*8)

```
parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE

wipefs -f -a /dev/vd[b-i]1
pvcreate /dev/vd[b-i]1
vgcreate -A y -s 128M vgdata01 /dev/vd[b-i]1
lvcreate -A y -i 8 -I 8 -l 100%FREE -n lv01 vgdata01

[root@iZbp12984c7sv43o6lbbxoZ ~]# pvs
PV VG Fmt Attr PSize PFree
/dev/vdb1 vgdata01 lvm2 a-- <1.75t 0
/dev/vdc1 vgdata01 lvm2 a-- <1.75t 0
/dev/vdd1 vgdata01 lvm2 a-- <1.75t 0
/dev/vde1 vgdata01 lvm2 a-- <1.75t 0
/dev/vdf1 vgdata01 lvm2 a-- <1.75t 0
/dev/vdg1 vgdata01 lvm2 a-- <1.75t 0
/dev/vdh1 vgdata01 lvm2 a-- <1.75t 0
/dev/vdi1 vgdata01 lvm2 a-- <1.75t 0
[root@iZbp12984c7sv43o6lbbxoZ ~]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lv01 vgdata01 -wi-a----- <13.97t

mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=16 -b 4096 -T largefile -L lv01

vi /etc/fstab
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0

mkdir /data01

mount -a
```

云盘逻辑卷(1.8T*8)

```
parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE

wipefs -f -a /dev/vd[b-i]1
pvcreate /dev/vd[b-i]1
vgcreate -A y -s 128M vgdata01 /dev/vd[b-i]1
lvcreate -A y -i 8 -I 8 -l 100%FREE -n lv01 vgdata01

[root@iZbp1621kc3de3pm12a0exZ ~]# pvs
PV VG Fmt Attr PSize PFree
/dev/vdb1 vgdata01 lvm2 a-- <1.76t 0
/dev/vdc1 vgdata01 lvm2 a-- <1.76t 0
/dev/vdd1 vgdata01 lvm2 a-- <1.76t 0
/dev/vde1 vgdata01 lvm2 a-- <1.76t 0
/dev/vdf1 vgdata01 lvm2 a-- <1.76t 0
/dev/vdg1 vgdata01 lvm2 a-- <1.76t 0
/dev/vdh1 vgdata01 lvm2 a-- <1.76t 0
/dev/vdi1 vgdata01 lvm2 a-- <1.76t 0
[root@iZbp1621kc3de3pm12a0exZ ~]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lv01 vgdata01 -wi-a----- 14.06t

mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=16 -b 4096 -T largefile -L lv01

vi /etc/fstab
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0

mkdir /data01

mount -a
```

云盘单盘(20T)部署

```
parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE

mkfs.ext4 /dev/vdj1 -m 0 -O extent,uninit_bg -E lazy_itable_init=1 -b 4096 -T largefile -L vdj1

vi /etc/fstab
LABEL=vdj1 /data02 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0

mkdir /data02

mount -a
```

PostgreSQL 12数据库部署

本地盘版(1.8T*8)

mkdir /data01/pg12_1921 chown postgres:postgres /data01/pg12_1921

```
su - postgres

vi ~/.bash_profile

追加

export PS1="$USER@/bin/hostname -s-> "
export PGPORT=1921
export PGDATA=/data01/pg12_$PGPORT/pg_root

export LANG=en_US.utf8
export PGHOME=/usr/pgsql-12
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=date +"%Y%m%d%H%M"
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
```

initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024

```
vi $PGDATA/postgresql.auto.conf

listen_addresses = '0.0.0.0'

监听端口

port = 1921
max_connections = 1000
superuser_reserved_connections = 13
unix_socket_directories = '., /var/run/postgresql, /tmp'
unix_socket_permissions = 0700
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 32GB
max_prepared_transactions = 1000
work_mem = 8MB
maintenance_work_mem = 2GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
bgwriter_flush_after = 512kB
effective_io_concurrency = 0
max_worker_processes = 128
max_parallel_maintenance_workers = 5
max_parallel_workers_per_gather = 0
parallel_leader_participation = off
max_parallel_workers = 40
backend_flush_after = 2MB
wal_level = replica
fsync = on
synchronous_commit = off
full_page_writes = on
wal_compression = on
wal_buffers = 16MB
wal_writer_delay = 10ms
wal_writer_flush_after = 1MB
checkpoint_timeout = 20min
max_wal_size = 128GB
min_wal_size = 16GB
checkpoint_completion_target = 0.2
archive_mode = on
archive_command = '/bin/date'
max_wal_senders = 32
wal_keep_segments = 0
wal_sender_timeout = 10s
max_replication_slots = 24
random_page_cost = 1.1
effective_cache_size = 128GB
log_destination = 'csvlog'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_line_prefix = '%m [%p] '
log_lock_waits = on
log_statement = 'ddl'
log_timezone = 'Asia/Shanghai'
track_functions = pl
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.05
autovacuum_freeze_max_age = 800000000
autovacuum_multixact_freeze_max_age = 900000000
autovacuum_vacuum_cost_delay = 0ms
default_table_access_method = 'heap'
check_function_bodies = on

以下两个超时参数建议根据业务需求设置

lock_timeout = 6s
idle_in_transaction_session_timeout = 60s
vacuum_freeze_table_age = 750000000
vacuum_multixact_freeze_table_age = 750000000
datestyle = 'iso, mdy'
timezone = 'Asia/Shanghai'
lc_messages = 'en_US.utf8'
lc_monetary = 'en_US.utf8'
lc_numeric = 'en_US.utf8'
lc_time = 'en_US.utf8'
default_text_search_config = 'pg_catalog.english'
```

start 数据库

pg_ctl start

云盘逻辑卷(1.8T*8)

mkdir /data01/pg12_1921 chown postgres:postgres /data01/pg12_1921

```
su - postgres

vi ~/.bash_profile

追加

export PS1="$USER@/bin/hostname -s-> "
export PGPORT=1921
export PGDATA=/data01/pg12_$PGPORT/pg_root

export LANG=en_US.utf8
export PGHOME=/usr/pgsql-12
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=date +"%Y%m%d%H%M"
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
```

initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024

```
vi $PGDATA/postgresql.auto.conf

listen_addresses = '0.0.0.0'

监听端口

port = 1921
max_connections = 1000
superuser_reserved_connections = 13
unix_socket_directories = '., /var/run/postgresql, /tmp'
unix_socket_permissions = 0700
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 32GB
max_prepared_transactions = 1000
work_mem = 8MB
maintenance_work_mem = 2GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
bgwriter_flush_after = 512kB
effective_io_concurrency = 0
max_worker_processes = 128
max_parallel_maintenance_workers = 5
max_parallel_workers_per_gather = 0
parallel_leader_participation = off
max_parallel_workers = 40
backend_flush_after = 2MB
wal_level = replica
fsync = on
synchronous_commit = off
full_page_writes = on
wal_compression = on
wal_buffers = 16MB
wal_writer_delay = 10ms
wal_writer_flush_after = 1MB
checkpoint_timeout = 20min
max_wal_size = 128GB
min_wal_size = 16GB
checkpoint_completion_target = 0.2
archive_mode = on
archive_command = '/bin/date'
max_wal_senders = 32
wal_keep_segments = 0
wal_sender_timeout = 10s
max_replication_slots = 24
random_page_cost = 1.1
effective_cache_size = 128GB
log_destination = 'csvlog'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_line_prefix = '%m [%p] '
log_lock_waits = on
log_statement = 'ddl'
log_timezone = 'Asia/Shanghai'
track_functions = pl
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.05
autovacuum_freeze_max_age = 800000000
autovacuum_multixact_freeze_max_age = 900000000
autovacuum_vacuum_cost_delay = 0ms
default_table_access_method = 'heap'
check_function_bodies = on

以下两个超时参数建议根据业务需求设置

lock_timeout = 6s
idle_in_transaction_session_timeout = 60s
vacuum_freeze_table_age = 750000000
vacuum_multixact_freeze_table_age = 750000000
datestyle = 'iso, mdy'
timezone = 'Asia/Shanghai'
lc_messages = 'en_US.utf8'
lc_monetary = 'en_US.utf8'
lc_numeric = 'en_US.utf8'
lc_time = 'en_US.utf8'
default_text_search_config = 'pg_catalog.english'
```

start 数据库

pg_ctl start

云盘单盘(20T)

mkdir /data02/pg12_1922 chown postgres:postgres /data02/pg12_1922

```
su - postgres

export PGPORT=1922
export PGDATA=/data02/pg12_$PGPORT/pg_root
export PGHOST=$PGDATA

initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024
```

```
vi $PGDATA/postgresql.auto.conf

listen_addresses = '0.0.0.0'

监听端口

port = 1922
max_connections = 1000
superuser_reserved_connections = 13
unix_socket_directories = '., /var/run/postgresql, /tmp'
unix_socket_permissions = 0700
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 32GB
max_prepared_transactions = 1000
work_mem = 8MB
maintenance_work_mem = 2GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
bgwriter_flush_after = 512kB
effective_io_concurrency = 0
max_worker_processes = 128
max_parallel_maintenance_workers = 5
max_parallel_workers_per_gather = 0
parallel_leader_participation = off
max_parallel_workers = 40
backend_flush_after = 2MB
wal_level = replica
fsync = on
synchronous_commit = off
full_page_writes = on
wal_compression = on
wal_buffers = 16MB
wal_writer_delay = 10ms
wal_writer_flush_after = 1MB
checkpoint_timeout = 20min
max_wal_size = 128GB
min_wal_size = 16GB
checkpoint_completion_target = 0.2
archive_mode = on
archive_command = '/bin/date'
max_wal_senders = 32
wal_keep_segments = 0
wal_sender_timeout = 10s
max_replication_slots = 24
random_page_cost = 1.1
effective_cache_size = 128GB
log_destination = 'csvlog'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_line_prefix = '%m [%p] '
log_lock_waits = on
log_statement = 'ddl'
log_timezone = 'Asia/Shanghai'
track_functions = pl
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.05
autovacuum_freeze_max_age = 800000000
autovacuum_multixact_freeze_max_age = 900000000
autovacuum_vacuum_cost_delay = 0ms
default_table_access_method = 'heap'
check_function_bodies = on

以下两个超时参数建议根据业务需求设置

lock_timeout = 6s
idle_in_transaction_session_timeout = 60s
vacuum_freeze_table_age = 750000000
vacuum_multixact_freeze_table_age = 750000000
datestyle = 'iso, mdy'
timezone = 'Asia/Shanghai'
lc_messages = 'en_US.utf8'
lc_monetary = 'en_US.utf8'
lc_numeric = 'en_US.utf8'
lc_time = 'en_US.utf8'
default_text_search_config = 'pg_catalog.english'
```

start 数据库

pg_ctl start

fsync 性能测试(单次IO能力, 检验数据库sync commit延迟)

本地盘版(1.8T*8)

```
su - postgres
cd $PGDATA

postgres@iZbp12984c7sv43o6lbbxoZ-> pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 32804.010 ops/sec 30 usecs/op
fdatasync 28216.659 ops/sec 35 usecs/op
fsync 25325.493 ops/sec 39 usecs/op
fsync_writethrough n/a
open_sync 29137.619 ops/sec 34 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 14620.630 ops/sec 68 usecs/op
fdatasync 24263.659 ops/sec 41 usecs/op
fsync 21762.878 ops/sec 46 usecs/op
fsync_writethrough n/a
open_sync 13006.581 ops/sec 77 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
1 * 16kB open_sync write 25182.114 ops/sec 40 usecs/op
2 * 8kB open_sync writes 13001.270 ops/sec 77 usecs/op
4 * 4kB open_sync writes 7233.661 ops/sec 138 usecs/op
8 * 2kB open_sync writes 3865.482 ops/sec 259 usecs/op
16 * 1kB open_sync writes 1979.762 ops/sec 505 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close 24085.484 ops/sec 42 usecs/op
write, close, fsync 24072.174 ops/sec 42 usecs/op

Non-sync'ed 8kB writes:
write 192054.909 ops/sec 5 usecs/op
```

云盘逻辑卷(1.8T*8)

```
su - postgres
cd $PGDATA

postgres@iZbp1621kc3de3pm12a0exZ-> pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 6345.217 ops/sec 158 usecs/op
fdatasync 6087.221 ops/sec 164 usecs/op
fsync 3689.149 ops/sec 271 usecs/op
fsync_writethrough n/a
open_sync 3942.073 ops/sec 254 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 3020.085 ops/sec 331 usecs/op
fdatasync 4684.567 ops/sec 213 usecs/op
fsync 2785.027 ops/sec 359 usecs/op
fsync_writethrough n/a
open_sync 1814.915 ops/sec 551 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
1 * 16kB open_sync write 2847.150 ops/sec 351 usecs/op
2 * 8kB open_sync writes 1920.576 ops/sec 521 usecs/op
4 * 4kB open_sync writes 1023.594 ops/sec 977 usecs/op
8 * 2kB open_sync writes 82.223 ops/sec 12162 usecs/op
16 * 1kB open_sync writes 63.008 ops/sec 15871 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close 3677.066 ops/sec 272 usecs/op
write, close, fsync 3708.723 ops/sec 270 usecs/op

Non-sync'ed 8kB writes:
write 297195.065 ops/sec 3 usecs/op
```

云盘单盘(20T)

```
su - postgres

export PGPORT=1922
export PGDATA=/data02/pg12_$PGPORT/pg_root
export PGHOST=$PGDATA
cd $PGDATA

postgres@iZbp1621kc3de3pm12a0exZ-> pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 6103.193 ops/sec 164 usecs/op
fdatasync 5901.509 ops/sec 169 usecs/op
fsync 3893.064 ops/sec 257 usecs/op
fsync_writethrough n/a
open_sync 3971.187 ops/sec 252 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 3010.795 ops/sec 332 usecs/op
fdatasync 5034.260 ops/sec 199 usecs/op
fsync 3648.353 ops/sec 274 usecs/op
fsync_writethrough n/a
open_sync 2033.980 ops/sec 492 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
1 * 16kB open_sync write 3574.507 ops/sec 280 usecs/op
2 * 8kB open_sync writes 1996.539 ops/sec 501 usecs/op
4 * 4kB open_sync writes 1079.944 ops/sec 926 usecs/op
8 * 2kB open_sync writes 96.440 ops/sec 10369 usecs/op
16 * 1kB open_sync writes 75.738 ops/sec 13203 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close 3716.489 ops/sec 269 usecs/op
write, close, fsync 3641.653 ops/sec 275 usecs/op

Non-sync'ed 8kB writes:
write 297784.864 ops/sec 3 usecs/op
```

fio测试(iops能力,吞吐能力)

yum install -y fio yum install -y iotop

测试脚本

```
vi fiotest

function RunFio
{
numjobs=$1 # 实例中的测试线程数,如示例中的8
iodepth=$2 # 同时发出I/O数的上限,如示例中的64
bs=$3 # 单次I/O的块文件大小,如示例中的4K
rw=$4 # 测试时的读写策略,如示例中的randwrite
filename=$5 # 指定测试文件的名称,如示例中的/data01/test
ioengine=$6 # io engine : libaio, sync等,参考man fio
direct=$7 # 是否跳过page cache ,参考man fio

nr_cpus=cat /proc/cpuinfo |grep "processor" |wc -l
if [ $nr_cpus -lt $numjobs ];then
echo “Numjobs is more than cpu cores, exit!”
exit -1
fi
let nu=$numjobs+1
cpulist=""
for ((i=1;i<10;i++))
do
list=cat /sys/block/vdb/mq/*/cpu_list | awk '{if(i<=NF) print $i;}' i="$i" | tr -d ',' | tr '\n' ','
if [ -z $list ];then
break
fi
cpulist=${cpulist}${list}
done
spincpu=echo $cpulist | cut -d ',' -f 2-${nu}
echo $spincpu
fio --ioengine=${ioengine} --runtime=60s --numjobs=${numjobs} --iodepth=${iodepth} --bs=${bs} --rw=${rw} --filename=${filename} --time_based=1 --direct=${direct} --name=test --group_reporting --cpus_allowed=$spincpu --cpus_allowed_policy=split --size=16G
}

设置essd块设备 queue rq affinity,假设有9个essd盘,并且他们在vd[b-j]

echo 2 > /sys/block/vdb/queue/rq_affinity
echo 2 > /sys/block/vdc/queue/rq_affinity
echo 2 > /sys/block/vdd/queue/rq_affinity
echo 2 > /sys/block/vde/queue/rq_affinity
echo 2 > /sys/block/vdf/queue/rq_affinity
echo 2 > /sys/block/vdg/queue/rq_affinity
echo 2 > /sys/block/vdh/queue/rq_affinity
echo 2 > /sys/block/vdi/queue/rq_affinity
echo 2 > /sys/block/vdj/queue/rq_affinity

sleep 5
RunFio $1 $2 $3 $4 $5 $6 $7

RunFio 16 64 8k randwrite /data01/test libaio 1

```

```
vi fiotest_1

function RunFio
{
numjobs=$1 # 实例中的测试线程数,如示例中的8
iodepth=$2 # 同时发出I/O数的上限,如示例中的64
bs=$3 # 单次I/O的块文件大小,如示例中的4K
rw=$4 # 测试时的读写策略,如示例中的randwrite
filename=$5 # 指定测试文件的名称,如示例中的/data01/test
ioengine=$6 # io engine : libaio, sync等,参考man fio
direct=$7 # 是否跳过page cache ,参考man fio
size=$8 # 写多大数据,例如16G

nr_cpus=cat /proc/cpuinfo |grep "processor" |wc -l
if [ $nr_cpus -lt $numjobs ];then
echo “Numjobs is more than cpu cores, exit!”
exit -1
fi
let nu=$numjobs+1
cpulist=""
for ((i=1;i<10;i++))
do
list=cat /sys/block/vdb/mq/*/cpu_list | awk '{if(i<=NF) print $i;}' i="$i" | tr -d ',' | tr '\n' ','
if [ -z $list ];then
break
fi
cpulist=${cpulist}${list}
done
spincpu=echo $cpulist | cut -d ',' -f 2-${nu}
echo $spincpu
fio --ioengine=${ioengine} --runtime=60s --numjobs=${numjobs} --iodepth=${iodepth} --bs=${bs} --rw=${rw} --filename=${filename} --time_based=1 --direct=${direct} --name=test --group_reporting --cpus_allowed=$spincpu --cpus_allowed_policy=split --size=${size}
}

设置essd块设备 queue rq affinity,假设有9个essd盘,并且他们在vd[b-j]

echo 2 > /sys/block/vdb/queue/rq_affinity
echo 2 > /sys/block/vdc/queue/rq_affinity
echo 2 > /sys/block/vdd/queue/rq_affinity
echo 2 > /sys/block/vde/queue/rq_affinity
echo 2 > /sys/block/vdf/queue/rq_affinity
echo 2 > /sys/block/vdg/queue/rq_affinity
echo 2 > /sys/block/vdh/queue/rq_affinity
echo 2 > /sys/block/vdi/queue/rq_affinity
echo 2 > /sys/block/vdj/queue/rq_affinity

sleep 5
RunFio $1 $2 $3 $4 $5 $6 $7 $8

RunFio 16 64 8k randwrite /data01/test libaio 1 16G

```

本地盘版(1.8T*8)

```
. ./fiotest 32 8 8K randwrite /data01/test libaio 1
write: IOPS=131k, BW=1024MiB/s (1074MB/s)(29.0GiB/30003msec)

. ./fiotest 32 8 8K randread /data01/test libaio 1
read: IOPS=153k, BW=1193MiB/s (1250MB/s)(34.9GiB/30001msec)

. ./fiotest 32 8 8K write /data01/test libaio 1
write: IOPS=134k, BW=1050MiB/s (1101MB/s)(30.8GiB/30001msec)

. ./fiotest 32 8 8K read /data01/test libaio 1
read: IOPS=151k, BW=1182MiB/s (1240MB/s)(34.6GiB/30002msec)
```

云盘逻辑卷(1.8T*8)

```
. ./fiotest 32 8 8K randwrite /data01/test libaio 1
write: IOPS=106k, BW=825MiB/s (865MB/s)(24.2GiB/30003msec)

. ./fiotest 32 8 8K randread /data01/test libaio 1
read: IOPS=106k, BW=826MiB/s (866MB/s)(24.2GiB/30016msec)

. ./fiotest 32 8 8K write /data01/test libaio 1
write: IOPS=106k, BW=826MiB/s (866MB/s)(24.2GiB/30003msec)

. ./fiotest 32 8 8K read /data01/test libaio 1
read: IOPS=106k, BW=829MiB/s (869MB/s)(24.3GiB/30003msec)
```

云盘单盘(20T)

```
. ./fiotest 32 8 8K randwrite /data02/test libaio 1
write: IOPS=106k, BW=826MiB/s (867MB/s)(24.2GiB/30003msec)

. ./fiotest 32 8 8K randread /data02/test libaio 1
read: IOPS=106k, BW=830MiB/s (870MB/s)(24.3GiB/30004msec)

. ./fiotest 32 8 8K write /data02/test libaio 1
write: IOPS=106k, BW=824MiB/s (865MB/s)(24.2GiB/30006msec)

. ./fiotest 32 8 8K read /data02/test libaio 1
read: IOPS=106k, BW=825MiB/s (866MB/s)(24.2GiB/30004msec)
```

PostgreSQL tpcb 10亿数据量 测试,数据库综合性能

本地盘版(1.8T*8)

su - postgres pgbench -i -s 10000 1000000000 of 1000000000 tuples (100%) done (elapsed 1058.21 s, remaining 0.00 s)

checkpoint; set max_parallel_workers_per_gather =16; set enable_indexscan =on; set enable_indexonlyscan =on; set enable_bitmapscan =off; set enable_seqscan=off; explain (analyze) select count(*) from pgbench_accounts ; set enable_indexscan =off; set enable_indexonlyscan =off; set enable_bitmapscan =off; set enable_seqscan=on; explain (analyze) select count(*) from pgbench_accounts ;

```
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 75279853
latency average = 0.204 ms
latency stddev = 0.557 ms
tps = 627000.964875 (including connections establishing)
tps = 627126.432692 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.205 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
```

```
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 3073411
latency average = 4.997 ms
latency stddev = 25.946 ms
tps = 25595.742845 (including connections establishing)
tps = 25600.289357 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.105 BEGIN;
3.458 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.339 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.220 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.206 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.196 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.488 END;
```

写入耗时,1058秒
查询qps,627000 (服务器512G内存,全部内存命中)
读写tps 25595 ,qps 127975

云盘逻辑卷(1.8T*8)

su - postgres pgbench -i -s 10000 1000000000 of 1000000000 tuples (100%) done (elapsed 948.07 s, remaining 0.00 s)

checkpoint; set max_parallel_workers_per_gather =16; set enable_indexscan =on; set enable_indexonlyscan =on; set enable_bitmapscan =off; set enable_seqscan=off; explain (analyze) select count(*) from pgbench_accounts ; set enable_indexscan =off; set enable_indexonlyscan =off; set enable_bitmapscan =off; set enable_seqscan=on; explain (analyze) select count(*) from pgbench_accounts ;

```
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 7802607
latency average = 1.968 ms
latency stddev = 9.345 ms
tps = 64979.905944 (including connections establishing)
tps = 64991.584022 (excluding connections establishing)
statement latencies in milliseconds:
0.000 \set aid random(1, 100000 * :scale)
1.970 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
```

```
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 3438342
latency average = 4.467 ms
latency stddev = 11.680 ms
tps = 28627.417199 (including connections establishing)
tps = 28633.295374 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.035 BEGIN;
3.967 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.084 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.117 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.084 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.071 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.109 END;
```

写入耗时,948秒
查询qps 64979 (服务器192G内存,命中低,同时io rt不如本地盘)
读写tps 28627 , qps 143135

云盘单盘(20T)

```
su - postgres
export PGPORT=1922
export PGDATA=/data02/pg12_$PGPORT/pg_root
export PGHOST=$PGDATA

pgbench -i -s 10000
1000000000 of 1000000000 tuples (100%) done (elapsed 1082.59 s, remaining 0.00 s)
```

checkpoint; set max_parallel_workers_per_gather =16; set enable_indexscan =on; set enable_indexonlyscan =on; set enable_bitmapscan =off; set enable_seqscan=off; explain (analyze) select count(*) from pgbench_accounts ; set enable_indexscan =off; set enable_indexonlyscan =off; set enable_bitmapscan =off; set enable_seqscan=on; explain (analyze) select count(*) from pgbench_accounts ;

```
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 4473595
latency average = 3.433 ms
latency stddev = 5.682 ms
tps = 37253.374526 (including connections establishing)
tps = 37259.270387 (excluding connections establishing)
statement latencies in milliseconds:
0.000 \set aid random(1, 100000 * :scale)
3.434 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
```

```
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 3790414
latency average = 4.052 ms
latency stddev = 12.467 ms
tps = 31545.429236 (including connections establishing)
tps = 31550.985798 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.024 BEGIN;
3.700 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.053 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.077 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.062 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.081 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.052 END;
```

写入耗时,1082秒
查询qps 37253 (服务器192G内存,命中低,同时io rt不如本地盘)
读写tps 31545 , qps 157725

PostgreSQL tpcb 100亿数据量 测试,数据库综合性能

```
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+----------+---------+-------------
public | pgbench_accounts | table | postgres | 1251 GB |
public | pgbench_branches | table | postgres | 3576 kB |
public | pgbench_history | table | postgres | 0 bytes |
public | pgbench_tellers | table | postgres | 42 MB |
(4 rows)

postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-----------------------+-------+----------+------------------+---------+-------------
public | pgbench_accounts_pkey | index | postgres | pgbench_accounts | 209 GB |
public | pgbench_branches_pkey | index | postgres | pgbench_branches | 4408 kB |
public | pgbench_tellers_pkey | index | postgres | pgbench_tellers | 43 MB |
(3 rows)
```

本地ssd 1.8T*8

```
10000000000 of 10000000000 tuples (100%) done (elapsed 9486.77 s, remaining 0.00 s)

pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S

transaction type:
scaling factor: 100000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 19059491
latency average = 0.806 ms
latency stddev = 14.075 ms
tps = 158451.015605 (including connections establishing)
tps = 158486.055663 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.813 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

Total DISK READ : 4.56 G/s | Total DISK WRITE : 16.16 M/s
Actual DISK READ: 4.56 G/s | Actual DISK WRITE: 16.49 M/s
最高到了 7 GB/s

pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120

transaction type:
scaling factor: 100000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 2969763
latency average = 5.171 ms
latency stddev = 27.769 ms
tps = 24720.906138 (including connections establishing)
tps = 24727.358605 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.088 BEGIN;
3.596 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.304 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.223 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.218 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.412 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.340 END;
```

essd pl3 1.8T*8

```
10000000000 of 10000000000 tuples (100%) done (elapsed 8195.55 s, remaining 0.00 s)

pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S

transaction type:
scaling factor: 100000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 2536629
latency average = 6.055 ms
latency stddev = 7.617 ms
tps = 21123.418940 (including connections establishing)
tps = 21126.649320 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
6.054 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

Total DISK READ : 793.93 M/s | Total DISK WRITE : 1475.73 K/s
Actual DISK READ: 793.21 M/s | Actual DISK WRITE: 1351.46 K/s
最高900多MB/s

pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120

transaction type:
scaling factor: 100000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 2071345
latency average = 7.415 ms
latency stddev = 9.884 ms
tps = 17243.124457 (including connections establishing)
tps = 17246.457208 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.019 BEGIN;
7.153 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.047 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.077 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.051 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.037 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.029 END;
```

essd pl3 20T

```
10000000000 of 10000000000 tuples (100%) done (elapsed 7983.23 s, remaining 0.00 s)

pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S

transaction type:
scaling factor: 100000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 2702617
latency average = 5.683 ms
latency stddev = 4.769 ms
tps = 22504.744988 (including connections establishing)
tps = 22508.795301 (excluding connections establishing)
statement latencies in milliseconds:
0.000 \set aid random(1, 100000 * :scale)
5.681 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

postgres=# select wait_event,count(*) from pg_stat_activity group by 1 order by 2 desc;
wait_event | count
---------------------+-------
DataFileRead | 128
BgWriterMain | 1
| 1
AutoVacuumMain | 1
CheckpointerMain | 1
LogicalLauncherMain | 1
WalWriterMain | 1
(7 rows)

pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120

transaction type:
scaling factor: 100000
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 2052621
latency average = 7.482 ms
latency stddev = 5.452 ms
tps = 17087.070214 (including connections establishing)
tps = 17090.310318 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.020 BEGIN;
7.224 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.046 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.073 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.052 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.037 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.029 END;
```

补充测试 104核384G ecs + 20T essd pl3

由于云盘的带宽、iops除了与云盘容量有关,与ecs规格也有关,所以加测一台: 104核384G ecs + 20T essd pl3

104核384G , 20T essd pl3

```
pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 6358.070 ops/sec 157 usecs/op
fdatasync 6221.143 ops/sec 161 usecs/op
fsync 3922.576 ops/sec 255 usecs/op
fsync_writethrough n/a
open_sync 4078.494 ops/sec 245 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 3237.264 ops/sec 309 usecs/op
fdatasync 5393.018 ops/sec 185 usecs/op
fsync 3499.650 ops/sec 286 usecs/op
fsync_writethrough n/a
open_sync 2117.949 ops/sec 472 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
1 * 16kB open_sync write 3684.151 ops/sec 271 usecs/op
2 * 8kB open_sync writes 2083.492 ops/sec 480 usecs/op
4 * 4kB open_sync writes 1142.101 ops/sec 876 usecs/op
8 * 2kB open_sync writes 85.792 ops/sec 11656 usecs/op
16 * 1kB open_sync writes 69.520 ops/sec 14384 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close 3878.116 ops/sec 258 usecs/op
write, close, fsync 3795.515 ops/sec 263 usecs/op

Non-sync'ed 8kB writes:
write 325926.283 ops/sec 3 usecs/op
```

fio

```
. ./fiotest 52 8 8K randwrite /data01/test libaio 1 16G
write: IOPS=167k, BW=1303MiB/s (1366MB/s)(76.3GiB/60006msec)

. ./fiotest 52 8 8K randread /data01/test libaio 1 16G
read: IOPS=141k, BW=1103MiB/s (1156MB/s)(64.6GiB/60005msec)

. ./fiotest 52 8 8K write /data01/test libaio 1 16G
write: IOPS=158k, BW=1231MiB/s (1291MB/s)(72.1GiB/60004msec)

. ./fiotest 52 8 8K read /data01/test libaio 1 16G
read: IOPS=121k, BW=943MiB/s (988MB/s)(55.2GiB/60006msec)
```

10亿

pgbench -i -s 10000 1000000000 of 1000000000 tuples (100%) done (elapsed 1058.21 s, remaining 0.00 s)

checkpoint; set max_parallel_workers_per_gather =16; set enable_indexscan =on; set enable_indexonlyscan =on; set enable_bitmapscan =off; set enable_seqscan=off; explain (analyze) select count(*) from pgbench_accounts ; set enable_indexscan =off; set enable_indexonlyscan =off; set enable_bitmapscan =off; set enable_seqscan=on; explain (analyze) select count(*) from pgbench_accounts ;

```
pgbench -M prepared -n -r -P 3 -c 52 -j 52 -T 120 -S

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 52
number of threads: 52
duration: 120 s
number of transactions actually processed: 106501930
latency average = 0.058 ms
latency stddev = 0.017 ms
tps = 887505.958088 (including connections establishing)
tps = 887576.878385 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.058 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
```

```
pgbench -M prepared -n -r -P 3 -c 52 -j 52 -T 120

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 52
number of threads: 52
duration: 120 s
number of transactions actually processed: 9363834
latency average = 0.666 ms
latency stddev = 4.141 ms
tps = 78006.043823 (including connections establishing)
tps = 78014.124808 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.001 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.020 BEGIN;
0.417 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.039 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.053 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.046 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.046 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.044 END;
```

100亿

pgbench -i -s 100000 10000000000 of 10000000000 tuples (100%) done (elapsed 7789.99 s, remaining 0.00 s)

```
pgbench -M prepared -n -r -P 3 -c 52 -j 52 -T 120 -S

transaction type:
scaling factor: 100000
query mode: prepared
number of clients: 52
number of threads: 52
duration: 120 s
number of transactions actually processed: 3833782
latency average = 1.627 ms
latency stddev = 1.566 ms
tps = 31946.758650 (including connections establishing)
tps = 31949.973049 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
1.628 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
```

```
pgbench -M prepared -n -r -P 3 -c 52 -j 52 -T 120

transaction type:
scaling factor: 100000
query mode: prepared
number of clients: 52
number of threads: 52
duration: 120 s
number of transactions actually processed: 2287129
latency average = 2.728 ms
latency stddev = 2.637 ms
tps = 19056.932156 (including connections establishing)
tps = 19058.904144 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.016 BEGIN;
2.504 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.038 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.066 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.046 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.031 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.024 END;
```

存储 | 磁盘配置 | 磁盘标称iops能力 | cpu | 内存
---|---|---|---|---
essd pl3 | 20T | 1000000, 4GB/s带宽 | 52核104线程2.5GHz | 384GB

test case | 104线程机器 essd pl3 20T
---|---
fdatasync 8K | 6221

test case | 104线程机器 essd pl3 20T
---|---
8k 随机写 | IOPS=167k, BW=1303MiB/s
8k 随机读 | IOPS=141k, BW=1103MiB/s
8k 顺序写 | IOPS=158k, BW=1231MiB/s
8k 顺序读 | IOPS=121k, BW=943MiB/s

test case | 104线程机器 essd pl3 20T
---|---
10亿写入 | 1058秒
10亿只读 | qps: 887505
10亿读写 | tps 78006 , qps 390030

test case | 104线程机器 essd pl3 20T
---|---
100亿 写入 | 7789秒
100亿 只读 | qps: 31946
100亿 读写 | tps 19056 , qps 95280

补充测试2

26核52线程 192G内存, 1.8T essd

```
export PGPORT=1923
export PGDATA=/data03/pg12_$PGPORT/pg_root
export PGHOST=$PGDATA

initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024
```

```
postgres@iZbp1621kc3de3pm12a0exZ-> pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 6455.059 ops/sec 155 usecs/op
fdatasync 6114.725 ops/sec 164 usecs/op
fsync 4182.573 ops/sec 239 usecs/op
fsync_writethrough n/a
open_sync 4318.160 ops/sec 232 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 3187.113 ops/sec 314 usecs/op
fdatasync 5170.527 ops/sec 193 usecs/op
fsync 3430.941 ops/sec 291 usecs/op
fsync_writethrough n/a
open_sync 2123.727 ops/sec 471 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
1 * 16kB open_sync write 3902.495 ops/sec 256 usecs/op
2 * 8kB open_sync writes 2136.396 ops/sec 468 usecs/op
4 * 4kB open_sync writes 1138.692 ops/sec 878 usecs/op
8 * 2kB open_sync writes 90.899 ops/sec 11001 usecs/op
16 * 1kB open_sync writes 74.271 ops/sec 13464 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close 4072.978 ops/sec 246 usecs/op
write, close, fsync 4136.551 ops/sec 242 usecs/op

Non-sync'ed 8kB writes:
write 296604.607 ops/sec 3 usecs/op
```

```
. ./fiotest 32 8 8K randwrite /data03/test libaio 1 16G
write: IOPS=91.9k, BW=718MiB/s (753MB/s)(42.1GiB/60019msec)

. ./fiotest 32 8 8K randread /data03/test libaio 1 16G
read: IOPS=92.2k, BW=720MiB/s (755MB/s)(42.2GiB/60003msec)

. ./fiotest 32 8 8K write /data03/test libaio 1 16G
write: IOPS=91.9k, BW=718MiB/s (753MB/s)(42.1GiB/60003msec)

. ./fiotest 32 8 8K read /data03/test libaio 1 16G
read: IOPS=92.0k, BW=719MiB/s (754MB/s)(42.2GiB/60026msec)
```

```
pgbench -i -s 10000

1000000000 of 1000000000 tuples (100%) done (elapsed 735.47 s, remaining 0.00 s)
```

checkpoint; set max_parallel_workers_per_gather =16; set enable_indexscan =on; set enable_indexonlyscan =on; set enable_bitmapscan =off; set enable_seqscan=off; explain (analyze) select count(*) from pgbench_accounts ; set enable_indexscan =off; set enable_indexonlyscan =off; set enable_bitmapscan =off; set enable_seqscan=on; explain (analyze) select count(*) from pgbench_accounts ;

```
pgbench -M prepared -n -r -P 3 -c 32 -j 32 -T 120 -S

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 9111047
latency average = 0.421 ms
latency stddev = 1.461 ms
tps = 75913.573252 (including connections establishing)
tps = 75917.478749 (excluding connections establishing)
statement latencies in milliseconds:
0.000 \set aid random(1, 100000 * :scale)
0.421 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
```

```
pgbench -M prepared -n -r -P 3 -c 32 -j 32 -T 120

transaction type:
scaling factor: 10000
query mode: prepared
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 4021731
latency average = 0.955 ms
latency stddev = 1.840 ms
tps = 33511.413835 (including connections establishing)
tps = 33513.166609 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.019 BEGIN;
0.756 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.036 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.046 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.039 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.032 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.025 END;
```

性能总结

测试环境说明:

存储 | 磁盘配置 | 磁盘标称iops能力 | cpu | 内存
---|---|---|---|---
本地ssd | 1.8T*8 | 每块 240000 | 32核64线程2.5GHz | 512GB
essd pl3 | 1.8T | 91800 | 26核52线程2.5GHz | 192GB
essd pl3 | 1.8T*8 | 每块 91800 | 26核52线程2.5GHz | 192GB
essd pl3 | 20T | 1000000 | 26核52线程2.5GHz | 192GB
essd pl3 | 20T | 1000000 | 52核104线程2.5GHz | 384GB

fsync 性能测试(单次IO能力, 检验数据库sync commit延迟)

test case | 本地ssd 1.8T*8 | essd pl3 1.8T*8 | (26核)essd pl3 20T | (52核)essd pl3 20T | (26核)essd pl3 1.8T
---|---|---|---|---|---
fdatasync 8K | 28216 | 6087 | 5901 | 6221 | 6114

fio测试(iops能力,吞吐能力)

test case | 本地ssd 1.8T*8 | essd pl3 1.8T*8 | (26核)essd pl3 20T | (52核)essd pl3 20T | (26核)essd pl3 1.8T
---|---|---|---|---|---
8k 随机写 | IOPS=131k, BW=1024MiB/s | IOPS=106k, BW=825MiB/s | IOPS=106k, BW=826MiB/s | IOPS=167k, BW=1303MiB/s | IOPS=91.9k, BW=718MiB/s
8k 随机读 | IOPS=153k, BW=1193MiB/s | IOPS=106k, BW=826MiB/s | IOPS=106k, BW=830MiB/s | IOPS=141k, BW=1103MiB/s | IOPS=92.2k, BW=720MiB/s
8k 顺序写 | IOPS=134k, BW=1050MiB/s | IOPS=106k, BW=826MiB/s | IOPS=106k, BW=824MiB/s | IOPS=158k, BW=1231MiB/s | IOPS=91.9k, BW=718MiB/s
8k 顺序读 | IOPS=151k, BW=1182MiB/s | IOPS=106k, BW=829MiB/s | IOPS=106k, BW=825MiB/s | IOPS=121k, BW=943MiB/s | IOPS=92.0k, BW=719MiB/s

PostgreSQL tpcb 10亿数据量 测试,数据库综合性能

test case | 本地ssd 1.8T*8 | essd pl3 1.8T*8 | (26核)essd pl3 20T | (52核)essd pl3 20T | (26核)essd pl3 1.8T
---|---|---|---|---|---
10亿写入 | 1058秒 | 948秒 | 1082秒 | 1058秒 | 735 秒
10亿只读 | qps: 627000(服务器512G内存,全内存命中) | qps: 64979 (服务器192G内存,命中低,同时io rt不如本地盘)| qps: 37253 (服务器192G内存,命中低,同时io rt不如本地盘)| qps: 887505 (服务器384G内存,全内存命中) | qps: 75913 (服务器192G内存,命中低,同时io rt不如本地盘)
10亿读写 | tps 31545 , qps 157725 | tps 25595 ,qps 127975 | tps 28627 , qps 143135 | tps 78006 , qps 390030 | tps 33511 , qps 167555

PostgreSQL tpcb 100亿数据量 测试,数据库综合性能

test case | 本地ssd 1.8T*8 | essd pl3 1.8T*8 | (26核)essd pl3 20T | (52核)essd pl3 20T
---|---|---|---|---
100亿 写入 | 9486秒 | 8195秒 | 7983秒 | 7789秒
100亿 只读 | qps: 158451 | qps: 21123 | qps: 22504 | qps: 31946
100亿 读写 | tps 24720 , qps 123600 | tps 17243 , qps 86215 | tps 17087 , qps 85435 | tps 19056 , qps 95280

PG使用本地盘与云盘优缺点对比

云盘版优势:

  • 更可靠(3副本 vs 1副本)(从节点挂了之后,主节点依旧有3副本。物理机版只剩1副本)
  • 更安全(支持云盘加密)
  • 更大容量(最大512TB(32TB*16块) vs 14T)
  • 备份更快(快照 vs copy文件),超过1T本地盘备份时间可能长达10小时以上
  • 使用本地盘存储数据文件时,备份会产生网络传输,同时会占用os page cache(buffer io)。
  • 克隆更快(快照恢复 vs copy文件)
  • 云盘版主机挂了,不需要重建数据,漂移ECS挂载云盘即可,很快恢复。
  • 性能平滑:本地ssd gc时有较大性能抖动。云盘数据打散,整体io操作性能更平滑。

云盘缺陷:

  • 单次io的延迟更高(164us VS 30us),如何解决?
    • 操作系统和数据库都有预读和写IO合并,PG本身大量操作是BufferIO,单次io延迟对PG的性能影响几乎可以忽略。
    • 并发起来后IO单次延迟引入的等待也会被缩小。
    • 单表不宜过大,建议分区。这样很多vacuum操作可以并行。

云盘使用建议:

由于云盘io引入了延迟,需要靠并行化弥补。

1、表尽量分区,提升vacuum吞吐

2、分析型业务使用32K block size,提高单次io量,提升吞吐

3、使用操作系统、数据库层面块设备预读

4、云盘的iops,带宽限制不仅仅取决于云盘规格、大小,同时还受到ecs规格限制,要综合来看。对io敏感建议选择大规格ecs。

内核针对云盘场景优化建议:

io并行化:

bgwriter,并行化

checkpoint,并行化

walwriter,并行化

vacuum单表,并行化

其他已实现并行化操作:

参考pg 12并行计算相关内容:建表、索引、查询都已实现并行化。

dba优化建议:

开启backend flush,让更多进程参与io,提高吞吐,降低ckpt负担。

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论