写在前面
有部分线上集群建设时间比较早,跑了一年多没啥问题,但是最近有扩容需求,客户要求在原来的7台机器的基础上,扩容11台,早期设计的时候没有预留太多PG,每个osd上的PG大概在100上下,扩容这么多机器必然会导致每个osd的PG数量大幅降低,因此有了对生产集群做PG数量调整的需求
调研先行
稳妥起见,我们先调查一下增加PG的操作,再利用非生产环境做测试,最后再到生产环境执行,目标集群版本是Mimic
关于增加PG数,官网查到的说法是
出自:https://docs.ceph.com/docs/mimic/rados/operations/placement-groups/
SET THE NUMBER OF PLACEMENT GROUPS
To set the number of placement groups in a pool, you must specify the number of placement groups at the time you create the pool. See Create a Pool for details. Even after a pool is created you can also change the number of placement groups with:
ceph osd pool set {pool-name} pg_num {pg_num}
After you increase the number of placement groups, you must also increase the number of placement groups for placement (pgp_num) before your cluster will rebalance. The pgp_num will be the number of placement groups that will be considered for placement by the CRUSH algorithm. Increasing pg_num splits the placement groups but data will not be migrated to the newer placement groups until placement groups for placement, ie. pgp_num is increased. The pgp_num should be equal to the pg_num. To increase the number of placement groups for placement, execute the following:
ceph osd pool set {pool-name} pgp_num {pgp_num}
When decreasing the number of PGs, pgp_num is adjusted automatically for you.
看来是执行ceph osd pool set {pool-name} pg_num {pg_num}
这个命令,不增加pgp的话,不会发生数据迁移
关于这点,《ceph之rados设计原理与实现》的第一章的1.5小节有介绍,不过没有详细介绍Bluestore的PG分裂原理,这里简单介绍一下,对于已经存在的pool,为了增加它的PG数而不引发大规模数据迁移,实际上是使用了现有PG进行分裂的方式来增加,新增加的PG与原有的PG将保持与原来PG相同的映射规则,这就使得新的PG(孩子)与原来的PG(祖先)存在相同的osd上,规避了数据迁移,而为了保证这个规则的相同,是借助了pgp来记录原来的pg分布情况的
一句话,增加PG数触发PG分裂,分裂后默认新的PG不发生迁移,PG数原地增加
但是,这里会有一个显而易见的问题,如果仅增加了PG而不修改PGP,那么PG将无法再次修改,因为PGP记录的是前一次的PG分布情况,因此,ceph建议在修改了PG后,适时修改PGP为相同的值
注意,根据Release Note,在Nautilus及后续版本中,修改pg数后,PGP数会自动修改为与PG数一致,无需手动修改
操作实验
在非生产环境上操作看看
[twj@test-cluster ~]$ sudo ceph osd pool ls detail
pool 6 'test-zone1.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 2725 flags hashpspool stripe_width 0 application rgw
修改这个pool的pg_num
[twj@test-cluster ~]$ sudo ceph osd pool set test-zone1.rgw.buckets.index pg_num 512
[twj@test-cluster ~]$ sudo ceph -s
cluster:
id: d9f72976-63d0-4b42-b149-cd0ca3004862
health: HEALTH_WARN
noout,noscrub,nodeep-scrub flag(s) set
1 pools have pg_num > pgp_num
data:
pools: 10 pools, 34064 pgs
objects: 1.45 G objects, 4.0 PiB
usage: 5.7 PiB used, 9.8 PiB 16 PiB avail
pgs: 0.499% pgs unknown
0.250% pgs not active
33809 active+clean
170 unknown
85 peering
[twj@test-cluster ~]$ sudo ceph osd pool ls detail
pool 6 'test-zone1.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 256 last_change 78945 lfor 0/78945 flags hashpspool stripe_width 0 application rgw
[twj@test-cluster ~]$ sudo ceph -s
cluster:
id: d9f72976-63d0-4b42-b149-cd0ca3004862
health: HEALTH_WARN
noout,noscrub,nodeep-scrub flag(s) set
1 pools have pg_num > pgp_num
data:
pools: 10 pools, 34064 pgs
objects: 1.45 G objects, 4.0 PiB
usage: 5.7 PiB used, 9.8 PiB 16 PiB avail
pgs: 34064 active+clean
io:
client: 198 MiB/s rd, 6.7 MiB/s wr, 18.08 kop/s rd, 3.34 kop/s wr
修改完后,pg原地创建成功,没有数据迁移,同时集群有提示信息,pg和pgp不一致,接下来,修改pgp
[twj@test-cluster ~]$ sudo ceph osd pool set test-zone1.rgw.buckets.index pgp_num 512
set pool 6 pgp_num to 512
[twj@test-cluster ~]$ sudo ceph -s
cluster:
id: d9f72976-63d0-4b42-b149-cd0ca3004862
health: HEALTH_WARN
noout,noscrub,nodeep-scrub flag(s) set
23002/23247060133 objects misplaced (0.000%)
data:
pools: 10 pools, 34064 pgs
objects: 1.45 G objects, 4.0 PiB
usage: 5.7 PiB used, 9.8 PiB 16 PiB avail
pgs: 23002/23247060133 objects misplaced (0.000%)
33809 active+clean
246 active+remapped+backfill_wait
9 active+remapped+backfilling
io:
client: 30 MiB/s rd, 2.0 MiB/s wr, 19.44 kop/s rd, 7.46 kop/s wr
recovery: 0 B/s, 204.7 kkeys/s, 8 objects/s
可以看到,修改完pgp后,发生了数据迁移,这里修改的是index池的pg数,因此恢复的数据都是key
多说一句
针对14.2.0及后续版本,增加pg后,会自动增加pgp的数量,因此无需手工修改pgp,同时因为会自动修改pgp而引发集群发生数据迁移,因此要注意一下业务的影响
在N版本中,修改pg后,pool的状态变成
-bash-4.2$ sudo ceph osd pool ls detail
pool 16 'mytestzone1.rgw.buckets.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16384 pgp_num 14430 pgp_num_target 16384 autoscale_mode warn last_change 5847 lfor 0/0/5636 flags hashpspool,selfmanaged_snaps stripe_width 0 application rgw
removed_snaps [1~3]
多了一个pgp_num_target
,即pgp数会逐步增加,直到与pg数一直为止
总结
生产环境使用ceph,扩容是绕不开的一个场景,扩容方案,从思路上一般有三条路可以走,第一,扩容现有pool,加入机器/磁盘到现有的pool中,第二,直接增加新的pool,第三,直接增加新的集群,三种思路各有千秋,针对扩容现有pool,增加pg数又是难以避免的问题,本文希望可以给大家一个参考