Tidb - Rocksdb 的使用
tikv 中的 Rocksdb
此篇分析仅是个人观点,仅作研究探讨使用。

Rocksdb 为一个LSM存储引擎的数据库,是以Kye:Value的方式进行存储的。
- 每个tikv使用了两个rocksdb来存储相关数据。一个存放raft数据,一个存放实际data数据。
- rocksdb raft存放的是分布式事务需要同步的数据(dml等),使用log apply等方式将数据同步到其它tikv节点,实现raft协议所规定的多数派情况。
- rocksdb kv中存放了实际的表数据
查看一个TiKV 目录结构可以看到 db 与 raft两个Rocksdb结构。
之前想过tidb是中table 与 sst文件对应关系是怎么样的,目前创建了2个测试表,仅生成一个sst文件,我猜测table和sst文件是多对多的关系,无法评估出sst文件中存放的是哪个表的数据。所以这里深切的怀疑大表查询的性能问题。如果sst的层级越深,sst分布的越广,那么读放大就比较严重了。
[root@db01 tidb-data]# tree tikv-20161/
tikv-20161/
├── db
│ ├── 000003.log
│ ├── 000017.sst
│ ├── CURRENT
│ ├── IDENTITY
│ ├── LOCK
│ ├── LOG
│ ├── MANIFEST-000006
│ ├── OPTIONS-000010
│ └── OPTIONS-000012
├── import
├── last_tikv.toml
├── LOCK
├── raft
│ ├── 000003.log
│ ├── CURRENT
│ ├── IDENTITY
│ ├── LOCK
│ ├── LOG
│ ├── MANIFEST-000001
│ └── OPTIONS-000005
├── snap
└── space_placeholder_file
tikv 中的kv cf
cf:Column Family主要是提供给RocksDB一个逻辑的分区,从实现上来看不同的Column Family共享WAL,而都有自己的Memtable和SST,默认的Column Family是 “default”。
tikv使用了Column Family 实现了mvcc机制。不得不说这等工程实现是一门艺术。

用于存放数据的kv中,使用了三个cf来实现事务、锁机制等。
Write 列簇:当用户写入了一行数据,如果该行数据长度小于255,存放在write cf中,如果超过255字节长度,则存放到default cf中。
Lock 列簇:用于存放事务中的锁信息,分布式事务中会存放指向主事务的链接信息。
default列簇:用于存储超过225字节长度的数据。
可以看到tikv中对3个列簇分别有其对应的参数设置
| tikv | 10.51.xxx.69:20161 | rocksdb.auto-tuned | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.bytes-per-sync | 1MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.compaction-readahead-size | 0KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.create-if-missing | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.block-based-bloom-filter | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.block-cache-size | 64353MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.block-size | 64KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.bloom-filter-bits-per-key | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.cache-index-and-filter-blocks | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.compaction-pri | 3 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.compaction-style | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.compression-per-level | ["no","no","lz4","lz4","lz4","zstd","zstd"] |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.disable-auto-compactions | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.disable-block-cache | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.dynamic-level-bytes | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.enable-doubly-skiplist | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.force-consistency-checks | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.hard-pending-compaction-bytes-limit | 256GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.level0-file-num-compaction-trigger | 4 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.level0-slowdown-writes-trigger | 20 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.level0-stop-writes-trigger | 36 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.max-bytes-for-level-base | 512MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.max-bytes-for-level-multiplier | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.max-compaction-bytes | 2GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.max-write-buffer-number | 5 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.min-write-buffer-number-to-merge | 1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.num-levels | 7 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.optimize-filters-for-hits | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.pin-l0-filter-and-index-blocks | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.prop-keys-index-distance | 40960 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.prop-size-index-distance | 4194304 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.read-amp-bytes-per-bit | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.soft-pending-compaction-bytes-limit | 64GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.target-file-size-base | 8MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.blob-cache-size | 0KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.blob-file-compression | lz4 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.blob-run-mode | normal |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.discardable-ratio | 0.5 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.gc-merge-rewrite | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.level-merge | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.max-gc-batch-size | 64MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.max-sorted-runs | 20 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.merge-small-file-threshold | 8MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.min-blob-size | 1KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.min-gc-batch-size | 16MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.range-merge | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.sample-ratio | 0.1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.use-bloom-filter | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.whole-key-filtering | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.write-buffer-size | 128MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.enable-multi-batch-write | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.enable-pipelined-write | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.enable-statistics | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.enable-unordered-write | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.info-log-dir | |
| tikv | 10.51.xxx.69:20161 | rocksdb.info-log-keep-log-file-num | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.info-log-max-size | 1GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.info-log-roll-time | 0s |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.block-based-bloom-filter | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.block-cache-size | 1GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.block-size | 16KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.bloom-filter-bits-per-key | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.cache-index-and-filter-blocks | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.compaction-pri | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.compaction-style | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.compression-per-level | ["no","no","no","no","no","no","no"] |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.disable-auto-compactions | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.disable-block-cache | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.dynamic-level-bytes | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.enable-doubly-skiplist | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.force-consistency-checks | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.hard-pending-compaction-bytes-limit | 256GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.level0-file-num-compaction-trigger | 1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.level0-slowdown-writes-trigger | 20 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.level0-stop-writes-trigger | 36 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.max-bytes-for-level-base | 128MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.max-bytes-for-level-multiplier | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.max-compaction-bytes | 2GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.max-write-buffer-number | 5 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.min-write-buffer-number-to-merge | 1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.num-levels | 7 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.optimize-filters-for-hits | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.pin-l0-filter-and-index-blocks | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.prop-keys-index-distance | 40960 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.prop-size-index-distance | 4194304 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.read-amp-bytes-per-bit | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.soft-pending-compaction-bytes-limit | 64GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.target-file-size-base | 8MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.blob-cache-size | 0KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.blob-file-compression | lz4 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.blob-run-mode | read-only |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.discardable-ratio | 0.5 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.gc-merge-rewrite | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.level-merge | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.max-gc-batch-size | 64MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.max-sorted-runs | 20 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.merge-small-file-threshold | 8MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.min-blob-size | 1KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.min-gc-batch-size | 16MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.range-merge | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.sample-ratio | 0.1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.use-bloom-filter | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.whole-key-filtering | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.write-buffer-size | 128MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-background-flushes | 2 |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-background-jobs | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-manifest-file-size | 128MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-open-files | 40960 |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-sub-compactions | 3 |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-total-wal-size | 4GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.block-based-bloom-filter | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.block-cache-size | 128MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.block-size | 16KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.bloom-filter-bits-per-key | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.cache-index-and-filter-blocks | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.compaction-pri | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.compaction-style | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.compression-per-level | ["no","no","no","no","no","no","no"] |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.disable-auto-compactions | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.disable-block-cache | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.dynamic-level-bytes | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.enable-doubly-skiplist | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.force-consistency-checks | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.hard-pending-compaction-bytes-limit | 256GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.level0-file-num-compaction-trigger | 1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.level0-slowdown-writes-trigger | 20 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.level0-stop-writes-trigger | 36 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.max-bytes-for-level-base | 128MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.max-bytes-for-level-multiplier | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.max-compaction-bytes | 2GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.max-write-buffer-number | 5 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.min-write-buffer-number-to-merge | 1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.num-levels | 7 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.optimize-filters-for-hits | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.pin-l0-filter-and-index-blocks | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.prop-keys-index-distance | 40960 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.prop-size-index-distance | 4194304 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.read-amp-bytes-per-bit | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.soft-pending-compaction-bytes-limit | 64GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.target-file-size-base | 8MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.blob-cache-size | 0KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.blob-file-compression | lz4 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.blob-run-mode | read-only |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.discardable-ratio | 0.5 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.gc-merge-rewrite | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.level-merge | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.max-gc-batch-size | 64MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.max-sorted-runs | 20 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.merge-small-file-threshold | 8MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.min-blob-size | 1KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.min-gc-batch-size | 16MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.range-merge | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.sample-ratio | 0.1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.use-bloom-filter | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.whole-key-filtering | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.write-buffer-size | 128MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.rate-bytes-per-sec | 0KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.rate-limiter-mode | 2 |
| tikv | 10.51.xxx.69:20161 | rocksdb.stats-dump-period | 10m |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.dirname | |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.disable-gc | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.enabled | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.max-background-gc | 4 |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.purge-obsolete-files-period | 10s |
| tikv | 10.51.xxx.69:20161 | rocksdb.use-direct-io-for-flush-and-compaction | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-bytes-per-sync | 512KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-dir | |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-recovery-mode | 2 |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-size-limit | 0KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-ttl-seconds | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writable-file-max-buffer-size | 1MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.block-based-bloom-filter | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.block-cache-size | 38611MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.block-size | 64KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.bloom-filter-bits-per-key | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.cache-index-and-filter-blocks | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.compaction-pri | 3 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.compaction-style | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.compression-per-level | ["no","no","lz4","lz4","lz4","zstd","zstd"] |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.disable-auto-compactions | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.disable-block-cache | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.dynamic-level-bytes | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.enable-doubly-skiplist | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.force-consistency-checks | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.hard-pending-compaction-bytes-limit | 256GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.level0-file-num-compaction-trigger | 4 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.level0-slowdown-writes-trigger | 20 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.level0-stop-writes-trigger | 36 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.max-bytes-for-level-base | 512MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.max-bytes-for-level-multiplier | 10 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.max-compaction-bytes | 2GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.max-write-buffer-number | 5 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.min-write-buffer-number-to-merge | 1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.num-levels | 7 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.optimize-filters-for-hits | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.pin-l0-filter-and-index-blocks | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.prop-keys-index-distance | 40960 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.prop-size-index-distance | 4194304 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.read-amp-bytes-per-bit | 0 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.soft-pending-compaction-bytes-limit | 64GiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.target-file-size-base | 8MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.blob-cache-size | 0KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.blob-file-compression | lz4 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.blob-run-mode | read-only |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.discardable-ratio | 0.5 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.gc-merge-rewrite | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.level-merge | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.max-gc-batch-size | 64MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.max-sorted-runs | 20 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.merge-small-file-threshold | 8MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.min-blob-size | 1KiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.min-gc-batch-size | 16MiB |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.range-merge | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.sample-ratio | 0.1 |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.use-bloom-filter | true |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.whole-key-filtering | false |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.write-buffer-size | 128MiB |
可以使用sst_dump命令将 sst文件dump出来
sst_dump --file=/tidb/db/000022.sst --command=raw
Footer Details:
--------------------------------------
checksum: 1
metaindex handle: B0BFD31A56
index handle: B38ACF1ACAF303
footer version: 2
table_magic_number: 9863518390377041911
Metaindex Details:
--------------------------------------
Filter block handle: E9F2A31AC5972B
Properties block handle: 82FED21AA941
Table Properties:
--------------------------------------
# data blocks: 843
# entries: 651325
# deletions: 85363
# merge operands: 0
# range deletions: 0
raw key size: 52024919
raw average key size: 79.875514
raw value size: 31022953
raw average value size: 47.630527
data block size: 55114089
index block size (user-key? 0, delta-value? 0): 63951
filter block size: 707525
(estimated) table size: 55885565
filter policy name: rocksdb.BuiltinBloomFilter
prefix extractor name: FixedSuffixSliceTransform
column family ID: 2
column family name: write --> 对应了write cf
comparator name: leveldb.BytewiseComparator
merge operator name: nullptr
property collectors names: [tikv.mvcc-properties-collector,tikv.range-properties-collector]
SST file compression algo: NoCompression --> 使用的压缩情况
creation time: 1655360424
time stamp of earliest key: 1653037511
Index Details:
--------------------------------------
Block key hex dump: Data block handle
Block key ascii
.......
......
....
..
------
HEX 7A7480000000000000FF345F728000000000FF0203C80000000000FAF9FA5351F20BFFF7: 508480B0BADF95EB82067631800004000000010203040C0016001E001F007462707264636F6E7665727453524231353130303132656E645F6461746530
ASCII z t <80> \0 \0 \0 \0 \0 \0 ÿ 4 _ r <80> \0 \0 \0 \0 ÿ ^B ^C È \0 \0 \0 \0 \0 ú ù ú S Q ò ^K ÿ ÷ : P <84> <80> ° º ß <95> ë <82> ^F v 1 <80> \0 ^D \0 \0 \0 ^A ^B ^C ^D ^L \0 ^V \0 ^^ \0 ^_ \0 t b x x x S R B 1 5 1 0 0 1 2 e n d _ d a t e 0
------
HEX 7A7480000000000000FF345F728000000000FF0203C90000000000FAF9FA5351F20BFFF7: 508480B0BADF95EB82067634800004000000010203040C001600210022007462707264636F6E76657274535242313531303031326665655F6163636F756E7420
ASCII z t <80> \0 \0 \0 \0 \0 \0 ÿ 4 _ r <80> \0 \0 \0 \0 ÿ ^B ^C É \0 \0 \0 \0 \0 ú ù ú S Q ò ^K ÿ ÷ : P <84> <80> ° º ß <95> ë <82> ^F v 4 <80> \0 ^D \0 \0 \0 ^A ^B ^C ^D ^L \0 ^V \0 ! \0 " \0 t b x x x S R B 1 5 1 0 0 1 2 f e e _ a c c o u n t --> 猜测这里就是行数据(结尾部分可以认出来是行数据,但是前面一部分不太认识,应该是和mvcc有关的 ROWID或者PK作为key值, TSO + 行数据构成value值)
Data Block Summary:
--------------------------------------
# data blocks: 843
min data block size: 6778
max data block size: 65533
avg data block size: 65373.516014
mysql> select _tidb_rowid from tbtest where table_name='tbxxx' and prd_code='SRB1510012' and field_code='fee_account';
+-------------+
| _tidb_rowid |
+-------------+
| 132041 |
+-------------+
1 row in set (0.00 sec)
mysql> select * from tbtest where table_name='tbprdconvert' and prd_code='SRB1510012' and field_code='fee_account';
+--------------+------------+-------------+-------------+
| table_name | prd_code | field_code | field_value |
+--------------+------------+-------------+-------------+
| tbxxx | SRB1510012 | fee_account | |
+--------------+------------+-------------+-------------+
1 row in set (0.01 sec)
Rocksdb 日志 LOG文件中有定期的性能信息:
** File Read Latency Histogram By Level [default] **
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
L1 1/0 193.14 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0
Sum 1/0 193.14 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0
Uptime(secs): 0.0 total, 0.0 interval
Flush(GB): cumulative 0.000, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
** File Read Latency Histogram By Level [default] **
2022/06/15-16:14:55.737378 7ff2277fe700 [WARN] [db/db_impl.cc:669] ------- DUMPING STATS -------
Rocksdb sst




当一个 Memtable 写满了之后,就会变成 immutable 的 Memtable,RocksDB 在后台会通过一个 flush 线程将这个 Memtable flush 到磁盘,生成一个 Sorted String Table(SST) 文件,放在 Level 0 层。当 Level 0 层的 SST 文件个数超过阈值之后,就会通过 Compaction 策略将其放到 Level 1 层,以此类推。
tikv 所具备的压缩能力,其实是rocksdb所具备的能力。数据库compaction过程中进行了排序,压缩等。另外tikv中,默认region大小是94M 开始拆分,但是实际看sst文件基本是8M 一个文件,tikv的region和sst文件也没有绝对的关系,这个问题导致了无法准确评估出tikv容量,也在tidb集群扩容时无法准确评估需要扩容的空间。
解释一下tikv以region为逻辑概念,来区分表数据存放在哪些region上,可以评估表大小,实际上不准确,分析如下。
通常原厂是以一个region 94M * 副本数 / 2 ,来估算扩容方案,或者架构方案。也就是说该方案认为sst会带来一倍的压缩。但是当经过一系列sst压缩后,一个region可能平均10MB左右,这其中将近10倍的变化,导致没有一个准确的计算方案。我认为不要通过region 94M 大小来做评估,可以根据比10MB大一点的容量做评估。
通常tikv容量估算算法:
rows * rowsize * 3 = 总容量
总容量 / 96 M / 2 = 预估region数量
预估region数量 / 50k = tikv 数量 --> 目前厂商建议是没课tikv 50k~60k个region数量
我认为的算法:
rows * rowsize * 3 = 总容量 --> 根据你nvme的大小,cpu,memory决定 每台机器tikv数量
总容量 / 10 M / tikv 数量 = 预估每台region数量 --> 每台 region的数量可以不要太多,适当保留一些buffer,但是50k~60k太小了,怎么也得200k左右,当前的nvme已经4t一个大小了,50%的利用率还可以接受。




