暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Tidb-Rocksdb 的使用

原创 lizhao01 2022-06-17
1759

Tidb - Rocksdb 的使用

tikv 中的 Rocksdb

此篇分析仅是个人观点,仅作研究探讨使用。

image.png

Rocksdb 为一个LSM存储引擎的数据库,是以Kye:Value的方式进行存储的。

  • 每个tikv使用了两个rocksdb来存储相关数据。一个存放raft数据,一个存放实际data数据。
  • rocksdb raft存放的是分布式事务需要同步的数据(dml等),使用log apply等方式将数据同步到其它tikv节点,实现raft协议所规定的多数派情况。
  • rocksdb kv中存放了实际的表数据
查看一个TiKV 目录结构可以看到 db 与 raft两个Rocksdb结构。
之前想过tidb是中table 与 sst文件对应关系是怎么样的,目前创建了2个测试表,仅生成一个sst文件,我猜测table和sst文件是多对多的关系,无法评估出sst文件中存放的是哪个表的数据。所以这里深切的怀疑大表查询的性能问题。如果sst的层级越深,sst分布的越广,那么读放大就比较严重了。

[root@db01 tidb-data]# tree tikv-20161/
tikv-20161/
├── db
│   ├── 000003.log
│   ├── 000017.sst
│   ├── CURRENT
│   ├── IDENTITY
│   ├── LOCK
│   ├── LOG
│   ├── MANIFEST-000006
│   ├── OPTIONS-000010
│   └── OPTIONS-000012
├── import
├── last_tikv.toml
├── LOCK
├── raft
│   ├── 000003.log
│   ├── CURRENT
│   ├── IDENTITY
│   ├── LOCK
│   ├── LOG
│   ├── MANIFEST-000001
│   └── OPTIONS-000005
├── snap
└── space_placeholder_file

tikv 中的kv cf

cf:Column Family主要是提供给RocksDB一个逻辑的分区,从实现上来看不同的Column Family共享WAL,而都有自己的Memtable和SST,默认的Column Family是 “default”。

tikv使用了Column Family 实现了mvcc机制。不得不说这等工程实现是一门艺术。

image.png

用于存放数据的kv中,使用了三个cf来实现事务、锁机制等。

Write 列簇:当用户写入了一行数据,如果该行数据长度小于255,存放在write cf中,如果超过255字节长度,则存放到default cf中。

Lock 列簇:用于存放事务中的锁信息,分布式事务中会存放指向主事务的链接信息。

default列簇:用于存储超过225字节长度的数据。

可以看到tikv中对3个列簇分别有其对应的参数设置

| tikv | 10.51.xxx.69:20161 | rocksdb.auto-tuned                                    | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.bytes-per-sync                                | 1MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.compaction-readahead-size                     | 0KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.create-if-missing                             | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.block-based-bloom-filter            | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.block-cache-size                    | 64353MiB                                    |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.block-size                          | 64KiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.bloom-filter-bits-per-key           | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.cache-index-and-filter-blocks       | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.compaction-pri                      | 3                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.compaction-style                    | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.compression-per-level               | ["no","no","lz4","lz4","lz4","zstd","zstd"] |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.disable-auto-compactions            | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.disable-block-cache                 | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.dynamic-level-bytes                 | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.enable-doubly-skiplist              | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.force-consistency-checks            | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.hard-pending-compaction-bytes-limit | 256GiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.level0-file-num-compaction-trigger  | 4                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.level0-slowdown-writes-trigger      | 20                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.level0-stop-writes-trigger          | 36                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.max-bytes-for-level-base            | 512MiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.max-bytes-for-level-multiplier      | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.max-compaction-bytes                | 2GiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.max-write-buffer-number             | 5                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.min-write-buffer-number-to-merge    | 1                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.num-levels                          | 7                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.optimize-filters-for-hits           | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.pin-l0-filter-and-index-blocks      | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.prop-keys-index-distance            | 40960                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.prop-size-index-distance            | 4194304                                     |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.read-amp-bytes-per-bit              | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.soft-pending-compaction-bytes-limit | 64GiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.target-file-size-base               | 8MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.blob-cache-size               | 0KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.blob-file-compression         | lz4                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.blob-run-mode                 | normal                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.discardable-ratio             | 0.5                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.gc-merge-rewrite              | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.level-merge                   | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.max-gc-batch-size             | 64MiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.max-sorted-runs               | 20                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.merge-small-file-threshold    | 8MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.min-blob-size                 | 1KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.min-gc-batch-size             | 16MiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.range-merge                   | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.titan.sample-ratio                  | 0.1                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.use-bloom-filter                    | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.whole-key-filtering                 | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.defaultcf.write-buffer-size                   | 128MiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.enable-multi-batch-write                      | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.enable-pipelined-write                        | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.enable-statistics                             | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.enable-unordered-write                        | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.info-log-dir                                  |                                             |
| tikv | 10.51.xxx.69:20161 | rocksdb.info-log-keep-log-file-num                    | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.info-log-max-size                             | 1GiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.info-log-roll-time                            | 0s                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.block-based-bloom-filter               | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.block-cache-size                       | 1GiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.block-size                             | 16KiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.bloom-filter-bits-per-key              | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.cache-index-and-filter-blocks          | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.compaction-pri                         | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.compaction-style                       | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.compression-per-level                  | ["no","no","no","no","no","no","no"]        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.disable-auto-compactions               | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.disable-block-cache                    | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.dynamic-level-bytes                    | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.enable-doubly-skiplist                 | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.force-consistency-checks               | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.hard-pending-compaction-bytes-limit    | 256GiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.level0-file-num-compaction-trigger     | 1                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.level0-slowdown-writes-trigger         | 20                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.level0-stop-writes-trigger             | 36                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.max-bytes-for-level-base               | 128MiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.max-bytes-for-level-multiplier         | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.max-compaction-bytes                   | 2GiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.max-write-buffer-number                | 5                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.min-write-buffer-number-to-merge       | 1                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.num-levels                             | 7                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.optimize-filters-for-hits              | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.pin-l0-filter-and-index-blocks         | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.prop-keys-index-distance               | 40960                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.prop-size-index-distance               | 4194304                                     |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.read-amp-bytes-per-bit                 | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.soft-pending-compaction-bytes-limit    | 64GiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.target-file-size-base                  | 8MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.blob-cache-size                  | 0KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.blob-file-compression            | lz4                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.blob-run-mode                    | read-only                                   |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.discardable-ratio                | 0.5                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.gc-merge-rewrite                 | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.level-merge                      | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.max-gc-batch-size                | 64MiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.max-sorted-runs                  | 20                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.merge-small-file-threshold       | 8MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.min-blob-size                    | 1KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.min-gc-batch-size                | 16MiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.range-merge                      | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.titan.sample-ratio                     | 0.1                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.use-bloom-filter                       | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.whole-key-filtering                    | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.lockcf.write-buffer-size                      | 128MiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-background-flushes                        | 2                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-background-jobs                           | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-manifest-file-size                        | 128MiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-open-files                                | 40960                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-sub-compactions                           | 3                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.max-total-wal-size                            | 4GiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.block-based-bloom-filter               | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.block-cache-size                       | 128MiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.block-size                             | 16KiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.bloom-filter-bits-per-key              | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.cache-index-and-filter-blocks          | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.compaction-pri                         | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.compaction-style                       | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.compression-per-level                  | ["no","no","no","no","no","no","no"]        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.disable-auto-compactions               | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.disable-block-cache                    | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.dynamic-level-bytes                    | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.enable-doubly-skiplist                 | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.force-consistency-checks               | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.hard-pending-compaction-bytes-limit    | 256GiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.level0-file-num-compaction-trigger     | 1                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.level0-slowdown-writes-trigger         | 20                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.level0-stop-writes-trigger             | 36                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.max-bytes-for-level-base               | 128MiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.max-bytes-for-level-multiplier         | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.max-compaction-bytes                   | 2GiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.max-write-buffer-number                | 5                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.min-write-buffer-number-to-merge       | 1                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.num-levels                             | 7                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.optimize-filters-for-hits              | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.pin-l0-filter-and-index-blocks         | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.prop-keys-index-distance               | 40960                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.prop-size-index-distance               | 4194304                                     |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.read-amp-bytes-per-bit                 | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.soft-pending-compaction-bytes-limit    | 64GiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.target-file-size-base                  | 8MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.blob-cache-size                  | 0KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.blob-file-compression            | lz4                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.blob-run-mode                    | read-only                                   |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.discardable-ratio                | 0.5                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.gc-merge-rewrite                 | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.level-merge                      | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.max-gc-batch-size                | 64MiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.max-sorted-runs                  | 20                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.merge-small-file-threshold       | 8MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.min-blob-size                    | 1KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.min-gc-batch-size                | 16MiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.range-merge                      | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.titan.sample-ratio                     | 0.1                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.use-bloom-filter                       | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.whole-key-filtering                    | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.raftcf.write-buffer-size                      | 128MiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.rate-bytes-per-sec                            | 0KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.rate-limiter-mode                             | 2                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.stats-dump-period                             | 10m                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.dirname                                 |                                             |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.disable-gc                              | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.enabled                                 | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.max-background-gc                       | 4                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.titan.purge-obsolete-files-period             | 10s                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.use-direct-io-for-flush-and-compaction        | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-bytes-per-sync                            | 512KiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-dir                                       |                                             |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-recovery-mode                             | 2                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-size-limit                                | 0KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.wal-ttl-seconds                               | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.writable-file-max-buffer-size                 | 1MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.block-based-bloom-filter              | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.block-cache-size                      | 38611MiB                                    |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.block-size                            | 64KiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.bloom-filter-bits-per-key             | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.cache-index-and-filter-blocks         | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.compaction-pri                        | 3                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.compaction-style                      | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.compression-per-level                 | ["no","no","lz4","lz4","lz4","zstd","zstd"] |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.disable-auto-compactions              | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.disable-block-cache                   | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.dynamic-level-bytes                   | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.enable-doubly-skiplist                | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.force-consistency-checks              | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.hard-pending-compaction-bytes-limit   | 256GiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.level0-file-num-compaction-trigger    | 4                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.level0-slowdown-writes-trigger        | 20                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.level0-stop-writes-trigger            | 36                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.max-bytes-for-level-base              | 512MiB                                      |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.max-bytes-for-level-multiplier        | 10                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.max-compaction-bytes                  | 2GiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.max-write-buffer-number               | 5                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.min-write-buffer-number-to-merge      | 1                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.num-levels                            | 7                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.optimize-filters-for-hits             | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.pin-l0-filter-and-index-blocks        | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.prop-keys-index-distance              | 40960                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.prop-size-index-distance              | 4194304                                     |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.read-amp-bytes-per-bit                | 0                                           |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.soft-pending-compaction-bytes-limit   | 64GiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.target-file-size-base                 | 8MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.blob-cache-size                 | 0KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.blob-file-compression           | lz4                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.blob-run-mode                   | read-only                                   |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.discardable-ratio               | 0.5                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.gc-merge-rewrite                | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.level-merge                     | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.max-gc-batch-size               | 64MiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.max-sorted-runs                 | 20                                          |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.merge-small-file-threshold      | 8MiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.min-blob-size                   | 1KiB                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.min-gc-batch-size               | 16MiB                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.range-merge                     | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.titan.sample-ratio                    | 0.1                                         |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.use-bloom-filter                      | true                                        |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.whole-key-filtering                   | false                                       |
| tikv | 10.51.xxx.69:20161 | rocksdb.writecf.write-buffer-size                     | 128MiB                                      |
可以使用sst_dump命令将 sst文件dump出来

sst_dump --file=/tidb/db/000022.sst --command=raw

Footer Details:
--------------------------------------
  checksum: 1
  metaindex handle: B0BFD31A56
  index handle: B38ACF1ACAF303
  footer version: 2
  table_magic_number: 9863518390377041911
  
Metaindex Details:
--------------------------------------
  Filter block handle: E9F2A31AC5972B
  Properties block handle: 82FED21AA941

Table Properties:
--------------------------------------
  # data blocks: 843
  # entries: 651325
  # deletions: 85363
  # merge operands: 0
  # range deletions: 0
  raw key size: 52024919
  raw average key size: 79.875514
  raw value size: 31022953
  raw average value size: 47.630527
  data block size: 55114089
  index block size (user-key? 0, delta-value? 0): 63951
  filter block size: 707525
  (estimated) table size: 55885565
  filter policy name: rocksdb.BuiltinBloomFilter
  prefix extractor name: FixedSuffixSliceTransform
  column family ID: 2
  column family name: write                      --> 对应了write cf
  comparator name: leveldb.BytewiseComparator
  merge operator name: nullptr
  property collectors names: [tikv.mvcc-properties-collector,tikv.range-properties-collector]
  SST file compression algo: NoCompression     --> 使用的压缩情况
  creation time: 1655360424
  time stamp of earliest key: 1653037511
  
Index Details:
--------------------------------------
  Block key hex dump: Data block handle
  Block key ascii

.......
......
....
..

  ------
  HEX    7A7480000000000000FF345F728000000000FF0203C80000000000FAF9FA5351F20BFFF7: 508480B0BADF95EB82067631800004000000010203040C0016001E001F007462707264636F6E7665727453524231353130303132656E645F6461746530
  ASCII  z t <80> \0 \0 \0 \0 \0 \0 ÿ 4 _ r <80> \0 \0 \0 \0 ÿ ^B ^C È \0 \0 \0 \0 \0 ú ù ú S Q ò ^K ÿ ÷ : P <84> <80> ° º ß <95> ë <82> ^F v 1 <80> \0 ^D \0 \0 \0 ^A ^B ^C ^D ^L \0 ^V \0 ^^ \0 ^_ \0 t b x x x S R B 1 5 1 0 0 1 2 e n d _ d a t e 0
  ------
  HEX    7A7480000000000000FF345F728000000000FF0203C90000000000FAF9FA5351F20BFFF7: 508480B0BADF95EB82067634800004000000010203040C001600210022007462707264636F6E76657274535242313531303031326665655F6163636F756E7420
  ASCII  z t <80> \0 \0 \0 \0 \0 \0 ÿ 4 _ r <80> \0 \0 \0 \0 ÿ ^B ^C É \0 \0 \0 \0 \0 ú ù ú S Q ò ^K ÿ ÷ : P <84> <80> ° º ß <95> ë <82> ^F v 4 <80> \0 ^D \0 \0 \0 ^A ^B ^C ^D ^L \0 ^V \0 ! \0 " \0 t b x x x S R B 1 5 1 0 0 1 2 f e e _ a c c o u n t        --> 猜测这里就是行数据(结尾部分可以认出来是行数据,但是前面一部分不太认识,应该是和mvcc有关的 ROWID或者PK作为key值, TSO + 行数据构成value值)

Data Block Summary:
--------------------------------------
  # data blocks: 843
  min data block size: 6778
  max data block size: 65533
  avg data block size: 65373.516014


mysql> select _tidb_rowid from tbtest where table_name='tbxxx' and prd_code='SRB1510012' and field_code='fee_account';
+-------------+
| _tidb_rowid |
+-------------+
|      132041 |
+-------------+
1 row in set (0.00 sec)


mysql> select * from tbtest where table_name='tbprdconvert' and prd_code='SRB1510012' and field_code='fee_account';
+--------------+------------+-------------+-------------+
| table_name   | prd_code   | field_code  | field_value |
+--------------+------------+-------------+-------------+
| tbxxx        | SRB1510012 | fee_account |             |
+--------------+------------+-------------+-------------+
1 row in set (0.01 sec)

Rocksdb 日志 LOG文件中有定期的性能信息:

** File Read Latency Histogram By Level [default] **

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L1      1/0   193.14 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0
 Sum      1/0   193.14 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0
Uptime(secs): 0.0 total, 0.0 interval
Flush(GB): cumulative 0.000, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **
2022/06/15-16:14:55.737378 7ff2277fe700 [WARN] [db/db_impl.cc:669] ------- DUMPING STATS -------

Rocksdb sst

image.png

image.png

image.png

​ 当一个 Memtable 写满了之后,就会变成 immutable 的 Memtable,RocksDB 在后台会通过一个 flush 线程将这个 Memtable flush 到磁盘,生成一个 Sorted String Table(SST) 文件,放在 Level 0 层。当 Level 0 层的 SST 文件个数超过阈值之后,就会通过 Compaction 策略将其放到 Level 1 层,以此类推。

​ tikv 所具备的压缩能力,其实是rocksdb所具备的能力。数据库compaction过程中进行了排序,压缩等。另外tikv中,默认region大小是94M 开始拆分,但是实际看sst文件基本是8M 一个文件,tikv的region和sst文件也没有绝对的关系,这个问题导致了无法准确评估出tikv容量,也在tidb集群扩容时无法准确评估需要扩容的空间。
解释一下tikv以region为逻辑概念,来区分表数据存放在哪些region上,可以评估表大小,实际上不准确,分析如下。

​ 通常原厂是以一个region 94M * 副本数 / 2 ,来估算扩容方案,或者架构方案。也就是说该方案认为sst会带来一倍的压缩。但是当经过一系列sst压缩后,一个region可能平均10MB左右,这其中将近10倍的变化,导致没有一个准确的计算方案。我认为不要通过region 94M 大小来做评估,可以根据比10MB大一点的容量做评估。

​ 通常tikv容量估算算法:

​rows * rowsize * 3  = 总容量
总容量  / 96 M / 2 = 预估region数量
预估region数量  / 50k = tikv 数量   --> 目前厂商建议是没课tikv 50k~60k个region数量

​ 我认为的算法:

rows * rowsize * 3  = 总容量  --> 根据你nvme的大小,cpu,memory决定 每台机器tikv数量
总容量  / 10 M  / tikv 数量  =  预估每台region数量    --> 每台 region的数量可以不要太多,适当保留一些buffer,但是50k~60k太小了,怎么也得200k左右,当前的nvme已经4t一个大小了,50%的利用率还可以接受。
最后修改时间:2022-07-06 08:36:46
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论