InnoDB record 细节整理

原创原文由孙河清发布于https://www.leviathan.vip/ 2023-06-25

567

背景

InnoDB 作为目前 MySQL 的主要存储引擎，其中 record 细节信息繁琐，这里仅做整理以便查阅. 版本 MySQL-8.0.25.

数据结构

InnoDB record 的逻辑格式: dtuple_t

/** Structure for an SQL data tuple of fields (logical record) */
struct dtuple_t {
  /* ... */

  /** Number of fields in dtuple */
  ulint n_fields; /* 当前 dtuple 记录的字段数量. */

  /** number of fields which should be used in comparison services of rem0cmp.*;
  the index search is performed by comparing only these fields, others are
  ignored; the default value in dtuple creation is the same value as n_fields */
  ulint n_fields_cmp; /* 当前 dtuple 中可以用来比较的字段数量, 可以通过
                       * dtuple_set_n_fields_cmp() 设置. */

  /** Fields. */
  dfield_t *fields; /* 当前 dtuple 的字段内容. */
    /** Structure for an SQL data field */
    struct dfield_t {
      void *data;       /*!< pointer to data */
      unsigned ext : 1; /*!< TRUE=externally stored, FALSE=local */
      unsigned spatial_status : 2;
      /*!< spatial status of externally stored field
        in undo log for purge */
      unsigned len; /*!< data length; UNIV_SQL_NULL if SQL null 数据长度 */
      dtype_t type; /*!< type of data  数据类型*/
      
      /* ... */
    } */

  /** ... */

  /** Compare a data tuple to a physical record.
    * dtuple_t 与 rec_t 的比较函数. */
  int compare(const rec_t *rec, const dict_index_t *index, const ulint *offsets,
              ulint *matched_fields) const;

  /** ... */
};
复制

MySQL SQL 层的 record 可以通过row_sel_convert_mysql_key_to_innobase()转换为 InnoDB 可识别的dtuple_t结构.

索引内存结构: dict_index_t

index->table->n_cols: table 的列数，包含用户定义的列 + 3 列系统列(DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR).
index->table->cols: 存上面 n_cols 个列的数组, 系统列在倒数后3个.
index->n_fields: 当前索引包含的列数，小于等于上面的 index->table->n_cols.
index->fields: 记录当前索引 column 的描述信息, 列名，长度, 顺序 or 倒序
对于主键索引 leaf node:
- 1. 如果定义了主键, 那么系统列就没有 DB_ROW_ID，那么此时 n_fields 比 n_cols 小 1.
- 1. 如果没有定义主键, 那么系统列就包含 DB_ROW_ID，那么此时 n_fields 和 n_cols 值一样.
对于主键索引 non-leaf node:
- 1. n_fields 包含所有唯一字段 + Page NO, 数量为 index->n_uniq + 1.
对于二级索引 leaf node:
- 1. n_fields 就是包含二级索引定义的列数 + 主键列数.
对于二级索引 non-leaf node:
- 1. n_fields 就是包含二级索引定义的列数 + Page No, 数量为 index->n_fields + 1.

使用dict_index_build_node_ptr()构建 non-leaf node.

InnoDB 物理 record: rec_t

offsets 数组由rec_get_offsets(), 数组大小由 n_fields + 1 + REC_OFFS_HEADER_SIZE 决定.

offsets[0] = n_alloc / n_alloc 是数组元素个数. /
offsets[1] = n_fields / n_fields 是 record 列数. /
offsets[2] = extra size
offsets[3.. 3 + n_fields] / 记录每个 field 的结束偏移. /

rec_t可以直接通过cmp_dtuple_rec_with_match_low()与dtuple_t比较:

rec_t可以通过 offsets 数组分别获取对应的 filed 字段, 再与((dfield_t *)tuple->fields + n)直接进行比较.

B-tree 游标: btr_pcur_t

btr_pcur_t是在 search 或者 modify 过程中用来定位的游标, 其中记录定位信息, 可以直接通过store_position()来保存，通过restore_position()可以直接恢复上一次保存的位置信息.

struct btr_pcur_t {
  /** ... */

  /* 保存 pcur 记录的信息. */
  void store_position(mtr_t *mtr);

  /* 恢复出来上一次 pcur 保存的位置. */
  bool restore_position(ulint latch_mode, mtr_t *mtr, const char *file,
                        ulint line);

  /** pcur 定位的元信息: index, block, n_fileds ... */
  btr_cur_t m_btr_cur;

  /** true if old_rec is stored */
  bool m_old_stored{false};

  /* 保存当前 pcur 指向的 record. */
  rec_t *m_old_rec{nullptr};

  /* 记录 m_old_rec 的 filed 数量. */
  ulint m_old_n_fields{0};

  /* 记录 modify clock. */
  uint64_t m_modify_clock{0};
}
复制

store_position()保存位置信息, 并释放 page 的 mutex, restore_position()先尝试乐观加锁，即直接判断m_modify_clock是否变化，假如 b+ tree 发生了 SMO, 需要进行悲观加锁的方式，即通过btr_cur_search_to_nth_level()重新 search 加锁.

store_position()会记录buf_block_t, 在乐观恢复中直接通过尝试对buf_block_t加锁，当前的 Buffer Pool 支持动态 resize, 这部分的内存可能会被释放, 所以 InnoDB 会首先判断这个buf_block_t指针是否存在于 Buffer Pool 的 chunk 中:

void Block_hint::buffer_fix_block_if_still_valid() {
  if (m_block != nullptr) {
    const buf_pool_t *const pool = buf_pool_get(m_page_id);
    rw_lock_t *latch = buf_page_hash_lock_get(pool, m_page_id);
    rw_lock_s_lock(latch);
    /* If not own buf_pool_mutex, page_hash can be changed. */
    latch = buf_page_hash_lock_s_confirm(latch, pool, m_page_id);
    if (buf_is_block_in_instance(pool, m_block) &&
        m_page_id == m_block->page.id &&
        buf_block_get_state(m_block) == BUF_BLOCK_FILE_PAGE) {
      buf_block_buf_fix_inc(m_block, __FILE__, __LINE__);
    } else {
      clear();
    }
    rw_lock_s_unlock(latch);
  }
}
复制

mysql索引 polardb

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者