暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

MySQL的函数和运算符 - 全文搜索 - 全文终止词

数据库杂货铺 2021-07-03
515

全文终止词

 

使用服务器字符集和排序规则(character_set_server collation_server 系统变量的值)加载终止词列表,进行全文搜索查询。如果用于全文索引或搜索的终止词文件或列的字符集或排序规则与 character_set_server collation_server 不同,那么终止词查找可能会出现错误的命中或未命中。

 

终止词查找是否区分大小写取决于服务器的排序规则。例如,如果排序规则是 utf8mb4_0900_ai_ci,则查找不区分大小写,而如果排序规则是 utf8mb4_0900_as_cs utf8mb4_bin,则查找区分大小写。

 

InnoDB 搜索索引的终止词

 

InnoDB 有一个相对较短的默认终止词列表,因为来自技术、文学和其他来源的文档通常使用短词作为关键字或重要短语。例如,您可能会搜索 “to be or not to be” 并期望得到一个合理的结果,而不是忽略所有这些词。

 

要查看默认 InnoDB 终止词列表,请查询 INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD 表。

 

    mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
    +-------+
    | value |
    +-------+
    | a |
    | about |
    | an |
    | are |
    | as |
    | at |
    | be |
    | by |
    | com |
    | de |
    | en |
    | for |
    | from |
    | how |
    | i |
    | in |
    | is |
    | it |
    | la |
    | of |
    | on |
    | or |
    | that |
    | the |
    | this |
    | to |
    | was |
    | what |
    | when |
    | where |
    | who |
    | will |
    | with |
    | und |
    | the |
    | www |
    +-------+
    36 rows in set (0.00 sec)

     

    要对所有 InnoDB 表定义自己的终止词列表,需要定义一个表与 INNODB_FT_DEFAULT_STOPWORD 表相同的结构,并填充终止词,创建全文索引之前,将 innodb_ft_server_stopword_table 选项的值设置为 db_name/table_name 形式。终止词表必须有一个名为 value VARCHAR 列。下面的示例演示如何为 InnoDB 创建和配置一个新的全局终止词表。

     

      -- Create a new stopword table

      mysql> CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;
      Query OK, 0 rows affected (0.01 sec)

      -- Insert stopwords (for simplicity, a single stopword is used in this example)

      mysql> INSERT INTO my_stopwords(value) VALUES ('Ishmael');
      Query OK, 1 row affected (0.00 sec)

      -- Create the table

      mysql> CREATE TABLE opening_lines (
      id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
      opening_line TEXT(500),
      author VARCHAR(200),
      title VARCHAR(200)
      ) ENGINE=InnoDB;
      Query OK, 0 rows affected (0.01 sec)

      -- Insert data into the table

      mysql> INSERT INTO opening_lines(opening_line,author,title) VALUES
      ('Call me Ishmael.','Herman Melville','Moby-Dick'),
      ('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'),
      ('I am an invisible man.','Ralph Ellison','Invisible Man'),
      ('Where now? Who now? When now?','Samuel Beckett','The Unnamable'),
      ('It was love at first sight.','Joseph Heller','Catch-22'),
      ('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'),
      ('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'),
      ('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451');
      Query OK, 8 rows affected (0.00 sec)
      Records: 8 Duplicates: 0 Warnings: 0

      -- Set the innodb_ft_server_stopword_table option to the new stopword table

      mysql> SET GLOBAL innodb_ft_server_stopword_table = 'test/my_stopwords';
      Query OK, 0 rows affected (0.00 sec)

      -- Create the full-text index (which rebuilds the table if no FTS_DOC_ID column is defined)

      mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
      Query OK, 0 rows affected, 1 warning (1.17 sec)
      Records: 0 Duplicates: 0 Warnings: 1

       

      通过查询 INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE 中的词,可以验证指定的终止词 ('Ishmael') 是否存在。

       

      注意

       

      默认情况下,长度小于3个字符或大于84个字符的单词不会出现在 InnoDB 全文索引中。最大和最小字长值可以通过 innodb_ft_max_token_size innodb_ft_min_token_size 变量进行配置。这个默认行为不适用于 ngram 解析器插件。ngram 标记的大小由 ngram_token_size 选项定义。

       

        mysql> SET GLOBAL innodb_ft_aux_table='test/opening_lines';
        Query OK, 0 rows affected (0.00 sec)

        mysql> SELECT word FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE LIMIT 15;
        +-----------+
        | word |
        +-----------+
        | across |
        | all |
        | burn |
        | buy |
        | call |
        | comes |
        | dalloway |
        | first |
        | flowers |
        | happened |
        | herself |
        | invisible |
        | less |
        | love |
        | man |
        +-----------+
        15 rows in set (0.00 sec)

         

        要逐表创建终止词列表,请创建其他终止词表,并使用 innodb_ft_user_stopword_table 选项指定在创建全文索引之前要使用的终止词表。

         

        MyISAM 搜索索引的终止词

         

        如果 character_set_server ucs2utf16utf16le utf32,则使用 latin1加载终止词文件进行搜索。

         

        要覆盖 MyISAM 表的默认终止词列表,请设置 ft_stopword_file 系统变量。变量值应该是包含终止词列表文件的路径名,或者是用于禁用终止词过滤的空字符串。服务器在数据目录中查找文件,除非指定了一个绝对路径名来指定另一个目录。在更改此变量的值或终止词文件的内容后,需要重新启动服务器并重建 FULLTEXT 索引。

         

        终止词列表形式自由,用换行符、空格或逗号等任何非字母数字字符分隔终止词。例外是下划线字符(_)和单个撇号('),它们被视为单词的一部分。终止词列表的字符集是服务器的默认字符集。

         

        下面的列表显示了 MyISAM 搜索索引的默认终止词。在 MySQL 源代码分发版中,可以在 storage/myisam/ft_static.c 文件中找到此列表。

         

         

          a's           able          about         above         according
          accordingly across actually after afterwards
          again against ain't all allow
          allows almost alone along already
          also although always am among
          amongst an and another any
          anybody anyhow anyone anything anyway
          anyways anywhere apart appear appreciate
          appropriate are aren't around as
          aside ask asking associated at
          available away awfully be became
          because become becomes becoming been
          before beforehand behind being believe
          below beside besides best better
          between beyond both brief but
          by c'mon c's came can
          can't cannot cant cause causes
          certain certainly changes clearly co
          com come comes concerning consequently
          consider considering contain containing contains
          corresponding could couldn't course currently
          definitely described despite did didn't
          different do does doesn't doing
          don't done down downwards during
          each edu eg eight either
          else elsewhere enough entirely especially
          et etc even ever every
          everybody everyone everything everywhere ex
          exactly example except far few
          fifth first five followed following
          follows for former formerly forth
          four from further furthermore get
          gets getting given gives go
          goes going gone got gotten
          greetings had hadn't happens hardly
          has hasn't have haven't having
          he he's hello help hence
          her here here's hereafter hereby
          herein hereupon hers herself hi
          him himself his hither hopefully
          how howbeit however i'd i'll
          i'm i've ie if ignored
          immediate in inasmuch inc indeed
          indicate indicated indicates inner insofar
          instead into inward is isn't
          it it'd it'll it's its
          itself just keep keeps kept
          know known knows last lately
          later latter latterly least less
          lest let let's like liked
          likely little look looking looks
          ltd mainly many may maybe
          me mean meanwhile merely might
          more moreover most mostly much
          must my myself name namely
          nd near nearly necessary need
          needs neither never nevertheless new
          next nine no nobody non
          none noone nor normally not
          nothing novel now nowhere obviously
          of off often oh ok
          okay old on once one
          ones only onto or other
          others otherwise ought our ours
          ourselves out outside over overall
          own particular particularly per perhaps
          placed please plus possible presumably
          probably provides que quite qv
          rather rd re really reasonably
          regarding regardless regards relatively respectively
          right said same saw say
          saying says second secondly see
          seeing seem seemed seeming seems
          seen self selves sensible sent
          serious seriously seven several shall
          she should shouldn't since six
          so some somebody somehow someone
          something sometime sometimes somewhat somewhere
          soon sorry specified specify specifying
          still sub such sup sure
          t's take taken tell tends
          th than thank thanks thanx
          that that's thats the their
          theirs them themselves then thence
          there there's thereafter thereby therefore
          therein theres thereupon these they
          they'd they'll they're they've think
          third this thorough thoroughly those
          though three through throughout thru
          thus to together too took
          toward towards tried tries truly
          try trying twice two un
          under unfortunately unless unlikely until
          unto up upon us use
          used useful uses using usually
          value various very via viz
          vs want wants was wasn't
          way we we'd we'll we're
          we've welcome well went were
          weren't what what's whatever when
          whence whenever where where's whereafter
          whereas whereby wherein whereupon wherever
          whether which while whither who
          who's whoever whole whom whose
          why will willing wish with
          within without won't wonder would
          wouldn't yes yet you you'd
          you'll you're you've your yours
          yourself yourselves zero

           

           

          官方网址:

          https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html

          文章转载自数据库杂货铺,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

          评论