Elasticsearch分词器详解

四阿哥胤禛 2019-03-18

672

analyzer

将字符串转换成多个分词。例如，字符串 "The quick Brown Foxes."，根据指定分词器可以得到分词：quick, brown, fox。分词器可以自定义，这使得可以有效地搜索大块文本中的单个单词。

此分析过程不仅需要在索引时进行，还需要在查询时进行：查询字符串需要指定相同或类似的分析器，以便在查询字符串分析得到的分词与索引中的分词格式相同。

Elasticsearch附带了需要预定义的分析器，无需进一步配置即可使用。它还附带了需要字符过滤器、分词器和分词过滤器，可以组合起来为每个索引配置自定义分析器。

可以按查询、按字段或者按索引指定分词器。在索引时，Elasticsearch将按一下顺序查询分析器：

定义在field mapping中的 analyzer
index settings中名称为 default 的分析器
标准分析器

在查询时，还有一些层级：

定义在全文本查询中的 analyzer
定义在 field mapping中的search_analyzer
定义在field mapping中的 analyzer
index settings中名称为 default_search 的分析器
index settings中名称为 default 的分析器
标准分析器

在特定字段指定分析器的最简单办法是在field mapping中定义它，如下所示：

PUT /my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type": "text",
          "fields": {
            "english": {
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}


GET my_index/_analyze {
  "field": "text",
  "text": "The quick Brown Foxes."
}


GET my_index/_analyze {
  "field": "text.english",
  "text": "The quick Brown Foxes."
}
复制

search_analyzer

通常，应在索引时和搜索时应用相同的分析器，以确保查询中的分词与反向索引中的分词具有相同的格式。但有时，在搜索时使用不同的分析器是由意义的，例如在使用edge_ngrapm tokenizer自动补全。

默认情况下，查询将使用field mapping中定义的分析器，但可以使用参数search_analyzer覆盖此设置。

PUT /my_index
{
    "settings":{
        "analysis":{
            "filter":{
                "type":"custom",
                "tokenizer":"standard",
                "autocomplete_filter":[
                    "type":"edge_ngram",
                    "min_gram":1,
                    "max_gram":20,
                ]
            },
            "analyzer":{
                "autocomplete":{
                    "type":"custom",
                    "tokenizer":"standard",
                    "filter":[
                        "lowercase"
                        "autocomplete_filter"
                    ]
                }
            }
        }
    },
    "mappings":{
        "_doc":{
            "properties":{
                "text":{
                    "type":"text",
                    "analyzer":"autocomplete",
                    "serach_analyzer":"standard"
                }
            }
        }
    }
}


PUT /my_index/_doc/1
{
    "title":"Quick Brown Fox"
}
GET my_index/_search
{
    "query":{
        "match":{
            "text":{
                "query":"Quick Br"
                "operator":"and"
            }
        }
    }
}
复制

search_quote_analyzer

通过参数search_quote_analyzer设置短语指定分析器，这在处理禁用短语查询的停用词时特别有用。

要禁用短语的停用词，需要使用三个分析器设置字段：

通过参数analyzer指定索引分词，包含停用词参数。分词中包含停用词
通过参数search_analyze指定非短语查询分词器，分词中删除停用词
通过参数search_quote_analyzer指定短语查询分词器，分词中包含停用词

PUT /my_index
{
    "settings":{
        "analysis":{
            "my_analyzer":{
                "type":"custom",
                "tokenizer":"standard",
                "filter":[
                    "lowercase"
                ]
            },
            "my_stop_analyzer":{
                "type":"custom",
                "tokenizer":"standard",
                "filter":[
                    "lowercase",
                    "english_stop"
                ]
            }
        },
        "filter":{
            "english_stop":{
                "type":"stop",
                "stopwords":"_english_"
            }
        }
    },
    "mappings":{
        "_doc":{
            "properties":{
                "title":{
                    "type":"text",
                    "analyzer":"my_analyzer",
                    "serach_analyzer":"my_stop_analyzer",
                    "search_quote_analyzer":"my_analyzer",
                }
            }
        }
    }
}


PUT /my_index/_doc/1
{
    "title":"The Quick Brown Fox"
}


PUT /my_index/_doc/2
{
    "title":"A Quick Brown Fox"
}


GET my_index/_search
{
    "query":{
        "query_string":{
            "query":"\"the quick brown fox\""
        }
    }
}
复制

数据库 elasticsearch

文章转载自四阿哥胤禛，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。