elasticsearch深入搜索一之近似匹配

开发架构二三事 2019-05-20

211

1. 问题：项目中用到了全文检索，但测试反应了两个问题:

数字检索的问题: 有标题为"666666"的圈子,输入"6",搜索不到;
单字检索的问题: 有标题为"测试圈子直播",输入"测",搜索不到;
顺序问题: 搜索引擎返回数据与实际返回数据顺序相同。

2. 分词器

1. 拼音分词器

http://192.168.1.38:9200/info/_analyze?analyzer=pinyin_analyzer&text=测试圈子直播

返回分词结果为:

{
    "tokens": [
        {
            "token": "ce",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "shi",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
        {
            "token": "quan",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 2
        },
        {
            "token": "zi",
            "start_offset": 3,
            "end_offset": 4,
            "type": "word",
            "position": 3
        },
        {
            "token": "zhi",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 4
        },
        {
            "token": "bo",
            "start_offset": 5,
            "end_offset": 6,
            "type": "word",
            "position": 5
        },
        {
            "token": "测试圈子直播",
            "start_offset": 0,
            "end_offset": 6,
            "type": "word",
            "position": 5
        },
        {
            "token": "csqzzb",
            "start_offset": 0,
            "end_offset": 6,
            "type": "word",
            "position": 5
        }
    ]
}

复制

2. ik_max_word分词

http://192.168.1.38:9200/info/_analyze?analyzer=ik_max_word&text=测试圈子直播

{
    "tokens": [
        {
            "token": "测试",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "圈子",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "圈",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "子",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "直播",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "播",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 5
        }
    ]
}

复制

3. ik_smart分词

http://192.168.1.38:9200/info/_analyze?analyzer=ik_smart&text=测试圈子直播

{
    "tokens": [
        {
            "token": "测试",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "圈子",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "直播",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

复制

4. standard分词器

http://192.168.1.38:9200/info/_analyze?analyzer=standard&text=测试圈子直播

{
    "tokens": [
        {
            "token": "测",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "试",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "圈",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "子",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "直",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "播",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        }
    ]
}

复制

1. 从上面几种分词器的对比中可以看出，拼音分词器主要是把中文转换成拼音的方式进行分词； 2. ik_max_word分词和ik_smart分词器主要是索引单词而不是索引独立的单词； 3. standard分词器主要是索引独立的单词而不对词项进行索引。

3. ES深入搜索近似匹配中常见的概念

1. 几种匹配方式

对于匹配了短语"quick brown fox"的文档，下面的条件必须为true：

	1. quick、brown和fox必须全部出现在某个字段中。
	2. brown的位置必须比quick的位置大1。
	3. fox的位置必须比quick的位置大2。

如果以上的任何一个条件没有被满足，那么文档就不能被匹配。

复制

match 不会对要查询的短语分词，比如用quick brown fox去查询时，会直接将quick brown fox 作为一个term传入查询；
match_phrase 查询可以对短语进行临近匹配，它会先把要查询的字符串解析成一个terms列表，然后去搜索与所有的terms匹配的document，但是只会保留位置匹配上的 documents。也就是说如果用quick fox 去查询时，将不会有结果出现，因为quick和fox之间还有其他单词。例如：

GET /my_index/my_type/_search
{
    "query": {
        "match_phrase": {
            "title": "quick brown fox"
        }
    }
}

复制

上面的查询可以被写成match形式:

"match": {
       "title": {
           "query": "quick brown fox",
           "type":  "phrase"
       }
   }

复制

2. slop的概念

一般在match_phrase中使用该参数
The slop parameter tells the match_phrase query how far apart terms are allowed to be while still considering the document a match. By how far apart we mean how many times do you need to move a term in order to make the query and document match?
在query string搜索文本中的几个term时,有时要经过几次移动才能与一个document匹配，这个移动的次数，就是slop.

我们使用slop参数进行上面的搜索(document中包含quick brown fox，查询条件为quick fox)：

GET /my_index/my_type/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "quick fox",
                "slop":  1
            }
        }
    }
}

复制

上面的查询中:

	Pos 1         Pos 2         Pos 3

Doc:        quick         brown         fox

Query:      quick         fox

Slop 1:     quick                 ↳     fox

复制

可以看出slop为1就能满足条件，也就是说fox这个term只需要向后移动一个term。

当查询条件为fox quick时，我们需要将slop变成3，也就是把quick这个term向前移动一个term，然后将fox这个term向后移动两个term。
当查询条件为quick fox时，slop为2.

3. 多值字段

多值字段进行短语匹配时会发生奇怪的事，例如：

PUT /my_index/groups/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

复制

然后运行一个对 Abraham Lincoln 的短语查询:

GET /my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": "Abraham Lincoln"
        }
    }
}

复制

令人惊讶的是，即使 Abraham 和 Lincoln 在 names 数组里属于两个不同的人名，我们的文档也匹配了查询。这一切的原因在Elasticsearch数组的索引方式。在分析 John Abraham 的时候，产生了如下信息：

Position 1: john
Position 2: abraham

然后在分析 Lincoln Smith 的时候，产生了：

Position 3: lincoln
Position 4: smith

换句话说， Elasticsearch对以上数组分析生成了与分析单个字符串 John Abraham Lincoln Smith 一样几乎完全相同的语汇单元。我们的查询示例寻找相邻的 lincoln 和 abraham ，而且这两个词条确实存在，并且它们俩正好相邻，所以这个查询匹配了。

在这样的情况下有一种叫做 position_increment_gap 的简单的解决方案，它在字段映射中配置。

DELETE /my_index/groups/ 

PUT /my_index/_mapping/groups 
{
    "properties": {
        "names": {
            "type":                "string",
            "position_increment_gap": 100
        }
    }
}

复制

先删除映射groups以及这个类型内的所有文档。
然后创建一个有正确值的新映射groups。

position_increment_gap 设置告诉 Elasticsearch 应该为数组中每个新元素增加当前词条 position 的指定值。所以现在当我们再索引 names 数组时，会产生如下的结果：

Position 1: john
Position 2: abraham
Position 103: lincoln
Position 104: smith

现在我们的短语查询可能无法匹配该文档因为 abraham 和 lincoln 之间的距离为 100 。为了匹配这个文档你必须添加值为 100 的 slop 。

4. 越近越好

一个短语查询仅仅排除了不包含确切查询短语的文档，而邻近查询:一个slop大于0的短语查询将查询词条的邻近度考虑到最终相关度 _score 中。通过设置一个像50或者100这样的高 slop 值, 你能够排除单词距离太远的文档，但是也给予了那些单词临近的的文档更高的分数。

如下对于quick dog的邻近查询匹配了同时包含含quick和dog的文档，但是也给了与quick和dog更加临近的文档更高的分数：

POST /my_index/my_type/_search
{
   "query": {
      "match_phrase": {
         "title": {
            "query": "quick dog",
            "slop":  50 
         }
      }
   }
}

复制

分数较高因为quick和dog很接近
分数较低因为quick和dog分开较远

{
  "hits": [
     {
        "_id":      "3",
        "_score":   0.75, 
        "_source": {
           "title": "The quick brown fox jumps over the quick dog"
        }
     },
     {
        "_id":      "2",
        "_score":   0.28347334, 
        "_source": {
           "title": "The quick brown fox jumps over the lazy dog"
        }
     }
  ]
}

复制

5. 使用邻近度提高相关度

虽然邻近查询很有用，但是所有词条都出现在文档的要求过于严格了。像在全文搜索中的控制精度一样：如果七个词条中有六个匹配，那么这个文档对用户而言就已经足够相关了，但是 match_phrase查询可能会将它排除在外。

相比将使用邻近匹配作为绝对要求，我们可以把它作为许多潜在查询中的一个，会对每个文档的最终分值做出贡献。可以用bool查询把它们合并。我们可以将一个简单的 match 查询作为一个 must 子句。这个查询将决定哪些文档需要被包含到结果集中。我们可以用 minimum_should_match 参数去除长尾。然后我们可以以 should 子句的形式添加更多特定查询。每一个匹配成功的都会增加匹配文档的相关度。

GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "must": {
        "match": { 
          "title": {
            "query":                "quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
      "should": {
        "match_phrase": { 
          "title": {
            "query": "quick brown fox",
            "slop":  50
          }
        }
      }
    }
  }
}

复制

must 子句从结果集中包含或者排除文档。
should 子句增加了匹配到文档的相关度评分。
可以在 should 子句里面添加其它的查询，其中每一个查询只针对某一特定方面的相关度。

6. 性能优化

短语查询(match_phrase)和邻近查询(match_phrase + slop) 都比简单的query查询代价更高。一个 match 查询仅仅是看词条是否存在于倒排索引中，而一个 match_phrase 查询是必须计算并比较多个可能重复词项的位置。

在官方的benchmarks中表明，一个简单的terms查询比一个短语查询大约快10倍，比邻近查询(有slop的短语查询)快大约20倍。而且，这个差距是在搜索时而不是索引时。

只是在某些特定的情况下，短语查询可能成本较高。一个典型的例子就是DNA序列，在序列里很多同样的词项在很多位置重复出现。使用高slop会导致位置计算大量增加。

结果集重新评分

在上面提到过使用邻近度提高相关度，只是调整了文档在结果列表中的顺序，因为一个查询可能会匹配成千上万的结果，但用户很可能只对结果的前几页感兴趣。一个简单的 match 查询已经通过排序把包含所有含有搜索词条的文档放在结果列表的前面了。事实上，我们只想对这些顶部文档重新排序，来给同时匹配了短语查询的文档一个额外的相关度升级。

search API通过重新评分明确支持该功能。重新评分阶段支持一个代价更高的评分算法--比如phrase查询--只是为了从每个分片中获得前K个结果。然后会根据它们的最新评分重新排序。

GET /my_index/my_type/_search
{
    "query": {
        "match": {  
            "title": {
                "query":                "quick brown fox",
                "minimum_should_match": "30%"
            }
        }
    },
    "rescore": {
        "window_size": 50, 
        "query": {         
            "rescore_query": {
                "match_phrase": {
                    "title": {
                        "query": "quick brown fox",
                        "slop":  50
                    }
                }
            }
        }
    }
}

复制

match 查询决定哪些文档将包含在最终结果集中，并通过 TF/IDF 排序。
window_size 是每一分片进行重新评分的顶部文档数量。
目前唯一支持的重新打分算法就是另一个查询，但是以后会有计划增加更多的算法。

7. 寻找相关词

短语查询和邻近查询都很好用，但仍有一个缺点。它们过于严格了：为了匹配短语查询，所有词项都必须存在，即使使用了slop。用 slop 得到的单词顺序的灵活性也需要付出代价，因为失去了单词对之间的联系。即使可以识别sue 、alligator和ate相邻出现的文档，但无法分辨是Sue ate还是 alligator ate。

当单词相互结合使用的时候，表达的含义比单独使用更丰富。两个子句 I’m not happy I’m working 和 I’m happy I’m not working 包含相同的单词，也拥有相同的邻近度，但含义截然不同。

如果索引单词而不是索引独立的单词，就能对这些单词的上下文尽可能多的保留。

例如对句子Sue ate the alligator，不仅要将每个单词（或者unigram)作为词项索引:

["sue", "ate", "the", "alligator"]

也要将每个单词以及它的邻近词作为单个词项索引：

["sue ate", "ate the", "the alligator"]

这些单词对（或者 bigrams ）被称为 shingles 。

Shingles 不限于单词对；你也可以索引三个单词（ trigrams ）：

["sue ate the", "ate the alligator"]

Trigrams 提供了更高的精度，但是也大大增加了索引中唯一词项的数量。在大多数情况下，Bigrams 就够了。

当然，只有当用户输入的查询内容和在原始文档中顺序相同时，shingles 才是有用的；对 sue alligator 的查询可能会匹配到单个单词，但是不会匹配任何 shingles 。

幸运的是，用户倾向于使用和搜索数据相似的构造来表达搜索意图。但这一点很重要：只是索引 bigrams 是不够的；我们仍然需要 unigrams ，但可以将匹配 bigrams 作为增加相关度评分的信号。

可以将unigrams和bigrams都索引到单个字段中，但将它们分开保存在能被独立查询的字段会更清晰。unigrams字段构成我们搜索的基础部分，而bigrams字段用来提高相关度。

1. 首先，我们需要在创建分析器时使用 shingle 语汇单元过滤器：

DELETE /my_index

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2,(1) 
                    "max_shingle_size": 2, 
                    "output_unigrams":  false   (2)
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter" (3)
                    ]
                }
            }
        }
    }
}

复制

(1) 默认最小/最大的 shingle 大小是 2 ，所以实际上不需要设置。
(2) shingle 语汇单元过滤器默认输出 unigrams ，但是我们想让 unigrams 和 bigrams 分开。
(3) my_shingle_analyzer 使用我们常规的 my_shingles_filter 语汇单元过滤器。

用analyzer API测试分词器：

GET /my_index/_analyze?analyzer=my_shingle_analyzer Sue ate the alligator

得到如下三个词项：

sue ate
ate the
the alligator

2. 多字段

将title字段创建成一个多字段，然后将unigrams和bigrams分开索引：

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "properties": {
            "title": {
                "type": "string",
                "fields": {
                    "shingles": {
                        "type":     "string",
                        "analyzer": "my_shingle_analyzer"
                    }
                }
            }
        }
    }
}

复制

在没有使用shingles时，默认使用的是bigrams，上面使用的shingles，通过这个映射， JSON 文档中的 title 字段将会被以 unigrams (title)和 bigrams (title.shingles)被索引，这意味着可以独立地查询这些字段。

3. 搜索Shingles

为了理解添加 shingles 字段的好处，让我们首先来看 The hungry alligator ate Sue 进行简单 match 查询的结果：

GET /my_index/my_type/_search
{
   "query": {
        "match": {
           "title": "the hungry alligator ate sue"
        }
   }
}

复制

这个查询返回了所有的三个文档，但是注意文档 1 和 2 有相同的相关度评分因为他们包含了相同的单词：

{
  "hits": [
     {
        "_id": "1",
        "_score": 0.44273707, 
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "2",
        "_score": 0.44273707, 
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "3", 
        "_score": 0.046571054,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

复制

两个文档都包含 the 、 alligator 和 ate ，所以获得相同的评分。我们可以通过设置 minimum_should_match 参数排除文档3,参考控制精度:https://www.elastic.co/guide/cn/elasticsearch/guide/current/match-multi-word.html#match-precision 。

在查询中添加shingles字段，因为这样可以提高相关度评分：

GET /my_index/my_type/_search
{
   "query": {
      "bool": {
         "must": {
            "match": {
               "title": "the hungry alligator ate sue"
            }
         },
         "should": {
            "match": {
               "title.shingles": "the hungry alligator ate sue"
            }
         }
      }
   }
}

复制

仍然匹配到了所有的3个文档，但是文档2现在排到了第一名因为它匹配了shingled词项ate sue.

{
  "hits": [
     {
        "_id": "2",
        "_score": 0.4883322,
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "1",
        "_score": 0.13422975,
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "3",
        "_score": 0.014119488,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

复制

即使查询包含的单词 hungry 没有在任何文档中出现，我们仍然使用单词邻近度返回了最相关的文档。

4. 搜索性能：

shingles 不仅比短语查询更灵活，而且性能也更好。 shingles 查询跟一个简单的 match 查询一样高效，而不用每次搜索花费短语查询的代价。只是在索引期间因为更多词项需要被索引会付出一些小的代价，这也意味着有 shingles 的字段会占用更多的磁盘空间。然而，大多数应用写入一次而读取多次，所以在索引期间优化我们的查询速度是有意义的。这是一个在 Elasticsearch 里会经常碰到的话题：不需要任何前期进行过多的设置，就能够在搜索的时候有很好的效果。一旦更清晰的理解了自己的需求，就能在索引时通过正确的为你的数据建模获得更好结果和性能。

4. 实际使用改进:

1. 不使用shingles时(下面示例为在kibana中操作)

1. mapping创建

 PUT my_index1
       {
       	"mappings": {
           "my_type": { 
             "_all":{ "enabled": false  },
           "_routing":{
           "required":true
           },
           "_source":{
           "excludes":[
            
           ]
           },
           "date_detection": false,
             "properties": {
       	    "@timestamp":{
       	      "type":"date"  
       	    },
       	    "title": {
                       "type": "text",
                        "store":"true",
                         "analyzer": "ik_max_word"
               }
             }
           }
         }
       }

复制

2. 添加数据

 PUT my_index1/my_type/1?routing=test
       {
 "title": "测试直播圈子"
 }

复制

3. 在kibana中查询

GET my_index1/_search?routing=test
{
   "query":{
     "query_string": {
       "query": "测"
     }
   }
}

返回结果为：
{
  "took": 50,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

复制

使用match:

GET my_index1/_search?routing=test
{
   "query":{
     "match": {
       "title": {
         "query": "测",
         "minimum_should_match": "30%"
       }
     }
   }
}

返回结果为：
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

复制

使用match_phrase:

GET my_index1/_search?routing=test
{
   "query":{
     "match_phrase": {
       "title": {
         "query": "测",
         "slop": 10
       }
     }
   }
}
结果：
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

复制

2. 使用shingles

1. mapping创建：

       PUT my_index6
       {
       	"mappings": {
           "my_type": { 
             "_all":{ "enabled": false  },
           "_routing":{
           "required":true
           },
           "_source":{
           "excludes":[
            
           ]
           },
           "date_detection": false,
             "properties": {
       	    "@timestamp":{
       	      "type":"date"  
       	    },
       	    "title": {
                       "type": "text",
                       "fields": {
                           "shingles": {
                               "type":     "text",
                               "analyzer": "my_shingle_analyzer"
                           }
                       }
               }
             }
           }
         },
           "settings": {
               "number_of_shards": 1,  
               "analysis": {
                   "filter": {
                       "my_shingle_filter": {
                           "type":             "shingle",
                           "min_shingle_size": 2, 
                           "max_shingle_size": 2, 
                           "output_unigrams":  false   
                       }
                   },
                   "analyzer": {
                       "my_shingle_analyzer": {
                           "type":             "custom",
                           "tokenizer":        "ik_max_word",
                           "filter": [
                               "lowercase",
                               "my_shingle_filter" 
                           ]
                       }
                   }
               }
           }
       }

复制

2. 数据创建

 PUT my_index6/my_type/2?routing=test 
 {
 "title": "测试直播圈子"
 }

复制

3. 在kinbana中进行查询:

GET my_index6/_search?routing=test
{
   "query":{
     "query_string": {
       "query": "测"
     }
   }
}

复制

查询结果为：

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index6",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.2876821,
        "_routing": "test",
        "_source": {
          "title": "测试直播圈子"
        }
      }
    ]
  }
}

复制