openGauss每日一练第20天 | 全文检索

原创田灬禾 2021-12-21

666

今天学习了解了下openGauss的全文索引，说实话学下来有点迷糊，检索了下官网相关内容，才有点眉目。

https://opengauss.org/zh/docs/2.1.0/docs/Developerguide/%E5%85%A8%E6%96%87%E6%A3%80%E7%B4%A2%E6%A6%82%E8%BF%B0.html

引用官网相关说明：openGauss为文本数据类型提供~、~*、LIKE和ILIKE操作符；但它们缺乏现代信息系统所要求的许多必要属性。这些缺憾可以通过使用索引及词典进行解决。

文本检索缺乏信息系统所要求的必要属性：

没有语义支持，即使是英语。要识别派生词并不是那么容易，因此正则表达式也不能满足要求，如，satisfies和satisfy，当使用正则表达式寻找satisfy时，并不会查询到包含satisfies的文档。
没有对搜索结果的分类（排序）。当搜索出成千的文档时，查找效率很低。
由于没有索引的支持，每一次的搜索需要遍历所有的文档，整体搜索比较缓慢

文中使用了两个关键词(tsvector和tsquery)：数据类型tsvector用于存储预处理文档，tsquery用于存储查询条件

tsvector

tsvector类型表示一个检索单元，通常是一个数据库表中一行的文本字段或者这些字段的组合，tsvector类型的值是一个标准词位的有序列表，标准词位就是把同一个词的变型体都标准化成相同的，在输入的同时会自动排序和消除重复。to_tsvector函数通常用于解析和标准化文档字符串。

tsvector的值是唯一分词的分类列表，把一句话的词格式化为不同的词条，在进行分词处理的时候tsvector会自动去掉分词中重复的词条，按照一定的顺序录入。

tsquery

tsquery类型表示一个检索条件，存储用于检索的词汇，并且使用布尔操作符&（AND），|（OR）和!（NOT）来组合他们，括号用来强调操作符的分组。to_tsquery函数及plainto_tsquery函数会将单词转换为tsquery类型前进行规范化处理。

全文检索功能还是需要花费一定的时间去学习。

课程作业

1.用tsvector @@ tsquery和tsquery @@ tsvector完成两个基本文本匹配

openGauss=# SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector AS RESULT;
 result
----------
 f
(1 row) 
openGauss=# SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery AS RESULT;
 result
----------
 t
(1 row)

2.创建表且至少有两个字段的类型为 text类型，在创建索引前进行全文检索

openGauss=# CREATE TABLE t_txt(id int, body text, title text);
CREATE TABLE
openGauss=# INSERT INTO t_txt VALUES(1, 'China, officially the People''s Republic of China(PRC), located in Asia, is the world''s most populous state.', 'China');
INSERT 0 1
openGauss=# INSERT INTO t_txt VALUES(2, 'America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley.', 'America');
INSERT 0 1
openGauss=# INSERT INTO t_txt VALUES(3, 'England is a country that is part of the United Kingdom. It shares land borders with Scotland to the north and Wales to the west.', 'England');
INSERT 0 1
openGauss=# 
openGauss=# 
openGauss=# SELECT id, body, title FROM t_txt WHERE to_tsvector(body) @@ to_tsquery('america');
 id |                                                          body                                                           |  title 
 
----+-------------------------------------------------------------------------------------------------------------------------+--------
-
  2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America
(1 row)

openGauss=#

3.创建GIN索引

openGauss=# CREATE INDEX t_txt_idx_1 ON t_txt USING gin(to_tsvector('english', body));
CREATE INDEX
openGauss=# \d+ t_txt
                         Table "public.t_txt"
 Column |  Type   | Modifiers | Storage  | Stats target | Description 
--------+---------+-----------+----------+--------------+-------------
 id     | integer |           | plain    |              | 
 body   | text    |           | extended |              | 
 title  | text    |           | extended |              | 
Indexes:
    "t_txt_idx_1" gin (to_tsvector('english'::regconfig, body)) TABLESPACE pg_default
Has OIDs: no
Options: orientation=row, compression=no

openGauss=#

4.清理数据

openGauss=# drop table t_txt;
DROP TABLE

opengauss

最后修改时间：2021-12-21 12:21:22

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

openGauss每日一练第20天 | 全文检索

tsvector

tsquery

课程作业

评论