暂无图片
暂无图片
3
暂无图片
暂无图片
暂无图片

【PG15】PostgreSQL 中文分词插件 pg_jieba 编译安装

原创 严少安 2023-03-01
3994

PG菜鸟入门学习中,欢迎各位大佬留言技术指导。

BG

昨日群里在讨论 PostgreSQL 的中文分词插件编译过程中异常中断的事情,我是第一次编译pg的插件,随笔记录下。

常见的中文分词插件有:zhparser、pg_jieba,可用于中文分词,全文检索。
但从github仓库看,这两个插件的代码已经有段没更新,未必直接适配pg15,且为了便于学习pg知识,所以采用自编译的方式进行安装。

测试环境

先交代环境,以免产生不必要的信息差,也为问题复现做好基础信息铺垫。

  1. OS 为 CentOS 7.9
  2. 需要安装 gcc 不然编译会报错。gcc 版本为 gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
checking for gcc... no
checking for cc... no
configure: error: in `/root/postgres':
configure: error: no acceptable C compiler found in $PATH
See `config.log' for more details

其实这里也可以看到,gcc的版本是2015年Release的,很老旧了,如有条件建议升级 Kernel/OS version.

  1. cmake需要安装,版本且看下文。

编译安装 PG 15

pg 版本为 v15.2, 直接下载源码包手动编译,需要安装其他依赖,此处暂不赘述,编译结果如下。

  1. 对源码进行配置
./configure --prefix=/opt/pgsql-152 --datadir=/data/pgdata-152 --with-extra-version="-Yan"
...
checking whether gcc -std=gnu99 supports -Wl,--as-needed... yes
configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
configure: using CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2
configure: using CPPFLAGS= -D_GNU_SOURCE 
configure: using LDFLAGS=  -Wl,--as-needed
configure: creating ./config.status
config.status: creating GNUmakefile
config.status: creating src/Makefile.global
config.status: creating src/include/pg_config.h
config.status: creating src/include/pg_config_ext.h
config.status: creating src/interfaces/ecpg/include/ecpg_config.h
config.status: linking src/backend/port/tas/dummy.s to src/backend/port/tas.s
config.status: linking src/backend/port/posix_sema.c to src/backend/port/pg_sema.c
config.status: linking src/backend/port/sysv_shmem.c to src/backend/port/pg_shmem.c
config.status: linking src/include/port/linux.h to src/include/pg_config_os.h
config.status: linking src/makefiles/Makefile.linux to src/Makefile.port
  1. 对源码进行编译,这个过程稍微久一点
gmake world
...
gmake -C vacuumlo all
gmake[2]: Entering directory `/root/postgresql-15.2/contrib/vacuumlo'
gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -I../../src/interfaces/libpq -I. -I. -I../../src/include  -D_GNU_SOURCE   -c -o vacuumlo.o vacuumlo.c
gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2  vacuumlo.o -L../../src/common -lpgcommon -L../../src/port -lpgport -L../../src/interfaces/libpq -lpq -L../../src/port -L../../src/common   -Wl,--as-needed -Wl,-rpath,'/opt/pgsql-152/lib',--enable-new-dtags   -lpgcommon -lpgport -lz -lreadline -lpthread -lrt -ldl -lm  -o vacuumlo
gmake[2]: Leaving directory `/root/postgresql-15.2/contrib/vacuumlo'
gmake[1]: Leaving directory `/root/postgresql-15.2/contrib'

可以看到此时 pg 已经编译安装完成,查看安装路径,会发现生成了12个目录。

# tree /opt/pgsql-152/ -L 2 -d
/opt/pgsql-152/
├── bin
├── include
│   ├── informix
│   ├── internal
│   ├── libpq
│   └── server
├── lib
│   ├── pgxs
│   └── pkgconfig
└── share
    ├── doc
    └── man

12 directories

但是,插件并非安装在此目录,而是安装到了数据(data)目录,即编译时指定的--datadir路径。

# tree /data/pgdata-152/ -L 2 -d
/data/pgdata-152/
└── postgresql
    ├── extension
    ├── timezone
    ├── timezonesets
    └── tsearch_data

5 directories

对pg进行初始化,并启动pg,登录后,可以查看版本号,同时,可以看到编译时,自定义的版本信息 (15.2-Yan)

$ psql psql (15.2-Yan) Type "help" for help. postgres=# select version(); version ------------------------------------------------------------------------------------------------------------- PostgreSQL 15.2-Yan on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit (1 row) postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | ICU Locale | Locale Provider | Access privileges -----------+----------+----------+-------------+-------------+------------+-----------------+----------------------- postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc | template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc | =c/postgres + | | | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc | =c/postgres + | | | | | | | postgres=CTc/postgres (3 rows) postgres=# \conninfo You are connected to database "postgres" as user "postgres" via socket in "/tmp" at port "5432". postgres=#

编译安装 pg_jieba

pg_jieba 的源码路径为 https://github.com/jaiminpan/pg_jieba/
由于 github 网络限制,我已镜像到 gitee: https://gitee.com/shawnyan/pg_jieba

代码的 readme 中显示该插件支持 postgresql 9.x, postgresql 10.x, postgresql 11.x
该插件需要确保 postgresql 已经安装,并可运行 pg_config 命令。

另外,该插件源码依赖另一个项目 – cppjieba
CppJieba是"结巴(Jieba)"中文分词的C++版本
该项目支持utf8编码。支持载自定义用户词典,多路径时支持分隔符’|‘或者’;'分隔。

但依赖:

  • g++ (version >= 4.1 is recommended) or clang++;
  • cmake (version >= 2.6 is recommended);

这里留个伏笔,虽然 readme 中标记为推荐使用 cmake 2.6,但实际编译时这里就有问题。
如下图,编译时报错:

  1. 提示 foreach 不支持
  2. 需要使用 c99 标准进行编译源码。

20230228_230120.png


[root@centos7 build]# cmake .. -DCMAKE_PREFIX_PATH=/opt/pgsql-152/
-- Setting pg_jieba build type - 
-- Found PostgreSQL: /opt/pgsql-152/lib (found version "15.2-Yan") 
-- PostgreSQL include dirs: /opt/pgsql-152/include;/opt/pgsql-152/include/server
-- PostgreSQL library dirs: /opt/pgsql-152/lib
-- PostgreSQL libraries:    pq
-- POSTGRESQL_EXECUTABLE is /opt/pgsql-152/bin/postgres
-- Configuring done
-- Generating done
-- Build files have been written to: /root/postgresql-15.2/contrib/pg_jieba/build
[root@centos7 build]# 

[root@centos7 build]# make
Scanning dependencies of target pg_jieba
[ 50%] Building C object CMakeFiles/pg_jieba.dir/pg_jieba.c.o
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:258:1: warning: ‘DefineCustomConfigVariables’ was used with no prototype before its definition [-Wmissing-prototypes]
 DefineCustomConfigVariables()
 ^
In file included from /opt/pgsql-152/include/server/nodes/primnodes.h:22:0,
                 from /opt/pgsql-152/include/server/nodes/parsenodes.h:27,
                 from /opt/pgsql-152/include/server/utils/guc.h:16,
                 from /root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:17:
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c: In function ‘extract_dict_list’:
/opt/pgsql-152/include/server/nodes/pg_list.h:353:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (ForEachState cell##__state = {(lst), 0}; \
  ^
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:459:2: note: in expansion of macro ‘foreach’
  foreach(lc, elemlist)
  ^
/opt/pgsql-152/include/server/nodes/pg_list.h:353:2: note: use option -std=c99 or -std=gnu99 to compile your code
  for (ForEachState cell##__state = {(lst), 0}; \
  ^
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:459:2: note: in expansion of macro ‘foreach’
  foreach(lc, elemlist)
  ^
make[2]: *** [CMakeFiles/pg_jieba.dir/pg_jieba.c.o] Error 1
make[1]: *** [CMakeFiles/pg_jieba.dir/all] Error 2
make: *** [all] Error 2
[root@centos7 build]# 

编译安装 limonp

翻阅源码,发现 pg_jieba 依赖于 deps/limonp,然而 limonp 也有依赖 cppjieba,那么我们就先跳过这步,直接编译 cppjieba,从最里层开始尝试编译。

<-- 此处先跳转到下一节 -->

代码中显示编译至少要使用 cmake3 以上版本,症结找到了,版本很重要,文档也很重要。
https://gitee.com/shawnyan/limonp/blob/master/CMakeLists.txt#L1

cmake_minimum_required(VERSION 3.0)

编译安装 cppjieba

剥“洋葱”是个有趣的过程,继续翻代码,可以发现 limonp 依赖于 cppjieba。
那么按照惯性,cppjieba也不可能一次编译成功了。

[ 16%] Building CXX object test/CMakeFiles/demo.dir/demo.cpp.o
In file included from /root/cppjieba/include/cppjieba/QuerySegment.hpp:7:0,
                 from /root/cppjieba/include/cppjieba/Jieba.hpp:4,
                 from /root/cppjieba/test/demo.cpp:1:
/root/cppjieba/deps/limonp/include/limonp/Logging.hpp:56:5: warning: identifier ‘nullptr’ is a keyword in C++11 [-Wc++0x-compat]
     assert(tm_tmp != nullptr);
     ^
In file included from /usr/include/c++/4.8.2/cassert:43:0,
                 from /root/cppjieba/deps/limonp/include/limonp/Logging.hpp:6,
                 from /root/cppjieba/include/cppjieba/QuerySegment.hpp:7,
                 from /root/cppjieba/include/cppjieba/Jieba.hpp:4,
                 from /root/cppjieba/test/demo.cpp:1:
/root/cppjieba/deps/limonp/include/limonp/Logging.hpp: In constructor ‘limonp::Logger::Logger(size_t, const char*, int)’:
/root/cppjieba/deps/limonp/include/limonp/Logging.hpp:56:22: error: ‘nullptr’ was not declared in this scope
     assert(tm_tmp != nullptr);
                      ^

果然,那么问题到底出在哪里了呢。
root cause 是在 cmake 的版本。

果然,在安装 cmake3 之后,重新编译便可以通过。

  • output:
[root@centos7 build]# cmake3 . -DCMAKE_CXX_STANDARD=11 .. 
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /root/cppjieba/build
[root@centos7 build]# make
Scanning dependencies of target gtest
[  6%] Building CXX object deps/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
[ 12%] Linking CXX static library libgtest.a
[ 12%] Built target gtest
Scanning dependencies of target load_test
[ 18%] Building CXX object test/CMakeFiles/load_test.dir/load_test.cpp.o
[ 25%] Linking CXX executable ../load_test
[ 25%] Built target load_test
Scanning dependencies of target demo
[ 31%] Building CXX object test/CMakeFiles/demo.dir/demo.cpp.o
[ 37%] Linking CXX executable ../demo
[ 37%] Built target demo
Scanning dependencies of target test.run
[ 43%] Building CXX object test/unittest/CMakeFiles/test.run.dir/gtest_main.cpp.o
[ 50%] Building CXX object test/unittest/CMakeFiles/test.run.dir/keyword_extractor_test.cpp.o
[ 56%] Building CXX object test/unittest/CMakeFiles/test.run.dir/trie_test.cpp.o
[ 62%] Building CXX object test/unittest/CMakeFiles/test.run.dir/segments_test.cpp.o
[ 68%] Building CXX object test/unittest/CMakeFiles/test.run.dir/pos_tagger_test.cpp.o
[ 75%] Building CXX object test/unittest/CMakeFiles/test.run.dir/jieba_test.cpp.o
[ 81%] Building CXX object test/unittest/CMakeFiles/test.run.dir/pre_filter_test.cpp.o
[ 87%] Building CXX object test/unittest/CMakeFiles/test.run.dir/unicode_test.cpp.o
[ 93%] Building CXX object test/unittest/CMakeFiles/test.run.dir/textrank_test.cpp.o
[100%] Linking CXX executable ../test.run
[100%] Built target test.run
[root@centos7 build]# ls
CMakeCache.txt  CMakeFiles  cmake_install.cmake  CTestTestfile.cmake  demo  deps  load_test  Makefile  test
[root@centos7 build]# ./demo 
他来到了网易杭研大厦
[demo] Cut With HMM
他/来到/了/网易/杭研/大厦
[demo] Cut Without HMM 
他/来到/了/网易/杭/研/大厦
我来到北京清华大学
[demo] CutAll
我/来到/北京/清华/清华大学/华大/大学
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
[demo] CutForSearch
小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
[demo] Insert User Word
男默/女泪
男默女泪
[demo] CutForSearch Word With Offset
[{"word": "小明", "offset": 0}, {"word": "硕士", "offset": 6}, {"word": "毕业", "offset": 12}, {"word": "于", "offset": 18}, {"word": "中国", "offset": 21}, {"word": "科学", "offset": 27}, {"word": "学院", "offset": 30}, {"word": "科学院", "offset": 27}, {"word": "中国科学院", "offset": 21}, {"word": "计算", "offset": 36}, {"word": "计算所", "offset": 36}, {"word": ",", "offset": 45}, {"word": "后", "offset": 48}, {"word": "在", "offset": 51}, {"word": "日本", "offset": 54}, {"word": "京都", "offset": 60}, {"word": "大学", "offset": 66}, {"word": "日本京都大学", "offset": 54}, {"word": "深造", "offset": 72}]
[demo] Lookup Tag for Single Token
[拖拉机:n, CEO:eng, 123:m, 。:x]
[demo] Tagging
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
[我:r, 是:v, 拖拉机:n, 学院:n, 手扶拖拉机:n, 专业:n, 的:uj, 。:x, 不用:v, 多久:m, ,:x, 我:r, 就:d, 会:v, 升职:v, 加薪:nr, ,:x, 当上:t, CEO:eng, ,:x, 走上:v, 人生:n, 巅峰:n, 。:x]
[demo] Keyword Extraction
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
[{"word": "CEO", "offset": [93], "weight": 11.7392}, {"word": "升职", "offset": [72], "weight": 10.8562}, {"word": "加薪", "offset": [78], "weight": 10.6426}, {"word": "手扶拖拉机", "offset": [21], "weight": 10.0089}, {"word": "巅峰", "offset": [111], "weight": 9.49396}]
[root@centos7 build]# 

完整编译过程

下载源码:

cd /root/postgresql-15.2/contrib
git clone https://gitee.com/shawnyan/pg_jieba --depth=1
git clone https://gitee.com/shawnyan/cppjieba --depth=1 pg_jieba/libjieba
git clone https://gitee.com/shawnyan/limonp --depth=1 pg_jieba/libjieba/deps/limonp

编译:

cd pg_jieba
vi CMakeLists.txt
set(CMAKE_C_STANDARD 99)

mkdir build
cd build

cmake3 -DCMAKE_PREFIX_PATH=/opt/pgsql-152/ ..

make
  • output:
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Setting pg_jieba build type - 
-- Found PostgreSQL: /opt/pgsql-152/lib/libpq.so (found version "15.2") 
-- POSTGRESQL_EXECUTABLE is /opt/pgsql-152/bin/postgres
-- Configuring done
-- Generating done
-- Build files have been written to: /root/postgresql-15.2/contrib/pg_jieba/build

[root@centos7 build]# make
Scanning dependencies of target pg_jieba
[ 33%] Building C object CMakeFiles/pg_jieba.dir/pg_jieba.c.o
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:258:1: warning: ‘DefineCustomConfigVariables’ was used with no prototype before its definition [-Wmissing-prototypes]
 DefineCustomConfigVariables()
 ^
[ 66%] Building CXX object CMakeFiles/pg_jieba.dir/jieba.cpp.o
[100%] Linking CXX shared library pg_jieba.so
[100%] Built target pg_jieba
[root@centos7 build]# ll
total 684
-rw-r--r-- 1 root root  15772 Mar  1 00:28 CMakeCache.txt
drwxr-xr-x 5 root root    283 Mar  1 00:28 CMakeFiles
-rw-r--r-- 1 root root   7779 Mar  1 00:28 cmake_install.cmake
-rw-r--r-- 1 root root   7961 Mar  1 00:28 Makefile
-rwxr-xr-x 1 root root 666720 Mar  1 00:28 pg_jieba.so  --------------  
[root@centos7 build]# cp pg_jieba.so /opt/pgsql-152/lib

安装 pg_jieba 插件

启动 pg 服务之前,先在配置文件中添加预加载。

vi postgresql.conf
shared_preload_libraries = 'pg_jieba'

其次,需要复制辅助文件。

cp pg_jieba/libjieba/dict/jieba.dict.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_base.dict
cp pg_jieba/libjieba/dict/user.dict.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_user.dict
cp pg_jieba/libjieba/dict/hmm_model.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_hmm.model
cp pg_jieba/libjieba/dict/stop_words.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba.stop
cp pg_jieba/pg_jieba.control /data/pgdata-152/postgresql/extension/pg_jieba.control
cp pg_jieba/pg_jieba.sql /data/pgdata-152/postgresql/extension/pg_jieba--1.1.1.sql

启动 pg 服务后,便可正常安装使用 pg_jieba 插件。

[postgres@centos7 ~]$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.....2023-03-01 00:44:13.044 CST [49163] LOG:  starting PostgreSQL 15.2-Yan on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2023-03-01 00:44:13.045 CST [49163] LOG:  listening on IPv6 address "::1", port 5432
2023-03-01 00:44:13.045 CST [49163] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-03-01 00:44:13.048 CST [49163] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-03-01 00:44:13.058 CST [49166] LOG:  database system was shut down at 2023-02-28 22:40:27 CST
2023-03-01 00:44:13.063 CST [49163] LOG:  database system is ready to accept connections
 done
server started
[postgres@centos7 ~]$ 
[postgres@centos7 ~]$ psql
psql (15.2-Yan)
Type "help" for help.

postgres=# CREATE EXTENSION pg_jieba;
CREATE EXTENSION
postgres=# 
postgres=# \dx
                        List of installed extensions
   Name   | Version |   Schema   |               Description                
----------+---------+------------+------------------------------------------
 pg_jieba | 1.1.1   | public     | a parser for full-text search of Chinese
 plpgsql  | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)

最后,实测 pg_jieba 插件,效果如下:

postgres=# SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在PGCCC学习PostgreSQL');
                                        to_tsvector                                        
-------------------------------------------------------------------------------------------
 'pgccc':10 'postgresql':12 '中国科学院':5 '学习':11 '小明':1 '毕业':3 '硕士':2 '计算所':6
(1 row)

postgres=# SELECT * FROM to_tsvector('jiebacfg', '小丽是个潜水王,也是数据库方面的专家');
                       to_tsvector                       
---------------------------------------------------------
 '专家':12 '小丽':1 '数据库':9 '方面':10 '潜水':4 '王':5
(1 row)

总结

简要总结,源码编译时,编译环境、依赖组件、所需版本都是要注意的。
coding 还是要写注释的,至少日志输出不能少,不然遇到错误不好调查。
尤其是数据库系统,万一遇到问题如何定位问题,如何快速且准确地解决,始终是 DBA 们的必修课。

最后修改时间:2023-03-02 22:58:01
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论