PG菜鸟入门学习中,欢迎各位大佬留言技术指导。
BG
昨日群里在讨论 PostgreSQL 的中文分词插件编译过程中异常中断的事情,我是第一次编译pg的插件,随笔记录下。
常见的中文分词插件有:zhparser、pg_jieba,可用于中文分词,全文检索。
但从github仓库看,这两个插件的代码已经有段没更新,未必直接适配pg15,且为了便于学习pg知识,所以采用自编译的方式进行安装。
测试环境
先交代环境,以免产生不必要的信息差,也为问题复现做好基础信息铺垫。
- OS 为 CentOS 7.9
- 需要安装 gcc 不然编译会报错。gcc 版本为 gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
checking for gcc... no checking for cc... no configure: error: in `/root/postgres': configure: error: no acceptable C compiler found in $PATH See `config.log' for more details
复制
其实这里也可以看到,gcc的版本是2015年Release的,很老旧了,如有条件建议升级 Kernel/OS version.
- cmake需要安装,版本且看下文。
编译安装 PG 15
pg 版本为 v15.2, 直接下载源码包手动编译,需要安装其他依赖,此处暂不赘述,编译结果如下。
- 对源码进行配置
./configure --prefix=/opt/pgsql-152 --datadir=/data/pgdata-152 --with-extra-version="-Yan" ... checking whether gcc -std=gnu99 supports -Wl,--as-needed... yes configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) configure: using CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 configure: using CPPFLAGS= -D_GNU_SOURCE configure: using LDFLAGS= -Wl,--as-needed configure: creating ./config.status config.status: creating GNUmakefile config.status: creating src/Makefile.global config.status: creating src/include/pg_config.h config.status: creating src/include/pg_config_ext.h config.status: creating src/interfaces/ecpg/include/ecpg_config.h config.status: linking src/backend/port/tas/dummy.s to src/backend/port/tas.s config.status: linking src/backend/port/posix_sema.c to src/backend/port/pg_sema.c config.status: linking src/backend/port/sysv_shmem.c to src/backend/port/pg_shmem.c config.status: linking src/include/port/linux.h to src/include/pg_config_os.h config.status: linking src/makefiles/Makefile.linux to src/Makefile.port
复制
- 对源码进行编译,这个过程稍微久一点
gmake world ... gmake -C vacuumlo all gmake[2]: Entering directory `/root/postgresql-15.2/contrib/vacuumlo' gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -I../../src/interfaces/libpq -I. -I. -I../../src/include -D_GNU_SOURCE -c -o vacuumlo.o vacuumlo.c gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 vacuumlo.o -L../../src/common -lpgcommon -L../../src/port -lpgport -L../../src/interfaces/libpq -lpq -L../../src/port -L../../src/common -Wl,--as-needed -Wl,-rpath,'/opt/pgsql-152/lib',--enable-new-dtags -lpgcommon -lpgport -lz -lreadline -lpthread -lrt -ldl -lm -o vacuumlo gmake[2]: Leaving directory `/root/postgresql-15.2/contrib/vacuumlo' gmake[1]: Leaving directory `/root/postgresql-15.2/contrib'
复制
可以看到此时 pg 已经编译安装完成,查看安装路径,会发现生成了12个目录。
# tree /opt/pgsql-152/ -L 2 -d /opt/pgsql-152/ ├── bin ├── include │ ├── informix │ ├── internal │ ├── libpq │ └── server ├── lib │ ├── pgxs │ └── pkgconfig └── share ├── doc └── man 12 directories
复制
但是,插件并非安装在此目录,而是安装到了数据(data)目录,即编译时指定的--datadir
路径。
# tree /data/pgdata-152/ -L 2 -d /data/pgdata-152/ └── postgresql ├── extension ├── timezone ├── timezonesets └── tsearch_data 5 directories
复制
对pg进行初始化,并启动pg,登录后,可以查看版本号,同时,可以看到编译时,自定义的版本信息 (15.2-Yan)
。
$ psql
psql (15.2-Yan)
Type "help" for help.
postgres=# select version();
version
-------------------------------------------------------------------------------------------------------------
PostgreSQL 15.2-Yan on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
(1 row)
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | ICU Locale | Locale Provider | Access privileges
-----------+----------+----------+-------------+-------------+------------+-----------------+-----------------------
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc | =c/postgres +
| | | | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc | =c/postgres +
| | | | | | | postgres=CTc/postgres
(3 rows)
postgres=# \conninfo
You are connected to database "postgres" as user "postgres" via socket in "/tmp" at port "5432".
postgres=#
复制
编译安装 pg_jieba
pg_jieba 的源码路径为 https://github.com/jaiminpan/pg_jieba/
由于 github 网络限制,我已镜像到 gitee: https://gitee.com/shawnyan/pg_jieba
代码的 readme 中显示该插件支持 postgresql 9.x, postgresql 10.x, postgresql 11.x
该插件需要确保 postgresql 已经安装,并可运行 pg_config
命令。
另外,该插件源码依赖另一个项目 – cppjieba
CppJieba是"结巴(Jieba)"中文分词的C++版本
该项目支持utf8编码。支持载自定义用户词典,多路径时支持分隔符’|‘或者’;'分隔。
但依赖:
- g++ (version >= 4.1 is recommended) or clang++;
- cmake (version >= 2.6 is recommended);
这里留个伏笔,虽然 readme 中标记为推荐使用 cmake 2.6,但实际编译时这里就有问题。
如下图,编译时报错:
- 提示 foreach 不支持
- 需要使用 c99 标准进行编译源码。
[root@centos7 build]# cmake .. -DCMAKE_PREFIX_PATH=/opt/pgsql-152/ -- Setting pg_jieba build type - -- Found PostgreSQL: /opt/pgsql-152/lib (found version "15.2-Yan") -- PostgreSQL include dirs: /opt/pgsql-152/include;/opt/pgsql-152/include/server -- PostgreSQL library dirs: /opt/pgsql-152/lib -- PostgreSQL libraries: pq -- POSTGRESQL_EXECUTABLE is /opt/pgsql-152/bin/postgres -- Configuring done -- Generating done -- Build files have been written to: /root/postgresql-15.2/contrib/pg_jieba/build [root@centos7 build]# [root@centos7 build]# make Scanning dependencies of target pg_jieba [ 50%] Building C object CMakeFiles/pg_jieba.dir/pg_jieba.c.o /root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:258:1: warning: ‘DefineCustomConfigVariables’ was used with no prototype before its definition [-Wmissing-prototypes] DefineCustomConfigVariables() ^ In file included from /opt/pgsql-152/include/server/nodes/primnodes.h:22:0, from /opt/pgsql-152/include/server/nodes/parsenodes.h:27, from /opt/pgsql-152/include/server/utils/guc.h:16, from /root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:17: /root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c: In function ‘extract_dict_list’: /opt/pgsql-152/include/server/nodes/pg_list.h:353:2: error: ‘for’ loop initial declarations are only allowed in C99 mode for (ForEachState cell##__state = {(lst), 0}; \ ^ /root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:459:2: note: in expansion of macro ‘foreach’ foreach(lc, elemlist) ^ /opt/pgsql-152/include/server/nodes/pg_list.h:353:2: note: use option -std=c99 or -std=gnu99 to compile your code for (ForEachState cell##__state = {(lst), 0}; \ ^ /root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:459:2: note: in expansion of macro ‘foreach’ foreach(lc, elemlist) ^ make[2]: *** [CMakeFiles/pg_jieba.dir/pg_jieba.c.o] Error 1 make[1]: *** [CMakeFiles/pg_jieba.dir/all] Error 2 make: *** [all] Error 2 [root@centos7 build]#
复制
编译安装 limonp
翻阅源码,发现 pg_jieba 依赖于 deps/limonp,然而 limonp 也有依赖 cppjieba,那么我们就先跳过这步,直接编译 cppjieba,从最里层开始尝试编译。
<-- 此处先跳转到下一节 -->
代码中显示编译至少要使用 cmake3 以上版本,症结找到了,版本很重要,文档也很重要。
https://gitee.com/shawnyan/limonp/blob/master/CMakeLists.txt#L1
cmake_minimum_required(VERSION 3.0)
复制
编译安装 cppjieba
剥“洋葱”是个有趣的过程,继续翻代码,可以发现 limonp 依赖于 cppjieba。
那么按照惯性,cppjieba也不可能一次编译成功了。
[ 16%] Building CXX object test/CMakeFiles/demo.dir/demo.cpp.o In file included from /root/cppjieba/include/cppjieba/QuerySegment.hpp:7:0, from /root/cppjieba/include/cppjieba/Jieba.hpp:4, from /root/cppjieba/test/demo.cpp:1: /root/cppjieba/deps/limonp/include/limonp/Logging.hpp:56:5: warning: identifier ‘nullptr’ is a keyword in C++11 [-Wc++0x-compat] assert(tm_tmp != nullptr); ^ In file included from /usr/include/c++/4.8.2/cassert:43:0, from /root/cppjieba/deps/limonp/include/limonp/Logging.hpp:6, from /root/cppjieba/include/cppjieba/QuerySegment.hpp:7, from /root/cppjieba/include/cppjieba/Jieba.hpp:4, from /root/cppjieba/test/demo.cpp:1: /root/cppjieba/deps/limonp/include/limonp/Logging.hpp: In constructor ‘limonp::Logger::Logger(size_t, const char*, int)’: /root/cppjieba/deps/limonp/include/limonp/Logging.hpp:56:22: error: ‘nullptr’ was not declared in this scope assert(tm_tmp != nullptr); ^
复制
果然,那么问题到底出在哪里了呢。
root cause 是在 cmake 的版本。
果然,在安装 cmake3 之后,重新编译便可以通过。
- output:
[root@centos7 build]# cmake3 . -DCMAKE_CXX_STANDARD=11 .. -- The C compiler identification is GNU 4.8.5 -- The CXX compiler identification is GNU 4.8.5 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc - works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ - works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Configuring done -- Generating done -- Build files have been written to: /root/cppjieba/build [root@centos7 build]# make Scanning dependencies of target gtest [ 6%] Building CXX object deps/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o [ 12%] Linking CXX static library libgtest.a [ 12%] Built target gtest Scanning dependencies of target load_test [ 18%] Building CXX object test/CMakeFiles/load_test.dir/load_test.cpp.o [ 25%] Linking CXX executable ../load_test [ 25%] Built target load_test Scanning dependencies of target demo [ 31%] Building CXX object test/CMakeFiles/demo.dir/demo.cpp.o [ 37%] Linking CXX executable ../demo [ 37%] Built target demo Scanning dependencies of target test.run [ 43%] Building CXX object test/unittest/CMakeFiles/test.run.dir/gtest_main.cpp.o [ 50%] Building CXX object test/unittest/CMakeFiles/test.run.dir/keyword_extractor_test.cpp.o [ 56%] Building CXX object test/unittest/CMakeFiles/test.run.dir/trie_test.cpp.o [ 62%] Building CXX object test/unittest/CMakeFiles/test.run.dir/segments_test.cpp.o [ 68%] Building CXX object test/unittest/CMakeFiles/test.run.dir/pos_tagger_test.cpp.o [ 75%] Building CXX object test/unittest/CMakeFiles/test.run.dir/jieba_test.cpp.o [ 81%] Building CXX object test/unittest/CMakeFiles/test.run.dir/pre_filter_test.cpp.o [ 87%] Building CXX object test/unittest/CMakeFiles/test.run.dir/unicode_test.cpp.o [ 93%] Building CXX object test/unittest/CMakeFiles/test.run.dir/textrank_test.cpp.o [100%] Linking CXX executable ../test.run [100%] Built target test.run [root@centos7 build]# ls CMakeCache.txt CMakeFiles cmake_install.cmake CTestTestfile.cmake demo deps load_test Makefile test [root@centos7 build]# ./demo 他来到了网易杭研大厦 [demo] Cut With HMM 他/来到/了/网易/杭研/大厦 [demo] Cut Without HMM 他/来到/了/网易/杭/研/大厦 我来到北京清华大学 [demo] CutAll 我/来到/北京/清华/清华大学/华大/大学 小明硕士毕业于中国科学院计算所,后在日本京都大学深造 [demo] CutForSearch 小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造 [demo] Insert User Word 男默/女泪 男默女泪 [demo] CutForSearch Word With Offset [{"word": "小明", "offset": 0}, {"word": "硕士", "offset": 6}, {"word": "毕业", "offset": 12}, {"word": "于", "offset": 18}, {"word": "中国", "offset": 21}, {"word": "科学", "offset": 27}, {"word": "学院", "offset": 30}, {"word": "科学院", "offset": 27}, {"word": "中国科学院", "offset": 21}, {"word": "计算", "offset": 36}, {"word": "计算所", "offset": 36}, {"word": ",", "offset": 45}, {"word": "后", "offset": 48}, {"word": "在", "offset": 51}, {"word": "日本", "offset": 54}, {"word": "京都", "offset": 60}, {"word": "大学", "offset": 66}, {"word": "日本京都大学", "offset": 54}, {"word": "深造", "offset": 72}] [demo] Lookup Tag for Single Token [拖拉机:n, CEO:eng, 123:m, 。:x] [demo] Tagging 我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。 [我:r, 是:v, 拖拉机:n, 学院:n, 手扶拖拉机:n, 专业:n, 的:uj, 。:x, 不用:v, 多久:m, ,:x, 我:r, 就:d, 会:v, 升职:v, 加薪:nr, ,:x, 当上:t, CEO:eng, ,:x, 走上:v, 人生:n, 巅峰:n, 。:x] [demo] Keyword Extraction 我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。 [{"word": "CEO", "offset": [93], "weight": 11.7392}, {"word": "升职", "offset": [72], "weight": 10.8562}, {"word": "加薪", "offset": [78], "weight": 10.6426}, {"word": "手扶拖拉机", "offset": [21], "weight": 10.0089}, {"word": "巅峰", "offset": [111], "weight": 9.49396}] [root@centos7 build]#
复制
完整编译过程
下载源码:
cd /root/postgresql-15.2/contrib git clone https://gitee.com/shawnyan/pg_jieba --depth=1 git clone https://gitee.com/shawnyan/cppjieba --depth=1 pg_jieba/libjieba git clone https://gitee.com/shawnyan/limonp --depth=1 pg_jieba/libjieba/deps/limonp
复制
编译:
cd pg_jieba vi CMakeLists.txt set(CMAKE_C_STANDARD 99) mkdir build cd build cmake3 -DCMAKE_PREFIX_PATH=/opt/pgsql-152/ .. make
复制
- output:
-- The C compiler identification is GNU 4.8.5 -- The CXX compiler identification is GNU 4.8.5 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc - works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ - works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Setting pg_jieba build type - -- Found PostgreSQL: /opt/pgsql-152/lib/libpq.so (found version "15.2") -- POSTGRESQL_EXECUTABLE is /opt/pgsql-152/bin/postgres -- Configuring done -- Generating done -- Build files have been written to: /root/postgresql-15.2/contrib/pg_jieba/build [root@centos7 build]# make Scanning dependencies of target pg_jieba [ 33%] Building C object CMakeFiles/pg_jieba.dir/pg_jieba.c.o /root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:258:1: warning: ‘DefineCustomConfigVariables’ was used with no prototype before its definition [-Wmissing-prototypes] DefineCustomConfigVariables() ^ [ 66%] Building CXX object CMakeFiles/pg_jieba.dir/jieba.cpp.o [100%] Linking CXX shared library pg_jieba.so [100%] Built target pg_jieba [root@centos7 build]# ll total 684 -rw-r--r-- 1 root root 15772 Mar 1 00:28 CMakeCache.txt drwxr-xr-x 5 root root 283 Mar 1 00:28 CMakeFiles -rw-r--r-- 1 root root 7779 Mar 1 00:28 cmake_install.cmake -rw-r--r-- 1 root root 7961 Mar 1 00:28 Makefile -rwxr-xr-x 1 root root 666720 Mar 1 00:28 pg_jieba.so -------------- [root@centos7 build]# cp pg_jieba.so /opt/pgsql-152/lib
复制
安装 pg_jieba 插件
启动 pg 服务之前,先在配置文件中添加预加载。
vi postgresql.conf shared_preload_libraries = 'pg_jieba'
复制
其次,需要复制辅助文件。
cp pg_jieba/libjieba/dict/jieba.dict.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_base.dict cp pg_jieba/libjieba/dict/user.dict.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_user.dict cp pg_jieba/libjieba/dict/hmm_model.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_hmm.model cp pg_jieba/libjieba/dict/stop_words.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba.stop cp pg_jieba/pg_jieba.control /data/pgdata-152/postgresql/extension/pg_jieba.control cp pg_jieba/pg_jieba.sql /data/pgdata-152/postgresql/extension/pg_jieba--1.1.1.sql
复制
启动 pg 服务后,便可正常安装使用 pg_jieba 插件。
[postgres@centos7 ~]$ pg_ctl start pg_ctl: another server might be running; trying to start server anyway waiting for server to start.....2023-03-01 00:44:13.044 CST [49163] LOG: starting PostgreSQL 15.2-Yan on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit 2023-03-01 00:44:13.045 CST [49163] LOG: listening on IPv6 address "::1", port 5432 2023-03-01 00:44:13.045 CST [49163] LOG: listening on IPv4 address "127.0.0.1", port 5432 2023-03-01 00:44:13.048 CST [49163] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2023-03-01 00:44:13.058 CST [49166] LOG: database system was shut down at 2023-02-28 22:40:27 CST 2023-03-01 00:44:13.063 CST [49163] LOG: database system is ready to accept connections done server started [postgres@centos7 ~]$ [postgres@centos7 ~]$ psql psql (15.2-Yan) Type "help" for help. postgres=# CREATE EXTENSION pg_jieba; CREATE EXTENSION postgres=# postgres=# \dx List of installed extensions Name | Version | Schema | Description ----------+---------+------------+------------------------------------------ pg_jieba | 1.1.1 | public | a parser for full-text search of Chinese plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language (2 rows)
复制
最后,实测 pg_jieba 插件,效果如下:
postgres=# SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在PGCCC学习PostgreSQL'); to_tsvector ------------------------------------------------------------------------------------------- 'pgccc':10 'postgresql':12 '中国科学院':5 '学习':11 '小明':1 '毕业':3 '硕士':2 '计算所':6 (1 row) postgres=# SELECT * FROM to_tsvector('jiebacfg', '小丽是个潜水王,也是数据库方面的专家'); to_tsvector --------------------------------------------------------- '专家':12 '小丽':1 '数据库':9 '方面':10 '潜水':4 '王':5 (1 row)
复制
总结
简要总结,源码编译时,编译环境、依赖组件、所需版本都是要注意的。
coding 还是要写注释的,至少日志输出不能少,不然遇到错误不好调查。
尤其是数据库系统,万一遇到问题如何定位问题,如何快速且准确地解决,始终是 DBA 们的必修课。