PG菜鸟入门学习中,欢迎各位大佬留言技术指导。
BG
昨日群里在讨论 PostgreSQL 的中文分词插件编译过程中异常中断的事情,我是第一次编译pg的插件,随笔记录下。
常见的中文分词插件有:zhparser、pg_jieba,可用于中文分词,全文检索。
但从github仓库看,这两个插件的代码已经有段没更新,未必直接适配pg15,且为了便于学习pg知识,所以采用自编译的方式进行安装。
测试环境
先交代环境,以免产生不必要的信息差,也为问题复现做好基础信息铺垫。
- OS 为 CentOS 7.9
- 需要安装 gcc 不然编译会报错。gcc 版本为 gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
checking for gcc... no
checking for cc... no
configure: error: in `/root/postgres':
configure: error: no acceptable C compiler found in $PATH
See `config.log' for more details
其实这里也可以看到,gcc的版本是2015年Release的,很老旧了,如有条件建议升级 Kernel/OS version.
- cmake需要安装,版本且看下文。
编译安装 PG 15
pg 版本为 v15.2, 直接下载源码包手动编译,需要安装其他依赖,此处暂不赘述,编译结果如下。
- 对源码进行配置
./configure --prefix=/opt/pgsql-152 --datadir=/data/pgdata-152 --with-extra-version="-Yan"
...
checking whether gcc -std=gnu99 supports -Wl,--as-needed... yes
configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
configure: using CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2
configure: using CPPFLAGS= -D_GNU_SOURCE
configure: using LDFLAGS= -Wl,--as-needed
configure: creating ./config.status
config.status: creating GNUmakefile
config.status: creating src/Makefile.global
config.status: creating src/include/pg_config.h
config.status: creating src/include/pg_config_ext.h
config.status: creating src/interfaces/ecpg/include/ecpg_config.h
config.status: linking src/backend/port/tas/dummy.s to src/backend/port/tas.s
config.status: linking src/backend/port/posix_sema.c to src/backend/port/pg_sema.c
config.status: linking src/backend/port/sysv_shmem.c to src/backend/port/pg_shmem.c
config.status: linking src/include/port/linux.h to src/include/pg_config_os.h
config.status: linking src/makefiles/Makefile.linux to src/Makefile.port
- 对源码进行编译,这个过程稍微久一点
gmake world
...
gmake -C vacuumlo all
gmake[2]: Entering directory `/root/postgresql-15.2/contrib/vacuumlo'
gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -I../../src/interfaces/libpq -I. -I. -I../../src/include -D_GNU_SOURCE -c -o vacuumlo.o vacuumlo.c
gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 vacuumlo.o -L../../src/common -lpgcommon -L../../src/port -lpgport -L../../src/interfaces/libpq -lpq -L../../src/port -L../../src/common -Wl,--as-needed -Wl,-rpath,'/opt/pgsql-152/lib',--enable-new-dtags -lpgcommon -lpgport -lz -lreadline -lpthread -lrt -ldl -lm -o vacuumlo
gmake[2]: Leaving directory `/root/postgresql-15.2/contrib/vacuumlo'
gmake[1]: Leaving directory `/root/postgresql-15.2/contrib'
可以看到此时 pg 已经编译安装完成,查看安装路径,会发现生成了12个目录。
# tree /opt/pgsql-152/ -L 2 -d
/opt/pgsql-152/
├── bin
├── include
│ ├── informix
│ ├── internal
│ ├── libpq
│ └── server
├── lib
│ ├── pgxs
│ └── pkgconfig
└── share
├── doc
└── man
12 directories
但是,插件并非安装在此目录,而是安装到了数据(data)目录,即编译时指定的--datadir
路径。
# tree /data/pgdata-152/ -L 2 -d
/data/pgdata-152/
└── postgresql
├── extension
├── timezone
├── timezonesets
└── tsearch_data
5 directories
对pg进行初始化,并启动pg,登录后,可以查看版本号,同时,可以看到编译时,自定义的版本信息 (15.2-Yan)
。
$ psql
psql (15.2-Yan)
Type "help" for help.
postgres=# select version();
version
-------------------------------------------------------------------------------------------------------------
PostgreSQL 15.2-Yan on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
(1 row)
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | ICU Locale | Locale Provider | Access privileges
-----------+----------+----------+-------------+-------------+------------+-----------------+-----------------------
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc | =c/postgres +
| | | | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | libc | =c/postgres +
| | | | | | | postgres=CTc/postgres
(3 rows)
postgres=# \conninfo
You are connected to database "postgres" as user "postgres" via socket in "/tmp" at port "5432".
postgres=#
编译安装 pg_jieba
pg_jieba 的源码路径为 https://github.com/jaiminpan/pg_jieba/
由于 github 网络限制,我已镜像到 gitee: https://gitee.com/shawnyan/pg_jieba
代码的 readme 中显示该插件支持 postgresql 9.x, postgresql 10.x, postgresql 11.x
该插件需要确保 postgresql 已经安装,并可运行 pg_config
命令。
另外,该插件源码依赖另一个项目 – cppjieba
CppJieba是"结巴(Jieba)"中文分词的C++版本
该项目支持utf8编码。支持载自定义用户词典,多路径时支持分隔符’|‘或者’;'分隔。
但依赖:
- g++ (version >= 4.1 is recommended) or clang++;
- cmake (version >= 2.6 is recommended);
这里留个伏笔,虽然 readme 中标记为推荐使用 cmake 2.6,但实际编译时这里就有问题。
如下图,编译时报错:
- 提示 foreach 不支持
- 需要使用 c99 标准进行编译源码。
[root@centos7 build]# cmake .. -DCMAKE_PREFIX_PATH=/opt/pgsql-152/
-- Setting pg_jieba build type -
-- Found PostgreSQL: /opt/pgsql-152/lib (found version "15.2-Yan")
-- PostgreSQL include dirs: /opt/pgsql-152/include;/opt/pgsql-152/include/server
-- PostgreSQL library dirs: /opt/pgsql-152/lib
-- PostgreSQL libraries: pq
-- POSTGRESQL_EXECUTABLE is /opt/pgsql-152/bin/postgres
-- Configuring done
-- Generating done
-- Build files have been written to: /root/postgresql-15.2/contrib/pg_jieba/build
[root@centos7 build]#
[root@centos7 build]# make
Scanning dependencies of target pg_jieba
[ 50%] Building C object CMakeFiles/pg_jieba.dir/pg_jieba.c.o
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:258:1: warning: ‘DefineCustomConfigVariables’ was used with no prototype before its definition [-Wmissing-prototypes]
DefineCustomConfigVariables()
^
In file included from /opt/pgsql-152/include/server/nodes/primnodes.h:22:0,
from /opt/pgsql-152/include/server/nodes/parsenodes.h:27,
from /opt/pgsql-152/include/server/utils/guc.h:16,
from /root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:17:
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c: In function ‘extract_dict_list’:
/opt/pgsql-152/include/server/nodes/pg_list.h:353:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (ForEachState cell##__state = {(lst), 0}; \
^
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:459:2: note: in expansion of macro ‘foreach’
foreach(lc, elemlist)
^
/opt/pgsql-152/include/server/nodes/pg_list.h:353:2: note: use option -std=c99 or -std=gnu99 to compile your code
for (ForEachState cell##__state = {(lst), 0}; \
^
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:459:2: note: in expansion of macro ‘foreach’
foreach(lc, elemlist)
^
make[2]: *** [CMakeFiles/pg_jieba.dir/pg_jieba.c.o] Error 1
make[1]: *** [CMakeFiles/pg_jieba.dir/all] Error 2
make: *** [all] Error 2
[root@centos7 build]#
编译安装 limonp
翻阅源码,发现 pg_jieba 依赖于 deps/limonp,然而 limonp 也有依赖 cppjieba,那么我们就先跳过这步,直接编译 cppjieba,从最里层开始尝试编译。
<-- 此处先跳转到下一节 -->
代码中显示编译至少要使用 cmake3 以上版本,症结找到了,版本很重要,文档也很重要。
https://gitee.com/shawnyan/limonp/blob/master/CMakeLists.txt#L1
cmake_minimum_required(VERSION 3.0)
编译安装 cppjieba
剥“洋葱”是个有趣的过程,继续翻代码,可以发现 limonp 依赖于 cppjieba。
那么按照惯性,cppjieba也不可能一次编译成功了。
[ 16%] Building CXX object test/CMakeFiles/demo.dir/demo.cpp.o
In file included from /root/cppjieba/include/cppjieba/QuerySegment.hpp:7:0,
from /root/cppjieba/include/cppjieba/Jieba.hpp:4,
from /root/cppjieba/test/demo.cpp:1:
/root/cppjieba/deps/limonp/include/limonp/Logging.hpp:56:5: warning: identifier ‘nullptr’ is a keyword in C++11 [-Wc++0x-compat]
assert(tm_tmp != nullptr);
^
In file included from /usr/include/c++/4.8.2/cassert:43:0,
from /root/cppjieba/deps/limonp/include/limonp/Logging.hpp:6,
from /root/cppjieba/include/cppjieba/QuerySegment.hpp:7,
from /root/cppjieba/include/cppjieba/Jieba.hpp:4,
from /root/cppjieba/test/demo.cpp:1:
/root/cppjieba/deps/limonp/include/limonp/Logging.hpp: In constructor ‘limonp::Logger::Logger(size_t, const char*, int)’:
/root/cppjieba/deps/limonp/include/limonp/Logging.hpp:56:22: error: ‘nullptr’ was not declared in this scope
assert(tm_tmp != nullptr);
^
果然,那么问题到底出在哪里了呢。
root cause 是在 cmake 的版本。
果然,在安装 cmake3 之后,重新编译便可以通过。
- output:
[root@centos7 build]# cmake3 . -DCMAKE_CXX_STANDARD=11 ..
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /root/cppjieba/build
[root@centos7 build]# make
Scanning dependencies of target gtest
[ 6%] Building CXX object deps/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
[ 12%] Linking CXX static library libgtest.a
[ 12%] Built target gtest
Scanning dependencies of target load_test
[ 18%] Building CXX object test/CMakeFiles/load_test.dir/load_test.cpp.o
[ 25%] Linking CXX executable ../load_test
[ 25%] Built target load_test
Scanning dependencies of target demo
[ 31%] Building CXX object test/CMakeFiles/demo.dir/demo.cpp.o
[ 37%] Linking CXX executable ../demo
[ 37%] Built target demo
Scanning dependencies of target test.run
[ 43%] Building CXX object test/unittest/CMakeFiles/test.run.dir/gtest_main.cpp.o
[ 50%] Building CXX object test/unittest/CMakeFiles/test.run.dir/keyword_extractor_test.cpp.o
[ 56%] Building CXX object test/unittest/CMakeFiles/test.run.dir/trie_test.cpp.o
[ 62%] Building CXX object test/unittest/CMakeFiles/test.run.dir/segments_test.cpp.o
[ 68%] Building CXX object test/unittest/CMakeFiles/test.run.dir/pos_tagger_test.cpp.o
[ 75%] Building CXX object test/unittest/CMakeFiles/test.run.dir/jieba_test.cpp.o
[ 81%] Building CXX object test/unittest/CMakeFiles/test.run.dir/pre_filter_test.cpp.o
[ 87%] Building CXX object test/unittest/CMakeFiles/test.run.dir/unicode_test.cpp.o
[ 93%] Building CXX object test/unittest/CMakeFiles/test.run.dir/textrank_test.cpp.o
[100%] Linking CXX executable ../test.run
[100%] Built target test.run
[root@centos7 build]# ls
CMakeCache.txt CMakeFiles cmake_install.cmake CTestTestfile.cmake demo deps load_test Makefile test
[root@centos7 build]# ./demo
他来到了网易杭研大厦
[demo] Cut With HMM
他/来到/了/网易/杭研/大厦
[demo] Cut Without HMM
他/来到/了/网易/杭/研/大厦
我来到北京清华大学
[demo] CutAll
我/来到/北京/清华/清华大学/华大/大学
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
[demo] CutForSearch
小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
[demo] Insert User Word
男默/女泪
男默女泪
[demo] CutForSearch Word With Offset
[{"word": "小明", "offset": 0}, {"word": "硕士", "offset": 6}, {"word": "毕业", "offset": 12}, {"word": "于", "offset": 18}, {"word": "中国", "offset": 21}, {"word": "科学", "offset": 27}, {"word": "学院", "offset": 30}, {"word": "科学院", "offset": 27}, {"word": "中国科学院", "offset": 21}, {"word": "计算", "offset": 36}, {"word": "计算所", "offset": 36}, {"word": ",", "offset": 45}, {"word": "后", "offset": 48}, {"word": "在", "offset": 51}, {"word": "日本", "offset": 54}, {"word": "京都", "offset": 60}, {"word": "大学", "offset": 66}, {"word": "日本京都大学", "offset": 54}, {"word": "深造", "offset": 72}]
[demo] Lookup Tag for Single Token
[拖拉机:n, CEO:eng, 123:m, 。:x]
[demo] Tagging
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
[我:r, 是:v, 拖拉机:n, 学院:n, 手扶拖拉机:n, 专业:n, 的:uj, 。:x, 不用:v, 多久:m, ,:x, 我:r, 就:d, 会:v, 升职:v, 加薪:nr, ,:x, 当上:t, CEO:eng, ,:x, 走上:v, 人生:n, 巅峰:n, 。:x]
[demo] Keyword Extraction
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
[{"word": "CEO", "offset": [93], "weight": 11.7392}, {"word": "升职", "offset": [72], "weight": 10.8562}, {"word": "加薪", "offset": [78], "weight": 10.6426}, {"word": "手扶拖拉机", "offset": [21], "weight": 10.0089}, {"word": "巅峰", "offset": [111], "weight": 9.49396}]
[root@centos7 build]#
完整编译过程
下载源码:
cd /root/postgresql-15.2/contrib
git clone https://gitee.com/shawnyan/pg_jieba --depth=1
git clone https://gitee.com/shawnyan/cppjieba --depth=1 pg_jieba/libjieba
git clone https://gitee.com/shawnyan/limonp --depth=1 pg_jieba/libjieba/deps/limonp
编译:
cd pg_jieba
vi CMakeLists.txt
set(CMAKE_C_STANDARD 99)
mkdir build
cd build
cmake3 -DCMAKE_PREFIX_PATH=/opt/pgsql-152/ ..
make
- output:
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Setting pg_jieba build type -
-- Found PostgreSQL: /opt/pgsql-152/lib/libpq.so (found version "15.2")
-- POSTGRESQL_EXECUTABLE is /opt/pgsql-152/bin/postgres
-- Configuring done
-- Generating done
-- Build files have been written to: /root/postgresql-15.2/contrib/pg_jieba/build
[root@centos7 build]# make
Scanning dependencies of target pg_jieba
[ 33%] Building C object CMakeFiles/pg_jieba.dir/pg_jieba.c.o
/root/postgresql-15.2/contrib/pg_jieba/pg_jieba.c:258:1: warning: ‘DefineCustomConfigVariables’ was used with no prototype before its definition [-Wmissing-prototypes]
DefineCustomConfigVariables()
^
[ 66%] Building CXX object CMakeFiles/pg_jieba.dir/jieba.cpp.o
[100%] Linking CXX shared library pg_jieba.so
[100%] Built target pg_jieba
[root@centos7 build]# ll
total 684
-rw-r--r-- 1 root root 15772 Mar 1 00:28 CMakeCache.txt
drwxr-xr-x 5 root root 283 Mar 1 00:28 CMakeFiles
-rw-r--r-- 1 root root 7779 Mar 1 00:28 cmake_install.cmake
-rw-r--r-- 1 root root 7961 Mar 1 00:28 Makefile
-rwxr-xr-x 1 root root 666720 Mar 1 00:28 pg_jieba.so --------------
[root@centos7 build]# cp pg_jieba.so /opt/pgsql-152/lib
安装 pg_jieba 插件
启动 pg 服务之前,先在配置文件中添加预加载。
vi postgresql.conf
shared_preload_libraries = 'pg_jieba'
其次,需要复制辅助文件。
cp pg_jieba/libjieba/dict/jieba.dict.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_base.dict
cp pg_jieba/libjieba/dict/user.dict.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_user.dict
cp pg_jieba/libjieba/dict/hmm_model.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba_hmm.model
cp pg_jieba/libjieba/dict/stop_words.utf8 /data/pgdata-152/postgresql/tsearch_data/jieba.stop
cp pg_jieba/pg_jieba.control /data/pgdata-152/postgresql/extension/pg_jieba.control
cp pg_jieba/pg_jieba.sql /data/pgdata-152/postgresql/extension/pg_jieba--1.1.1.sql
启动 pg 服务后,便可正常安装使用 pg_jieba 插件。
[postgres@centos7 ~]$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.....2023-03-01 00:44:13.044 CST [49163] LOG: starting PostgreSQL 15.2-Yan on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2023-03-01 00:44:13.045 CST [49163] LOG: listening on IPv6 address "::1", port 5432
2023-03-01 00:44:13.045 CST [49163] LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-03-01 00:44:13.048 CST [49163] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-03-01 00:44:13.058 CST [49166] LOG: database system was shut down at 2023-02-28 22:40:27 CST
2023-03-01 00:44:13.063 CST [49163] LOG: database system is ready to accept connections
done
server started
[postgres@centos7 ~]$
[postgres@centos7 ~]$ psql
psql (15.2-Yan)
Type "help" for help.
postgres=# CREATE EXTENSION pg_jieba;
CREATE EXTENSION
postgres=#
postgres=# \dx
List of installed extensions
Name | Version | Schema | Description
----------+---------+------------+------------------------------------------
pg_jieba | 1.1.1 | public | a parser for full-text search of Chinese
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)
最后,实测 pg_jieba 插件,效果如下:
postgres=# SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在PGCCC学习PostgreSQL');
to_tsvector
-------------------------------------------------------------------------------------------
'pgccc':10 'postgresql':12 '中国科学院':5 '学习':11 '小明':1 '毕业':3 '硕士':2 '计算所':6
(1 row)
postgres=# SELECT * FROM to_tsvector('jiebacfg', '小丽是个潜水王,也是数据库方面的专家');
to_tsvector
---------------------------------------------------------
'专家':12 '小丽':1 '数据库':9 '方面':10 '潜水':4 '王':5
(1 row)
总结
简要总结,源码编译时,编译环境、依赖组件、所需版本都是要注意的。
coding 还是要写注释的,至少日志输出不能少,不然遇到错误不好调查。
尤其是数据库系统,万一遇到问题如何定位问题,如何快速且准确地解决,始终是 DBA 们的必修课。