等待 PostgreSQL 15 – 添加各种新的 regexp_xxx SQL 函数

飞象数据 2022-11-26

1073

2021 年 8 月 3 日，Tom Lane提交了补丁：

添加各种新的 regexp_xxx SQL 函数。
 
此补丁添加了新函数 regexp_count()、regexp_instr()、
regexp_like() 和 regexp_substr()，并扩展了 regexp_replace()
带有一些新的可选参数。所有这些功能都遵循
Oracle 中使用的定义，尽管存在细微差别
由于使用了我们自己的正则表达式引擎，所以在正则表达式语言中——大多数
值得注意的是，默认的换行匹配行为是不同的。
DB2 和其他地方也出现了类似的功能。除了
便于携带，这些功能在某些情况下更容易使用
任务比我们现有的 regexp_match[es] 函数。
 
Gilles Darold，由我大量修改
 
讨论：https ://postgr.es/m/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net
复制

我错过了,对不起。但在将近 4 个月后找到它，并决定写一下。

我是正则表达式的忠实粉丝。有趣的是——我不喜欢 Pg 中的正则表达式，因为 PostgreSQL 中的正则表达式引擎有一些奇怪的限制，我不喜欢，特别是因为还有很多其他选项。

无论如何，直到 Pg14，我们有这些正则表达式函数：

regexp_match
regexp_matches
regexp_replace
regexp_split_to_array
regexp_split_to_table

现在，我们又得到了 4 个：

regexp_count( string, pattern [, start [, flags]]

返回源字符串中给定模式匹配的次数，可选的索引意味着它应该从字符串中的第 n 个字符开始搜索，以及一些可选的标志。

https://www.postgresql.org/docs/devel/functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE

例如：

=$ select regexp_count('depesz DEPESZ depesz DepEsz', 'de.e');
 regexp_count 
--------------
            2
(1 row)
 
=$ select regexp_count('depesz DEPESZ depesz DepEsz', 'de.e', 2);
 regexp_count 
--------------
            1
(1 row)
 
=$ select regexp_count('depesz DEPESZ depesz DepEsz', 'de.e', 1, 'i');
 regexp_count 
--------------
            4
(1 row)
复制

regexp_instr( string, pattern [, start [, N [, endoption [, flags [, subexpr ]]]]])

这个更复杂，有更多的可选参数。

它以最简单的形式返回模式在字符串中开始的位置：

=$ with input as (select 'abcdefghijklmnopqrstuvwxyz'::text as i),
ri as (select regexp_instr( i, '..[aeiou]' ) as p from input)
select ri.p, substr( input.i, ri.p) from input, ri;
 p |          substr          
---+--------------------------
 3 | cdefghijklmnopqrstuvwxyz
(1 row)
复制

第二个参数只是您想要开始工作的角色。所以，如果我说我想从 4 号开始：

=$ with input as (select 'abcdefghijklmnopqrstuvwxyz'::text as i),
ri as (select regexp_instr( i, '..[aeiou]', 4 ) as p from input)
select ri.p, substr( input.i, ri.p ) from input, ri;
 p |        substr        
---+----------------------
 7 | ghijklmnopqrstuvwxyz
(1 row)
复制

第三是 regexp_instr 应该考虑哪个匹配来计算位置。默认情况下它从第一个匹配开始，但我们可以很容易地让它寻找第三个：

=$ with input as (select 'abcdefghijklmnopqrstuvwxyz'::text as i),
ri as (select regexp_instr( i, '..[aeiou]', 1, 3 ) as p from input)
select ri.p, substr( input.i, ri.p) from input, ri;
 p  |     substr     
----+----------------
 13 | mnopqrstuvwxyz
(1 row)
复制

第四种选择是endoption。它只能是 0/null/1，如果设置为 1，regexp_instr 将返回给定匹配后第一个字符的位置：

=$ with input as (select 'abcdefghijklmnopqrstuvwxyz'::text as i),
ri as (select regexp_instr( i, '..[aeiou]', 1, 3, 1 ) as p from input)
select ri.p, substr( input.i, ri.p) from input, ri;
 p  |   substr    
----+-------------
 16 | pqrstuvwxyz
(1 row)
复制

这就引出了一个有趣的问题——不匹配的正则表达式会显示什么？

=$ select regexp_instr('aaa', 'b', 1, 1, 1);
 regexp_instr 
--------------
            0
(1 row)
复制

而且，显然，如果它最后匹配会发生什么？

=$ select regexp_instr('abc', 'c', 1, 1, 1);
 regexp_instr 
--------------
            4
(1 row)
复制

第五个参数是标志，我之前用 regexp_count 提到过，但第六个是一个新东西：subexpr。这意味着什么？

这意味着如果我的模式包含括号中的元素，我可以要求特定的括号集：

=$ with input as (select 'abcdefghijklmnopqrstuvwxyz'::text as i),
ri as (select regexp_instr( i, '.(.[aeiou])(..)', 1, 1, 0, '') as p from input)
select ri.p, substr( input.i, ri.p) from input, ri;
 p |          substr          
---+--------------------------
 3 | cdefghijklmnopqrstuvwxyz
(1 row)
 
=$ with input as (select 'abcdefghijklmnopqrstuvwxyz'::text as i),
ri as (select regexp_instr( i, '.(.[aeiou])(..)', 1, 1, 0, '', 1) as p from input)
select ri.p, substr( input.i, ri.p) from input, ri;
 p |         substr          
---+-------------------------
 4 | defghijklmnopqrstuvwxyz
(1 row)
 
=$ with input as (select 'abcdefghijklmnopqrstuvwxyz'::text as i),
ri as (select regexp_instr( i, '.(.[aeiou])(..)', 1, 1, 0, '', 2) as p from input)
select ri.p, substr( input.i, ri.p) from input, ri;
 p |        substr         
---+-----------------------
 6 | fghijklmnopqrstuvwxyz
(1 row)
复制

复杂并表明我们确实可以将命名参数用于内置函数。我宁愿看到：

select regexp_instr( 'abcdefghijklmnopqrstuvwxyz', '.(.[aeiou])(..)', subexpr => 2 )
复制

比起它：

select regexp_instr( 'abcdefghijklmnopqrstuvwxyz', '.(.[aeiou])(..)', 1, 1, 0, '', 2 )
复制

regexp_like( string, pattern [, flags])

这次简单的真/假给定模式是否匹配字符串。带有可选标志：

=$ select regexp_like( 'AAA', 'a+' ), regexp_like( 'AAA', 'a+', 'i');
 regexp_like | regexp_like 
-------------+-------------
 f           | t
(1 row)
复制

regexp_substr( string, pattern [, start [, N [, flags [, subexpr ]]]])

这有点类似于 regexp_instr，至少在它获得的选项方面是这样。除了endoption之外，其他所有的都是相同的，并且工作方式相同。

不同之处在于它不会返回位置，而是返回匹配的内容。包括所有选择器，例如开始位置、第 n 次匹配、标志和子表达式编号：

=$ select n as nth, s as subexpression, regexp_substr( 'abcdefghijklmnopqrstuvwxyz', '.(.[aeiou])(..)', 1, n, '', s )
from generate_series(1,3) n, generate_series(0, 2) s;
 nth | subexpression | regexp_substr 
-----+---------------+---------------
   1 |             0 | cdefg
   1 |             1 | de
   1 |             2 | fg
   2 |             0 | mnopq
   2 |             1 | no
   2 |             2 | pq
   3 |             0 | stuvw
   3 |             1 | tu
   3 |             2 | vw
(9 rows)
复制

此外，我们还扩展了四个新功能：

regexp_replace（source, pattern, replacement [, start [, N ]] [, flags ]）

添加了两个新选项 - start 和 N，就像其他函数一样，这意味着匹配应该考虑来自字符号 START 的字符串，并处理第 N 个匹配，所以我可以：

=$ select regexp_replace('hubert xxx lubaczewski', '(\S+)', 'depesz', 1, 2);
      regexp_replace       
---------------------------
 hubert depesz lubaczewski
(1 row)
复制

这确实很方便。非常感谢所有相关人员，对于描述的滞后感到抱歉。

文章转载自飞象数据，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。