KINGBASE 全文检索

时间:2021-07-22 17:34:08   收藏:0   阅读:0

KINGBASE 支持全文检索,其内置的缺省的分词解析器采用空格分词。因为中文的词语之间没有空格分割,所以这种方法并不适用于中文。要支持中文的全文检索需要额外的中文分词插件。

一、默认空格分词

1、tsvector

test=# SELECT to_tsvector(Try not to become a man of success, but rather try to become a man of value);
                             to_tsvector                              
----------------------------------------------------------------------
 becom:4,13 man:6,15 rather:10 success:8 tri:1,11 valu:17

2、标准化过程

标准花啊过程会以下操作:

  1. 总是把大写字母换成小写的
  2. 也经常移除后缀(比如英语中的s,es和ing等)。这样可以搜索同一个字的各种变体,而不是乏味地输入所有可能的变体。
  3. 数字表示词位在原始字符串中的位置,比如“man"出现在第6和15的位置上。你可以自己数数看。
  4. to_tesvetor的默认配置的文本搜索是“英语“。它会忽略掉英语中的停用词(stopword,译注:也就是am is are a an等单词)。

3、tsvector搜索

test=# SELECT to_tsvector(Try not to become a man of success, but rather try to become a man of value) @@ become;
 ?column? 
----------
 f
(1 row)

test=# SELECT to_tsvector(Try not to become a man of success, but rather try to become a man of value) @@ becom;   
 ?column? 
----------
 t
(1 row)

test=# select become::tsquery,to_tsquery(become);
 tsquery  | to_tsquery 
----------+------------
 become | becom

to_tsquery 也会进行标准化转换,在搜索时必须用 to_tsquery,确保数据不会因为标准化转换而搜索不到。

4、逻辑操作

test=# SELECT to_tsvector(Try not to become a man of success, but rather try to become a man of value) @@ to_tsquery(become);
 ?column? 
----------
 t
(1 row)

test=# SELECT to_tsvector(Try not to become a man of success, but rather try to become a man of value) @@ to_tsquery(!become); 
 ?column? 
----------
 f
(1 row)

test=# SELECT to_tsvector(Try not to become a man of success, but rather try to become a man of value) @@ to_tsquery(tri & become);
 ?column? 
----------
 t
(1 row)

test=# SELECT to_tsvector(Try not to become a man of success, but rather try to become a man of value) @@ to_tsquery(Try & !becom);
 ?column? 
----------
 f
(1 row)

test=# SELECT to_tsvector(Try not to become a man of success, but rather try to become a man of value) @@ to_tsquery(Try | !become);
 ?column? 
----------
 t
(1 row)

5、可以用 :* 表示某词开始字符

test=# SELECT to_tsvector(Try not to become a man of success, but rather try to become a man of value) @@ to_tsquery(bec:*);
 ?column? 
----------
 t
(1 row)

6、其他语言支持

test=# SELECT to_tsvector(simple,Try not to become a man of success, but rather try to become a man of value);
                                                     to_tsvector                                                     
---------------------------------------------------------------------------------------------------------------------
 a:5,14 become:4,13 but:9 man:6,15 not:2 of:7,16 rather:10 success:8 to:3,12 try:1,11 value:17
(1 row)

test=# SELECT to_tsvector(english,Try not to become a man of success, but rather try to become a man of value) ;
                             to_tsvector                              
----------------------------------------------------------------------
 becom:4,13 man:6,15 rather:10 success:8 tri:1,11 valu:17
(1 row)
                           ^
test=# SELECT to_tsvector(french,Try not to become a man of success, but rather try to become a man of value) ;
                                                   to_tsvector                                                   
-----------------------------------------------------------------------------------------------------------------
 a:5,14 becom:4,13 but:9 man:6,15 not:2 of:7,16 rath:10 success:8 to:3,12 try:1,11 valu:17
(1 row)
                                     ^
test=# SELECT to_tsvector(french::regconfig,Try not to become a man of success, but rather try to become a man of value) ;
                                                   to_tsvector                                                   
-----------------------------------------------------------------------------------------------------------------
 a:5,14 becom:4,13 but:9 man:6,15 not:2 of:7,16 rath:10 success:8 to:3,12 try:1,11 valu:17
(1 row)

simple并不忽略禁用词表,它也不会试着去查找单词的词根。使用simple时,空格分割的每一组字符都是一个语义;simple 只做了小写转换;对于数据来说,simple文本搜索配置项很实用。 

二、中文检索

在开始介绍中文检索前,我们先来看个例子:

test=# select to_tsvector(人大金仓致力于提供高可靠的数据库产品);
               to_tsvector                
------------------------------------------
 人大金仓致力于提供高可靠的数据库产品:1

因为内置的分词器是按空格分割的,而中文间没有空格,因此,整句话就被看做一个分词。

1、创建中文搜索插件

create extension zhparser;
create text search configuration zhongwen_parser (parser = zhparser);
alter text search configuration zhongwen_parser add mapping for n,v,a,i,e,l,j with simple;

上面 for 后面的字母表示分词的token,上面的token映射只映射了名词(n),动词(v),形容词(a),成语(i),叹词(e),缩写(j) 和习用语(l)6种,这6种以外的token全部被屏蔽。词典使用的是内置的simple词典。具体的token 如下:

test=# select ts_token_type(zhparser);
     ts_token_type      
------------------------
 (97,a,adjective)
 (98,b,differentiation)
 (99,c,conjunction)
 (100,d,adverb)
 (101,e,exclamation)
 (102,f,position)
 (103,g,root)
 (104,h,head)
 (105,i,idiom)
 (106,j,abbreviation)
 (107,k,tail)
 (108,l,tmp)
 (109,m,numeral)
 (110,n,noun)
 (111,o,onomatopoeia)
 (112,p,prepositional)
 (113,q,quantity)
 (114,r,pronoun)
 (115,s,space)
 (116,t,time)
 (117,u,auxiliary)
 (118,v,verb)
 (119,w,punctuation)
 (120,x,unknown)
 (121,y,modal)
 (122,z,status)
(26 rows)

2、查看pg_ts_config

创建text search configuration 后,可以在视图pg_ts_config 看到如下信息:

test=# select * from pg_ts_config;
  oid  |     cfgname     | cfgnamespace | cfgowner | cfgparser 
-------+-----------------+--------------+----------+-----------
  3748 | simple          |           11 |       10 |      3722
 13265 | arabic          |           11 |       10 |      3722
 13267 | danish          |           11 |       10 |      3722
 13269 | dutch           |           11 |       10 |      3722
 13271 | english         |           11 |       10 |      3722
 13273 | finnish         |           11 |       10 |      3722
 13275 | french          |           11 |       10 |      3722
 13277 | german          |           11 |       10 |      3722
 13279 | hungarian       |           11 |       10 |      3722
 13281 | indonesian      |           11 |       10 |      3722
 13283 | irish           |           11 |       10 |      3722
 13285 | italian         |           11 |       10 |      3722
 13287 | lithuanian      |           11 |       10 |      3722
 13289 | nepali          |           11 |       10 |      3722
 13291 | norwegian       |           11 |       10 |      3722
 13293 | portuguese      |           11 |       10 |      3722
 13295 | romanian        |           11 |       10 |      3722
 13297 | russian         |           11 |       10 |      3722
 13299 | spanish         |           11 |       10 |      3722
 13301 | swedish         |           11 |       10 |      3722
 13303 | tamil           |           11 |       10 |      3722
 13305 | turkish         |           11 |       10 |      3722
 16390 | parser_name     |         2200 |       10 |     16389
 24587 | zhongwen_parser |         2200 |       10 |     16389

3、使用中文分词

test=# select to_tsvector(zhongwen_parser,人大金仓致力于提供高可靠的数据库产品);
                           to_tsvector                            
------------------------------------------------------------------
 产品:7 人大:1 可靠:5 提供:3 数据库:6 致力于:2 :4

4、contains 函数

test=# \df+ contains
                                                                                           List of functions
 Schema |   Name   | Result data type | Argument data types | Type | Volatility | Parallel | Owner  | Security | Access privileges | Language |               Source code        
        | Description 
--------+----------+------------------+---------------------+------+------------+----------+--------+----------+-------------------+----------+------------------------------------------+-------------
 sys    | contains | boolean          | text, text          | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) | 
 sys    | contains | boolean          | text, text, integer | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) | 
 sys    | contains | boolean          | text, tsquery       | func | immutable  | safe     | system | invoker  |                   | sql      | select $1::tsvector @@ $2                | 
 sys    | contains | boolean          | tsvector, text      | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2::tsquery                 | 
 sys    | contains | boolean          | tsvector, tsquery   | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2                          | 

默认contains 函数使用的是空格分词解析器,因此,无法使用contains 进行中文判断

test=# select contains(人大金仓致力于提供高可靠的数据库产品,产品);
 contains 
----------
 f

 

评论(0
© 2014 mamicode.com 版权所有 京ICP备13008772号-2  联系我们:gaon5@hotmail.com
迷上了代码!