Recently, I am studying PostgreSQL to do search engine, in short, the main use of the search engine is the inverted index, that is, an article or statement, the first word, the article into N words, each word has a certain weight, there are many places in this step can be optimized, the article will be cut into the exact meaning of the word, the impact on the subsequent search is very big, this can refer to TF-IDF or TEXTRANK algorithm. The second step is to establish the inverted index, that is, the word and the position of the word in the article associated; with the inverted index, the third step is to search, the same, we will input the word, and assembled into a certain search criteria, into the search engine to search; the fourth step is to process the search results.
However, the commonly used search engines in the industry are ES, why do I want to use PG? The search in the case of massive data maybe ES will be a better solution, but the general case with ES is too heavy, PG can be competent for common cases? If our business data and search can be handled by PG, we don’t need to synchronize between database and ES, and the overall complexity will be reduced a lot.
This is the reason why I made this attempt.
PG Internal Support
To use full-text indexes in PostgreSQL (PG for short), you need to use a built-in data structure provided by PG, called tsvector
.
|
|
tsvector
is used to store the subword vector, let’s look at a simple example.
As you can see, here is what we said earlier about the inverted index, each is a word and the position of the word composed of the A
and B
is actually the weight of the word, PG has ABCD 4 weights, A’s weight is the highest, D’s lowest. The higher the weight, the higher the ranking can be when the subsequent search.
Build inverted index
After we get the article, we have to do the splitting first, I use sego
, the reason is that gojieba
is always somehow panic, so for simplicity, I use sego
first, assuming that later we find that sego
optimization space is not enough, then we can use jieba
Python version encapsulated as a service to provide out.
My database is designed as follows
title
titledescription
article or descriptiontokens
stores the subword vectors
This should work for most scenarios, such as searching for articles, products, lyrics, posts, etc. The code is as follows.
|
|
The next step is to update the results of the word split into.
Use to_tsvector
to set the vector, preceded by simple
to indicate that it is cut by spaces, or by english
by default if not given. The setweight
function is used to set the weight of the result of to_tsvector
, and there are ABCD
options for the weight, as described above. ||
is used to merge multiple subword vectors.
This mode is actually controlling the subword entirely at the application level, and most of the cases that can be searched online are based on the compiled PG plugin form. I prefer application-level subscripts, which have the following advantages.
- Simple maintenance and no compilation. If it is a cloud-hosted PG, it may not be able to load self-compiled plugins
- Easy to scale, application level scaling is much easier than database level scaling. The word splitting itself is a CPU-intensive task, so it is easy to reach the bottleneck in the database.
- The application is fast to update, and it is very easy to update any new features and functions of the Pictionary plugin.
At this point, we’ve updated the subword vector in. Next we still need to create the index.
I use the GIN index, in addition to the GiST option, see the difference at: https://www.postgresql.org/docs/current/textsearch-indexes.html.
Searching
After saving the data, the next thing we have to do is to search.
|
|
to_tsquery
is a parsing query statement with the following syntax (refer to documentation).
&
means both conditions must be satisfied|
means one of them is satisfied!
NOT operation means no match<->
means that the A word follows the B word, almost likeA.... B
structureblabla*
means it matches the prefix
ts_rank
is calculated based on the weight, which is convenient for the subsequent ORDER BY
ranking.
Performance testing
I have repeatedly imported all the articles of the blog many times to get together 100 million subwords, here is the actual test data.
|
|
As you can see, the total number of articles is 360,000, and the total number of words is about 100 million after the word separation. When searching, the response time is basically proportional to the number of results returned, and the response is very fast if there are few search results. I think the common scenario PG is completely enough to cover. There are still many places to optimize, for example, removing common tone words, removing punctuation, special characters, removing most useless words, using TF-IDF to extract keywords to give higher weight, and reducing the weight of the rest of the words, after these optimizations, the overall performance should be much better.
Summary
This article summarizes my experience of tossing PG as a search engine, after verification, PG is fully capable of common scenarios, in the future I do some of my own what need to search capabilities, I believe this solution can make the overall simpler.