Text Search in Postgres

Presentation at the NOI Techpark Developers’ Thursday on 2026-05-07

Demo Setup
- Example text corpus
- Postgres installation
Text Search Use Cases

Demo Setup

Example text corpus

As example text corpus, I picked the English (en), german (de) and italian (it) Wikipedia dumps.

All articles were extracted from the XML dumps from → dumps.wikimedia.org uptodate as of 2026-04-01. Disambiguation, talk, category etc. pages were skipped.

The native Mediawiki format is really weird. Almost nothing except the Mediawiki backend can parse this. Pandoc, even the latest version, chokes on many pages. The least troublesome tool I found was the aptly named → mwparserfromhell. I’ve used its .strip_code() method to extract the plain text from the Mediawiki pages (not perfect, but workable).

I’ve imported the article titles and bodies into simple Postgres tables enwiki, dewiki and itwiki:

field | name
------|-------------------
id    | bigint 
title | character varying
body  | character varying

Some stats:

	enwiki	dewiki	itwiki
articles	7 114 298	2 745 523	1 801 598
chars	27 851 734 786	12 148 201 273	5 772 994 910
PG table size	17.71 GiB	8.09 GiB	3.87 GiB
title length	20 +- 11	20 +- 11	20 +- 12
body length	3895 +- 7426	4405 +- 8037	3184 +- 6537

The DE Wikipedia has the longest title (219 chars) and the longest body (1517425 chars).

Note
Fulfilling a cliché about Germany, that 219 char title is the name of a german federal law (→ link for the curious). The longest body is the article about the name “Karl” with a long list of people named “Karl” (→ link for the curious).

Postgres installation

Everything was done on a VM with 16 GiB of RAM and 4 CPU threads of an AMD EPYC 9555P and 250 GiB of network attached block storage at 15k IOPS provided by → Scaleway in Milan.

Only for the creation of the HNSW index, the machine was shortly scaled up to speed-up the process (the HNSW index creation is much faster if you increase maintenance_work_mem and I was doing this right before the talk ;).

I’m using → Postgres 17.9 on Debian 13 with the extensions pg_trgm (built-in), fuzzystrmatch (built-in) and pgvector (third party: → pgvector).

Tuning parameters were:

shared_buffers: 4GB
maintenance_work_mem: 1GB
work_mem: 128MB
wal_level: minimal

and replication settings were off.

Regarding benchmarks, all timing were done repeating the query multiple times and taking the timing at a warm cache situation. Note that Postgres never caches query results, but Postgres as well as Linux do cache the file system pages in RAM.

Text Search Use Cases

Keep it Simple: Finding Substrings

Use case:
You just want to find substrings in shorter texts, think the titles of our Wikipedia articles.

Sometimes the simplest solution is all you need.

Let’s for example search for article titles in the EN Wikipedia that contain the substring “South Tyrol”. We can use the standard SQL like operator (or ilike for a case-insensitive match in Postgres):

select title from enwiki 
    where title ilike '%South Tyrol%'
    order by title;

Cool. 68 results:

                 title                   
------------------------------------------
Ahr (South Tyrol)
Andrian, South Tyrol
Auer, South Tyrol
[...]
Valleys of South Tyrol
Weißhorn (South Tyrol)
We South Tyroleans

Problem: this takes 1.3 seconds on my test machine.

Actually… that’s not so slow. But, of course we can do better using an index!

We cannot use a standard B-tree index here, though. B-Trees rely on ordering data, like indexes in a text book. However, when you search a substring you don’t know what the start of the string is.

No problem. Remember:

Note
Postgres can index anything.

We can use two tricks from the toolbox of computer science.

One trick is splitting the string up into three-letter-chunks, so-called trigrams. Here’s a trigram for “Valleys of South Tyrol”:

select show_trgm('Valleys of South Tyrol');

This gives the (unordered) chunks:

show_trgm
-----------------------------------------------------------------
{"  o","  s","  t","  v"," of"," so"," ty"," va",all,eys,ley,lle,
 "of ","ol ",out,rol,sou,"th ",tyr,uth,val,yro,"ys "}

The second trick is using a GIN (Generalized Inverted Index), that’s an index structure that can speed up searching subsets that are contained in sets, regardless any ordering.

So, looking for “South Tyrol” will find “Valleys of South Tyrol” because the (unordered) subset {” s”,” t”,” so”,” ty”,“ol”,out,rol,sou,“th”,tyr,uth,yro} is contained in the above set.

Sounds complicated? It isn’t. Just create the index specifying you want to use the GIN algorithm, and you want it to work on trigrams:

create index enwiki_title_ix on enwiki 
    using gin(title gin_trgm_ops);

That’s it.

The index weights 388 MiB for the 141 million chars in the 7.1 million titles. It took 30 seconds to create it on my test machine. As is true for all indexes, once you create it, it will be automatically kept up to date if the data changes.

Finding the 68 matching titles on my test machine now takes 10 milliseconds, down from 1300 milliseconds without the index!

Note
Are you a fan of JSON and semi-structured databases? No problem, if your strings are inside some JSON field, everything works exactly the same way. You can create the index on the expression that extracts the field from a jsonb column. Whenever you use the same expression, Postgres will be able to use the index.

Note
Besides LIKE and ILIKE this type of index also works for the regular expression operators ~ and ~*.

Full Text Search

Use case:
You want to find natural-language documents that contain given query terms and optionally rank them by relevance to the query.

A document is a text of any length, but, typically, when we talk about Full Text Search we’re talking about articles, posts, longer texts, not just titles. For the demo, we’re going to use the Wikipedia article bodies.

Consider a page of text typed on a typewriter is about 2000 chars. Then, in our English Wikipedia corpus 47% of article bodies are longer than 1 page, 8.1 % are longer than 5 pages.

We could create a GIN/trigram index on the articles body and use LIKE. However, there are a number of problems with such a simple approach!

Performance

The smaller the search term and the larger the texts, the more inefficient the GIN/trigram approach becomes. To find the 3038 article bodies matching “South Tyrol”, a GIN on trigrams brings the query time from 75 seconds to 3 seconds. A big improvement, but not a good end result. This is because the larger bodies contain many trigrams. Finding a small subset in a large set is slower and gives candidate matches that end up not matching the LIKE operation.

As a matter of fact, for this example when using the GIN/trigram approach on the bodies, there are 16321 candidate matches, of which 13283 are removed by a recheck. By contrast, for the previous example querying the title, there are 68 candidate matches and every one turns out to be correct.

Linguistic support

The LIKE operator doesn’t know anything about word boundaries, or tokenization in general. If we’re looking for “Car” we don’t want the system find documents about a “Carpet”.

In the same way, LIKE doesn’t know anything about languages. It doesn’t know “Car” and “Cars” are the same word in English, one the plural of the other. It also has no concept of a stop word. Stop words are words that should be ignored in the given language, like “the”, or “a” in English.

Ranking

Transforming an article body into trigrams doesn’t conserve any information about the number of occurrences, positions or the overall density of the search terms in the text. These metrics are normally used to rank the relevance of the documents. So ranking is no really possible when all you got is the LIKE operator.

Postgres has support for full text search, which solved all these problems!

The tsvector data type (text search vector) is the key here. To transform a text into a tsvector we can use the to_tsvector() function.

Let’s pick a sentence from the article about computers:

“A computer is a machine that can be programmed to automatically carry out sequences of arithmetic or logical operations (computation).”

and transform it into a tsvector:

select to_tsvector('A computer is a machine that can be programmed to automatically ' ||
    'carry out sequences of arithmetic or logical operations (computation).');

                           to_tsvector                                                           
-----------------------------------------------------------------
 'arithmet':16 'automat':11 'carri':12 'comput':2,20 'logic':18
 'machin':5 'oper':19 'program':9 'sequenc':14

As you can see:

the text has been tokenized into words,
stop words have been removed (“a”, “is”, “that”, etc.),
knowledge about the English language has been used to reduce the words to their stem, for example “logical” became “logic”,
the vector contains the positions of the occurrences, for example the stem “comput” appears as the 2nd and 20th word.

Postgres 17 supports 29 languages (list them in psql using \dF)! You can specify the language with a configuration parameter (default_text_search_config) or by specify the language in the function call. Let’s try German and Italian:

select to_tsvector('german', 'Ein Computer oder Rechner ist ein Gerät, das mittels ' ||
                   'programmierbarer Rechenvorschriften Daten verarbeitet.');

                           to_tsvector                                                           
-----------------------------------------------------------------
  'comput':2 'dat':12 'gerat':7 'mittel':9 'programmierbar':10 'rechenvorschrift':11
  'rechn':4 'verarbeitet':13

select to_tsvector('italian', 'Un computer, in italiano anche noto come elaboratore o ' ||
    'calcolatore, è una macchina automatizzata programmabile in grado di eseguire sia ' ||
    'complessi calcoli matematici (calcolatore) sia altri tipi di elaborazioni dati ' ||
    '(elaboratore).');

                           to_tsvector                                                           
-----------------------------------------------------------------
 'altri':26 'automatizz':14 'calcol':10,22,24 'compless':21 'computer':2 'dat':30 'elabor':8,29,31
 'esegu':19 'grad':17 'ital':4 'macchin':13 'matemat':23 'not':6 'programm':15 'tip':27

As you can see, stop words and stemming follow (roughly) the rules of each language.

OK, so let’s work on our tables! We want to precompute to_tsvector(body). Postgres has a nice feature, generated colums, that allow us to define and materialize a column in a table that is computed from the other columns. This column is automatically updated whenever records in the table change:

alter table enwiki add colum fts tsvector
    generated always as ( to_tsvector('english', title || body) ) stored;
alter table dewiki add colum fts tsvector
    generated always as ( to_tsvector('german',  title || body) ) stored;
alter table itwiki add colum fts tsvector
    generated always as ( to_tsvector('italian', title || body) ) stored;

Note we included both, the title and the body in our new fts column with the tsvector.

On my test machine materializing the new column took quite some time, 44 minutes for enwiki, 20 minutes for dewiki and 10 minutes for itwiki. Remember that you need to do this just once, though. From now on, the generated column is always automatically updated whenever the source data changes.

So far so good. How do we query this? How do we actually perform a full text search?

There is a second ingredient into the mix: the tsquery data type! A query to find all occurrences of “South” followed by “Tyrol” would look like this:

select to_tsquery('South <-> Tyrol');

     to_tsquery      
---------------------
 'south' <-> 'tyrol'
(1 row)

There are some operators you can use for the query: & (and), | (or), ! (not) as well as <-> (followed by).

Parentheses can be used for grouping.

And, there’s the same language support as seen for the tsvector:

select to_tsquery('german', 'Die & Bäume');

 to_tsquery 
------------
 'baum'
(1 row)

“Ein” is a stop word in German and “Bäume” has been stemmed to “baum”.

Let’s finally perform a search, Postgres uses the @@ operator to express a match operation between a tsvector and a tsquery:

select title from enwiki where fts @@ to_tsquery('South <-> Tyrol');

2734 rows are found. That takes a whopping 90 seconds! That’s slow.

Remember:

Note
Postgres can index anything.

In this case we can again use a GIN (Generalized Inverted Index). GIN already knows how to handle a tsvector column, so just create the indexes like so:

create index enwiki_fts_ix on enwiki using gin(fts);
create index dewiki_fts_ix on dewiki using gin(fts);
create index itwiki_fts_ix on itwiki using gin(fts);

Again, not a super quick operation: it took 20, 10 and 4 minutes on my test machine to create the indexes that weight 6.3, 3.3 and 1.5 GiB (again, you need to do this just once).

Let’s repeat the full text search with the index:

select title from enwiki where fts @@ to_tsquery('South <-> Tyrol');

title                                                
----------------------------------
Turopolje pig
2026 Winter Olympics
Baldwin Street
Bogislaw von Bonin
[...]
2010–11 Parma FC season
Tyson Sexsmith
Bergfried
(2734 rows)

Now, to find the 2734 articles out of the 7114298 that contain “South” followed by “Tyrol” we need just 40 milliseconds.

It gets better than that, if the query is highly selective, such as

select title from enwiki
    where fts @@ to_tsquery('squirrel & cactus & bridge & spoon');  -- remember May 2017?

                    title                     
----------------------------------------------
 Phoenix, Arizona
 Mister Maker
 Index of Arizona-related articles
 List of Encyclopædia Britannica Films titles
 List of 2023 albums
 List of birds of North America
(6 rows)

Timing goes down to less than 3 milliseconds!

Or even less when the words are more selective:

select title from enwiki where fts @@ to_tsquery('noi & techpark');

              title               
----------------------------------
 Science and technology in Italy
 South Tyrol
 Free University of Bozen-Bolzano
 Bolzano
 Tyrol
 NOI
 Khalil Berro
 NOI Techpark
 Timeline of Bolzano
(9 rows)

On my test machine this takes about 0.5 milliseconds.

Of course, we can do this in other languages with the same performance:

select title from dewiki
    where fts @@ to_tsquery('german', 'noi & techpark'); -- 18 results in 0.4 ms

Note
As the fts column takes up a lot of space, some people create a functional GIN on to_tsvector(...), basically doing two steps in one (remember: Postgres can index anything). The disadvantage is that you have to repeat the to_tsvector(...) expression in all queries for the index to be applied. Also ranking can be slower if you have many results and need to rank them. We’ll talk about ranking right now.

We have superfast full text search now, but there’s one problem left.

Consider the results for “South <-> Tyrol”:

title                                                
----------------------------------
Turopolje pig
2026 Winter Olympics
Baldwin Street
Bogislaw von Bonin
[...]
2010–11 Parma FC season
Tyson Sexsmith
Bergfried
(2734 rows)

I’m sure → Turopolje pigs are nice animals, but maybe they shouldn’t be the first hit (as has randomly happened here).

So how can we rank results? If you have metadata, such as timestamps, authors or something, you can apply a standard SQL ORDER BY to these fields. However, even by looking at just the documents themselves, some ranking is possible!

Postgres has a function, ts_rank_cd(), that computes the so-called cover density ranking for a given tsvector and tsquery. This takes into account how often the query terms appear in the document, how close together they appear and how this related to the length of the document. There are a few flags you can pass, I like the flag1 (see the → Postgres manual for details).

Here we go:

select title, ts_rank_cd(fts, to_tsquery('South <-> Tyrol'), 1) as r
    from enwiki where fts @@ to_tsquery('South <-> Tyrol') order by r desc;

or, the exact same thing without repeating the tsquery:

with mysearch as (select to_tsquery('South <-> Tyrol') as query)
    select title, ts_rank_cd(fts, query) as r
    from enwiki, mysearch where fts @@ query order by r desc;

                title                     |      r       
------------------------------------------+--------------
 South Tyrol                              |    1.4759854
 Bolzano                                  |   0.44761646
 Trentino-Alto Adige/Südtirol             |    0.4372067
 List of Italian cheeses                  |    0.4216553
 South Tyrolean independence movement     |     0.400347
 History of South Tyrol                   |    0.3030828
[...]
 Deaths in November 2022                  |  0.010483125
 Deaths in September 2025                 |  0.010480676
 Deaths in May 2025                       |  0.010408239
(2734 rows)

This makes a lot more sense!

The top hits are the ones we expect to be at the top. The “death” lists are long lists of people died that month, so the density of South Tyrolean people there is low, and they’re sorted at the end.

This takes 50 milliseconds.

Warning
A query that has many hits, will be slower here, since ts_rank_cd() needs to be computed on many records! If I were to make Wikipedia search engine I’d first try an unranked search with a LIMIT clause to avoid potentially computing the ranking for too many results. If you get more than, say, 10000 results, the terms are not very selective, so maybe fall back to search only the titles to get fewer results and rank the bodies of just those results.

Here’s an example of this situation. A user wants to search the term “person”. Full text searching “person” in the article bodies of the English Wikipedia would give as 917172 results. Ranking 917172 article bodies using ts_rank_cd() would be too costly.

So we’d run the unranked query with a limit first

select title from enwiki where fts @@ to_tsquery('person') limit 10000;

and obtains 10000 results (in about 160 milliseconds), so we know the search term is too generic. We hence fall back to search the titles only. In this demo I don’t have a tsvector for only the title prepared, so as a workaround I’ll just use a regular expression looking for “person” with regexp word boundaries to use my old trigram index. The import point is that I will find fewer results by searching the titles only (488) and still be able to ranking the bodies:

select title, ts_rank_cd(fts, to_tsquery('person')) as r from enwiki
    where title ~* '[[:<:]]person[[:>:]]'
    order by r desc;

This will give relevant results in 14 milliseconds:

            title               |     r      
--------------------------------+------------
 Legal person                   |  10.900001
 One-person operation           |       10.6
 First-person shooter           |  10.400001
 Person-centred planning        |  10.400001
 Person–situation debate        |         10
[...]
 Black Moon (person)            |        0.1
 Er (biblical person)           |        0.1
 Timon of Athens (person)       |        0.1
(488 rows)

Fuzzy string matching

Use case:
Similar to substring search, but with fuzzy string matching. For example to offer possible corrections of misspelled words.

Let’s move from longer texts to single words.

To obtain a list of unique words for the demo, we can extract all distinct words from the Wikipedia titles This can be done natively in Postgres with a trick.

There is a debug function (ts_stat()) that returns statistics about a query returning tsvector, it finds the unique words and has stats about their occurrences and so on.

select word from ts_stat('select to_tsvector(''simple'', ''Die Bäume'')');

The “simple” language has no stop words and no stemming, so it leaves the words unchanged, what is what we want here.

 word  
-------
 die
 bäume

We can “abuse” this to generate tables with the unique words of all the titles:

create table enwords as select word from ts_stat('select to_tsvector(''simple'', title) from enwiki');
create table dewords as select word from ts_stat('select to_tsvector(''simple'', title) from dewiki');
create table itwords as select word from ts_stat('select to_tsvector(''simple'', title) from itwiki');

This just takes a few seconds and leaves as with tables with unique words.

Let’s now say we want to check if a word is not in the list (and hence possibly misspelled). For example: “Postgre”. That word doesn’t exist. The name of the database system is PostgreSQL or short, Postgres.

A common concept to compare words, for example in the context of finding misspelled words, is the edit distance. How many insert, substitutions or deletes does one need to make to go from word A to word B? A common algorithm to compute the edit distance is the one by Levenshtein, implemented in Postgres as levenshtein():

with mysearch as (select 'postgre' as q)
select word, levenshtein(mysearch.q, word) as ld
    from enwords, mysearch
    order by ld asc limit 10;

word     | ld
---------+----
postgres |  1
posture  |  1
postre   |  1
postrel  |  2
postures |  2
postsee  |  2
postrer  |  2
poutre   |  2
sostre   |  2
potere   |  2
(10 rows)
Time: 173.544 ms

OK, there are three best candidates having distance 1, one is “postgres”.

We can speed this up a little bit if we limit the algorithm to short edit distances by using levenshtein_less_equal(mysearch.q, word, 2), this will stop calculating the distance after reaching 2. This brings the time down to 120 milliseconds.

Can we really speed this up? Not right away with levenshtein(). However, by now we know trigrams! Let’s create the index we already know also on these tables:

create index enwords_word_ix on enwords using gin(word gin_trgm_ops);
create index dewords_word_ix on dewords using gin(word gin_trgm_ops);
create index itwords_word_ix on itwords using gin(word gin_trgm_ops);

Trigrams also offer a similarity function that computes the overlap of trigrams between two words (similarity()) and a matching operator that returns true if the similarity is above a (configurable) threshold (%).

with mysearch as (select 'postgre' as q)
select word, similarity(word, mysearch.q) as s
    from enwords, mysearch
    where word % mysearch.q
    order by s desc limit 10;

word        |     s      
------------+------------
postgres    |        0.7
postgresql  |  0.5833333
postgres-xl | 0.53846157
postre      |        0.5
postground  | 0.46153846
postgis     | 0.45454547
posture     | 0.45454547
post        | 0.44444445
postgame    | 0.41666666
postgate    | 0.41666666
(10 rows)
Time: 5.052 ms

Here we go, this also gives as good matches and is a lot faster. The matches are different from the once computed through the edit distance. However, the ‘%’ operator in the where clause can use the GIN and timing goes down to 5 milliseconds.

One technique does not exclude the other. We can quickly compute levenshtein() only on the records that survive the where clause:

with mysearch as (select 'postgre' as q)
select word, similarity(word, mysearch.q) as s, levenshtein(mysearch.q, word) as ld
    from enwords, mysearch where word % mysearch.q order by s desc limit 10;

word        |     s      | ld
------------+------------+----
postgres    |        0.7 |  1
postgresql  |  0.5833333 |  3
postgres-xl | 0.53846157 |  4
postre      |        0.5 |  1
postground  | 0.46153846 |  4
postgis     | 0.45454547 |  2
posture     | 0.45454547 |  1
post        | 0.44444445 |  3
postgame    | 0.41666666 |  2
postgate    | 0.41666666 |  2
(10 rows)
Time: 5.025 ms

Here we can further restrict the possible misspelling by adding a condition on the edit distance (ld) and sort by edit distance and similarity:

with mysearch as (select 'postgre' as q),
     mystep1 as (select word,
                        similarity(word, mysearch.q) as s,
                        levenshtein(mysearch.q, word) as ld
                 from enwords, mysearch
                 where word % mysearch.q)
     select * from mystep1 where ld <= 2 order by ld, s desc limit 10;

This gives good and quick results:

   word   |     s      | ld 
----------+------------+----
 postgres |        0.7 |  1
 postre   |        0.5 |  1
 posture  | 0.45454547 |  1
 postgis  | 0.45454547 |  2
 postgame | 0.41666666 |  2
 postaire | 0.41666666 |  2
 postgate | 0.41666666 |  2
 poste    |        0.4 |  2
 poster   | 0.36363637 |  2
 postie   | 0.36363637 |  2
(10 rows)
Time: 5.143 ms

Note
Postgres’ built-in fuzzystrmatch extensions also has algorithms help identify similarily sounding words (as opposed to misspelled words), such as the original Soundex algorithm or the Daitch-Mokotoff Soundex algorithm (see the → Postgres manual for details).

Semantic Search with Embedding Vectors

Use case:
You want to search by meaning (semantic search), not relying on string matching techniques.

Let’s first talk about embeddings. In this context, embedding means converting a text into a vector (if you’re not familiar with math terms, just think of a vector as an arrow pointing in a specific direction).

Let’s pick an example. Imagine we map genres in novels to the directions of a compass star. Let’s say Fantasy & Science Fiction points north, Mystery & Crime points east, Romance points south, Historical Fiction points west.

So, “Lord of the Rings” points north, “Sherlock Holmes” points east, “Romeo and Juliet” points south.

What about Jane Austen’s “Pride and Prejudice”? Well, there’s romance and historical fiction. We could say it points south-west.

Now, imagine you need to rank novels by how close they are to a new query term, for example the biography of “Bonnie and Clyde” (technically not fiction…)? We could point that at south-east. Let’s compute the distances (angles) between “Bonnie and Clyde” and all the other novels and sort them by distance:

novel	distance (deg)
Romeo and Juliet	45
Sherlock Holmes	45
Pride and Prejudice	90
Lord of the Rings	135

Suddenly something as abstract as novel genres maps into a space we can do computations on: we can compute the similarity of two identities or rank one by similarity to all the others!

If you think further about this, problems arise. The compass rose has two dimensions and there is not “enough space” to fit all possible abstract criteria. Where would you place a romantic sci-fy novel. Why should Romance be opposite of Sci-Fy? The example starts to fall short. It turns out that we need more space. A lot more space! Today, embedding maps text into a vector in a high-dimensional spaces. We’re talking about 384 - 1024 dimensions, typically.

There are Postgres extensions that perform embedding inside Postgres. However, the lingua franca of all things related to machine learning and its application today is Python, and I recommend staying in that ecosystem if you want to do embeddings. Implementing the embedding code as a Postgres client looks easier and more maintainable to me.

The standard Python package to perform embedding is → sentence transformers, based on PyTorch.

Embedding can be done using pre-trained neural network models. These models have been trained in such a way that sentences with a similar meaning map to vectors that point into a similar direction. Many pre-trained models are available as open-weight models. They come in a large variety. Some specialize on shorter or longer texts. Some know only a single language, some are multilingual.

A great, multilingual, very flexible MIT-licensed open-weight model is (→ bge-m3).

For this demo I’m using its cousin, the smaller and faster, english-only (→ bge-small-en-v1.5) under the same license. It takes a text spanning from few words to a few sentences (max. 512 tokens, about 1000 something chars or so) and maps it into a vector of 384 dimensions, which is represented by an array of 384 floats.

Let’s try this on the titles of the EN Wikipedia! We create a table where we store the IDs, titles and the resulting embedding vector:

create table embeddings (id bigint, title varchar, embedding vector(384));

The data type vector is implemented by the popular → pgvector extension.

The Python program embed.py reads (id, title) from enwiki, runs title through bge-small-en-v1.5 and stores (id, title, embedding) into the new table embeddings.

# pip install "psycopg[binary]"
# pip install torch --index-url https://download.pytorch.org/whl/cpu
# pip install sentence-transformers

import time
import psycopg
from sentence_transformers import SentenceTransformer

cursor = psycopg.connect("host='127.0.0.1' dbname='chris' user='chris'").cursor()
model = SentenceTransformer('BAAI/bge-small-en-v1.5', local_files_only=True)

batch_size = 1000
cnt = 0
id = 1

while True:

    t0 = time.time()
    cursor.execute("select id, title from enwiki " +
                   "where id between %s and %s;", [id, id + batch_size - 1])
    results = cursor.fetchall()
    if len(results) == 0:
        break;
    titles = []
    for result in results:
        titles.append(result[1])
    t_select = time.time() - t0;

    t0 = time.time()
    embeddings = model.encode(titles)
    t_embed = time.time() - t0;

    t0 = time.time()
    for i in range(0, len(results)):
        cursor.execute("insert into embeddings values (%s, %s, %s)",
                       [results[i][0], titles[i], embeddings[i].tolist()])
    cursor.connection.commit()
    t_store = time.time() - t0;

    cnt += len(results)
    print("%d done, last batch: %.3f select, %.3f embed, %.3f insert" %
          (cnt, t_select, t_embed, t_store))
    id += batch_size

print("%d embeddings done" % (cnt));

The script works in batches auf 1000 records, ordered by ID so it can be restarted from a given ID if it ist interrupted. For this to work efficiently, make sure ID is a primary key or at least has an index.

On my test machine, this runs at about 300 embeddings per second, and takes about 6 hours for the 7.1 M titles in the EN Wikipedia. The new table embeddings weights about 11 GiB.

Note
You can embed longer texts with models that support that, for example bge-m3 supports up to 8192 tokens (20000 chars or so). However, to perform semantic searches on, say, entire books, you need to chunk the text. This is one of the ingredients of RAG, for example. Our Wikipedia article bodies would be too long to meaningfully embed each one into a single vector, for example. An abstract would be a good use case, but Wikipedia doesn’t provide them.

Warning
Unlike the generated column for the precomputed tsvector in the full text search example, the embedding column was computed by a client program and therefore is not automatically kept up to date by Postgres.

What can we do with this? Well, now we can take a query, embed it into a vector and rank all other vectors by semantic similarity to the one we’re looking for!

As an example, let’s see what titles match the query:

“interesting places for sightseeing in the capital of France”

The Python program query.py does this.

# pip install "psycopg[binary]"
# pip install torch --index-url https://download.pytorch.org/whl/cpu
# pip install sentence-transformers

import os
import time
import psycopg
from sentence_transformers import SentenceTransformer

cursor = psycopg.connect("host='127.0.0.1' dbname='chris' user='chris'").cursor()
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

f = model.encode("interesting places for sightseeing in the capital of France")

t0 = time.time()
cursor.execute(
    """
    select title,
    1.0 - (embedding <=> %s::vector) as sim 
    from embeddings order by embedding <=> %s::vector limit 10
    """, [f.tolist(), f.tolist()])
results = cursor.fetchall()
t1 = time.time()

print("%.3f s" % (t1 - t0))
for result in results:
    print("%6.4f, %s" % (result[1], result[0]))

The program embeds the query, then runs

select title,
       1.0 - (embedding <=> %s::vector) as sim 
    from embeddings
    order by embedding <=> %s::vector limit 10

which lists the top 10 most similar titles.

Here, %s is the placeholder for the float array, substituted before sending the SQL to Postgres.

<=> is the so-called cosine distance operator. It’s a measure of how distant two vectors are.

The result is:

1.655 s
0.8158, List of historic places in Capitale-Nationale
0.8109, List of capitals of France
0.8103, List of tourist attractions in Paris
0.8076, Landmarks in Paris
0.7897, Places to Visit
0.7852, City Sightseeing
0.7810, French Towns and Lands of Art and History
0.7806, Tourism in France
0.7784, Galeries de la Capitale
0.7774, List of protected heritage sites in Les Bons Villers

As you can see, the titles are pretty close to the meaning of the query!

This works bests when the context is at least a few words up to a few sentences. For example the title Paris is missing here. Remember, we didn’t embed the whole article, just the title. The single word Paris alone apparently isn’t close enough to our search term about sightseeing.

1.6 seconds ist fast, right? Well it is, but note there is no index yet!

By now, you know the drill:

Note
Postgres can index anything.

The most performant index to speed this up is a HNSW (Hierarchical navigable small world) index algorithm with the vector_cosine_ops operator class.

CREATE INDEX embeddings_ix ON embeddings USING hnsw (embedding vector_cosine_ops);

If you have enough RAM (set maintenance_work_mem to at least the amount of memory the index to be created will occupy) this is reasonably fast (18 minutes on my temporarily scaled up test machine). Otherwise it still runs fine, but takes much longer. The resulting index is 14 GiB.

Let’s run the query.py again:

0.005 s
0.8158, List of historic places in Capitale-Nationale
0.8109, List of capitals of France
0.8103, List of tourist attractions in Paris
0.8076, Landmarks in Paris
0.7810, French Towns and Lands of Art and History
0.7806, Tourism in France
0.7774, List of protected heritage sites in Les Bons Villers
0.7750, List of museums in France
0.7749, List of most-visited museums in France
0.7732, Tourism in Paris

Two points.

First. The HNSW indexing is superfast. We computed the top-ten closest vectors among 7.1 million vectors in 5 milliseconds.

An important point is order by embedding <=> %s::vector limit 10 is performed by HNSW. It’s not computing 7.1 million distances, then sorting, then showing the first ten. Postgres actually uses HNSW to find the top-10 closest vectors!

That’s why it’s important to write the query this way, so Postgres’ planner can apply the index algorithm.

Second. HNSW ist approximate. You see the results are not exactly identical with and without the index.

Note
People combine semantic search with lexical search! You can just do both and show the combined ranking results (hybrid search). You can also perform one type of search to obtain the top-N results and then rank them using the other type of search (reranking).

Let’s try to run our query example again, but additionally rerank the 10 results using ts_rank_cd().

What can we give as full text search term for the ranking? If we look for all the words, we risk getting ranks 0.0, because not all single words of the search term are present in all articles. Let’s just or the words. Here are the updated parts in query.py:

f = model.encode("interesting places for sightseeing in the capital of France")
k = "interesting | places | for | sightseeing | in | the | capital | of | France"

t0 = time.time()
cursor.execute(
    """
    select title, ts_rank_cd(fts, to_tsquery(%s)) as r from enwiki where enwiki.id in (
        select id from embeddings order by embedding <=> %s::vector limit 10
    ) order by r desc
    """, [k, f.tolist()])
results = cursor.fetchall()
t1 = time.time()

Here is the output. We get the same 10 results, of course, but the ranking within the 10 results is arguably even better!

0.007 s
5.6000, Landmarks in Paris
5.4000, Tourism in France
1.9000, List of tourist attractions in Paris
1.8000, Tourism in Paris
1.6000, List of capitals of France
1.3000, List of museums in France
0.8000, French Towns and Lands of Art and History
0.4000, List of historic places in Capitale-Nationale
0.4000, List of most-visited museums in France
0.0000, List of protected heritage sites in Les Bons Villers

Note
Besides text, there are models that can embed other media. → CLIP is an interesting model that embeds texts and images into the same vector space! That means you can search images using images, or search images using text, or any combination. Or you can compute semantic similarity between images.