Chris Mair (→ 1006.org)
Presentation at the NOI Techpark Developers’ Thursday on 2026-05-07
As example text corpus, I picked the English (en), german (de) and italian (it) Wikipedia dumps.
All articles were extracted from the XML dumps from → dumps.wikimedia.org uptodate as of 2026-04-01. Disambiguation, talk, category etc. pages were skipped.
The native Mediawiki format is really weird. Almost nothing except
the Mediawiki backend can parse this. Pandoc, even the latest version,
chokes on many pages. The least troublesome tool I found was the aptly
named →
mwparserfromhell. I’ve used its .strip_code() method to
extract the plain text from the Mediawiki pages (not perfect, but
workable).
I’ve imported the article titles and bodies into simple Postgres
tables enwiki, dewiki and
itwiki:
field | name
------|-------------------
id | bigint
title | character varying
body | character varying
Some stats:
| enwiki | dewiki | itwiki | |
|---|---|---|---|
| articles | 7 114 298 | 2 745 523 | 1 801 598 |
| chars | 27 851 734 786 | 12 148 201 273 | 5 772 994 910 |
| PG table size | 17.71 GiB | 8.09 GiB | 3.87 GiB |
| title length | 20 +- 11 | 20 +- 11 | 20 +- 12 |
| body length | 3895 +- 7426 | 4405 +- 8037 | 3184 +- 6537 |
The DE Wikipedia has the longest title (219 chars) and the longest body (1517425 chars).
Note
Fulfilling a cliché about Germany, that 219 char title is the name of a german federal law (→ link for the curious). The longest body is the article about the name “Karl” with a long list of people named “Karl” (→ link for the curious).
Everything was done on a VM with 16 GiB of RAM and 4 CPU threads of an AMD EPYC 9555P and 250 GiB of network attached block storage at 15k IOPS provided by → Scaleway in Milan.
Only for the creation of the HNSW index, the machine was shortly
scaled up to speed-up the process (the HNSW index creation is much
faster if you increase maintenance_work_mem and I was doing
this right before the talk ;).
I’m using → Postgres 17.9
on Debian 13 with the extensions pg_trgm (built-in),
fuzzystrmatch (built-in) and pgvector (third
party: →
pgvector).
Tuning parameters were:
shared_buffers: 4GB
maintenance_work_mem: 1GB
work_mem: 128MB
wal_level: minimal
and replication settings were off.
Regarding benchmarks, all timing were done repeating the query multiple times and taking the timing at a warm cache situation. Note that Postgres never caches query results, but Postgres as well as Linux do cache the file system pages in RAM.
Use case:
You just want to find substrings in shorter
texts, think the titles of our Wikipedia articles.
Sometimes the simplest solution is all you need.
Let’s for example search for article titles in the EN Wikipedia that
contain the substring “South Tyrol”. We can use the standard SQL
like operator (or ilike for a case-insensitive
match in Postgres):
select title from enwiki
where title ilike '%South Tyrol%'
order by title;Cool. 68 results:
title
------------------------------------------
Ahr (South Tyrol)
Andrian, South Tyrol
Auer, South Tyrol
[...]
Valleys of South Tyrol
Weißhorn (South Tyrol)
We South Tyroleans
Problem: this takes 1.3 seconds on my test machine.
Actually… that’s not so slow. But, of course we can do better using an index!
We cannot use a standard B-tree index here, though. B-Trees rely on ordering data, like indexes in a text book. However, when you search a substring you don’t know what the start of the string is.
No problem. Remember:
Note
Postgres can index anything.
We can use two tricks from the toolbox of computer science.
One trick is splitting the string up into three-letter-chunks, so-called trigrams. Here’s a trigram for “Valleys of South Tyrol”:
select show_trgm('Valleys of South Tyrol');This gives the (unordered) chunks:
show_trgm
-----------------------------------------------------------------
{" o"," s"," t"," v"," of"," so"," ty"," va",all,eys,ley,lle,
"of ","ol ",out,rol,sou,"th ",tyr,uth,val,yro,"ys "}
The second trick is using a GIN (Generalized Inverted Index), that’s an index structure that can speed up searching subsets that are contained in sets, regardless any ordering.
So, looking for “South Tyrol” will find “Valleys of South Tyrol” because the (unordered) subset {” s”,” t”,” so”,” ty”,“ol”,out,rol,sou,“th”,tyr,uth,yro} is contained in the above set.
Sounds complicated? It isn’t. Just create the index specifying you want to use the GIN algorithm, and you want it to work on trigrams:
create index enwiki_title_ix on enwiki
using gin(title gin_trgm_ops);That’s it.
The index weights 388 MiB for the 141 million chars in the 7.1 million titles. It took 30 seconds to create it on my test machine. As is true for all indexes, once you create it, it will be automatically kept up to date if the data changes.
Finding the 68 matching titles on my test machine now takes 10 milliseconds, down from 1300 milliseconds without the index!
Note
Are you a fan of JSON and semi-structured databases? No problem, if your strings are inside some JSON field, everything works exactly the same way. You can create the index on the expression that extracts the field from ajsonbcolumn. Whenever you use the same expression, Postgres will be able to use the index.
Note
BesidesLIKEandILIKEthis type of index also works for the regular expression operators~and~*.
Use case:
You want to find natural-language documents that
contain given query terms and optionally rank them by relevance to the
query.
A document is a text of any length, but, typically, when we talk about Full Text Search we’re talking about articles, posts, longer texts, not just titles. For the demo, we’re going to use the Wikipedia article bodies.
Consider a page of text typed on a typewriter is about 2000 chars. Then, in our English Wikipedia corpus 47% of article bodies are longer than 1 page, 8.1 % are longer than 5 pages.
We could create a GIN/trigram index on the articles body and use
LIKE. However, there are a number of problems with such a
simple approach!
Performance
The smaller the search term and the larger the texts, the more
inefficient the GIN/trigram approach becomes. To find the 3038 article
bodies matching “South Tyrol”, a GIN on trigrams brings the query time
from 75 seconds to 3 seconds. A big improvement, but not a good end
result. This is because the larger bodies contain many trigrams. Finding
a small subset in a large set is slower and gives candidate matches that
end up not matching the LIKE operation.
As a matter of fact, for this example when using the GIN/trigram approach on the bodies, there are 16321 candidate matches, of which 13283 are removed by a recheck. By contrast, for the previous example querying the title, there are 68 candidate matches and every one turns out to be correct.
Linguistic support
The LIKE operator doesn’t know anything about word
boundaries, or tokenization in general. If we’re looking for “Car” we
don’t want the system find documents about a “Carpet”.
In the same way, LIKE doesn’t know anything about
languages. It doesn’t know “Car” and “Cars” are the same word in
English, one the plural of the other. It also has no concept of a stop
word. Stop words are words that should be ignored in the given language,
like “the”, or “a” in English.
Ranking
Transforming an article body into trigrams doesn’t conserve any
information about the number of occurrences, positions or the overall
density of the search terms in the text. These metrics are normally used
to rank the relevance of the documents. So ranking is no really possible
when all you got is the LIKE operator.
Postgres has support for full text search, which solved all these problems!
The tsvector data type (text
search vector) is the key here. To
transform a text into a tsvector we can use the
to_tsvector() function.
Let’s pick a sentence from the article about computers:
“A computer is a machine that can be programmed to automatically carry out sequences of arithmetic or logical operations (computation).”
and transform it into a tsvector:
select to_tsvector('A computer is a machine that can be programmed to automatically ' ||
'carry out sequences of arithmetic or logical operations (computation).'); to_tsvector
-----------------------------------------------------------------
'arithmet':16 'automat':11 'carri':12 'comput':2,20 'logic':18
'machin':5 'oper':19 'program':9 'sequenc':14
As you can see:
Postgres 17 supports 29 languages (list them in psql
using \dF)! You can specify the language with a
configuration parameter (default_text_search_config) or by
specify the language in the function call. Let’s try German and
Italian:
select to_tsvector('german', 'Ein Computer oder Rechner ist ein Gerät, das mittels ' ||
'programmierbarer Rechenvorschriften Daten verarbeitet.'); to_tsvector
-----------------------------------------------------------------
'comput':2 'dat':12 'gerat':7 'mittel':9 'programmierbar':10 'rechenvorschrift':11
'rechn':4 'verarbeitet':13
select to_tsvector('italian', 'Un computer, in italiano anche noto come elaboratore o ' ||
'calcolatore, è una macchina automatizzata programmabile in grado di eseguire sia ' ||
'complessi calcoli matematici (calcolatore) sia altri tipi di elaborazioni dati ' ||
'(elaboratore).'); to_tsvector
-----------------------------------------------------------------
'altri':26 'automatizz':14 'calcol':10,22,24 'compless':21 'computer':2 'dat':30 'elabor':8,29,31
'esegu':19 'grad':17 'ital':4 'macchin':13 'matemat':23 'not':6 'programm':15 'tip':27
As you can see, stop words and stemming follow (roughly) the rules of each language.
OK, so let’s work on our tables! We want to precompute
to_tsvector(body). Postgres has a nice feature,
generated colums, that allow us to define and
materialize a column in a table that is computed from the other columns.
This column is automatically updated whenever records in the table
change:
alter table enwiki add colum fts tsvector
generated always as ( to_tsvector('english', title || body) ) stored;
alter table dewiki add colum fts tsvector
generated always as ( to_tsvector('german', title || body) ) stored;
alter table itwiki add colum fts tsvector
generated always as ( to_tsvector('italian', title || body) ) stored;Note we included both, the title and the body in our new
fts column with the tsvector.
On my test machine materializing the new column took quite some time,
44 minutes for enwiki, 20 minutes for dewiki
and 10 minutes for itwiki. Remember that you need to do
this just once, though. From now on, the generated column is always
automatically updated whenever the source data changes.
So far so good. How do we query this? How do we actually perform a full text search?
There is a second ingredient into the mix: the tsquery
data type! A query to find all occurrences of “South” followed by
“Tyrol” would look like this:
select to_tsquery('South <-> Tyrol'); to_tsquery
---------------------
'south' <-> 'tyrol'
(1 row)
There are some operators you can use for the query:
& (and), | (or), ! (not) as
well as <-> (followed by).
Parentheses can be used for grouping.
And, there’s the same language support as seen for the
tsvector:
select to_tsquery('german', 'Die & Bäume'); to_tsquery
------------
'baum'
(1 row)
“Ein” is a stop word in German and “Bäume” has been stemmed to “baum”.
Let’s finally perform a search, Postgres uses the @@
operator to express a match operation between a tsvector
and a tsquery:
select title from enwiki where fts @@ to_tsquery('South <-> Tyrol');2734 rows are found. That takes a whopping 90 seconds! That’s slow.
Remember:
Note
Postgres can index anything.
In this case we can again use a GIN (Generalized Inverted Index). GIN
already knows how to handle a tsvector column, so just
create the indexes like so:
create index enwiki_fts_ix on enwiki using gin(fts);
create index dewiki_fts_ix on dewiki using gin(fts);
create index itwiki_fts_ix on itwiki using gin(fts);Again, not a super quick operation: it took 20, 10 and 4 minutes on my test machine to create the indexes that weight 6.3, 3.3 and 1.5 GiB (again, you need to do this just once).
Let’s repeat the full text search with the index:
select title from enwiki where fts @@ to_tsquery('South <-> Tyrol');title
----------------------------------
Turopolje pig
2026 Winter Olympics
Baldwin Street
Bogislaw von Bonin
[...]
2010–11 Parma FC season
Tyson Sexsmith
Bergfried
(2734 rows)
Now, to find the 2734 articles out of the 7114298 that contain “South” followed by “Tyrol” we need just 40 milliseconds.
It gets better than that, if the query is highly selective, such as
select title from enwiki
where fts @@ to_tsquery('squirrel & cactus & bridge & spoon'); -- remember May 2017? title
----------------------------------------------
Phoenix, Arizona
Mister Maker
Index of Arizona-related articles
List of Encyclopædia Britannica Films titles
List of 2023 albums
List of birds of North America
(6 rows)
Timing goes down to less than 3 milliseconds!
Or even less when the words are more selective:
select title from enwiki where fts @@ to_tsquery('noi & techpark'); title
----------------------------------
Science and technology in Italy
South Tyrol
Free University of Bozen-Bolzano
Bolzano
Tyrol
NOI
Khalil Berro
NOI Techpark
Timeline of Bolzano
(9 rows)
On my test machine this takes about 0.5 milliseconds.
Of course, we can do this in other languages with the same performance:
select title from dewiki
where fts @@ to_tsquery('german', 'noi & techpark'); -- 18 results in 0.4 msNote
As theftscolumn takes up a lot of space, some people create a functional GIN onto_tsvector(...), basically doing two steps in one (remember: Postgres can index anything). The disadvantage is that you have to repeat theto_tsvector(...)expression in all queries for the index to be applied. Also ranking can be slower if you have many results and need to rank them. We’ll talk about ranking right now.
We have superfast full text search now, but there’s one problem left.
Consider the results for “South <-> Tyrol”:
title
----------------------------------
Turopolje pig
2026 Winter Olympics
Baldwin Street
Bogislaw von Bonin
[...]
2010–11 Parma FC season
Tyson Sexsmith
Bergfried
(2734 rows)
I’m sure → Turopolje pigs are nice animals, but maybe they shouldn’t be the first hit (as has randomly happened here).
So how can we rank results? If you have metadata, such as timestamps,
authors or something, you can apply a standard SQL ORDER BY
to these fields. However, even by looking at just the documents
themselves, some ranking is possible!
Postgres has a function, ts_rank_cd(), that computes the
so-called cover density ranking for a given
tsvector and tsquery. This takes into account
how often the query terms appear in the document, how close together
they appear and how this related to the length of the document. There
are a few flags you can pass, I like the flag1 (see the →
Postgres manual for details).
Here we go:
select title, ts_rank_cd(fts, to_tsquery('South <-> Tyrol'), 1) as r
from enwiki where fts @@ to_tsquery('South <-> Tyrol') order by r desc;or, the exact same thing without repeating the
tsquery:
with mysearch as (select to_tsquery('South <-> Tyrol') as query)
select title, ts_rank_cd(fts, query) as r
from enwiki, mysearch where fts @@ query order by r desc; title | r
------------------------------------------+--------------
South Tyrol | 1.4759854
Bolzano | 0.44761646
Trentino-Alto Adige/Südtirol | 0.4372067
List of Italian cheeses | 0.4216553
South Tyrolean independence movement | 0.400347
History of South Tyrol | 0.3030828
[...]
Deaths in November 2022 | 0.010483125
Deaths in September 2025 | 0.010480676
Deaths in May 2025 | 0.010408239
(2734 rows)
This makes a lot more sense!
The top hits are the ones we expect to be at the top. The “death” lists are long lists of people died that month, so the density of South Tyrolean people there is low, and they’re sorted at the end.
This takes 50 milliseconds.
Warning
A query that has many hits, will be slower here, sincets_rank_cd()needs to be computed on many records! If I were to make Wikipedia search engine I’d first try an unranked search with aLIMITclause to avoid potentially computing the ranking for too many results. If you get more than, say, 10000 results, the terms are not very selective, so maybe fall back to search only the titles to get fewer results and rank the bodies of just those results.
Here’s an example of this situation. A user wants to search the term
“person”. Full text searching “person” in the article bodies of the
English Wikipedia would give as 917172 results. Ranking 917172 article
bodies using ts_rank_cd() would be too costly.
So we’d run the unranked query with a limit first
select title from enwiki where fts @@ to_tsquery('person') limit 10000;and obtains 10000 results (in about 160 milliseconds), so we know the
search term is too generic. We hence fall back to search the
titles only. In this demo I don’t have a
tsvector for only the title prepared, so as a workaround
I’ll just use a regular expression looking for “person” with regexp word
boundaries to use my old trigram index. The import point is that I will
find fewer results by searching the titles only (488)
and still be able to ranking the bodies:
select title, ts_rank_cd(fts, to_tsquery('person')) as r from enwiki
where title ~* '[[:<:]]person[[:>:]]'
order by r desc;This will give relevant results in 14 milliseconds:
title | r
--------------------------------+------------
Legal person | 10.900001
One-person operation | 10.6
First-person shooter | 10.400001
Person-centred planning | 10.400001
Person–situation debate | 10
[...]
Black Moon (person) | 0.1
Er (biblical person) | 0.1
Timon of Athens (person) | 0.1
(488 rows)
Use case:
Similar to substring search, but with fuzzy
string matching. For example to offer possible corrections of misspelled
words.
Let’s move from longer texts to single words.
To obtain a list of unique words for the demo, we can extract all distinct words from the Wikipedia titles This can be done natively in Postgres with a trick.
There is a debug function (ts_stat()) that returns
statistics about a query returning tsvector, it finds the
unique words and has stats about their occurrences and so on.
select word from ts_stat('select to_tsvector(''simple'', ''Die Bäume'')');The “simple” language has no stop words and no stemming, so it leaves the words unchanged, what is what we want here.
word
-------
die
bäume
We can “abuse” this to generate tables with the unique words of all the titles:
create table enwords as select word from ts_stat('select to_tsvector(''simple'', title) from enwiki');
create table dewords as select word from ts_stat('select to_tsvector(''simple'', title) from dewiki');
create table itwords as select word from ts_stat('select to_tsvector(''simple'', title) from itwiki');This just takes a few seconds and leaves as with tables with unique words.
Let’s now say we want to check if a word is not in the list (and hence possibly misspelled). For example: “Postgre”. That word doesn’t exist. The name of the database system is PostgreSQL or short, Postgres.
A common concept to compare words, for example in the context of
finding misspelled words, is the edit distance. How
many insert, substitutions or deletes does one need to make to go from
word A to word B? A common algorithm to compute the edit distance is the
one by Levenshtein, implemented in Postgres as
levenshtein():
with mysearch as (select 'postgre' as q)
select word, levenshtein(mysearch.q, word) as ld
from enwords, mysearch
order by ld asc limit 10;word | ld
---------+----
postgres | 1
posture | 1
postre | 1
postrel | 2
postures | 2
postsee | 2
postrer | 2
poutre | 2
sostre | 2
potere | 2
(10 rows)
Time: 173.544 ms
OK, there are three best candidates having distance 1, one is “postgres”.
We can speed this up a little bit if we limit the algorithm to short
edit distances by using
levenshtein_less_equal(mysearch.q, word, 2), this will stop
calculating the distance after reaching 2. This brings the time down to
120 milliseconds.
Can we really speed this up? Not right away with
levenshtein(). However, by now we know trigrams! Let’s
create the index we already know also on these tables:
create index enwords_word_ix on enwords using gin(word gin_trgm_ops);
create index dewords_word_ix on dewords using gin(word gin_trgm_ops);
create index itwords_word_ix on itwords using gin(word gin_trgm_ops);Trigrams also offer a similarity function that computes the overlap
of trigrams between two words (similarity()) and a matching
operator that returns true if the similarity is above a (configurable)
threshold (%).
with mysearch as (select 'postgre' as q)
select word, similarity(word, mysearch.q) as s
from enwords, mysearch
where word % mysearch.q
order by s desc limit 10;word | s
------------+------------
postgres | 0.7
postgresql | 0.5833333
postgres-xl | 0.53846157
postre | 0.5
postground | 0.46153846
postgis | 0.45454547
posture | 0.45454547
post | 0.44444445
postgame | 0.41666666
postgate | 0.41666666
(10 rows)
Time: 5.052 ms
Here we go, this also gives as good matches and is a lot faster. The matches are different from the once computed through the edit distance. However, the ‘%’ operator in the where clause can use the GIN and timing goes down to 5 milliseconds.
One technique does not exclude the other. We can quickly compute
levenshtein() only on the records that survive the where
clause:
with mysearch as (select 'postgre' as q)
select word, similarity(word, mysearch.q) as s, levenshtein(mysearch.q, word) as ld
from enwords, mysearch where word % mysearch.q order by s desc limit 10;word | s | ld
------------+------------+----
postgres | 0.7 | 1
postgresql | 0.5833333 | 3
postgres-xl | 0.53846157 | 4
postre | 0.5 | 1
postground | 0.46153846 | 4
postgis | 0.45454547 | 2
posture | 0.45454547 | 1
post | 0.44444445 | 3
postgame | 0.41666666 | 2
postgate | 0.41666666 | 2
(10 rows)
Time: 5.025 ms
Here we can further restrict the possible misspelling by adding a
condition on the edit distance (ld) and sort by edit
distance and similarity:
with mysearch as (select 'postgre' as q),
mystep1 as (select word,
similarity(word, mysearch.q) as s,
levenshtein(mysearch.q, word) as ld
from enwords, mysearch
where word % mysearch.q)
select * from mystep1 where ld <= 2 order by ld, s desc limit 10;This gives good and quick results:
word | s | ld
----------+------------+----
postgres | 0.7 | 1
postre | 0.5 | 1
posture | 0.45454547 | 1
postgis | 0.45454547 | 2
postgame | 0.41666666 | 2
postaire | 0.41666666 | 2
postgate | 0.41666666 | 2
poste | 0.4 | 2
poster | 0.36363637 | 2
postie | 0.36363637 | 2
(10 rows)
Time: 5.143 ms
Note
Postgres’ built-infuzzystrmatchextensions also has algorithms help identify similarily sounding words (as opposed to misspelled words), such as the original Soundex algorithm or the Daitch-Mokotoff Soundex algorithm (see the → Postgres manual for details).
Use case:
You want to search by meaning (semantic search),
not relying on string matching techniques.
Let’s first talk about embeddings. In this context, embedding means converting a text into a vector (if you’re not familiar with math terms, just think of a vector as an arrow pointing in a specific direction).
Let’s pick an example. Imagine we map genres in novels to the directions of a compass star. Let’s say Fantasy & Science Fiction points north, Mystery & Crime points east, Romance points south, Historical Fiction points west.
So, “Lord of the Rings” points north, “Sherlock Holmes” points east, “Romeo and Juliet” points south.
What about Jane Austen’s “Pride and Prejudice”? Well, there’s romance and historical fiction. We could say it points south-west.
Now, imagine you need to rank novels by how close they are to a new query term, for example the biography of “Bonnie and Clyde” (technically not fiction…)? We could point that at south-east. Let’s compute the distances (angles) between “Bonnie and Clyde” and all the other novels and sort them by distance:
| novel | distance (deg) |
|---|---|
| Romeo and Juliet | 45 |
| Sherlock Holmes | 45 |
| Pride and Prejudice | 90 |
| Lord of the Rings | 135 |
Suddenly something as abstract as novel genres maps into a space we can do computations on: we can compute the similarity of two identities or rank one by similarity to all the others!
If you think further about this, problems arise. The compass rose has two dimensions and there is not “enough space” to fit all possible abstract criteria. Where would you place a romantic sci-fy novel. Why should Romance be opposite of Sci-Fy? The example starts to fall short. It turns out that we need more space. A lot more space! Today, embedding maps text into a vector in a high-dimensional spaces. We’re talking about 384 - 1024 dimensions, typically.
There are Postgres extensions that perform embedding inside Postgres. However, the lingua franca of all things related to machine learning and its application today is Python, and I recommend staying in that ecosystem if you want to do embeddings. Implementing the embedding code as a Postgres client looks easier and more maintainable to me.
The standard Python package to perform embedding is → sentence transformers, based on PyTorch.
Embedding can be done using pre-trained neural network models. These models have been trained in such a way that sentences with a similar meaning map to vectors that point into a similar direction. Many pre-trained models are available as open-weight models. They come in a large variety. Some specialize on shorter or longer texts. Some know only a single language, some are multilingual.
A great, multilingual, very flexible MIT-licensed open-weight model is (→ bge-m3).
For this demo I’m using its cousin, the smaller and faster, english-only (→ bge-small-en-v1.5) under the same license. It takes a text spanning from few words to a few sentences (max. 512 tokens, about 1000 something chars or so) and maps it into a vector of 384 dimensions, which is represented by an array of 384 floats.
Let’s try this on the titles of the EN Wikipedia! We create a table where we store the IDs, titles and the resulting embedding vector:
create table embeddings (id bigint, title varchar, embedding vector(384));The data type vector is implemented by the popular → pgvector
extension.
The Python program embed.py reads
(id, title) from enwiki, runs
title through bge-small-en-v1.5 and stores
(id, title, embedding) into the new table
embeddings.
# pip install "psycopg[binary]"
# pip install torch --index-url https://download.pytorch.org/whl/cpu
# pip install sentence-transformers
import time
import psycopg
from sentence_transformers import SentenceTransformer
cursor = psycopg.connect("host='127.0.0.1' dbname='chris' user='chris'").cursor()
model = SentenceTransformer('BAAI/bge-small-en-v1.5', local_files_only=True)
batch_size = 1000
cnt = 0
id = 1
while True:
t0 = time.time()
cursor.execute("select id, title from enwiki " +
"where id between %s and %s;", [id, id + batch_size - 1])
results = cursor.fetchall()
if len(results) == 0:
break;
titles = []
for result in results:
titles.append(result[1])
t_select = time.time() - t0;
t0 = time.time()
embeddings = model.encode(titles)
t_embed = time.time() - t0;
t0 = time.time()
for i in range(0, len(results)):
cursor.execute("insert into embeddings values (%s, %s, %s)",
[results[i][0], titles[i], embeddings[i].tolist()])
cursor.connection.commit()
t_store = time.time() - t0;
cnt += len(results)
print("%d done, last batch: %.3f select, %.3f embed, %.3f insert" %
(cnt, t_select, t_embed, t_store))
id += batch_size
print("%d embeddings done" % (cnt));The script works in batches auf 1000 records, ordered by ID so it can be restarted from a given ID if it ist interrupted. For this to work efficiently, make sure ID is a primary key or at least has an index.
On my test machine, this runs at about 300 embeddings per second, and
takes about 6 hours for the 7.1 M titles in the EN Wikipedia. The new
table embeddings weights about 11 GiB.
Note
You can embed longer texts with models that support that, for example bge-m3 supports up to 8192 tokens (20000 chars or so). However, to perform semantic searches on, say, entire books, you need to chunk the text. This is one of the ingredients of RAG, for example. Our Wikipedia article bodies would be too long to meaningfully embed each one into a single vector, for example. An abstract would be a good use case, but Wikipedia doesn’t provide them.
Warning
Unlike the generated column for the precomputedtsvectorin the full text search example, theembeddingcolumn was computed by a client program and therefore is not automatically kept up to date by Postgres.
What can we do with this? Well, now we can take a query, embed it into a vector and rank all other vectors by semantic similarity to the one we’re looking for!
As an example, let’s see what titles match the query:
“interesting places for sightseeing in the capital of France”
The Python program query.py does this.
# pip install "psycopg[binary]"
# pip install torch --index-url https://download.pytorch.org/whl/cpu
# pip install sentence-transformers
import os
import time
import psycopg
from sentence_transformers import SentenceTransformer
cursor = psycopg.connect("host='127.0.0.1' dbname='chris' user='chris'").cursor()
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
f = model.encode("interesting places for sightseeing in the capital of France")
t0 = time.time()
cursor.execute(
"""
select title,
1.0 - (embedding <=> %s::vector) as sim
from embeddings order by embedding <=> %s::vector limit 10
""", [f.tolist(), f.tolist()])
results = cursor.fetchall()
t1 = time.time()
print("%.3f s" % (t1 - t0))
for result in results:
print("%6.4f, %s" % (result[1], result[0]))The program embeds the query, then runs
select title,
1.0 - (embedding <=> %s::vector) as sim
from embeddings
order by embedding <=> %s::vector limit 10which lists the top 10 most similar titles.
Here, %s is the placeholder for the float array,
substituted before sending the SQL to Postgres.
<=> is the so-called cosine distance operator.
It’s a measure of how distant two vectors are.
The result is:
1.655 s
0.8158, List of historic places in Capitale-Nationale
0.8109, List of capitals of France
0.8103, List of tourist attractions in Paris
0.8076, Landmarks in Paris
0.7897, Places to Visit
0.7852, City Sightseeing
0.7810, French Towns and Lands of Art and History
0.7806, Tourism in France
0.7784, Galeries de la Capitale
0.7774, List of protected heritage sites in Les Bons Villers
As you can see, the titles are pretty close to the meaning of the query!
This works bests when the context is at least a few words up to a few
sentences. For example the title Paris is missing here.
Remember, we didn’t embed the whole article, just the title. The single
word Paris alone apparently isn’t close enough to our
search term about sightseeing.
1.6 seconds ist fast, right? Well it is, but note there is no index yet!
By now, you know the drill:
Note
Postgres can index anything.
The most performant index to speed this up is a HNSW
(Hierarchical navigable small world) index algorithm with the
vector_cosine_ops operator class.
CREATE INDEX embeddings_ix ON embeddings USING hnsw (embedding vector_cosine_ops);If you have enough RAM (set maintenance_work_mem to at
least the amount of memory the index to be created will occupy) this is
reasonably fast (18 minutes on my temporarily scaled up test machine).
Otherwise it still runs fine, but takes much longer. The resulting index
is 14 GiB.
Let’s run the query.py again:
0.005 s
0.8158, List of historic places in Capitale-Nationale
0.8109, List of capitals of France
0.8103, List of tourist attractions in Paris
0.8076, Landmarks in Paris
0.7810, French Towns and Lands of Art and History
0.7806, Tourism in France
0.7774, List of protected heritage sites in Les Bons Villers
0.7750, List of museums in France
0.7749, List of most-visited museums in France
0.7732, Tourism in Paris
Two points.
First. The HNSW indexing is superfast. We computed the top-ten closest vectors among 7.1 million vectors in 5 milliseconds.
An important point is
order by embedding <=> %s::vector limit 10 is
performed by HNSW. It’s not computing 7.1 million distances, then
sorting, then showing the first ten. Postgres actually uses HNSW to find
the top-10 closest vectors!
That’s why it’s important to write the query this way, so Postgres’ planner can apply the index algorithm.
Second. HNSW ist approximate. You see the results are not exactly identical with and without the index.
Note
People combine semantic search with lexical search! You can just do both and show the combined ranking results (hybrid search). You can also perform one type of search to obtain the top-N results and then rank them using the other type of search (reranking).
Let’s try to run our query example again, but additionally
rerank the 10 results using
ts_rank_cd().
What can we give as full text search term for the ranking? If we look
for all the words, we risk getting ranks 0.0, because
not all single words of the search term are present in all articles.
Let’s just or the words. Here are the updated parts in
query.py:
f = model.encode("interesting places for sightseeing in the capital of France")
k = "interesting | places | for | sightseeing | in | the | capital | of | France"
t0 = time.time()
cursor.execute(
"""
select title, ts_rank_cd(fts, to_tsquery(%s)) as r from enwiki where enwiki.id in (
select id from embeddings order by embedding <=> %s::vector limit 10
) order by r desc
""", [k, f.tolist()])
results = cursor.fetchall()
t1 = time.time()Here is the output. We get the same 10 results, of course, but the ranking within the 10 results is arguably even better!
0.007 s
5.6000, Landmarks in Paris
5.4000, Tourism in France
1.9000, List of tourist attractions in Paris
1.8000, Tourism in Paris
1.6000, List of capitals of France
1.3000, List of museums in France
0.8000, French Towns and Lands of Art and History
0.4000, List of historic places in Capitale-Nationale
0.4000, List of most-visited museums in France
0.0000, List of protected heritage sites in Les Bons Villers
Note
Besides text, there are models that can embed other media. → CLIP is an interesting model that embeds texts and images into the same vector space! That means you can search images using images, or search images using text, or any combination. Or you can compute semantic similarity between images.
(C) 2026 Chris Mair (license: CC BY
4.0)