Motivation
Not so long ago, I shared that I have an interest in finding things and, in that case, the question was about where something could be found. Another common requrement is, given some expression of an interest, finding the set of documents that best answers the question. For example, coupled with the geospatial question, we might include that we’re looking for Indian restaurants within the specified geographic area.
For this article, though, we’ll restrict the focus to the problem of finding the most relevant documents within some collection, where that collection just happens to be the CedarDB documentation. To that end, I’ll assert up front that my query “Does the CedarDB ‘asof join’ use an index?” should return a helpful response, while the query “Does pickled watermelon belong on a taco?” should ideally return an empty result.
There are obvious parallels to this blog post, from October 2024, where I provide some potentially helpful background on what semantic search is all about. Worth noting: I used the code that accompanied that article as the starting point for what I’m discussing here. What is notably absent here is the K-Means code from that earlier article.
An illustration of the limitations of the classic keyword based search and why semantic search is so compelling: despite the presence of stemmed forms of some of my search terms, out of the 2572 search results for the query shown above, none of the top matches addresses what I was seeking. Having just been stymied by such poor search results, I’m highly motivated to dive into some semantic search using CedarDB.
Semantic search: the essentials
What this problem boils down to is: given a text document, generate vector embeddings that I can store in a way that allows me to efficiently compare them to embeddings generated from a user’s query, and then present the user a list of results based on similarity to the query.
For me, building a semantic search application implies I’ll use Python, since its support for LLM integrations, database interactions, natural language tools, and accessing and processing HTML data are all pretty straightforward. Here, then, is my list of the essential ingredients:
- CedarDB: due to its
pgvector
support and ability to rapidly scan data - FastEmbed to generate the embeddings
- Psycopg for the Python / CedarDB interactions
- NLTK to “chunk” the text into sentences (here, embeddings are done at the sentence level)
- Flask for the REST API
- Beautiful Soup lets us easily extract the text from HTML documents
Great! So, how do we run it?
There are a couple of prerequisites / caveats:
- We’ve tested this on Linux (x86) and MacOS (ARM)
- Docker is required
jq
is used for formatting the JSON returned by the search client- (optional) The
psql
CLI: the usual command line interface for PostgreSQL
Everything else we need is included in the CedarDB “examples” GitHub repo, so we begin by running
$ git clone https://github.com/cedardb/examples.git
cd ./examples/semantic/
With that done, the quickest way to get this running is to use the Docker Compose configuration built by Lukas:
$ docker compose up
[+] Running 2/2
✔ Volume "semantic_cedardata" Created 0.0s
✔ Container cedardb Created 0.0s
Attaching to cedardb, semantic_search
cedardb | [INFO] Setting up database directory
cedardb | [INFO] Creating superuser: postgres
cedardb | ALTER ROLE
cedardb |
cedardb | [INFO] Creating database: postgres
cedardb | [INFO] Done setting up database
cedardb |
cedardb | /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
cedardb |
cedardb | [INFO] CedarDB init process complete.
cedardb |
cedardb | [INFO] You are running the most recent CedarDB version (v2025-06-24)
cedardb |
cedardb | 2025-06-26 10:34:47.319427000 UTC INFO: Using 10798 MB buffers, 10798 MB work memory
cedardb | 2025-06-26 10:34:47.337382916 UTC WARNING: Your storage device uses write back caching. Write durability might be delayed. See: https://cedardb.com/docs/references/writecache
cedardb | 2025-06-26 10:34:47.337422291 UTC INFO: You're running CEDARDB COMMUNITY EDITION - using 0 GB out of 64 GB. Our General Terms and Conditions apply to the use of the CedarDB Community Edition. Run "cedardb --license" for more information.
[...]
semantic_search | [notice] A new release of pip is available: 24.0 -> 25.1.1
semantic_search | [notice] To update, run: pip install --upgrade pip
semantic_search | [06/26/2025 10:34:56 AM MainThread] TextEmbedding model ready: 0.58 s
semantic_search | [nltk_data] Downloading package punkt_tab to /root/nltk_data...
semantic_search | [nltk_data] Package punkt_tab is already up-to-date!
semantic_search | [06/26/2025 10:34:56 AM MainThread] NLTK ready: 0.10 s
semantic_search | [06/26/2025 10:34:56 AM MainThread] Checking whether text_embed table exists
semantic_search | [06/26/2025 10:34:57 AM MainThread] Creating tables ...
semantic_search | [06/26/2025 10:34:57 AM MainThread] OK
semantic_search | [06/26/2025 10:34:57 AM MainThread] Serving on http://0.0.0.0:1999
That last line indicates we’re all set and ready to use a client to load some documents. You’ll notice some lines of output, occurring at 10 second intervals, like this:
cedardb | 2025-06-26 10:08:34.906773343 UTC FATAL: client terminated connection unexpectedly
Don’t worry – they are due to the health check built into the Docker Compose environment.
Let’s index those CedarDB documents.
With our CedarDB and app instance up and running, we can index a collection of docs. Within
the GitHub repo, there’s a list of URLs to CedarDB documents, current as of this writing. We
use that to drive the indexing process (in a different terminal window). During this time, you
can observe some stats related to the process over in the terminal where docker compose up
is running. I call this an “indexing process” because the result is a set of vectors in the
database that we can use as a kind of document index.
$ for url in $( cat ./cedardb_docs_urls.txt ) ; do ./index_url.sh $url ; done
https://cedardb.com/docs/compatibility/sql_features/ => OK
https://cedardb.com/docs/compatibility/backend/ => OK
https://cedardb.com/docs/compatibility/ => OK
[...]
https://cedardb.com/blog/compilation/ => OK
https://cedardb.com/blog/strings_deep_dive/ => OK
https://cedardb.com/blog/german_strings/ => OK
https://cedardb.com/blog/geospatial/ => OK
Time to run some queries!
Earlier, I claimed that the following query should return a helpful response: “Does the CedarDB ‘asof join’ use an index?”
$ time ./search_client.sh "Does the CedarDB 'asof join' use an index?"
[
{
"uri": "https://cedardb.com/docs/references/advanced/asof_join/",
"score": 0.92760235502706,
"chunk": "â¹ï¸ CedarDB currently does not use any indexes for AsOf joins.",
"chunk_num": 17
},
{
"uri": "https://cedardb.com/docs/references/advanced/asof_join/",
"score": 0.852962392215444,
"chunk": "Support for index-supported AsOf joins is planned for a future CedarDB release.",
"chunk_num": 18
},
{
"uri": "https://cedardb.com/blog/geospatial/",
"score": 0.838736791761304,
"chunk": "Did CedarDB use the primary key index?",
"chunk_num": 45
},
{
"uri": "https://cedardb.com/docs/references/configuration/",
"score": 0.808461606502539,
"chunk": "compilationmode = 'a' ; Multiway joins In addition to binary joins, CedarDB also implements mult ...",
"chunk_num": 40
},
{
"uri": "https://cedardb.com/blog/simple_efficient_hash_tables/",
"score": 0.792863477977883,
"chunk": "Most prominently, CedarDB implements relational joins as hash joins.",
"chunk_num": 6
}
]
real 0m0.037s
user 0m0.022s
sys 0m0.021s
That response, “CedarDB currently does not use any indexes for AsOf joins”, answers my question. Very helpful indeed.
What about the query “Does pickled watermelon belong on a taco?” – does that return an empty result?
$ time ./search_client.sh "Does pickled watermelon belong on a taco?"
[]
real 0m0.038s
user 0m0.020s
sys 0m0.021s
It does, so my bold claim has been validated – Huzzah!
Where’s the secret sauce?
Ultimately, it all boils down to this SQL query:
1 WITH s AS
2 (
3 SELECT uri, 1 - (embedding <=> '[-0.054966, ..., 0.011331]'::VECTOR) sim, chunk, chunk_num
4 FROM text_embed
5 )
6 SELECT * FROM s
7 WHERE sim >= 0.65
8 ORDER BY sim DESC
9 LIMIT 5;
Let’s go through that concise yet powerful SQL statement, line by line:
1, 2, 5: We define a common table expression (CTE), which I needed because I
wanted to reference sim
in my WHERE
clause.
3: We SELECT
the uri
, chunk
(“chunk” is sentence), chunk_num
columns,
along with (and this is the most interesting part) 1 minus the “cosine
distance” (the <=>
operator) between the embedding column in the text_embed
table and the embedding (vector) corresponding to the user’s query, and assign
that to sim
so that it can be used further down in the query. Note here that
I’ve truncated the vector and reduced the precision to 6 decimal places so that
it displays nicely here since the vectors generated by the model used here have
384 dimensions. Elaborating on the terminology here: in this app, based on the
results of experiments, we’ve chosen to break the entire text into sentences,
and we refer to those here as “chunks” (and, they are numbered sequentially
within the document; hence chunk_num
).
6: We just project all those columns from the CTE.
7: Here, we ensure that, despite having a ranked list of results, each of them has a score of at least the specified value (0.65).
8, 9: We order that result set and keep only the top 5 results.
You’ll naturally be curious about how to reduce the amount of data scanned, scored, and sorted. I’ll touch on that in the next section.
Check out these great app features!
Some aspects of this little Python app that are worth pointing out:
The
MIN_SIMILARITY
environment variable allows us to specify a threshold, below which documents aren’t considered relevant. Without that, the query simply returns the top-N (here, 5) matches, which isn’t what we want. This value could be tuned experimentally; my chosen value of 0.65 seems to get the job done here.Prior to fetching a document, the app does an HTTP HEAD request and calculates the SHA256 digest of whichever of the following headers are present:
Last-Modified, Content-Length, Etag
. If there is a row in thetext_embed_freshness
table corresponding to this(uri, sha256)
combination, then the URL is skipped. This is useful when re-indexing a set of URLs when only some of them have changed, as we can avoid generating embeddings for them, which is the most expensive part of this process.Related to that: at search time, the embeddings generated from the query string are cached, so subsequent searches using the same query should be a little faster (though the cache size is only 1024).
While CedarDB does support the
pgvector
type and similarity calculations on these vectors, it does not yet provide an index for them. As I’ve observed repeatedly when using CedarDB, its ability to efficiently scan vast amounts of data allows you to resort to the “brute force” approach in many cases. For cases where the data set is much larger, though, integrating some indexed column into the predicate would make sense. Here, theuri
column is a natural choice, and that is exposed via the search client as a command line flag; e.g.-c substring_of_doc_URL
. The effect of this is discussed in some detail in the GitHub repo.
Let’s wrap this up.
I really appeciate you taking the time to explore the topic of how CedarDB’s
pgvector
support enables semantic search. While CedarDB is not a purpose
built vector database, its amazing ability to rapidly scan large data sets gets
it a seat at the vector DB table. Combine that with the fact that, in addition to
supporting these vector queries, CedarDB is PostgreSQL compatible, and is blazingly
fast at running analytical queries which can, for example, power the type of
dashboards you find within apps like the one Charles Schwab provides for its
user community.
Please reach out to us if you’d like to discuss your AI related vector DB challenges!