Use CedarDB to search the CedarDB docs and blogs

Motivation

Not so long ago, I shared that I have an interest in finding things and, in that case, the question was about where something could be found. Another common requirement is, given some expression of an interest, finding the set of documents that best answers the question. For example, coupled with the geospatial question, we might include that we’re looking for Indian restaurants within the specified geographic area.

For this article, though, we’ll restrict the focus to the problem of finding the most relevant documents within some collection, where that collection just happens to be the CedarDB documentation. To that end, I’ll assert up front that my query “Does the CedarDB ‘asof join’ use an index?” should return a helpful response, while the query “Does pickled watermelon belong on a taco?” should ideally return an empty result.

There are obvious parallels to this blog post, from October 2024, where I provide some potentially helpful background on what semantic search is all about. Worth noting: I used the code that accompanied that article as the starting point for what I’m discussing here. What is notably absent here is the K-Means code from that earlier article.

Schwab search

An illustration of the limitations of the classic keyword based search and why semantic search is so compelling: despite the presence of stemmed forms of some of my search terms, out of the 2572 search results for the query shown above, none of the top matches addresses what I was seeking. Having just been stymied by such poor search results, I’m highly motivated to dive into some semantic search using CedarDB.

Semantic search: the essentials

What this problem boils down to is: given a text document, generate vector embeddings that I can store in a way that allows me to efficiently compare them to embeddings generated from a user’s query, and then present the user a list of results based on similarity to the query.

For me, building a semantic search application implies I’ll use Python, since its support for LLM integrations, database interactions, natural language tools, and accessing and processing HTML data are all pretty straightforward. Here, then, is my list of the essential ingredients:

CedarDB: due to its pgvector support and ability to rapidly scan data
FastEmbed to generate the embeddings
Psycopg for the Python / CedarDB interactions
NLTK to “chunk” the text into sentences (here, embeddings are done at the sentence level)
Flask for the REST API
Beautiful Soup lets us easily extract the text from HTML documents

Great! So, how do we run it?

There are a couple of prerequisites / caveats:

We’ve tested this on Linux (x86) and MacOS (ARM)
Docker is required
jq is used for formatting the JSON returned by the search client
(optional) The psql CLI: the usual command line interface for PostgreSQL

Everything else we need is included in the CedarDB “examples” GitHub repo, so we begin by running

$ git clone https://github.com/cedardb/examples.git
cd ./examples/semantic/

With that done, the quickest way to get this running is to use the Docker Compose configuration built by Lukas:

$ docker compose up
[+] Running 2/2
 ✔ Volume "semantic_cedardata"  Created                                                                                                    0.0s
 ✔ Container cedardb            Created                                                                                                    0.0s
Attaching to cedardb, semantic_search
cedardb          | [INFO] Setting up database directory
cedardb          | [INFO] Creating superuser: postgres
cedardb          | ALTER ROLE
cedardb          |
cedardb          | [INFO] Creating database: postgres
cedardb          | [INFO] Done setting up database
cedardb          |
cedardb          | /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
cedardb          |
cedardb          | [INFO] CedarDB init process complete.
cedardb          |
cedardb          | [INFO] You are running the most recent CedarDB version (v2025-06-24)
cedardb          |
cedardb          | 2025-06-26 10:34:47.319427000 UTC	INFO:    Using 10798 MB buffers, 10798 MB work memory
cedardb          | 2025-06-26 10:34:47.337382916 UTC	WARNING: Your storage device uses write back caching. Write durability might be delayed. See: https://cedardb.com/docs/references/writecache
cedardb          | 2025-06-26 10:34:47.337422291 UTC	INFO:    You're running CEDARDB COMMUNITY EDITION - using 0 GB out of 64 GB. Our General Terms and Conditions apply to the use of the CedarDB Community Edition. Run "cedardb --license" for more information.
[...]
semantic_search  | [notice] A new release of pip is available: 24.0 -> 25.1.1
semantic_search  | [notice] To update, run: pip install --upgrade pip
semantic_search  | [06/26/2025 10:34:56 AM MainThread] TextEmbedding model ready: 0.58 s
semantic_search  | [nltk_data] Downloading package punkt_tab to /root/nltk_data...
semantic_search  | [nltk_data]   Package punkt_tab is already up-to-date!
semantic_search  | [06/26/2025 10:34:56 AM MainThread] NLTK ready: 0.10 s
semantic_search  | [06/26/2025 10:34:56 AM MainThread] Checking whether text_embed table exists
semantic_search  | [06/26/2025 10:34:57 AM MainThread] Creating tables ...
semantic_search  | [06/26/2025 10:34:57 AM MainThread] OK
semantic_search  | [06/26/2025 10:34:57 AM MainThread] Serving on http://0.0.0.0:1999

That last line indicates we’re all set and ready to use a client to load some documents. You’ll notice some lines of output, occurring at 10 second intervals, like this:

cedardb          | 2025-06-26 10:08:34.906773343 UTC	FATAL:   client terminated connection unexpectedly

Don’t worry – they are due to the health check built into the Docker Compose environment.

Let’s index those CedarDB documents.

With our CedarDB and app instance up and running, we can index a collection of docs. Within the GitHub repo, there’s a list of URLs to CedarDB documents, current as of this writing. We use that to drive the indexing process (in a different terminal window). During this time, you can observe some stats related to the process over in the terminal where docker compose up is running. I call this an “indexing process” because the result is a set of vectors in the database that we can use as a kind of document index.

$ for url in $( cat ./cedardb_docs_urls.txt ) ; do ./index_url.sh $url ; done
https://cedardb.com/docs/compatibility/sql_features/ => OK
https://cedardb.com/docs/compatibility/backend/ => OK
https://cedardb.com/docs/compatibility/ => OK
[...]
https://cedardb.com/blog/compilation/ => OK
https://cedardb.com/blog/strings_deep_dive/ => OK
https://cedardb.com/blog/german_strings/ => OK
https://cedardb.com/blog/geospatial/ => OK

Time to run some queries!

Earlier, I claimed that the following query should return a helpful response: “Does the CedarDB ‘asof join’ use an index?”

$ time ./search_client.sh "Does the CedarDB 'asof join' use an index?"
[
  {
    "uri": "https://cedardb.com/docs/references/advanced/asof_join/",
    "score": 0.92760235502706,
    "chunk": "â¹ï¸ CedarDB currently does not use any indexes for AsOf joins.",
    "chunk_num": 17
  },
  {
    "uri": "https://cedardb.com/docs/references/advanced/asof_join/",
    "score": 0.852962392215444,
    "chunk": "Support for index-supported AsOf joins is planned for a future CedarDB release.",
    "chunk_num": 18
  },
  {
    "uri": "https://cedardb.com/blog/geospatial/",
    "score": 0.838736791761304,
    "chunk": "Did CedarDB use the primary key index?",
    "chunk_num": 45
  },
  {
    "uri": "https://cedardb.com/docs/references/configuration/",
    "score": 0.808461606502539,
    "chunk": "compilationmode = 'a' ; Multiway joins In addition to binary joins, CedarDB also implements mult ...",
    "chunk_num": 40
  },
  {
    "uri": "https://cedardb.com/blog/simple_efficient_hash_tables/",
    "score": 0.792863477977883,
    "chunk": "Most prominently, CedarDB implements relational joins as hash joins.",
    "chunk_num": 6
  }
]

real	0m0.037s
user	0m0.022s
sys	0m0.021s

That response, “CedarDB currently does not use any indexes for AsOf joins”, answers my question. Very helpful indeed.

What about the query “Does pickled watermelon belong on a taco?” – does that return an empty result?

$ time ./search_client.sh "Does pickled watermelon belong on a taco?"
[]

real	0m0.038s
user	0m0.020s
sys	0m0.021s

It does, so my bold claim has been validated – Huzzah!

Where’s the secret sauce?

Ultimately, it all boils down to this SQL query:

1	WITH s AS
2	(
3	  SELECT uri, 1 - (embedding <=> '[-0.054966, ..., 0.011331]'::VECTOR) sim, chunk, chunk_num
4	  FROM text_embed
5	)
6	SELECT * FROM s
7	WHERE sim >= 0.65
8	ORDER BY sim DESC
9	LIMIT 5;

Let’s go through that concise yet powerful SQL statement, line by line:

1, 2, 5: We define a common table expression (CTE), which I needed because I wanted to reference sim in my WHERE clause.

3: We SELECT the uri, chunk (“chunk” is sentence), chunk_num columns, along with (and this is the most interesting part) 1 minus the “cosine distance” (the <=> operator) between the embedding column in the text_embed table and the embedding (vector) corresponding to the user’s query, and assign that to sim so that it can be used further down in the query. Note here that I’ve truncated the vector and reduced the precision to 6 decimal places so that it displays nicely here since the vectors generated by the model used here have 384 dimensions. Elaborating on the terminology here: in this app, based on the results of experiments, we’ve chosen to break the entire text into sentences, and we refer to those here as “chunks” (and, they are numbered sequentially within the document; hence chunk_num).

6: We just project all those columns from the CTE.

7: Here, we ensure that, despite having a ranked list of results, each of them has a score of at least the specified value (0.65).

8, 9: We order that result set and keep only the top 5 results.

You’ll naturally be curious about how to reduce the amount of data scanned, scored, and sorted. I’ll touch on that in the next section.

Check out these great app features!

Some aspects of this little Python app that are worth pointing out:

The MIN_SIMILARITY environment variable allows us to specify a threshold, below which documents aren’t considered relevant. Without that, the query simply returns the top-N (here, 5) matches, which isn’t what we want. This value could be tuned experimentally; my chosen value of 0.65 seems to get the job done here.

Prior to fetching a document, the app does an HTTP HEAD request and calculates the SHA256 digest of whichever of the following headers are present: Last-Modified, Content-Length, Etag. If there is a row in the text_embed_freshness table corresponding to this (uri, sha256) combination, then the URL is skipped. This is useful when re-indexing a set of URLs when only some of them have changed, as we can avoid generating embeddings for them, which is the most expensive part of this process.

Related to that: at search time, the embeddings generated from the query string are cached, so subsequent searches using the same query should be a little faster (though the cache size is only 1024).

While CedarDB does support the pgvector type and similarity calculations on these vectors, it does not yet provide an index for them. As I’ve observed repeatedly when using CedarDB, its ability to efficiently scan vast amounts of data allows you to resort to the “brute force” approach in many cases. For cases where the data set is much larger, though, integrating some indexed column into the predicate would make sense. Here, the uri column is a natural choice, and that is exposed via the search client as a command line flag; e.g. -c substring_of_doc_URL. The effect of this is discussed in some detail in the GitHub repo.

Let’s wrap this up.

I really appeciate you taking the time to explore the topic of how CedarDB’s pgvector support enables semantic search. While CedarDB is not a purpose built vector database, its amazing ability to rapidly scan large data sets gets it a seat at the vector DB table. Combine that with the fact that, in addition to supporting these vector queries, CedarDB is PostgreSQL compatible, and is blazingly fast at running analytical queries which can, for example, power the type of dashboards you find within apps like the one Charles Schwab provides for its user community.

Please reach out to us if you’d like to discuss your AI related vector DB challenges!

Let's Talk