Motivation
If you share my interest in finding things, then I hope you will find this brief post worthwhile. I’ve been interested in databases for a while now and, during this time, I’ve consistently been intrigued by text and spatial data. When I got my hands on CedarDB I was very excited about its potential, then I heard a request by a user, for a geospatial related feature. He was kind enough to share with me his specific need, which was essentially given a point, find all other points which are located within a specified distance. (In this exercise, we’ll show we can do this in about 10 ms on a 9M row table.)
(Source: https://geojson.io/#map=14.5/33.59808/-117.86917
, with “OSM” option)
Let’s consider an experiment
That got me thinking: this desire to find the closest points to a given one may be the most common use of geospatial queries. For example, let’s say I am looking for the nearest coffee shops or gas stations. One formulation of the question could be show me the ten nearest coffee shops, and an alternative to that could be show me all coffee shops within a distance of two kilometers. What do we need to get this done?
- Data on coffee shops
- A database table to store that data
- A function to calculate distances between points
- A SQL query
- Data with details on coffee shops, including locations expressed as a (latitude, longitude) pair. I confess I didn’t have such data handy, but was able to find the OSM Names data set, which I then distilled into a more manageable size (see the note below). Here is a sample of this data:
$ head mini_osm_names.tsv
Farmer Island -147.0654248 -76.6071246 Farmer Island Antarctica aq 0dr
Shepard Island -132.6442311 -74.4317802 Shepard Island Antarctica aq 14c
Grant Island -131.4958397 -74.4629826 Grant Island Antarctica aq 14f
Dean Island -127.574413 -74.4741434 Dean Island Antarctica aq 14v
Mount Siple -126.6667177 -73.4333347 Siple Island Antarctica aq 14v
Siple Island -125.8606286 -73.7713537 Siple Island Antarctica aq 14y
Carney Island -121.4802313 -74.0504624 Carney Island Antarctica aq 16c
Thurston Island -98.8154634 -72.1603856 Thurston Island Antarctica aq 1g1
Sprekkehallet -90.6687707 -68.8718615 Peter I Island Antarctica aq 1gz
Toftefallet -90.6933725 -68.8331335 Peter I Island Antarctica aq 1gz
- A database table into which we load this data, so we can query it; something
like this (you may be curious about the
geohash3
column, and I’ll explain that later):
CREATE TABLE osm_names
(
name STRING NOT NULL
, lon FLOAT NOT NULL
, lat FLOAT NOT NULL
, city STRING NOT NULL
, country STRING
, country_code CHAR(2)
, geohash3 CHAR(3) -- ± 78 km
, PRIMARY KEY (geohash3, city, name)
);
- A distance function that operates within the database. Here, I leaned on AI and asked ChatGPT to build a user defined function for PostgreSQL, and it did an excellent job (note that I edited it slightly to conform to the CedarDB dialect for UDFs):
CREATE FUNCTION distance_km (lat1 FLOAT, lon1 FLOAT, lat2 FLOAT, lon2 FLOAT)
RETURNS FLOAT
AS $$
2 * 6_371 * ASIN(
SQRT(
SIN(RADIANS(lat2 - lat1) / 2)^2
+ COS(RADIANS(lat1)) * COS(RADIANS(lat2)) * SIN(RADIANS(lon2 - lon1) / 2)^2
)
)
$$
LANGUAGE sql IMMUTABLE STRICT;
- Finally, a SQL query:
WITH d AS (
SELECT name, city, country,
distance_km(33.59814807733679, -117.8692855497005, lat, lon)::NUMERIC(12, 3) dist_km,
lat, lon, geohash3
FROM osm_names
)
SELECT * FROM d
WHERE dist_km <= 0.5
ORDER BY dist_km ASC;
What you need
- Docker: I am using Docker Desktop on a MacBook Pro (ARM chip)
- CedarDB: here is the Get Started
- Data set: this is a sparse version of the original data set, suitable for this little exercise
- The
psql
CLI, the usual command line interface for PostgreSQL; for me, using a Mac, I was able to justbrew install libpq
(but, then I had to symlink/opt/homebrew/Cellar/libpq/17.4_1/bin/psql
to/usr/local/bin/psql
so it would be in$PATH
)
Let’s try it …
- Pull the CedarDB Docker image:
$ docker pull cedardb/cedardb
Using default tag: latest
latest: Pulling from cedardb/cedardb
Digest: sha256:e16c690c1642d8bf0053f30512be25e619f2309404f4e7c8b9ce4cf237c3cedc
Status: Image is up to date for cedardb/cedardb:latest
docker.io/cedardb/cedardb:latest
- Start CedarDB:
$ docker run --rm -p 5432:5432 -e CEDAR_PASSWORD=postgres --name cedardb cedardb/cedardb
[INFO] Setting up database directory
[INFO] Creating superuser: postgres
ALTER ROLE
[INFO] Creating database: postgres
[INFO] Done setting up database
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
[INFO] CedarDB init process complete.
[INFO] You are running the most recent CedarDB version (v2025-05-28)
2025-05-30 17:26:14.445188927 UTC INFO: Using 10798 MB buffers, 10798 MB work memory
2025-05-30 17:26:14.498580760 UTC WARNING: Your storage device uses write back caching. Write durability might be delayed. See: https://cedardb.com/docs/references/writecache
2025-05-30 17:26:14.498633593 UTC INFO: You're running CEDARDB COMMUNITY EDITION - using 0 GB out of 64 GB. Our General Terms and Conditions apply to the use of the CedarDB Community Edition. Run "cedardb --license" for more information.
- In a different terminal, connect to CedarDB using
psql
:
$ PGPASSWORD=postgres psql -h localhost -U postgres -d postgres
Timing is on.
psql (17.4, server 16.3 cedar v2025-05-28)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off, ALPN: none)
Type "help" for help.
- Using that
psql
connection, create the two database objects:
CREATE TABLE osm_names
(
name STRING NOT NULL
, lon FLOAT NOT NULL
, lat FLOAT NOT NULL
, city STRING NOT NULL
, country STRING
, country_code CHAR(2)
, geohash3 CHAR(3) -- ± 78 km
, PRIMARY KEY (geohash3, city, name)
);
CREATE FUNCTION distance_km (lat1 FLOAT, lon1 FLOAT, lat2 FLOAT, lon2 FLOAT)
RETURNS FLOAT
AS $$
2 * 6_371 * ASIN(
SQRT(
SIN(RADIANS(lat2 - lat1) / 2)^2
+ COS(RADIANS(lat1)) * COS(RADIANS(lat2)) * SIN(RADIANS(lon2 - lon1) / 2)^2
)
)
$$
LANGUAGE sql IMMUTABLE STRICT;
- In a new terminal window, download and extract the compressed archive containing the tab-separated data file:
$ curl https://download.cedardb.com/datasets/mini_osm_names.tar.gz | tar xzvf -
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 142M 100 142M 0 0 21.3M 0 0:00:06 0:00:06 --:--:-- 26.0Mx mini_osm_names.tsv
x LICENSE.txt
- From that same terminal, use
COPY
to load the data from the tab-separated data file:
$ cat mini_osm_names.tsv | PGPASSWORD=postgres psql -h localhost -U postgres -d postgres -c "COPY osm_names FROM stdin DELIMITER '\t' QUOTE '\r' CSV NULL '' ESCAPE '';"
Timing is on.
COPY 8997054
Time: 10010.060 ms (00:10.010)
You now have roughly 9M rows in the osm_names
table. Each row corresponds with a named place
from the OSM Names data (shown here via the psql
client connection):
postgres=# select * from osm_names tablesample reservoir (10);
name | lon | lat | city | country | country_code | geohash3
----------------------+--------------+-------------+--------------------------------+--------------------------+--------------+----------
Calle Sinchi Roca | -76.7173288 | -11.9351765 | Lurigancho | Peru | pe | 6mc
Şin | 47.1146715 | 41.3558897 | Şin | Azerbaijan | az | tp3
Большой Шелюг | 47.7557516 | 60.4766813 | Папуловское сельское поселение | Russia | ru | v4c
Rio Içana | -68.4741737 | 1.3784914 | São Gabriel da Cachoeira | Brazil | br | d2p
Cedar Drive | -107.7721722 | 39.5392617 | Rifle | United States of America | us | 9x5
Gloria Drive | -83.4561877 | 42.313082 | Canton | United States of America | us | dps
伊武田埼 | 123.8328693 | 24.4075483 | Iriomotejima | Japan | jp | wu2
Kostolná | 18.8939491 | 48.5970825 | District of Žiar nad Hronom | Slovakia | sk | u2t
HaKomuna HaHaderatit | 34.9014796 | 32.4457155 | Hadera | Israel | il | svb
ул. Петър Делян | 23.334317 | 42.9647839 | Своге | Bulgaria | bg | sx8
(10 rows)
Time: 37.754 ms
- With all of that in place, let’s find the nearest named points to a place I was visiting in April:
WITH d AS (
SELECT name, city, country,
distance_km(33.59814807733679, -117.8692855497005, lat, lon)::NUMERIC(12, 3) dist_km,
lat, lon, geohash3
FROM osm_names
)
SELECT * FROM d
WHERE dist_km <= 0.5
ORDER BY dist_km ASC;
name | city | country | dist_km | lat | lon | geohash3
-------------------+---------------+--------------------------+---------+------------+--------------+----------
Marigold Avenue | Newport Beach | United States of America | 0.203 | 33.596331 | -117.8695226 | 9mu
Narcissus Avenue | Newport Beach | United States of America | 0.233 | 33.5961836 | -117.8683826 | 9mu
Corona del Mar | Newport Beach | United States of America | 0.333 | 33.5995229 | -117.8724843 | 9mu
Orchid Avenue | Newport Beach | United States of America | 0.349 | 33.5951653 | -117.8681045 | 9mu
Hibiscus Court | Newport Beach | United States of America | 0.406 | 33.6014666 | -117.8674516 | 9mu
Gardenia Way | Newport Beach | United States of America | 0.432 | 33.601487 | -117.8668929 | 9mu
Poinsettia Avenue | Newport Beach | United States of America | 0.433 | 33.5945788 | -117.8674148 | 9mu
Lilac Avenue | Newport Beach | United States of America | 0.460 | 33.5981828 | -117.8643103 | 9mu
Seaview Avenue | Newport Beach | United States of America | 0.471 | 33.5956631 | -117.8734069 | 9mu
Doffodil Avenue | Newport Beach | United States of America | 0.481 | 33.5985276 | -117.8641069 | 9mu
Poppy Avenue | Newport Beach | United States of America | 0.499 | 33.5944137 | -117.8662819 | 9mu
(11 rows)
Time: 92.432 ms
Sounds about right. I was in Corona del Mar, California, and these are names
of streets in the area. CedarDB needed to do a full table scan, filter by that
distance predicate, and then sort by the dist_km
value; it did that in less
than 100 ms. Impressive by itself!
Recall from earlier: that geohash3
column was the first part of the primary
key. The geohash of a (latitude, longitude) pair provides a string
representation of that point, to a precision that varies depending on the
length of the geohash. With three characters, that equates to a box of +/-
78 km around the point. This suggests a way to speed up queries like these for
cases where the area of interest is confined (as it typically is):
postgres=# WITH ll AS (
SELECT 33.59814807733679 lat, -117.8692855497005 lon
),
d AS
(
SELECT name, city, country,
distance_km(ll.lat, ll.lon, o.lat, o.lon)::NUMERIC(12, 3) dist_km,
o.lat, o.lon, geohash3
FROM ll, osm_names o
WHERE geohash3 = encode_geohash(ll.lat, ll.lon, 3)
)
SELECT * FROM d
WHERE dist_km <= 0.5
ORDER BY dist_km ASC;
name | city | country | dist_km | lat | lon | geohash3
-------------------+---------------+--------------------------+---------+------------+--------------+----------
Marigold Avenue | Newport Beach | United States of America | 0.203 | 33.596331 | -117.8695226 | 9mu
Narcissus Avenue | Newport Beach | United States of America | 0.233 | 33.5961836 | -117.8683826 | 9mu
Corona del Mar | Newport Beach | United States of America | 0.333 | 33.5995229 | -117.8724843 | 9mu
Orchid Avenue | Newport Beach | United States of America | 0.349 | 33.5951653 | -117.8681045 | 9mu
Hibiscus Court | Newport Beach | United States of America | 0.406 | 33.6014666 | -117.8674516 | 9mu
Gardenia Way | Newport Beach | United States of America | 0.432 | 33.601487 | -117.8668929 | 9mu
Poinsettia Avenue | Newport Beach | United States of America | 0.433 | 33.5945788 | -117.8674148 | 9mu
Lilac Avenue | Newport Beach | United States of America | 0.460 | 33.5981828 | -117.8643103 | 9mu
Seaview Avenue | Newport Beach | United States of America | 0.471 | 33.5956631 | -117.8734069 | 9mu
Doffodil Avenue | Newport Beach | United States of America | 0.481 | 33.5985276 | -117.8641069 | 9mu
Poppy Avenue | Newport Beach | United States of America | 0.499 | 33.5944137 | -117.8662819 | 9mu
(11 rows)
Time: 9.215 ms
With this optimization, query time is pretty amazing at less than 10 ms. Why did
we get that 10x speedup? Did CedarDB use the primary key index? No, it didn’t.
With this data set, the optimizer chose a plan that relies on early pruning;
e.g. when the data is at least somewhat clustered by geohash3
, then CedarDB can
skip all the blocks where it rules out the possibility of a hit during the
table scan. Interested readers could compare the results of running the two variants
of this query, but with EXPLAIN ANALYZE
prepended, noting the differences in the
TABLESCAN on osm_names
line.
- If you like, you can stop CedarDB by pressing
CTRL-C
in the terminal where you randocker run ...
(though I’d leave it running and continue to experiment with it).
Wrapping up
Congratulations on solving a common geospatial data problem! Let’s briefly review what we accomplished:
- Fetched and ran the CedarDB Docker image
- Connected using the ubiquitous
psql
CLI - Created a table, along with a UDF to compute distance between points
- Quickly (within about 10 seconds) loaded 9M points into the table, from a CSV (TSV) file
- Ran two variations of the query to find the closest points
- Finally, showed how effective CedarDB’s early pruning is at yielding extremely efficient table scans
If you’d like to discuss your data challenges with us, we’d love to hear from you.