May 28, 2025 11 minutes

Solve a Geospatial (GIS) problem with CedarDB!

Let's solve for a common geospatial question: given a point, find the closest points of interest.

Michael Goddard
Michael Goddard
Solve a Geospatial (GIS) problem with CedarDB!

Motivation

If you share my interest in finding things, then I hope you will find this brief post worthwhile. I’ve been interested in databases for a while now and, during this time, I’ve consistently been intrigued by text and spatial data. When I got my hands on CedarDB I was very excited about its potential, then I heard a request by a user, for a geospatial related feature. He was kind enough to share with me his specific need, which was essentially given a point, find all other points which are located within a specified distance. (In this exercise, we’ll show we can do this in about 10 ms on a 9M row table.)

points within 500m
(Source: https://geojson.io/#map=14.5/33.59808/-117.86917, with “OSM” option)

Let’s consider an experiment

That got me thinking: this desire to find the closest points to a given one may be the most common use of geospatial queries. For example, let’s say I am looking for the nearest coffee shops or gas stations. One formulation of the question could be show me the ten nearest coffee shops, and an alternative to that could be show me all coffee shops within a distance of two kilometers. What do we need to get this done?

  1. Data on coffee shops
  2. A database table to store that data
  3. A function to calculate distances between points
  4. A SQL query
  • Data with details on coffee shops, including locations expressed as a (latitude, longitude) pair. I confess I didn’t have such data handy, but was able to find the OSM Names data set, which I then distilled into a more manageable size (see the note below). Here is a sample of this data:
$ head mini_osm_names.tsv
Farmer Island	-147.0654248	-76.6071246	Farmer Island	Antarctica	aq	0dr
Shepard Island	-132.6442311	-74.4317802	Shepard Island	Antarctica	aq	14c
Grant Island	-131.4958397	-74.4629826	Grant Island	Antarctica	aq	14f
Dean Island	-127.574413	-74.4741434	Dean Island	Antarctica	aq	14v
Mount Siple	-126.6667177	-73.4333347	Siple Island	Antarctica	aq	14v
Siple Island	-125.8606286	-73.7713537	Siple Island	Antarctica	aq	14y
Carney Island	-121.4802313	-74.0504624	Carney Island	Antarctica	aq	16c
Thurston Island	-98.8154634	-72.1603856	Thurston Island	Antarctica	aq	1g1
Sprekkehallet	-90.6687707	-68.8718615	Peter I Island	Antarctica	aq	1gz
Toftefallet	-90.6933725	-68.8331335	Peter I Island	Antarctica	aq	1gz
  • A database table into which we load this data, so we can query it; something like this (you may be curious about the geohash3 column, and I’ll explain that later):
CREATE TABLE osm_names
(
  name STRING NOT NULL
  , lon FLOAT NOT NULL
  , lat FLOAT NOT NULL
  , city STRING NOT NULL
  , country STRING
  , country_code CHAR(2)
  , geohash3 CHAR(3) -- ± 78 km
  , PRIMARY KEY (geohash3, city, name)
);
  • A distance function that operates within the database. Here, I leaned on AI and asked ChatGPT to build a user defined function for PostgreSQL, and it did an excellent job (note that I edited it slightly to conform to the CedarDB dialect for UDFs):
CREATE FUNCTION distance_km (lat1 FLOAT, lon1 FLOAT, lat2 FLOAT, lon2 FLOAT)
RETURNS FLOAT
AS $$
  2 * 6_371 * ASIN(
    SQRT(
      SIN(RADIANS(lat2 - lat1) / 2)^2
      + COS(RADIANS(lat1)) * COS(RADIANS(lat2)) * SIN(RADIANS(lon2 - lon1) / 2)^2 
    )
  )
$$
LANGUAGE sql IMMUTABLE STRICT;
  • Finally, a SQL query:
WITH d AS                                                                                                          (
  SELECT name, city, country,
    distance_km(33.59814807733679, -117.8692855497005, lat, lon)::NUMERIC(12, 3) dist_km,
    lat, lon, geohash3
  FROM osm_names
)
SELECT * FROM d
WHERE dist_km <= 0.5
ORDER BY dist_km ASC;

What you need

  • Docker: I am using Docker Desktop on a MacBook Pro (ARM chip)
  • CedarDB: here is the Get Started
  • Data set: this is a sparse version of the original data set, suitable for this little exercise
  • The psql CLI, the usual command line interface for PostgreSQL; for me, using a Mac, I was able to just brew install libpq (but, then I had to symlink /opt/homebrew/Cellar/libpq/17.4_1/bin/psql to /usr/local/bin/psql so it would be in $PATH)

Let’s try it …

  • Pull the CedarDB Docker image:
$ docker pull cedardb/cedardb
Using default tag: latest
latest: Pulling from cedardb/cedardb
Digest: sha256:e16c690c1642d8bf0053f30512be25e619f2309404f4e7c8b9ce4cf237c3cedc
Status: Image is up to date for cedardb/cedardb:latest
docker.io/cedardb/cedardb:latest
  • Start CedarDB:
$ docker run --rm -p 5432:5432 -e CEDAR_PASSWORD=postgres --name cedardb cedardb/cedardb
[INFO] Setting up database directory
[INFO] Creating superuser: postgres
ALTER ROLE

[INFO] Creating database: postgres
[INFO] Done setting up database

/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*

[INFO] CedarDB init process complete.

[INFO] You are running the most recent CedarDB version (v2025-05-28)

2025-05-30 17:26:14.445188927 UTC	INFO:    Using 10798 MB buffers, 10798 MB work memory
2025-05-30 17:26:14.498580760 UTC	WARNING: Your storage device uses write back caching. Write durability might be delayed. See: https://cedardb.com/docs/references/writecache
2025-05-30 17:26:14.498633593 UTC	INFO:    You're running CEDARDB COMMUNITY EDITION - using 0 GB out of 64 GB. Our General Terms and Conditions apply to the use of the CedarDB Community Edition. Run "cedardb --license" for more information.
  • In a different terminal, connect to CedarDB using psql:
$ PGPASSWORD=postgres psql -h localhost -U postgres -d postgres
Timing is on.
psql (17.4, server 16.3 cedar v2025-05-28)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off, ALPN: none)
Type "help" for help.
  • Using that psql connection, create the two database objects:
CREATE TABLE osm_names
(
  name STRING NOT NULL
  , lon FLOAT NOT NULL
  , lat FLOAT NOT NULL
  , city STRING NOT NULL
  , country STRING
  , country_code CHAR(2)
  , geohash3 CHAR(3) -- ± 78 km
  , PRIMARY KEY (geohash3, city, name)
);

CREATE FUNCTION distance_km (lat1 FLOAT, lon1 FLOAT, lat2 FLOAT, lon2 FLOAT)
RETURNS FLOAT
AS $$
  2 * 6_371 * ASIN(
    SQRT(
      SIN(RADIANS(lat2 - lat1) / 2)^2
      + COS(RADIANS(lat1)) * COS(RADIANS(lat2)) * SIN(RADIANS(lon2 - lon1) / 2)^2 
    )
  )
$$
LANGUAGE sql IMMUTABLE STRICT;
  • In a new terminal window, download and extract the compressed archive containing the tab-separated data file:
$ curl https://download.cedardb.com/datasets/mini_osm_names.tar.gz | tar xzvf -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  142M  100  142M    0     0  21.3M      0  0:00:06  0:00:06 --:--:-- 26.0Mx mini_osm_names.tsv

x LICENSE.txt
  • From that same terminal, use COPY to load the data from the tab-separated data file:
$ cat mini_osm_names.tsv | PGPASSWORD=postgres psql -h localhost -U postgres -d postgres -c "COPY osm_names FROM stdin DELIMITER '\t' QUOTE '\r' CSV NULL '' ESCAPE '';"
Timing is on.
COPY 8997054
Time: 10010.060 ms (00:10.010)

You now have roughly 9M rows in the osm_names table. Each row corresponds with a named place from the OSM Names data (shown here via the psql client connection):

postgres=# select * from osm_names tablesample reservoir (10);
         name         |     lon      |     lat     |              city              |         country          | country_code | geohash3
----------------------+--------------+-------------+--------------------------------+--------------------------+--------------+----------
 Calle Sinchi Roca    |  -76.7173288 | -11.9351765 | Lurigancho                     | Peru                     | pe           | 6mc
 Şin                  |   47.1146715 |  41.3558897 | Şin                            | Azerbaijan               | az           | tp3
 Большой Шелюг        |   47.7557516 |  60.4766813 | Папуловское сельское поселение | Russia                   | ru           | v4c
 Rio Içana            |  -68.4741737 |   1.3784914 | São Gabriel da Cachoeira       | Brazil                   | br           | d2p
 Cedar Drive          | -107.7721722 |  39.5392617 | Rifle                          | United States of America | us           | 9x5
 Gloria Drive         |  -83.4561877 |   42.313082 | Canton                         | United States of America | us           | dps
 伊武田埼             |  123.8328693 |  24.4075483 | Iriomotejima                   | Japan                    | jp           | wu2
 Kostolná             |   18.8939491 |  48.5970825 | District of Žiar nad Hronom    | Slovakia                 | sk           | u2t
 HaKomuna HaHaderatit |   34.9014796 |  32.4457155 | Hadera                         | Israel                   | il           | svb
 ул. Петър Делян      |    23.334317 |  42.9647839 | Своге                          | Bulgaria                 | bg           | sx8
(10 rows)

Time: 37.754 ms
  • With all of that in place, let’s find the nearest named points to a place I was visiting in April:
WITH d AS                                                                                                          (
  SELECT name, city, country,
    distance_km(33.59814807733679, -117.8692855497005, lat, lon)::NUMERIC(12, 3) dist_km,
    lat, lon, geohash3
  FROM osm_names
)
SELECT * FROM d
WHERE dist_km <= 0.5
ORDER BY dist_km ASC;
       name        |     city      |         country          | dist_km |    lat     |     lon      | geohash3
-------------------+---------------+--------------------------+---------+------------+--------------+----------
 Marigold Avenue   | Newport Beach | United States of America |   0.203 |  33.596331 | -117.8695226 | 9mu
 Narcissus Avenue  | Newport Beach | United States of America |   0.233 | 33.5961836 | -117.8683826 | 9mu
 Corona del Mar    | Newport Beach | United States of America |   0.333 | 33.5995229 | -117.8724843 | 9mu
 Orchid Avenue     | Newport Beach | United States of America |   0.349 | 33.5951653 | -117.8681045 | 9mu
 Hibiscus Court    | Newport Beach | United States of America |   0.406 | 33.6014666 | -117.8674516 | 9mu
 Gardenia Way      | Newport Beach | United States of America |   0.432 |  33.601487 | -117.8668929 | 9mu
 Poinsettia Avenue | Newport Beach | United States of America |   0.433 | 33.5945788 | -117.8674148 | 9mu
 Lilac Avenue      | Newport Beach | United States of America |   0.460 | 33.5981828 | -117.8643103 | 9mu
 Seaview Avenue    | Newport Beach | United States of America |   0.471 | 33.5956631 | -117.8734069 | 9mu
 Doffodil Avenue   | Newport Beach | United States of America |   0.481 | 33.5985276 | -117.8641069 | 9mu
 Poppy Avenue      | Newport Beach | United States of America |   0.499 | 33.5944137 | -117.8662819 | 9mu
(11 rows)

Time: 92.432 ms

Sounds about right. I was in Corona del Mar, California, and these are names of streets in the area. CedarDB needed to do a full table scan, filter by that distance predicate, and then sort by the dist_km value; it did that in less than 100 ms. Impressive by itself!

Recall from earlier: that geohash3 column was the first part of the primary key. The geohash of a (latitude, longitude) pair provides a string representation of that point, to a precision that varies depending on the length of the geohash. With three characters, that equates to a box of +/- 78 km around the point. This suggests a way to speed up queries like these for cases where the area of interest is confined (as it typically is):

postgres=# WITH ll AS                                                                                                         (
  SELECT 33.59814807733679 lat, -117.8692855497005 lon
),
d AS
(
  SELECT name, city, country,
    distance_km(ll.lat, ll.lon, o.lat, o.lon)::NUMERIC(12, 3) dist_km,
    o.lat, o.lon, geohash3
  FROM ll, osm_names o
  WHERE geohash3 = encode_geohash(ll.lat, ll.lon, 3)
)
SELECT * FROM d
WHERE dist_km <= 0.5
ORDER BY dist_km ASC;
       name        |     city      |         country          | dist_km |    lat     |     lon      | geohash3
-------------------+---------------+--------------------------+---------+------------+--------------+----------
 Marigold Avenue   | Newport Beach | United States of America |   0.203 |  33.596331 | -117.8695226 | 9mu
 Narcissus Avenue  | Newport Beach | United States of America |   0.233 | 33.5961836 | -117.8683826 | 9mu
 Corona del Mar    | Newport Beach | United States of America |   0.333 | 33.5995229 | -117.8724843 | 9mu
 Orchid Avenue     | Newport Beach | United States of America |   0.349 | 33.5951653 | -117.8681045 | 9mu
 Hibiscus Court    | Newport Beach | United States of America |   0.406 | 33.6014666 | -117.8674516 | 9mu
 Gardenia Way      | Newport Beach | United States of America |   0.432 |  33.601487 | -117.8668929 | 9mu
 Poinsettia Avenue | Newport Beach | United States of America |   0.433 | 33.5945788 | -117.8674148 | 9mu
 Lilac Avenue      | Newport Beach | United States of America |   0.460 | 33.5981828 | -117.8643103 | 9mu
 Seaview Avenue    | Newport Beach | United States of America |   0.471 | 33.5956631 | -117.8734069 | 9mu
 Doffodil Avenue   | Newport Beach | United States of America |   0.481 | 33.5985276 | -117.8641069 | 9mu
 Poppy Avenue      | Newport Beach | United States of America |   0.499 | 33.5944137 | -117.8662819 | 9mu
(11 rows)

Time: 9.215 ms

With this optimization, query time is pretty amazing at less than 10 ms. Why did we get that 10x speedup? Did CedarDB use the primary key index? No, it didn’t. With this data set, the optimizer chose a plan that relies on early pruning; e.g. when the data is at least somewhat clustered by geohash3, then CedarDB can skip all the blocks where it rules out the possibility of a hit during the table scan. Interested readers could compare the results of running the two variants of this query, but with EXPLAIN ANALYZE prepended, noting the differences in the TABLESCAN on osm_names line.

  • If you like, you can stop CedarDB by pressing CTRL-C in the terminal where you ran docker run ... (though I’d leave it running and continue to experiment with it).

Wrapping up

Congratulations on solving a common geospatial data problem! Let’s briefly review what we accomplished:

  1. Fetched and ran the CedarDB Docker image
  2. Connected using the ubiquitous psql CLI
  3. Created a table, along with a UDF to compute distance between points
  4. Quickly (within about 10 seconds) loaded 9M points into the table, from a CSV (TSV) file
  5. Ran two variations of the query to find the closest points
  6. Finally, showed how effective CedarDB’s early pruning is at yielding extremely efficient table scans

If you’d like to discuss your data challenges with us, we’d love to hear from you.