February 11, 2026 • 15 minutes

Parquet as our 'Ski Lift' to Migrate from ClickHouse to CedarDB

Stack Overflow data, combined with the Parquet format, helps illustrate how easy it is to transition to CedarDB when query complexity overwhelms ClickHouse.

Michael Goddard

The newly released Parquet support in CedarDB streamlines data exchange.

My recent LinkedIn post (below, with minor changes) provides the motivation for the more detailed deep dive presented in this article:

All this snow has me thinking about skiing (again). And data. And databases. I was inspired by the “Analyzing Stack Overflow data with ClickHouse” article I read, thinking how amazing it is they just raised $400M. After working through their examples, it occurred to me that, as fun as that was, the SQL queries there were very simple. In skiing jargon, I might liken them to the green ski runs, for beginners.

As skiers progress, they move to the blue intermediate runs, then may get bold and go try the black diamond runs. That’s the direction I headed with the Stack Overflow data set. ClickHouse provides a convenient Parquet export function, which I was able to use in conjunction with CedarDB’s Parquet import functionality to load my CedarDB instance with the data set I’d used in ClickHouse. I deployed each of the DB’s on my m7a.8xlarge (32 vCPU, 128 GB RAM, $1.85/hour) EC2 node, in Docker, running them one at a time. A little pair programming with ChatGPT yielded a set of seven interesting queries, which I ran on each of these DBs, discarding the initial run time and recording the average timing for five runs.

Why the Stack Overflow data set?

The Stack Overflow data set appeals to us for several reasons, the first of which is that many of us can relate to it, having used the Web site over the years. Some other aspects are:

It’s a real-world data set with a schema that spans multiple tables that relate to one another.
Real data is messy, so it’s a great test for database systems. Import/export is hard (nulls, escape symbols, etc.) and queries are often slow (as illustrated here by Clickhouse’s struggle to handle complex joins).
ClickHouse chose to publish an article around it, which provides a nice starting point for comparing ClickHouse with CedarDB.
The data set presents both an opportunity to write some simple SQL queries that compute aggregates against a single table, as you can see here (from the ClickHouse post):

SELECT
  arrayJoin(arrayFilter(t -> (t != ''), splitByChar('|', Tags))) AS Tags,
  count() AS c
FROM stackoverflow.posts
GROUP BY Tags
ORDER BY c DESC
LIMIT 10

… but also an opportunity to explore more complex relationships through SQL queries that join multiple tables; for example, this query that lists the 10 most recent questions (since 2019) that have an accepted answer, and shows the question title and who asked it, the accepted answer ID and who wrote it, and (if available) a badge name belonging to the person who wrote the accepted answer:

SELECT
  q."Id"        AS question_id,
  q."Title"     AS question_title,
  qu."DisplayName" AS question_owner,
  a."Id"        AS answer_id,
  au."DisplayName" AS answer_owner,
  b."Name"      AS answerer_badge
FROM posts q
JOIN posts a
  ON a."Id" = q."AcceptedAnswerId"
JOIN users qu
  ON qu."Id" = q."OwnerUserId"
JOIN users au
  ON au."Id" = a."OwnerUserId"
LEFT JOIN badges b
  ON b."UserId" = au."Id"
WHERE q."AcceptedAnswerId" IS NOT NULL
  AND q."Title" <> ''
  AND q."CreationDate" >= TIMESTAMP '2019-01-01'
ORDER BY q."CreationDate" DESC
LIMIT 10;

Since ClickHouse already covered the simplistic queries in its post, the focus here will be the more complex ones like the one above, the black diamond runs if you will. This is what drives us to consider the use of Parquet to migrate our workload over to CedarDB.
What does this Parquet migration process look like in practice? This is the essence of it:

-- Clickhouse:
ip-10-0-1-175.us-east-2.compute.internal :) SELECT *
:-] FROM votes
:-] INTO OUTFILE 'votes.parquet'
:-] FORMAT Parquet;

-- CedarDB:
postgres=# CREATE TABLE votes AS SELECT * FROM '/var/lib/cedardb/data/ext/votes.parquet';
SELECT 238984011
Time: 46287.163 ms (00:46.287)

Overview of the experiment

Refer to Appendix: Après-Ski Details, below, for the step-by-step procedure.

Deployed an EC2 instance in us-east-2 using this CloudFormation template
Started ClickHouse in Docker
Loaded the Stack Overflow data set
Ran the seven SQL queries, collecting telemetry on the runs
Exported the data in Parquet format to the local filesystem
Shut down ClickHouse
Started up CedarDB in Docker
Loaded the data set from the Parquet files
Repeated the query runs and timing collection

The questions we’ll ask

Here are the questions that our SQL queries will ask, numbered to correspond with the numbered files used in the experiments and also with the entries in the results table (below). To view a SQL query, click the right arrow symbol next to its number.

Q1: Find the top 10 most frequently awarded badge types among high-reputation users (users with reputation ≥ 100,000), considering only users who have posts created on or after January 1, 2018, and requiring that those posts have at least one comment.

SELECT
  b."Name" AS badge_name,
  COUNT(*) AS badge_awards
FROM badges b
JOIN users u
  ON u."Id" = b."UserId"
JOIN posts p
  ON p."OwnerUserId" = u."Id"
JOIN comments c
  ON c."PostId" = p."Id"
WHERE u."Reputation"::int >= 100000
  AND p."CreationDate" >= TIMESTAMP '2018-01-01'
GROUP BY b."Name"
ORDER BY badge_awards DESC
LIMIT 10;

Q2: Find posts that the community strongly disagrees on - those with lots of both upvotes and downvotes — and highlight the ones where opinion is most evenly divided, especially those that sparked substantial discussion.

WITH vote_counts AS (
  SELECT
    v."PostId",
    SUM(CASE WHEN v."VoteTypeId" = 2 THEN 1 ELSE 0 END) AS upvotes,
    SUM(CASE WHEN v."VoteTypeId" = 3 THEN 1 ELSE 0 END) AS downvotes
  FROM votes v
GROUP BY v."PostId"
)
SELECT
  p."Id",
  p."Title",
  u."DisplayName" AS owner,
  vc.upvotes,
  vc.downvotes,
  (CASE WHEN vc.upvotes > vc.downvotes THEN vc.upvotes - vc.downvotes ELSE vc.downvotes - vc.upvotes END) AS abs_diff,
  COUNT(c."Id") AS comment_cnt
FROM vote_counts vc
JOIN posts p
  ON p."Id" = vc."PostId"
JOIN users u
  ON u."Id" = p."OwnerUserId"
LEFT JOIN comments c
  ON c."PostId" = p."Id"
WHERE p."Title" <> ''
  AND vc.upvotes >= 50
  AND vc.downvotes >= 50
GROUP BY p."Id", p."Title", u."DisplayName", vc.upvotes, vc.downvotes
ORDER BY abs_diff ASC, comment_cnt DESC
LIMIT 10;

Q3: List the 10 most recent questions (since 2019) that have an accepted answer, and show (1) the question title and who asked it, (2) the accepted answer ID and who wrote it, and (3, if available) a badge name belonging to the person who wrote the accepted answer.

SELECT
  q."Id"        AS question_id,
  q."Title"     AS question_title,
  qu."DisplayName" AS question_owner,
  a."Id"        AS answer_id,
  au."DisplayName" AS answer_owner,
  b."Name"      AS answerer_badge
FROM posts q
JOIN posts a
  ON a."Id" = q."AcceptedAnswerId"
JOIN users qu
  ON qu."Id" = q."OwnerUserId"
JOIN users au
  ON au."Id" = a."OwnerUserId"
LEFT JOIN badges b
  ON b."UserId" = au."Id"
WHERE q."AcceptedAnswerId" IS NOT NULL
  AND q."Title" <> ''
  AND q."CreationDate" >= TIMESTAMP '2019-01-01'
ORDER BY q."CreationDate" DESC
LIMIT 10;

Q4: Highlight popular posts that were 'community-touched' (edited by someone else), and rank them by overall vote volume, using comment count as a tiebreaker.

WITH vote_counts AS (
  SELECT
    v."PostId",
    COUNT(*) AS votes_total
  FROM votes v
  GROUP BY v."PostId"
)
SELECT
  p."Id",
  p."Title",
  owner."DisplayName"  AS owner_name,
  editor."DisplayName" AS editor_name,
  editor."Reputation"  AS editor_rep,
  COALESCE(vc.votes_total, 0) AS votes_total,
  COUNT(c."Id") AS comment_cnt
FROM posts p
JOIN users owner
  ON owner."Id" = p."OwnerUserId"
JOIN users editor
  ON editor."Id" = p."LastEditorUserId"
LEFT JOIN vote_counts vc
  ON vc."PostId" = p."Id"
LEFT JOIN comments c
  ON c."PostId" = p."Id"
WHERE p."LastEditorUserId" IS NOT NULL
  AND p."Title" <> ''
  AND p."OwnerUserId" <> p."LastEditorUserId"
GROUP BY
  p."Id", p."Title", owner."DisplayName", editor."DisplayName", editor."Reputation", vc.votes_total
ORDER BY votes_total DESC, comment_cnt DESC
LIMIT 10;

Q5: Show the 10 most recent links between posts (since 2020), including the titles and authors on both sides of each link.

SELECT
  pl."PostId"        AS src_post_id,
  src."Title"        AS src_title,
  su."DisplayName"   AS src_owner,
  pl."RelatedPostId" AS dst_post_id,
  dst."Title"        AS dst_title,
  du."DisplayName"   AS dst_owner,
  pl."CreationDate"  AS link_time
FROM postlinks pl
JOIN posts src
  ON src."Id" = pl."PostId"
JOIN posts dst
  ON dst."Id" = pl."RelatedPostId"
LEFT JOIN users su
  ON su."Id" = src."OwnerUserId"
LEFT JOIN users du
  ON du."Id" = dst."OwnerUserId"
WHERE src."Title" <> ''
  AND dst."Title" <> ''
  AND pl."CreationDate" >= TIMESTAMP '2020-01-01'
ORDER BY pl."CreationDate" DESC
LIMIT 10;

Q6: Highlight the posts that have been 'worked on' the most over time, and whether that work came from a single editor or a broad set of contributors.

SELECT
  p."Id" AS post_id,
  p."Title",
  owner."DisplayName" AS owner,
  editor."DisplayName" AS last_editor,
  COUNT(ph."Id") AS revisions,
  COUNT(DISTINCT ph."UserId") AS distinct_revision_authors
FROM posts p
JOIN users owner
  ON owner."Id" = p."OwnerUserId"
LEFT JOIN users editor
  ON editor."Id" = p."LastEditorUserId"
JOIN posthistory ph
  ON ph."PostId" = p."Id"
WHERE p."Title" <> ''
GROUP BY p."Id", p."Title", owner."DisplayName", editor."DisplayName"
HAVING COUNT(ph."Id") >= 20
ORDER BY revisions DESC
LIMIT 10;

Q7: Determine which combinations of tags generate the most answers overall, and whether those answers tend to come from high-reputation or lower-reputation contributors.

SELECT
  q."Tags",
  COUNT(a."Id") AS answers_cnt,
  AVG(au."Reputation"::int) AS avg_answerer_rep
FROM posts a
JOIN posts q
  ON q."Id" = a."ParentId"::int
JOIN users au
  ON au."Id" = a."OwnerUserId"
WHERE a."PostTypeId" = 'Answer'
  AND q."Tags" <> ''
GROUP BY q."Tags"
ORDER BY answers_cnt DESC
LIMIT 10;

Results and observations

AWS EC2 instance type: m7a.8xlarge (32 vCPU, 128 GB RAM, $1.85/hour)
Docker provided a simple means of deployment for both of these DB engines.
CedarDB version: v2026-02-03
ClickHouse server version: 26.1.2.11
Parquet support provided the “chair lift”, taking us from the beginner SQL queries (ClickHouse) to the advanced ones (CedarDB).
In CedarDB, with a single exception, we created the tables using CREATE TABLE ... AS SELECT ..., from the Parquet data.
The psql CLI was our interface into both systems, except for when we dumped the Parquet data from ClickHouse.
CedarDB outperformed ClickHouse on all seven queries, ranging from about 1.5x to 11x faster

This table summarizes the results:

Query #	ClickHouse time (ms)	CedarDB time (ms)	Ratio
1	22432.234	3036.361	7.388
2	1726.711	423.479	4.077
3	13251.638	1146.439	11.559
4	5601.891	943.241	5.939
5	803.491	163.510	4.914
6	2752.464	1836.822	1.498
7	1374.796	695.075	1.978
		Geometric mean speedup	4.363

My Takeaways

With its new support for the Parquet data format, CedarDB makes ingesting Parquet data painless (and, it’s easy to deploy using Docker).
As you’ll see below, if you “Show the Experimental Notes”, ClickHouse asks you to provide a fair bit of insight into the data up front:
- CreationDate DateTime64(3, 'UTC'): contrast this with simply TIMESTAMPTZ in CedarDB (and Postgres)
- Id Int32 CODEC(Delta(4), ZSTD(1)): you specify compression strategies
- ContentLicense LowCardinality(String): you enable storage optimizations (e.g. dictionary encoding)
- ENGINE = MergeTree ORDER BY (CreationDate, PostId): you provide some hints on how best to arrange the data
If you’ve encountered challenges with ClickHouse when your queries involve complex joins, then CedarDB provides immediate relief.
It’s also worth pointing out that, unlike ClickHouse, CedarDB is able to simultaneously handle both your OLAP and your OLTP workload, and it looks just like Postgres, so you get to run your analytical queries on fresh data, with zero replication lag and a simpler data architecture.
Parquet doesn’t care which DB engine you ride. It just gets you to the top. What happens on the way down is up to the database.

Thank you for joining me on this Winter adventure!

If you’d like to discuss your data challenges with us, we’d love to hear from you.

Appendix: Après-Ski Details

Show the Experimental Notes

Deploy onto an EC2 instance with these specs (and a 768 GB EBS disk) into us-east-2:

ubuntu@ip-10-0-1-175:~$ hostnamectl
 Static hostname: ip-10-0-1-175
       Icon name: computer-vm
         Chassis: vm
      Machine ID: ec21a20c0e5541c2b2a709e6d4c6a213
         Boot ID: 2a5966b0c4f1448b98adf6c1f2a7aa3c
  Virtualization: amazon
Operating System: Ubuntu 24.04.3 LTS
          Kernel: Linux 6.14.0-1018-aws
    Architecture: x86-64
 Hardware Vendor: Amazon EC2
  Hardware Model: m7a.8xlarge
Firmware Version: 1.0
   Firmware Date: Mon 2017-10-16
    Firmware Age: 8y 3month 3w 4d

Configure PostgreSQL connectivity on port 15432 for ClickHouse:

conf_dir="clickhouse/config.d"
mkdir -p $conf_dir
echo "
<clickhouse>
  <postgresql_port>15432</postgresql_port>
</clickhouse>
" > $conf_dir/postgres_port.xml

Start ClickHouse in Docker:

docker_dir="/var/lib/docker"

data_dir="$docker_dir/clickhouse/data"
sudo mkdir -p $data_dir

logs_dir="$docker_dir/clickhouse/logs"
sudo mkdir -p $logs_dir

# Directory into which we'll write out the Parquet files from ClickHouse
cedar_ext="$docker_dir/cedardb/data/ext"
sudo mkdir -p $cedar_ext

docker run -d --name clickhouse \
  --network=host \
  --ulimit nofile=262144:262144 \
  -p 8123:8123 \
  -p 9000:9000 \
  -p 15432:15432 \
  -e CLICKHOUSE_USER=postgres \
  -e CLICKHOUSE_PASSWORD=postgres \
  -v $HOME/clickhouse/config.d:/etc/clickhouse-server/config.d \
  -v $data_dir:/var/lib/clickhouse/ \
  -v $logs_dir:/var/log/clickhouse-server/ \
  -v $cedar_ext:/var/lib/clickhouse/user_files \
  clickhouse/clickhouse-server

Create the Stack Overflow tables on ClickHouse:

psql "postgresql://postgres:postgres@localhost:15432/$db?sslmode=disable" < stack_overflow_ddl_clickhouse.sql
SELECT 0
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE

Contents of stack_overflow_ddl_clickhouse.sql:

CREATE TABLE posts
(
    `Id` Int32 CODEC(Delta(4), ZSTD(1)),
    `PostTypeId` Enum8('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8),
    `AcceptedAnswerId` UInt32,
    `CreationDate` DateTime64(3, 'UTC'),
    `Score` Int32,
    `ViewCount` UInt32 CODEC(Delta(4), ZSTD(1)),
    `Body` String,
    `OwnerUserId` Int32,
    `OwnerDisplayName` String,
    `LastEditorUserId` Int32,
    `LastEditorDisplayName` String,
    `LastEditDate` DateTime64(3, 'UTC') CODEC(Delta(8), ZSTD(1)),
    `LastActivityDate` DateTime64(3, 'UTC'),
    `Title` String,
    `Tags` String,
    `AnswerCount` UInt16 CODEC(Delta(2), ZSTD(1)),
    `CommentCount` UInt8,
    `FavoriteCount` UInt8,
    `ContentLicense` LowCardinality(String),
    `ParentId` String,
    `CommunityOwnedDate` DateTime64(3, 'UTC'),
    `ClosedDate` DateTime64(3, 'UTC')
)
ENGINE = MergeTree
PARTITION BY toYear(CreationDate)
ORDER BY (PostTypeId, toDate(CreationDate), CreationDate);

CREATE TABLE votes
(
    `Id` UInt32,
    `PostId` Int32,
    `VoteTypeId` UInt8,
    `CreationDate` DateTime64(3, 'UTC'),
    `UserId` Int32,
    `BountyAmount` UInt8
)
ENGINE = MergeTree
ORDER BY (VoteTypeId, CreationDate, PostId, UserId);

CREATE TABLE comments
(
    `Id` UInt32,
    `PostId` UInt32,
    `Score` UInt16,
    `Text` String,
    `CreationDate` DateTime64(3, 'UTC'),
    `UserId` Int32,
    `UserDisplayName` LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY CreationDate;

CREATE TABLE users
(
    `Id` Int32,
    `Reputation` LowCardinality(String),
    `CreationDate` DateTime64(3, 'UTC') CODEC(Delta(8), ZSTD(1)),
    `DisplayName` String,
    `LastAccessDate` DateTime64(3, 'UTC'),
    `AboutMe` String,
    `Views` UInt32,
    `UpVotes` UInt32,
    `DownVotes` UInt32,
    `WebsiteUrl` String,
    `Location` LowCardinality(String),
    `AccountId` Int32
)
ENGINE = MergeTree
ORDER BY (Id, CreationDate);

CREATE TABLE badges
(
    `Id` UInt32,
    `UserId` Int32,
    `Name` LowCardinality(String),
    `Date` DateTime64(3, 'UTC'),
    `Class` Enum8('Gold' = 1, 'Silver' = 2, 'Bronze' = 3),
    `TagBased` Bool
)
ENGINE = MergeTree
ORDER BY UserId;

CREATE TABLE postlinks
(
    `Id` UInt64,
    `CreationDate` DateTime64(3, 'UTC'),
    `PostId` Int32,
    `RelatedPostId` Int32,
    `LinkTypeId` Enum8('Linked' = 1, 'Duplicate' = 3)
)
ENGINE = MergeTree
ORDER BY (PostId, RelatedPostId);

CREATE TABLE posthistory
(
    `Id` UInt64,
    `PostHistoryTypeId` UInt8,
    `PostId` Int32,
    `RevisionGUID` String,
    `CreationDate` DateTime64(3, 'UTC'),
    `UserId` Int32,
    `Text` String,
    `ContentLicense` LowCardinality(String),
    `Comment` String,
    `UserDisplayName` String
)
ENGINE = MergeTree
ORDER BY (CreationDate, PostId);

Load the Parquet formatted data set into ClickHouse:

Timing is on.
psql (16.11 (Ubuntu 16.11-0ubuntu0.24.04.1), server 26.1.2.11)
WARNING: psql major version 16, server major version 26.
         Some psql features might not work.
Type "help" for help.

default=> INSERT INTO posts SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/posts/*.parquet');
INSERT 0 59819048
Time: 636401.214 ms (10:36.401)

default=> INSERT INTO votes SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/votes/*.parquet');
INSERT 0 238984011
Time: 16150.618 ms (00:16.151)

default=> INSERT INTO comments SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/comments/*.parquet');
INSERT 0 90380323
Time: 64832.456 ms (01:04.832)

default=> INSERT INTO users SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/users.parquet');
INSERT 0 22484235
Time: 10104.109 ms (00:10.104)

default=> INSERT INTO badges SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/badges.parquet');
INSERT 0 51289973
Time: 7348.446 ms (00:07.348)

default=> INSERT INTO postlinks SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/postlinks.parquet');
INSERT 0 6552590
Time: 2621.988 ms (00:02.622)

default=> INSERT INTO posthistory SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/posthistory/*.parquet');
INSERT 0 160790317
Time: 1747694.098 ms (29:07.694)

Build the query workload runner (a Go app):

sudo apt install golang-go
curl -s https://cedardb-public.s3.us-east-1.amazonaws.com/pgx_bench.tar.gz | tar xzvf -
x pgx_bench/
x pgx_bench/go.mod
x pgx_bench/go.sum
x pgx_bench/main.go

cd pgx_bench/
go mod tidy
go build .
cd -

Run the query workload on ClickHouse:

./pgx_bench/pgx_bench 5 q1.sql q2.sql q3.sql q4.sql q5.sql q6.sql q7.sql > clickhouse_timings.csv

cat clickhouse_timings.csv
file,iterations,min_ms,max_ms,avg_ms
q1.sql,5,22059.208,23186.074,22432.234
q2.sql,5,1648.039,1871.533,1726.711
q3.sql,5,12692.024,14110.898,13251.638
q4.sql,5,5338.250,5753.039,5601.891
q5.sql,5,777.590,823.650,803.491
q6.sql,5,2636.661,2939.570,2752.464
q7.sql,5,1330.932,1409.835,1374.796

Dump each of the tables to the local filesystem, in Parquet format:

This is done from a Bash shell on the clickhouse container. Here is what the process looks like for the posts table (the remaining tables follow the same pattern).

ubuntu@ip-10-0-1-175:~$ docker exec -it clickhouse /bin/bash
root@ip-10-0-1-175:/# set -o vi
root@ip-10-0-1-175:/# cd /var/lib/clickhouse/user_files
root@ip-10-0-1-175:/var/lib/clickhouse/user_files# clickhouse-client
ClickHouse client version 26.1.2.11 (official build).
Connecting to localhost:9000 as user postgres.
Connected to ClickHouse server version 26.1.2.

Warnings:
 * Delay accounting is not enabled, OSIOWaitMicroseconds will not be gathered. You can enable it using `sudo sh -c 'echo 1 > /proc/sys/kernel/task_delayacct'` or by using sysctl.

ip-10-0-1-175.us-east-2.compute.internal :) SELECT *
:-] FROM posts
:-] INTO OUTFILE 'posts.parquet'
:-] FORMAT Parquet;

SELECT *
FROM posts
INTO OUTFILE 'posts.parquet'
FORMAT Parquet

Query id: 49904a9a-a54a-4478-ab94-0944ed519319


59819048 rows in set. Elapsed: 294.139 sec. Processed 59.82 million rows, 74.64 GB (203.37 thousand rows/s., 253.76 MB/s.)
Peak memory usage: 850.06 MiB.

The resulting files:

root@ip-10-0-1-175:/var/lib/clickhouse/user_files# ls -ltr
total 73153720
-rw-r--r-- 1 root root 25099598978 Feb  9 23:51 posts.parquet
-rw-r--r-- 1 root root  2134918600 Feb 10 00:02 votes.parquet
-rw-r--r-- 1 root root  7582062208 Feb 10 00:04 comments.parquet
-rw-r--r-- 1 root root  1085776043 Feb 10 00:07 users.parquet
-rw-r--r-- 1 root root   736515182 Feb 10 00:08 badges.parquet
-rw-r--r-- 1 root root   122295217 Feb 10 00:08 postlinks.parquet
-rw-r--r-- 1 root root 38148226915 Feb 10 00:16 posthistory.parquet

Shut down ClickHouse:

docker stop clickhouse

Start CedarDB in Docker:

CedarDB Community Edition is available free of charge subject to the Community Edition License but, since that restricts the data size to a maxium of 64 GB, I obtained an Enterprise Trial license per the guidance available here. The resulting license key string is assigned to the LICENSE_KEY variable, below.

docker_dir="/var/lib/docker"
data_dir="$docker_dir/cedardb/data"

docker run -d \
  --privileged --ulimit memlock=1073741824 \
  --name cedardb \
  -v $data_dir:/var/lib/cedardb/data \
  -p 5432:5432 \
  -e CEDAR_PASSWORD="MMABl3n3SS3nuc" \
  -e CEDAR_USER="postgres" \
  -e CEDAR_DB="postgres" \
  -e LICENSE_KEY="eyJhb[...]" \
  cedardb/cedardb:latest

Load the Parquet formatted data set:

This is done via a psql CLI connection.
The only table explicitly created was posts so that we could ensure the correct data types for a couple of columns.

ubuntu@ip-10-0-1-175:~$ psql "postgresql://postgres:MMABl3n3SS3nuc@localhost:5432/postgres?sslmode=require"
Timing is on.
psql (16.11 (Ubuntu 16.11-0ubuntu0.24.04.1), server 16.3 cedar v2026-02-03)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

postgres=# CREATE TABLE posts
postgres-# (
postgres(#   "Id"  INTEGER,
postgres(#   "PostTypeId"  TEXT,
postgres(#   "AcceptedAnswerId"  UINT4,
postgres(#   "CreationDate"  TIMESTAMP,
postgres(#   "Score"  INTEGER,
postgres(#   "ViewCount"  UINT4,
postgres(#   "Body"  TEXT,
postgres(#   "OwnerUserId"  INTEGER,
postgres(#   "OwnerDisplayName"  TEXT,
postgres(#   "LastEditorUserId"  INTEGER,
postgres(#   "LastEditorDisplayName"  TEXT,
postgres(#   "LastEditDate"  TIMESTAMP,
postgres(#   "LastActivityDate"  TIMESTAMP,
postgres(#   "Title"  TEXT,
postgres(#   "Tags"  TEXT,
postgres(#   "AnswerCount"  UINT2,
postgres(#   "CommentCount"  UINT1,
postgres(#   "FavoriteCount"  UINT1,
postgres(#   "ContentLicense"  TEXT,
postgres(#   "ParentId"  TEXT,
postgres(#   "CommunityOwnedDate"  TIMESTAMP,
postgres(#   "ClosedDate"  TIMESTAMP
postgres(# );
CREATE TABLE
Time: 5.786 ms

postgres=# INSERT INTO posts
postgres-# SELECT "Id", "PostTypeId", "AcceptedAnswerId", "CreationDate", "Score", "ViewCount", "Body", "OwnerUserId",
postgres-#   "OwnerDisplayName", "LastEditorUserId", "LastEditorDisplayName", "LastEditDate", "LastActivityDate",
postgres-#   "Title", "Tags", "AnswerCount", "CommentCount", "FavoriteCount", "ContentLicense",
postgres-#   CASE WHEN "ParentId" = '' THEN NULL ELSE "ParentId"::INT END "ParentId",
postgres-#   "CommunityOwnedDate", "ClosedDate"
postgres-# FROM '/var/lib/cedardb/data/ext/posts.parquet';
 INSERT 0 59819048
Time: 369243.039 ms (06:09.243)

postgres=# CREATE TABLE votes AS SELECT * FROM '/var/lib/cedardb/data/ext/votes.parquet';
SELECT 238984011
Time: 46287.163 ms (00:46.287)

postgres=# CREATE TABLE comments AS SELECT * FROM '/var/lib/cedardb/data/ext/comments.parquet';
SELECT 90380323
Time: 134800.228 ms (02:14.800)

postgres=# CREATE TABLE users AS SELECT * FROM '/var/lib/cedardb/data/ext/users.parquet';
SELECT 22484235
Time: 20974.077 ms (00:20.974)

postgres=# CREATE TABLE badges AS SELECT * FROM '/var/lib/cedardb/data/ext/badges.parquet';
SELECT 51289973
Time: 10397.579 ms (00:10.398)

postgres=# CREATE TABLE postlinks AS SELECT * FROM '/var/lib/cedardb/data/ext/postlinks.parquet';
SELECT 6552590
Time: 618.701 ms

postgres=# CREATE TABLE posthistory AS SELECT * FROM '/var/lib/cedardb/data/ext/posthistory.parquet';
SELECT 160790317
Time: 892233.261 ms (14:52.233)

Run the query workload in CedarDB:

ubuntu@ip-10-0-1-175:~$ ./pgx_bench/pgx_bench 5 q1.sql q2.sql q3.sql q4.sql q5.sql q6.sql q7.sql > cedardb_timings.csv

ubuntu@ip-10-0-1-175:~$ cat cedardb_timings.csv
file,iterations,min_ms,max_ms,avg_ms
q1.sql,5,2963.357,3091.380,3036.361
q2.sql,5,404.593,436.573,423.479
q3.sql,5,1088.311,1269.720,1146.439
q4.sql,5,924.592,955.603,943.241
q5.sql,5,161.864,165.734,163.510
q6.sql,5,1813.603,1850.868,1836.822
q7.sql,5,664.310,734.428,695.075

Finally, import timing data from the two output CSV files into a spreadsheet to create the table shown earlier – that’s it.

Parquet as our 'Ski Lift' to Migrate from ClickHouse to CedarDB

The newly released Parquet support in CedarDB streamlines data exchange.

Why the Stack Overflow data set?

Overview of the experiment

The questions we’ll ask

Results and observations

My Takeaways

Appendix: Après-Ski Details

Read more in the CedarDB Blog

Efficient String Compression for Modern Database Systems

Stop Nesting Database Systems

CedarDB Takes to the Slopes!

The newly released Parquet support in CedarDB streamlines data exchange.

Why the Stack Overflow data set?

Overview of the experiment

The questions we’ll ask

Results and observations

My Takeaways

Appendix: Après-Ski Details

Efficient String Compression for Modern Database Systems

Stop Nesting Database Systems

CedarDB Takes to the Slopes!

Subscribe to our Newsletter