The newly released Parquet support in CedarDB streamlines data exchange.
My recent LinkedIn post (below, with minor changes) provides the motivation for the more detailed deep dive presented in this article:
All this snow has me thinking about skiing (again). And data. And databases. I was inspired by the “Analyzing Stack Overflow data with ClickHouse” article I read, thinking how amazing it is they just raised $400M. After working through their examples, it occurred to me that, as fun as that was, the SQL queries there were very simple. In skiing jargon, I might liken them to the green ski runs, for beginners.
As skiers progress, they move to the blue intermediate runs, then may get bold and go try the black diamond runs. That’s the direction I headed with the Stack Overflow data set. ClickHouse provides a convenient Parquet export function, which I was able to use in conjunction with CedarDB’s Parquet import functionality to load my CedarDB instance with the data set I’d used in ClickHouse. I deployed each of the DB’s on my m7a.8xlarge (32 vCPU, 128 GB RAM, $1.85/hour) EC2 node, in Docker, running them one at a time. A little pair programming with ChatGPT yielded a set of seven interesting queries, which I ran on each of these DBs, discarding the initial run time and recording the average timing for five runs.
Why the Stack Overflow data set?
The Stack Overflow data set appeals to us for several reasons, the first of which is that many of us can relate to it, having used the Web site over the years. Some other aspects are:
- It’s a real-world data set with a schema that spans multiple tables that relate to one another.
- Real data is messy, so it’s a great test for database systems. Import/export is hard (nulls, escape symbols, etc.) and queries are often slow (as illustrated here by Clickhouse’s struggle to handle complex joins).
- ClickHouse chose to publish an article around it, which provides a nice starting point for comparing ClickHouse with CedarDB.
- The data set presents both an opportunity to write some simple SQL queries that compute aggregates against a single table, as you can see here (from the ClickHouse post):
SELECT
arrayJoin(arrayFilter(t -> (t != ''), splitByChar('|', Tags))) AS Tags,
count() AS c
FROM stackoverflow.posts
GROUP BY Tags
ORDER BY c DESC
LIMIT 10
- … but also an opportunity to explore more complex relationships through SQL queries that join multiple tables; for example, this query that lists the 10 most recent questions (since 2019) that have an accepted answer, and shows the question title and who asked it, the accepted answer ID and who wrote it, and (if available) a badge name belonging to the person who wrote the accepted answer:
SELECT
q."Id" AS question_id,
q."Title" AS question_title,
qu."DisplayName" AS question_owner,
a."Id" AS answer_id,
au."DisplayName" AS answer_owner,
b."Name" AS answerer_badge
FROM posts q
JOIN posts a
ON a."Id" = q."AcceptedAnswerId"
JOIN users qu
ON qu."Id" = q."OwnerUserId"
JOIN users au
ON au."Id" = a."OwnerUserId"
LEFT JOIN badges b
ON b."UserId" = au."Id"
WHERE q."AcceptedAnswerId" IS NOT NULL
AND q."Title" <> ''
AND q."CreationDate" >= TIMESTAMP '2019-01-01'
ORDER BY q."CreationDate" DESC
LIMIT 10;
- Since ClickHouse already covered the simplistic queries in its post, the focus here will be the more complex ones like the one above, the black diamond runs if you will. This is what drives us to consider the use of Parquet to migrate our workload over to CedarDB.
- What does this Parquet migration process look like in practice? This is the essence of it:
-- Clickhouse:
ip-10-0-1-175.us-east-2.compute.internal :) SELECT *
:-] FROM votes
:-] INTO OUTFILE 'votes.parquet'
:-] FORMAT Parquet;
-- CedarDB:
postgres=# CREATE TABLE votes AS SELECT * FROM '/var/lib/cedardb/data/ext/votes.parquet';
SELECT 238984011
Time: 46287.163 ms (00:46.287)
Overview of the experiment
Refer to Appendix: Après-Ski Details, below, for the step-by-step procedure.
- Deployed an EC2 instance in us-east-2 using this CloudFormation template
- Started ClickHouse in Docker
- Loaded the Stack Overflow data set
- Ran the seven SQL queries, collecting telemetry on the runs
- Exported the data in Parquet format to the local filesystem
- Shut down ClickHouse
- Started up CedarDB in Docker
- Loaded the data set from the Parquet files
- Repeated the query runs and timing collection
The questions we’ll ask
Here are the questions that our SQL queries will ask, numbered to correspond with the numbered files used in the experiments and also with the entries in the results table (below). To view a SQL query, click the right arrow symbol next to its number.
Q1: Find the top 10 most frequently awarded badge types among high-reputation users (users with reputation ≥ 100,000), considering only users who have posts created on or after January 1, 2018, and requiring that those posts have at least one comment.
SELECT
b."Name" AS badge_name,
COUNT(*) AS badge_awards
FROM badges b
JOIN users u
ON u."Id" = b."UserId"
JOIN posts p
ON p."OwnerUserId" = u."Id"
JOIN comments c
ON c."PostId" = p."Id"
WHERE u."Reputation"::int >= 100000
AND p."CreationDate" >= TIMESTAMP '2018-01-01'
GROUP BY b."Name"
ORDER BY badge_awards DESC
LIMIT 10;
Q2: Find posts that the community strongly disagrees on - those with lots of both upvotes and downvotes — and highlight the ones where opinion is most evenly divided, especially those that sparked substantial discussion.
WITH vote_counts AS (
SELECT
v."PostId",
SUM(CASE WHEN v."VoteTypeId" = 2 THEN 1 ELSE 0 END) AS upvotes,
SUM(CASE WHEN v."VoteTypeId" = 3 THEN 1 ELSE 0 END) AS downvotes
FROM votes v
GROUP BY v."PostId"
)
SELECT
p."Id",
p."Title",
u."DisplayName" AS owner,
vc.upvotes,
vc.downvotes,
(CASE WHEN vc.upvotes > vc.downvotes THEN vc.upvotes - vc.downvotes ELSE vc.downvotes - vc.upvotes END) AS abs_diff,
COUNT(c."Id") AS comment_cnt
FROM vote_counts vc
JOIN posts p
ON p."Id" = vc."PostId"
JOIN users u
ON u."Id" = p."OwnerUserId"
LEFT JOIN comments c
ON c."PostId" = p."Id"
WHERE p."Title" <> ''
AND vc.upvotes >= 50
AND vc.downvotes >= 50
GROUP BY p."Id", p."Title", u."DisplayName", vc.upvotes, vc.downvotes
ORDER BY abs_diff ASC, comment_cnt DESC
LIMIT 10;
Q3: List the 10 most recent questions (since 2019) that have an accepted answer, and show (1) the question title and who asked it, (2) the accepted answer ID and who wrote it, and (3, if available) a badge name belonging to the person who wrote the accepted answer.
SELECT
q."Id" AS question_id,
q."Title" AS question_title,
qu."DisplayName" AS question_owner,
a."Id" AS answer_id,
au."DisplayName" AS answer_owner,
b."Name" AS answerer_badge
FROM posts q
JOIN posts a
ON a."Id" = q."AcceptedAnswerId"
JOIN users qu
ON qu."Id" = q."OwnerUserId"
JOIN users au
ON au."Id" = a."OwnerUserId"
LEFT JOIN badges b
ON b."UserId" = au."Id"
WHERE q."AcceptedAnswerId" IS NOT NULL
AND q."Title" <> ''
AND q."CreationDate" >= TIMESTAMP '2019-01-01'
ORDER BY q."CreationDate" DESC
LIMIT 10;
Q4: Highlight popular posts that were 'community-touched' (edited by someone else), and rank them by overall vote volume, using comment count as a tiebreaker.
WITH vote_counts AS (
SELECT
v."PostId",
COUNT(*) AS votes_total
FROM votes v
GROUP BY v."PostId"
)
SELECT
p."Id",
p."Title",
owner."DisplayName" AS owner_name,
editor."DisplayName" AS editor_name,
editor."Reputation" AS editor_rep,
COALESCE(vc.votes_total, 0) AS votes_total,
COUNT(c."Id") AS comment_cnt
FROM posts p
JOIN users owner
ON owner."Id" = p."OwnerUserId"
JOIN users editor
ON editor."Id" = p."LastEditorUserId"
LEFT JOIN vote_counts vc
ON vc."PostId" = p."Id"
LEFT JOIN comments c
ON c."PostId" = p."Id"
WHERE p."LastEditorUserId" IS NOT NULL
AND p."Title" <> ''
AND p."OwnerUserId" <> p."LastEditorUserId"
GROUP BY
p."Id", p."Title", owner."DisplayName", editor."DisplayName", editor."Reputation", vc.votes_total
ORDER BY votes_total DESC, comment_cnt DESC
LIMIT 10;
Q5: Show the 10 most recent links between posts (since 2020), including the titles and authors on both sides of each link.
SELECT
pl."PostId" AS src_post_id,
src."Title" AS src_title,
su."DisplayName" AS src_owner,
pl."RelatedPostId" AS dst_post_id,
dst."Title" AS dst_title,
du."DisplayName" AS dst_owner,
pl."CreationDate" AS link_time
FROM postlinks pl
JOIN posts src
ON src."Id" = pl."PostId"
JOIN posts dst
ON dst."Id" = pl."RelatedPostId"
LEFT JOIN users su
ON su."Id" = src."OwnerUserId"
LEFT JOIN users du
ON du."Id" = dst."OwnerUserId"
WHERE src."Title" <> ''
AND dst."Title" <> ''
AND pl."CreationDate" >= TIMESTAMP '2020-01-01'
ORDER BY pl."CreationDate" DESC
LIMIT 10;
Q6: Highlight the posts that have been 'worked on' the most over time, and whether that work came from a single editor or a broad set of contributors.
SELECT
p."Id" AS post_id,
p."Title",
owner."DisplayName" AS owner,
editor."DisplayName" AS last_editor,
COUNT(ph."Id") AS revisions,
COUNT(DISTINCT ph."UserId") AS distinct_revision_authors
FROM posts p
JOIN users owner
ON owner."Id" = p."OwnerUserId"
LEFT JOIN users editor
ON editor."Id" = p."LastEditorUserId"
JOIN posthistory ph
ON ph."PostId" = p."Id"
WHERE p."Title" <> ''
GROUP BY p."Id", p."Title", owner."DisplayName", editor."DisplayName"
HAVING COUNT(ph."Id") >= 20
ORDER BY revisions DESC
LIMIT 10;
Q7: Determine which combinations of tags generate the most answers overall, and whether those answers tend to come from high-reputation or lower-reputation contributors.
SELECT
q."Tags",
COUNT(a."Id") AS answers_cnt,
AVG(au."Reputation"::int) AS avg_answerer_rep
FROM posts a
JOIN posts q
ON q."Id" = a."ParentId"::int
JOIN users au
ON au."Id" = a."OwnerUserId"
WHERE a."PostTypeId" = 'Answer'
AND q."Tags" <> ''
GROUP BY q."Tags"
ORDER BY answers_cnt DESC
LIMIT 10;
Results and observations
- AWS EC2 instance type:
m7a.8xlarge(32 vCPU, 128 GB RAM, $1.85/hour) - Docker provided a simple means of deployment for both of these DB engines.
- CedarDB version:
v2026-02-03 - ClickHouse server version:
26.1.2.11 - Parquet support provided the “chair lift”, taking us from the beginner SQL queries (ClickHouse) to the advanced ones (CedarDB).
- In CedarDB, with a single exception, we created the tables using
CREATE TABLE ... AS SELECT ..., from the Parquet data. - The
psqlCLI was our interface into both systems, except for when we dumped the Parquet data from ClickHouse. - CedarDB outperformed ClickHouse on all seven queries, ranging from about 1.5x to 11x faster
This table summarizes the results:
| Query # | ClickHouse time (ms) | CedarDB time (ms) | Ratio |
|---|---|---|---|
| 1 | 22432.234 | 3036.361 | 7.388 |
| 2 | 1726.711 | 423.479 | 4.077 |
| 3 | 13251.638 | 1146.439 | 11.559 |
| 4 | 5601.891 | 943.241 | 5.939 |
| 5 | 803.491 | 163.510 | 4.914 |
| 6 | 2752.464 | 1836.822 | 1.498 |
| 7 | 1374.796 | 695.075 | 1.978 |
| Geometric mean speedup | 4.363 |
My Takeaways
With its new support for the Parquet data format, CedarDB makes ingesting Parquet data painless (and, it’s easy to deploy using Docker).
As you’ll see below, if you “Show the Experimental Notes”, ClickHouse asks you to provide a fair bit of insight into the data up front:
CreationDate DateTime64(3, 'UTC'): contrast this with simplyTIMESTAMPTZin CedarDB (and Postgres)Id Int32 CODEC(Delta(4), ZSTD(1)): you specify compression strategiesContentLicense LowCardinality(String): you enable storage optimizations (e.g. dictionary encoding)ENGINE = MergeTree ORDER BY (CreationDate, PostId): you provide some hints on how best to arrange the data
If you’ve encountered challenges with ClickHouse when your queries involve complex joins, then CedarDB provides immediate relief.
It’s also worth pointing out that, unlike ClickHouse, CedarDB is able to simultaneously handle both your OLAP and your OLTP workload, and it looks just like Postgres, so you get to run your analytical queries on fresh data, with zero replication lag and a simpler data architecture.
Parquet doesn’t care which DB engine you ride. It just gets you to the top. What happens on the way down is up to the database.
Thank you for joining me on this Winter adventure!
If you’d like to discuss your data challenges with us, we’d love to hear from you.
Appendix: Après-Ski Details
Deploy onto an EC2 instance with these specs (and a 768 GB EBS disk) into us-east-2: Configure PostgreSQL connectivity on port 15432 for ClickHouse: Start ClickHouse in Docker: Create the Stack Overflow tables on ClickHouse: Contents of Load the Parquet formatted data set into ClickHouse: Build the query workload runner (a Go app): Run the query workload on ClickHouse: Dump each of the tables to the local filesystem, in Parquet format: This is done from a Bash shell on the The resulting files: Shut down ClickHouse: Start CedarDB in Docker: CedarDB Community Edition is available free of charge subject to the Community Edition License
but, since that restricts the data size to a maxium of 64 GB, I obtained an Enterprise Trial
license per the guidance available here. The resulting
license key string is assigned to the Load the Parquet formatted data set: This is done via a The only table explicitly created was Run the query workload in CedarDB: Finally, import timing data from the two output CSV files into a spreadsheet to create the table shown earlier – that’s it.Show the Experimental Notes
ubuntu@ip-10-0-1-175:~$ hostnamectl
Static hostname: ip-10-0-1-175
Icon name: computer-vm
Chassis: vm
Machine ID: ec21a20c0e5541c2b2a709e6d4c6a213
Boot ID: 2a5966b0c4f1448b98adf6c1f2a7aa3c
Virtualization: amazon
Operating System: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-1018-aws
Architecture: x86-64
Hardware Vendor: Amazon EC2
Hardware Model: m7a.8xlarge
Firmware Version: 1.0
Firmware Date: Mon 2017-10-16
Firmware Age: 8y 3month 3w 4d
conf_dir="clickhouse/config.d"
mkdir -p $conf_dir
echo "
<clickhouse>
<postgresql_port>15432</postgresql_port>
</clickhouse>
" > $conf_dir/postgres_port.xml
docker_dir="/var/lib/docker"
data_dir="$docker_dir/clickhouse/data"
sudo mkdir -p $data_dir
logs_dir="$docker_dir/clickhouse/logs"
sudo mkdir -p $logs_dir
# Directory into which we'll write out the Parquet files from ClickHouse
cedar_ext="$docker_dir/cedardb/data/ext"
sudo mkdir -p $cedar_ext
docker run -d --name clickhouse \
--network=host \
--ulimit nofile=262144:262144 \
-p 8123:8123 \
-p 9000:9000 \
-p 15432:15432 \
-e CLICKHOUSE_USER=postgres \
-e CLICKHOUSE_PASSWORD=postgres \
-v $HOME/clickhouse/config.d:/etc/clickhouse-server/config.d \
-v $data_dir:/var/lib/clickhouse/ \
-v $logs_dir:/var/log/clickhouse-server/ \
-v $cedar_ext:/var/lib/clickhouse/user_files \
clickhouse/clickhouse-server
psql "postgresql://postgres:postgres@localhost:15432/$db?sslmode=disable" < stack_overflow_ddl_clickhouse.sql
SELECT 0
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
stack_overflow_ddl_clickhouse.sql:CREATE TABLE posts
(
`Id` Int32 CODEC(Delta(4), ZSTD(1)),
`PostTypeId` Enum8('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8),
`AcceptedAnswerId` UInt32,
`CreationDate` DateTime64(3, 'UTC'),
`Score` Int32,
`ViewCount` UInt32 CODEC(Delta(4), ZSTD(1)),
`Body` String,
`OwnerUserId` Int32,
`OwnerDisplayName` String,
`LastEditorUserId` Int32,
`LastEditorDisplayName` String,
`LastEditDate` DateTime64(3, 'UTC') CODEC(Delta(8), ZSTD(1)),
`LastActivityDate` DateTime64(3, 'UTC'),
`Title` String,
`Tags` String,
`AnswerCount` UInt16 CODEC(Delta(2), ZSTD(1)),
`CommentCount` UInt8,
`FavoriteCount` UInt8,
`ContentLicense` LowCardinality(String),
`ParentId` String,
`CommunityOwnedDate` DateTime64(3, 'UTC'),
`ClosedDate` DateTime64(3, 'UTC')
)
ENGINE = MergeTree
PARTITION BY toYear(CreationDate)
ORDER BY (PostTypeId, toDate(CreationDate), CreationDate);
CREATE TABLE votes
(
`Id` UInt32,
`PostId` Int32,
`VoteTypeId` UInt8,
`CreationDate` DateTime64(3, 'UTC'),
`UserId` Int32,
`BountyAmount` UInt8
)
ENGINE = MergeTree
ORDER BY (VoteTypeId, CreationDate, PostId, UserId);
CREATE TABLE comments
(
`Id` UInt32,
`PostId` UInt32,
`Score` UInt16,
`Text` String,
`CreationDate` DateTime64(3, 'UTC'),
`UserId` Int32,
`UserDisplayName` LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY CreationDate;
CREATE TABLE users
(
`Id` Int32,
`Reputation` LowCardinality(String),
`CreationDate` DateTime64(3, 'UTC') CODEC(Delta(8), ZSTD(1)),
`DisplayName` String,
`LastAccessDate` DateTime64(3, 'UTC'),
`AboutMe` String,
`Views` UInt32,
`UpVotes` UInt32,
`DownVotes` UInt32,
`WebsiteUrl` String,
`Location` LowCardinality(String),
`AccountId` Int32
)
ENGINE = MergeTree
ORDER BY (Id, CreationDate);
CREATE TABLE badges
(
`Id` UInt32,
`UserId` Int32,
`Name` LowCardinality(String),
`Date` DateTime64(3, 'UTC'),
`Class` Enum8('Gold' = 1, 'Silver' = 2, 'Bronze' = 3),
`TagBased` Bool
)
ENGINE = MergeTree
ORDER BY UserId;
CREATE TABLE postlinks
(
`Id` UInt64,
`CreationDate` DateTime64(3, 'UTC'),
`PostId` Int32,
`RelatedPostId` Int32,
`LinkTypeId` Enum8('Linked' = 1, 'Duplicate' = 3)
)
ENGINE = MergeTree
ORDER BY (PostId, RelatedPostId);
CREATE TABLE posthistory
(
`Id` UInt64,
`PostHistoryTypeId` UInt8,
`PostId` Int32,
`RevisionGUID` String,
`CreationDate` DateTime64(3, 'UTC'),
`UserId` Int32,
`Text` String,
`ContentLicense` LowCardinality(String),
`Comment` String,
`UserDisplayName` String
)
ENGINE = MergeTree
ORDER BY (CreationDate, PostId);
Timing is on.
psql (16.11 (Ubuntu 16.11-0ubuntu0.24.04.1), server 26.1.2.11)
WARNING: psql major version 16, server major version 26.
Some psql features might not work.
Type "help" for help.
default=> INSERT INTO posts SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/posts/*.parquet');
INSERT 0 59819048
Time: 636401.214 ms (10:36.401)
default=> INSERT INTO votes SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/votes/*.parquet');
INSERT 0 238984011
Time: 16150.618 ms (00:16.151)
default=> INSERT INTO comments SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/comments/*.parquet');
INSERT 0 90380323
Time: 64832.456 ms (01:04.832)
default=> INSERT INTO users SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/users.parquet');
INSERT 0 22484235
Time: 10104.109 ms (00:10.104)
default=> INSERT INTO badges SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/badges.parquet');
INSERT 0 51289973
Time: 7348.446 ms (00:07.348)
default=> INSERT INTO postlinks SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/postlinks.parquet');
INSERT 0 6552590
Time: 2621.988 ms (00:02.622)
default=> INSERT INTO posthistory SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/posthistory/*.parquet');
INSERT 0 160790317
Time: 1747694.098 ms (29:07.694)
sudo apt install golang-go
curl -s https://cedardb-public.s3.us-east-1.amazonaws.com/pgx_bench.tar.gz | tar xzvf -
x pgx_bench/
x pgx_bench/go.mod
x pgx_bench/go.sum
x pgx_bench/main.go
cd pgx_bench/
go mod tidy
go build .
cd -
./pgx_bench/pgx_bench 5 q1.sql q2.sql q3.sql q4.sql q5.sql q6.sql q7.sql > clickhouse_timings.csv
cat clickhouse_timings.csv
file,iterations,min_ms,max_ms,avg_ms
q1.sql,5,22059.208,23186.074,22432.234
q2.sql,5,1648.039,1871.533,1726.711
q3.sql,5,12692.024,14110.898,13251.638
q4.sql,5,5338.250,5753.039,5601.891
q5.sql,5,777.590,823.650,803.491
q6.sql,5,2636.661,2939.570,2752.464
q7.sql,5,1330.932,1409.835,1374.796
clickhouse container. Here is what the
process looks like for the posts table (the remaining tables follow the same
pattern).ubuntu@ip-10-0-1-175:~$ docker exec -it clickhouse /bin/bash
root@ip-10-0-1-175:/# set -o vi
root@ip-10-0-1-175:/# cd /var/lib/clickhouse/user_files
root@ip-10-0-1-175:/var/lib/clickhouse/user_files# clickhouse-client
ClickHouse client version 26.1.2.11 (official build).
Connecting to localhost:9000 as user postgres.
Connected to ClickHouse server version 26.1.2.
Warnings:
* Delay accounting is not enabled, OSIOWaitMicroseconds will not be gathered. You can enable it using `sudo sh -c 'echo 1 > /proc/sys/kernel/task_delayacct'` or by using sysctl.
ip-10-0-1-175.us-east-2.compute.internal :) SELECT *
:-] FROM posts
:-] INTO OUTFILE 'posts.parquet'
:-] FORMAT Parquet;
SELECT *
FROM posts
INTO OUTFILE 'posts.parquet'
FORMAT Parquet
Query id: 49904a9a-a54a-4478-ab94-0944ed519319
59819048 rows in set. Elapsed: 294.139 sec. Processed 59.82 million rows, 74.64 GB (203.37 thousand rows/s., 253.76 MB/s.)
Peak memory usage: 850.06 MiB.
root@ip-10-0-1-175:/var/lib/clickhouse/user_files# ls -ltr
total 73153720
-rw-r--r-- 1 root root 25099598978 Feb 9 23:51 posts.parquet
-rw-r--r-- 1 root root 2134918600 Feb 10 00:02 votes.parquet
-rw-r--r-- 1 root root 7582062208 Feb 10 00:04 comments.parquet
-rw-r--r-- 1 root root 1085776043 Feb 10 00:07 users.parquet
-rw-r--r-- 1 root root 736515182 Feb 10 00:08 badges.parquet
-rw-r--r-- 1 root root 122295217 Feb 10 00:08 postlinks.parquet
-rw-r--r-- 1 root root 38148226915 Feb 10 00:16 posthistory.parquet
docker stop clickhouse
LICENSE_KEY variable, below.docker_dir="/var/lib/docker"
data_dir="$docker_dir/cedardb/data"
docker run -d \
--privileged --ulimit memlock=1073741824 \
--name cedardb \
-v $data_dir:/var/lib/cedardb/data \
-p 5432:5432 \
-e CEDAR_PASSWORD="MMABl3n3SS3nuc" \
-e CEDAR_USER="postgres" \
-e CEDAR_DB="postgres" \
-e LICENSE_KEY="eyJhb[...]" \
cedardb/cedardb:latest
psql CLI connection.posts so that we could ensure the
correct data types for a couple of columns.ubuntu@ip-10-0-1-175:~$ psql "postgresql://postgres:MMABl3n3SS3nuc@localhost:5432/postgres?sslmode=require"
Timing is on.
psql (16.11 (Ubuntu 16.11-0ubuntu0.24.04.1), server 16.3 cedar v2026-02-03)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.
postgres=# CREATE TABLE posts
postgres-# (
postgres(# "Id" INTEGER,
postgres(# "PostTypeId" TEXT,
postgres(# "AcceptedAnswerId" UINT4,
postgres(# "CreationDate" TIMESTAMP,
postgres(# "Score" INTEGER,
postgres(# "ViewCount" UINT4,
postgres(# "Body" TEXT,
postgres(# "OwnerUserId" INTEGER,
postgres(# "OwnerDisplayName" TEXT,
postgres(# "LastEditorUserId" INTEGER,
postgres(# "LastEditorDisplayName" TEXT,
postgres(# "LastEditDate" TIMESTAMP,
postgres(# "LastActivityDate" TIMESTAMP,
postgres(# "Title" TEXT,
postgres(# "Tags" TEXT,
postgres(# "AnswerCount" UINT2,
postgres(# "CommentCount" UINT1,
postgres(# "FavoriteCount" UINT1,
postgres(# "ContentLicense" TEXT,
postgres(# "ParentId" TEXT,
postgres(# "CommunityOwnedDate" TIMESTAMP,
postgres(# "ClosedDate" TIMESTAMP
postgres(# );
CREATE TABLE
Time: 5.786 ms
postgres=# INSERT INTO posts
postgres-# SELECT "Id", "PostTypeId", "AcceptedAnswerId", "CreationDate", "Score", "ViewCount", "Body", "OwnerUserId",
postgres-# "OwnerDisplayName", "LastEditorUserId", "LastEditorDisplayName", "LastEditDate", "LastActivityDate",
postgres-# "Title", "Tags", "AnswerCount", "CommentCount", "FavoriteCount", "ContentLicense",
postgres-# CASE WHEN "ParentId" = '' THEN NULL ELSE "ParentId"::INT END "ParentId",
postgres-# "CommunityOwnedDate", "ClosedDate"
postgres-# FROM '/var/lib/cedardb/data/ext/posts.parquet';
INSERT 0 59819048
Time: 369243.039 ms (06:09.243)
postgres=# CREATE TABLE votes AS SELECT * FROM '/var/lib/cedardb/data/ext/votes.parquet';
SELECT 238984011
Time: 46287.163 ms (00:46.287)
postgres=# CREATE TABLE comments AS SELECT * FROM '/var/lib/cedardb/data/ext/comments.parquet';
SELECT 90380323
Time: 134800.228 ms (02:14.800)
postgres=# CREATE TABLE users AS SELECT * FROM '/var/lib/cedardb/data/ext/users.parquet';
SELECT 22484235
Time: 20974.077 ms (00:20.974)
postgres=# CREATE TABLE badges AS SELECT * FROM '/var/lib/cedardb/data/ext/badges.parquet';
SELECT 51289973
Time: 10397.579 ms (00:10.398)
postgres=# CREATE TABLE postlinks AS SELECT * FROM '/var/lib/cedardb/data/ext/postlinks.parquet';
SELECT 6552590
Time: 618.701 ms
postgres=# CREATE TABLE posthistory AS SELECT * FROM '/var/lib/cedardb/data/ext/posthistory.parquet';
SELECT 160790317
Time: 892233.261 ms (14:52.233)
ubuntu@ip-10-0-1-175:~$ ./pgx_bench/pgx_bench 5 q1.sql q2.sql q3.sql q4.sql q5.sql q6.sql q7.sql > cedardb_timings.csv
ubuntu@ip-10-0-1-175:~$ cat cedardb_timings.csv
file,iterations,min_ms,max_ms,avg_ms
q1.sql,5,2963.357,3091.380,3036.361
q2.sql,5,404.593,436.573,423.479
q3.sql,5,1088.311,1269.720,1146.439
q4.sql,5,924.592,955.603,943.241
q5.sql,5,161.864,165.734,163.510
q6.sql,5,1813.603,1850.868,1836.822
q7.sql,5,664.310,734.428,695.075



