What It Takes to Be PostgreSQL Compatible
We, and many other database enthusiasts, are big fans of PostgreSQL. Even though we built a database system from scratch, we believe there are many good reasons for using PostgreSQL. In fact, we like PostgreSQL so much that we made sure to build CedarDB to be compatible with PostgreSQL.
Because of PostgreSQL’s popularity, we were not the first to develop a PostgreSQL compatible database system. CMU’s “Database of Databases” lists over 40 database systems that claim to be PostgreSQL compatible. Among them you can find database systems from large cloud vendors such as AlloyDB from Google or Aurora from AWS.
What you can not easily find in the Database of Databases, or even on most vendor’s websites is what “PostgreSQL compatibility” actually means. It turns out that the answer to this is not that simple. So, we dedicated today’s blog post to answering exactly this question to finally clarifying on “What is PostgreSQL compatibility”?
What Even Is PostgreSQL?
To answer what PostgreSQL compatibility means, we first have to understand what PostgreSQL is. Obviously, we all know that it’s a great and very reliable relational database system. However, developers usually mean much more than just the server program when they talk about PostgreSQL or other database systems. Often times, two people talking about PostgreSQL will mean different things entirely.
What most will agree on, however, is that “PostgreSQL compatible” is usually a term which is not used to describe tools, libraries, and applications that connect to and use PostgreSQL. Instead, it describes database systems that try to immitate PostgreSQL and be compatible with the tools, libraries, and applications built for PostgreSQL. So, PostgreSQL compatibility actually refers to being compatible with the PostgreSQL-ecosystem.
Due to the popularity of PostgreSQL, its ecosystem is very large and there are many tools and libraries that use or integrate with PostgreSQL. The ecosystem includes libraries that help you talk to a PostgreSQL server such as rust-postgres or psycopg, ORMs that usually build on top of these libraries such as Diesel or SQLAlchemy, and end-user applications such as Grafana. If you want to get an overview of your database as a developer, you can use database introspection tools such as DataGrip or DBeaver. And finally, many people have developed extensions for PostgreSQL such as pgvector which turns PostgreSQL into a vector database.
Different parts of the PostgreSQL ecosystem use and rely on different components of the PostgreSQL server. So, it is possible to support many tools from the PostgreSQL ecosystem in a new database system by selectively implementing or at least imitating some of the components of the PostgreSQL server.
How to Achieve PostgreSQL Compatibility
SQL Grammar
One thing you immediately run into when using a new database system is its specific SQL syntax
or grammar. Different database systems have different conventions about quoting strings (e.g., "my table"
in PostgreSQL vs. [my table]
in Microsoft SQL Server), have different names for the same
data types (BOOLEAN
vs. BIT
), or use different keywords for the same thing (LIMIT 5
vs.
TOP 5
).
Since PostgreSQL is open source, you can take a look at its full 17k LOC grammar definition file here. Reimplementing this grammar is easier than it may appear initially: All database systems use an internal representation of all queries that abstracts away from the actual SQL text. So, to support PostgreSQL’s SQL grammar, a database system has to implement a parser that can read the same SQL grammar and convert it to its own internal representation.
Even better: Because of the popularity of PostgreSQL there also exist a few libraries for parsing its SQL dialect such as libpq_query. This library is used by popular database systems such as DuckDB (in a slightly modified version).
Such a huge grammar naturally has a lot of functionality to offer. PostgreSQL’s SQL grammar supports window functions, JSON documents and even XML language constructs. Even if your database system doesn’t support all of these features, using the PostgreSQL grammar will still allow to fully parse SQL statements and then selectively tell the user which functions are not implemented (yet?). You can see this when you try to use XML functionality in CedarDB:
select xmlelement(name foobar);
select xmlelement(name foobar);
When you run this query in CedarDB, you won’t get an error message complaining about invalid SQL syntax, but instead get an error message specifically mentioning that CedarDB doesn’t support XML.
Wire Protocol
If you write a program that wants to connect to a database system, you usually use a library for that. Internally, these libraries implement the wire protocol of the database system. PostgreSQL has its own wire protocol and implements it in its own C-library called libpq. This C-library is used by other PostgreSQL client libraries such as psycopg for Python. Other libraries chose to re-implement the protocol without using libpq such as rust-postgres.
PostgreSQL has a very comprehensive documentation which also covers its wire protocol in detail. The protocol runs over TCP and is divided into messages. Each message begins with a one-byte message type, e.g., ASCII ‘Q’ for a SQL query sent from the client to the server, or ASCII ‘D’ for a single row of a query result sent from the server to the client. The message type is followed by a four-byte integer specifying the size of the entire message. The following message contents depend on the type of the message. For a ‘Q’ message, for example, it’s just the SQL query as a string.
With a bit of experience in network programming, implementing this wire protocol is pretty straight-forward. However, if you implement this, you will also quickly find that the wire protocol is rather inefficient. Most values are transmitted in their text representation instead of as binary data. This increases the size of the data that the server needs to send to the client and also adds encoding and decoding overhead on both sides. Also, the protocol doesn’t compress or deduplicate values. When you write SQL queries with multiple joins, it may often happen that the value of one or a few columns of the result almost never changes. Yet, the wire protocol still requires every individual copy of the same value to be transmitted.
In any case, if you are looking for compatibility with the PostgreSQL wire protocol, you have to accept these downsides.
Inspection Tables and Functions
Especially for debugging, developers connect to a database system using a separate set of tools. While most PostgreSQL tools will use libpq for the actual connection, database inspection tools such as DataGrip and DBeaver use a very specific set of inspection tables and functions. This is one of multiple queries DBeaver sends to the PostgreSQL server when you connect to it:
select string_agg(word, ',') from pg_catalog.pg_get_keywords() where word <> ALL ('{a,abs,absolute,action,ada,add,admin,after,all,allocate,alter,always,and,any,are,array,as,asc,asensitive,assertion,assignment,asymmetric,at,atomic,attribute,attributes,authorization,avg,before,begin,bernoulli,between,bigint,binary,blob,boolean,both,breadth,by,c,call,called,cardinality,cascade,cascaded,case,cast,catalog,catalog_name,ceil,ceiling,chain,char,char_length,character,character_length,character_set_catalog,character_set_name,character_set_schema,characteristics,characters,check,checked,class_origin,clob,close,coalesce,cobol,code_units,collate,collation,collation_catalog,collation_name,collation_schema,collect,column,column_name,command_function,command_function_code,commit,committed,condition,condition_number,connect,connection_name,constraint,constraint_catalog,constraint_name,constraint_schema,constraints,constructors,contains,continue,convert,corr,corresponding,count,covar_pop,covar_samp,create,cross,cube,cume_dist,current,current_collation,current_date,current_default_transform_group,current_path,current_role,current_time,current_timestamp,current_transform_group_for_type,current_user,cursor,cursor_name,cycle,data,date,datetime_interval_code,datetime_interval_precision,day,deallocate,dec,decimal,declare,default,defaults,deferrable,deferred,defined,definer,degree,delete,dense_rank,depth,deref,derived,desc,describe,descriptor,deterministic,diagnostics,disconnect,dispatch,distinct,domain,double,drop,dynamic,dynamic_function,dynamic_function_code,each,element,else,end,end-exec,equals,escape,every,except,exception,exclude,excluding,exec,execute,exists,exp,external,extract,false,fetch,filter,final,first,float,floor,following,for,foreign,fortran,found,free,from,full,function,fusion,g,general,get,global,go,goto,grant,granted,group,grouping,having,hierarchy,hold,hour,identity,immediate,implementation,in,including,increment,indicator,initially,inner,inout,input,insensitive,insert,instance,instantiable,int,integer,intersect,intersection,interval,into,invoker,is,isolation,join,k,key,key_member,key_type,language,large,last,lateral,leading,left,length,level,like,ln,local,localtime,localtimestamp,locator,lower,m,map,match,matched,max,maxvalue,member,merge,message_length,message_octet_length,message_text,method,min,minute,minvalue,mod,modifies,module,month,more,multiset,mumps,name,names,national,natural,nchar,nclob,nesting,new,next,no,none,normalize,normalized,not,"null",nullable,nullif,nulls,number,numeric,object,octet_length,octets,of,old,on,only,open,option,options,or,order,ordering,ordinality,others,out,outer,output,over,overlaps,overlay,overriding,pad,parameter,parameter_mode,parameter_name,parameter_ordinal_position,parameter_specific_catalog,parameter_specific_name,parameter_specific_schema,partial,partition,pascal,path,percent_rank,percentile_cont,percentile_disc,placing,pli,position,power,preceding,precision,prepare,preserve,primary,prior,privileges,procedure,public,range,rank,read,reads,real,recursive,ref,references,referencing,regr_avgx,regr_avgy,regr_count,regr_intercept,regr_r2,regr_slope,regr_sxx,regr_sxy,regr_syy,relative,release,repeatable,restart,result,return,returned_cardinality,returned_length,returned_octet_length,returned_sqlstate,returns,revoke,right,role,rollback,rollup,routine,routine_catalog,routine_name,routine_schema,row,row_count,row_number,rows,savepoint,scale,schema,schema_name,scope_catalog,scope_name,scope_schema,scroll,search,second,section,security,select,self,sensitive,sequence,serializable,server_name,session,session_user,set,sets,similar,simple,size,smallint,some,source,space,specific,specific_name,specifictype,sql,sqlexception,sqlstate,sqlwarning,sqrt,start,state,statement,static,stddev_pop,stddev_samp,structure,style,subclass_origin,submultiset,substring,sum,symmetric,system,system_user,table,table_name,tablesample,temporary,then,ties,time,timestamp,timezone_hour,timezone_minute,to,top_level_count,trailing,transaction,transaction_active,transactions_committed,transactions_rolled_back,transform,transforms,translate,translation,treat,trigger,trigger_catalog,trigger_name,trigger_schema,trim,true,type,uescape,unbounded,uncommitted,under,union,unique,unknown,unnamed,unnest,update,upper,usage,user,user_defined_type_catalog,user_defined_type_code,user_defined_type_name,user_defined_type_schema,using,value,values,var_pop,var_samp,varchar,varying,view,when,whenever,where,width_bucket,window,with,within,without,work,write,year,zone}'::text[])
Also, PostgreSQL’s own command line tool “psql” generates pretty complex SQL queries for many of its
command. For example, if you want to find all tables that are called “foo”, you can type \dt foo
in psql which will send the following query to the server:
SELECT n.nspname as "Schema",
c.relname as "Name",
CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 't' THEN 'TOAST table' WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' WHEN 'I' THEN 'partitioned index' END as "Type",
pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_catalog.pg_am am ON am.oid = c.relam
WHERE c.relkind IN ('r','p','t','s','')
AND c.relname OPERATOR(pg_catalog.~) '^(foo)$' COLLATE pg_catalog.default
AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;
More complex commands that output more information, like \d+ foo
, will send several even more
complex queries to the server.
You will most likely never use these inspection tables and functions for your application. However,
if you connect to a PostgreSQL database, you probably expect your favorite database tool to work. As
you can see in the queries, supporting these tools requires support for complex SQL features (such as
ALL
quantifiers or regular expressions) and the database needs to have all tables and functions
from the pg_catalog
namespace.
While the SQL grammar and the wire protocol have been implemented by multiple different database systems and you can look up their definitions easily, the inspection tables and functions unfortunately are not well-known or even well-documented. PostgreSQL has some documentation on the tables in the pg_catalog namespace. The pg_catalog namespace also contains several views which are implemented as SQL queries. These SQL queries use a lot more internal PostgreSQL functions, tables, and views which are usually prefixed with “pg_”.
While many functions and tables in the pg_catalog are not well documented, many tools still rely on them. The PostgreSQL server also uses them internally to store and retrieve information about the database. If you ever wondered where a database system stores the metadata of which tables and columns exist, the answer for the PostgreSQL server is: in the pg_catalog tables!
Fortunately, even though these tables are technically updatable, sane tools from the PostgreSQL ecosystem only ever read from these tables. For a database system trying to be compatible with these tools this means it can emulate the entire pg_catalog. In CedarDB, the tables in pg_catalog are actually only “virtual” tables that are generated on the fly from CedarDB’s internal metadata storage format. You can find the current implementation status of pg_catalog in CedarDB in our docs.
We have found that inspecting these tools is also the best way to implement the undocumented internal functions and tables. Especially when the tools are open-source, we can look at their source code to understand which undocumented tables they use and what the tools expect the contents of the tables to look like. PostgreSQL itself also uses several of its internal functions in its pure-SQL implementation of the information_schema.
While this sounds tedious, you will have implemented most common undocumented functions and tables after looking at only a few popular tools and try to connect them to your database system.
For CedarDB, you can find the (incomplete) list of supported clients and tools on the docs page. If you try out your favorite tool with CedarDB and encounter issues caused by an incomplete or incorrect implementation of pg_catalog (or anything else), please let us know in our issue tracker!
Low-Level APIs
If you are a power user of PostgreSQL and have used it in production deployments, you may also be familiar with some of the low-level APIs of PostgreSQL. For example, to optimize storage layout, it may be useful to understand how PostgreSQL lays out its tables on disk and how it writes the write-ahead log. Also, as we mentioned above, many people use PostgreSQL extensions such as pgvector which themselves use an internal API to access and influence the query execution in PostgreSQL.
With the inspection tables and functions we already touched a bit of the internal functionality of PostgreSQL. While the inspection tables and functions are an internal API, they still follow regular SQL semantics, so they can be implemented on top of the existing SQL infrastructure without affecting the internal structure of the database system. The low-level APIs and data structures, however, are implementation details of the PostgreSQL server itself.
In order to be compatible with low-level APIs, essentially you have to be PostgreSQL. You need to adapt its storage layout, its internal representation for queries, its execution engine, and many of its internal C functions. This leaves little room for other database systems to support the low-level API of PostgreSQL without also inheriting its downsides.
When Is a Database System PostgreSQL Compatible?
To summarize the previous section: When people talk about PostgreSQL or PostgreSQL compatibility, they usually mean one or more of these categories:
- The PostgreSQL-specific SQL grammar
- The PostgreSQL wire protocol
- The PostgreSQL-specific inspection tables and functions
- The low-level APIs of PostgreSQL for their storage layout, execution engine, and plugin support
Now, when is it reasonable to call a database system PostgreSQL compatible? Which of the four categories are essential for a PostgreSQL compatible database system?
If you look at PostgreSQL compatible database systems such as Cockroach or Aurora DSQL which are not direct forks of the PostgreSQL server (such as Neon or AlloyDB), you can see that they implement the SQL grammar and the wire protocol. A database system can’t reasonably call itself PostgreSQL compatible if it doesn’t even support basic (PostgreSQL-specific) SQL statements. So, at a minimum, it is necessary to implement the SQL grammar of PostgreSQL. Similarly, users would expect a basic “hello world” web app that uses PostgreSQL as a backend to work with a database system claiming to be PostgreSQL compatible. This requires implementing the wire protocol of PostgreSQL so that existing clients can connect to the new database without changes.
While applications building on PostgreSQL are very likely to work without the inspection tables and functions, we still think supporting them is essential. Switching over an application from using PostgreSQL to another PostgreSQL compatible database system usually just works. Cockroach and Aurora DSQL have shown this to work very well. However, actively developing and debugging an application using PostgreSQL requires more interaction with the database. For this, developers usually use different tools that rely on the inspection tables and functions. Thus, we think you should only call your database system PostgreSQL compatible if you can connect to it with the most popular database tools.
Finally, as we mentioned above, implementing all PostgreSQL low-level APIs means reimplementing a copy of the PostgreSQL server. Alternatively, you could invest a lot of time and effort into emulating the low-level APIs to support PostgreSQL extensions. However, this effort hardly seems reasonable in comparison to rewriting individual extensions, many of which only exist to sidestep some shortcoming of PostgreSQL that can be avoided in newer systems or to add new functionality. In CedarDB, for example, we chose to implement the functionality of pgvector directly.
For CedarDB, we found the best trade-off is implementing all categories but the low-level APIs. We can implement the grammar, the wire protocol, and the inspection tables and functions without negatively impacting our internal storage layout and query processing. This allows you to use CedarDB with most tools from the PostgreSQL ecosystem and only leaves out PostgreSQL extensions.
The following table summarizes CedarDB’s PostgreSQL compatibility:
PostgreSQL component | Required for | Supported by CedarDB? |
---|---|---|
SQL grammar | Reuse your existing SQL queries, no rewrite required | |
wire protocol | Connect using existing libraries and applications with no changes | |
inspection tables/functions | Required for database inspection tools, development, and debugging | |
low-level APIs | Necessary to support PostgreSQL extensions |
We built CedarDB completely from scratch giving us 100x performance for many workloads over PostgreSQL while still being PostgreSQL compatible. If you want to test whether CedarDB supports your favorite PostgreSQL tool, sign up for our waitlist!