Importing Data from Parquet

Importing Data from Parquet

CedarDB supports data imports from Parquet files. Parquet files are compressed columnar files, that are space-efficient and type safe. They can be used to import information from different data systems into CedarDB.

ℹ️
While interactive querying of Parquet files is also possible, CedarDB is optimized for its own storage engine. You should use parquet mainly to import data into CedarDB.

Read a Parquet file

-- Autodetect based on file suffix
SELECT * FROM 'test.parquet';

-- Use parquet_view function
SELECT * FROM parquet_view('test.parquet');
test.parquet
 a  | b  
----+----
  1 |  1
  2 |  2
  3 |  3
 .. | ..
(100 rows)

Import Parquet into CedarDB

You can either load parquet data directly into a table:

CREATE TABLE test AS (SELECT * FROM 'test.parquet');

Use parquet_view if the file name does not end in .parquet

CREATE TABLE test AS (SELECT * FROM parquet_view('test'));

Or you first create the table and insert afterward Copy data into an existing table via psql

CREATE TABLE test (a integer, b integer);

-- Use COPY in SQL (relative path to server)
COPY test from 'test.parquet' (format parquet);

-- Or \copy in psql (relative path to client)
\COPY test from 'test.parquet' (format parquet);

You can also specify only some columns

-- Create a table with the columns you need
CREATE TABLE onecol (c integer);

-- Select the columns you need for import
INSERT INTO onecol (SELECT a FROM 'test.parquet')

Inspect Parquet Metadata

Print the parquet file layout

SELECT * FROM parquet_schema('test.parquet');
test.parquet
  file_name   |  name  | cedar_type |  type   | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type 
--------------+--------+------------+---------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------
 test.parquet | schema | boolean    | BOOLEAN |             | REQUIRED        |            2 |                |       |           |          | 
 test.parquet | a      | bigint     | INT64   |             | OPTIONAL        |            0 |                |       |           |          | 
 test.parquet | b      | bigint     | INT64   |             | OPTIONAL        |            0 |                |       |           |          | 
(3 rows)

Print the parquet file meta data footer:

SELECT * FROM parquet_file('test.parquet');
test.parquet
  file_name   | file_size | file_meta_size |            created_by            | num_rows | num_row_groups | num_columns | format_version 
--------------+-----------+----------------+----------------------------------+----------+----------------+-------------+----------------
 test.parquet | 2029      | 554            | parquet-cpp-arrow version 21.0.0 | 100      | 1              | 3           |              2
(1 row)

Print all contained row groups:

SELECT * FROM parquet_rowgroups('test.parquet');
test.parquet
  file_name   | id | num_rows | num_cols | total_bytes | compressed_bytes 
--------------+----+----------+----------+-------------+------------------
 test.parquet |  0 |      100 |        2 |        1463 |             1463

Print all contained column chunks:

SELECT * FROM parquet_colchunks('test.parquet');
test.parquet
  file_name   | rg_id | col_id | type  | file_path | schema_path |         encodings          |    codec     | num_values | total_bytes | compressed_bytes 
--------------+-------+--------+-------+-----------+-------------+----------------------------+--------------+------------+-------------+------------------
 test.parquet |     0 |      0 | INT64 |           | {a}         | {PLAIN,RLE,RLE_DICTIONARY} | UNCOMPRESSED |        100 |         771 |              771
 test.parquet |     0 |      1 | INT64 |           | {b}         | {PLAIN,RLE,RLE_DICTIONARY} | UNCOMPRESSED |        100 |         692 |              692

Performance Considerations

CedarDB’s parquet scan is optimized for full parquet file imports. The scan is fully multi-threaded and only reads the columns that are queried by the user. We do not yet push-down filters into the parquet rowgroups to prune based on parquet statistics and metadata. Thus, you should always prefer importing the columns you need into CedarDB over working on the parquet file directly.

Implementation Status

This page summarizes the available features supported by the CedarDB Parser.

Legend

  • 🟢 Supported
  • 🟡 Partially suported: Details for partial support
  • 🔴 Not yet supported

Physical Types

Data TypeSupport
BOOLEAN🟢
INT32🟢
INT64🟢
INT961🟢
FLOAT🟢
DOUBLE🟢
BYTE_ARRAY🟢
FIXED_LEN_BYTE_ARRAY🟡 (not for legacy string columns)

Logical Types

Data TypeSupport
STRING🟢
ENUM🟡 (parsed as text)
UUID🟢
Int8,16,32,64🟢
UInt8,16,32,64🟢
DECIMAL (INT32)🟢
DECIMAL (INT64)🟢
DECIMAL (BYTE_ARRAY)🟢
DECIMAL (FIXED_LEN_BYTE_ARRAY)🟢
FLOAT16🔴
DATE🟢
TIME (INT32)🟢
TIME (INT64)🟢
TIMESTAMP (INT64)🟢
INTERVAL🔴
JSON🟡 (use text instead)
BSON🔴
VARIANT🔴
GEOMETRY🔴
GEOGRAPHY🔴
LIST🔴
MAP🔴
UNKNOWN (always null)🟢

Encodings

EncodingSupport
PLAIN🟢
PLAIN_DICTIONARY🟢
RLE_DICTIONARY🟢
RLE🟢
BIT_PACKED (deprecated)🔴
DELTA_BINARY_PACKED🔴
DELTA_LENGTH_BYTE_ARRAY🔴
DELTA_BYTE_ARRAY🔴
BYTE_STREAM_SPLIT🔴

Compression Codecs

CompressionSupport
UNCOMPRESSED🟢
BROTLI🔴
GZIP🟢
LZ4 (deprecated)🔴
LZ4_RAW🟢
LZO🔴
SNAPPY🟢
ZSTD🟢

Enhanced Features

FeatureSupport
Selective Column Read🟢
Row-Group Skip🔴
DataPageHeaderV2🟢
Size Statistics🔴
Page Index🔴
Bloom Filter🔴
Nested Encodings🔴