Tables stored on S3
CedarDB supports processing data from disaggregated storage. With disaggregated storage, CedarDB can process data that is larger than your local SSD. CedarDB automatically compresses data to columnar format on the fly (in the background) when appending or updating rows. CedarDB uploads the compressed data to S3 and downloads it on demand during querying. Behind the scenes, our unified storage system Colibri differentiates between hot and cold data. For more technical details, you can also read up on our blog.
Creating a Table on S3
Before creating tables that work on remote data, you need to use the CREATE SERVER statement to first define the location of the data (e.g., the bucket) and the credentials. After defining the remote server, the CREATE TABLE statement uses the information of the remote server to store the data on disaggregated storage. Multiple tables can share the same remote server (e.g., the same bucket).
-- Create a server that stores the location (bucket) and credentials for accessing it
CREATE SERVER serverName FOREIGN DATA WRAPPER s3 OPTIONS (location 's3://bucketname:region', id '<key (AAA...)>', secret '<secret>');
-- Create a table using the server defined above
CREATE TABLE salary (salary integer, tax numeric) WITH (SERVER = serverName);
After setting up a table that uses S3 as backend storage, you can use it the same as a regular table stored on the local filesystem.
AWS performance considerations
To get the most out of S3-backed data, it is crucial to choose an instance with enough network bandwidth.
For remote data processing, we recommend using network optimized instances with 50 Gbit/s
or more.
Please also add a sufficiently large EBS device as local storage for metadata and hot data rows.
For fast transactional throughput the EBS device should have enough IOPS and bandwidth provisioned.
We recommend a gp3
volume with at least 500 MB/s
bandwidth and 10k IOPS
, but additional cost occur for the higher IOPS and bandwidth (compared to the standard gp3
volumes).
To create such an instance, you can use the following aws cli command as a starting point.
aws ec2 run-instances \
--image-id ami-xxxxxxxxxxxxxxxxx \ # Replace with desired AMI
--instance-type c6in.16xlarge \ # Network-Optimized c6in.16xlarge
--key-name your-key-pair \ # Your SSH Key Pair Name
--subnet-id subnet-xxxxxxxx \ # Your Subnet ID
--security-group-ids sg-xxxxxxxx \ # Your Security Group ID
--associate-public-ip-address \
--block-device-mappings '[{
"DeviceName": "/dev/sda1",
"Ebs": {
"VolumeSize": 1024, # Size in GiB
"VolumeType": "gp3", # Volume Type
"Iops": 10000, # 10000 IOPS
"Throughput": 500 # 500 MB/s throughput
}
}]' \
--query 'Instances[0]'
Cost considerations
AWS generally bills you according to the size of your stored data. S3 storage can be significantly cheaper per byte than EBS storage, so it is often desirable to store large amounts of data on S3 instead of an EBS volume. However, during processing additional S3 access cost occur. For each file request (put or get) AWS charges a small fee, but CedarDBs files are designed to be large enough so that those costs are minimal and are usually less than the cost of the compute instance. Note that it is important to co-locate the S3 bucket and the instance (same region) to avoid any network cost. Otherwise, expensive inter-region cost will be charged which may dominate the overall cost.