Text Types

Reference: Text Types

CedarDB’s text data type stores string data. It considers all strings as Unicode in UTF-8 encoding. In addition to the unconstrained text data type, CedarDB support standard SQL blank-padded char(length), and length constrained varchar(length) types.

Usage Example

create table example (
    gender char(1),
    description text
);
insert into example
    values ('⚧', repeat('UwU', 100)),
           ('X', '''Hi''');
select * from example;
 gender | description 
--------+-------------
 ⚧      | UwUUwU...
 X      | 'Hi'
(2 rows)

Text Length

CedarDB specifies text length in Unicode code points:

select length('🍍'), char_length('🍍'), octet_length('🍍');
 length | char_length | octet_length 
--------+-------------+--------------
      1 |           1 |            4
(1 row)

The maximum byte length for strings is 4 GiB. For compatibility to existing systems, conversions from char to text strips trailing blank-padded spaces.

In addition, all strings need to be valid UTF-8 sequences, i.e., it is not possible to store arbitrary binary data in string columns without additional encoding. For such data, consider using bytea.

Performance Considerations

Text and length-constrained string data types are handled equivalently. Strings with explicit length do not provide performance or storage benefits. Thus, we generally recommend against length-constraining string columns.

One exception is char(1), which is often used as an enum value. Therefore, CedarDB stores it as a four-byte integer, i.e., one full Unicode code point.

Independent of length-constraints, CedarDB stores short strings of up to 12 Bytes inline, whereas larger strings need an indirection. Short strings therefore have significant performance advantages.

Unicode Collation Support

Unicode collations allow comparisons of string data. In the default collation, strings are ordered binary, i.e., lexicographically byte-by-byte by their UTF-8 encoding. Collates can be specified as Unicode CLDR locale identifiers with the additional tags _ci for case-insensitivity, and _ai for accent-insensitivity.

For example, a case-insensitive collate can be useful for text comparison:

with strings(a, b) as (
   values ('foo', 'FOO')
)
select a, b, a = b, a collate "en_US_ci" = b
from strings;
  a  |  b  | ?column? | ?column? 
-----+-----+----------+----------
 foo | FOO | f        | t
(1 row)

Non-deterministic Results

Be aware that queries using collates can lead to unexpected results, when values look different, but are considered equivalent according to the specified collate! For example, for the following query, both, the lowercase and the uppercase result are equally valid:

with strings(s) as (values ('foo'), ('FOO'))
select distinct s collate "en_US_ci"
from strings;
 ?column? 
----------
 foo
(1 row)
 ?column? 
----------
 FOO
(1 row)

You can achieve a deterministic result by rewriting the query to output a min() aggregate in binary collate.

select min(s collate "binary")
from strings
group by s collate "en_US_ci";

Choose the Right Locale

The expected ordering of diacritics can depend on the specified collate. French Candians, for example, seem to have a specific preference about the lexicographical order of diacritics:

with strings(s) as (
   values ('cote'),
          ('coté'),
          ('côte'),
          ('côté')
)
select s from strings order by s;
  s
------
 cote
 coté
 côte
 côté
(4 rows)
select s from strings order by s collate "fr_CA";
  s
------
 cote
 côte
 coté
 côté
(4 rows)