Join Order Benchmark

Aside from the well-known industry benchmarks for analytical systems, TPC-H and TPC-DS, the Join Order Benchmark (JOB) is an excellent choice to evaluate the capabilities of a database system, especially its optimizer. In contrast to the generated datasets of TPC-H and TPC-DS, which generate a predictable data distribution to allow scaling the dataset arbitrarily, JOB is derived from the Internet Movie Database and contains real-world data full of correlations, leading to a wide variety of data distributions. This makes ordering joins in queries over this dataset challenging, hence the name, as the optimizer cannot simply assume uniform data distribution and instead has to rely on collected samples and statistics for join ordering.

The Dataset

The dataset comprises a total of 21 tables extracted from IMDB, containing information about the movie industry, such as movies, studios, actors, and their connections, such as roles of actors in movies. The full schema with information on all tables is available as an SQL file schema.sql. E.g., the cast_info table, which comprises the most rows by far, can be created as:

CREATE TABLE cast_info (
    id integer NOT NULL PRIMARY KEY,
    person_id integer NOT NULL,
    movie_id integer NOT NULL,
    person_role_id integer,
    note text,
    nr_order integer,
    role_id integer NOT NULL
);

Obtain the data

An excerpt of the IMDB dataset is available for non-commercial purposes through the JOB paper. To obtain the relevant data simply run

mkdir jobdata && cd jobdata
curl -OL https://bonsai.cedardb.com/job/imdb.tgz
tar -zxvf imdb.tgz

ℹ️

The compressed tarball is about 1.2 GB to download, which decompresses to about 3.7 GB.

Import the schema to CedarDB

To create the full schema inside CedarDB, download the schema.sql file and load it, either on a new connection using:

psql -h localhost -U {{username}} < your/path/schema.sql

or in an existing psql session:

\i your/path/schema.sql

This will create all necessary relations for the JOB dataset.

Import the data to CedarDB

Once you have created all necessary relations, you can import the previously extracted CSV files. Each CSV file can be imported using the COPY command.

copy cast_info from 'your/path/cast_info.csv' DELIMITER ',' CSV NULL '' ESCAPE '\' HEADER;

⚠️

Some strings in the IMDB dataset contain the separator ,, so you have to use the CSV format option instead of the more performant TEXT when importing the data.

For convenience, we provide an SQL file will all necessary copy commands load.sql. Please note that you need to modify the include paths from your/path to the correct location relative to the CedarDB server.

ℹ️

For more information and alternative options to server-relatives paths for CSV imports, please refer to the CSV Cookbook.

The Query Workload

The queries of the Join Order Benchmarks were created, as the name already reveals, to contain challenging join order decisions for the optimizer. Queries in the join order benchmark, therefore, join at least 4 and up to 17 tables, with an average of 8 joins in a query. These queries where designed to include meaningful connection between tables and mimic real analytical task one might want to know about movies.

Run the benchmark queries

All 113 JOB queries are available for download. You can either run these queries manually one by one using the usual query interface. E.g., the first query, 1a, tries to find movies in the top 250 that were not produced by Metro-Goldwyn-Mayer Pictures.

1a.sql
SELECT MIN(mc.note) AS production_note, MIN(t.title) AS movie_title, MIN(t.production_year) AS movie_year
FROM company_type AS ct, info_type AS it, movie_companies AS mc, movie_info_idx AS mi_idx, title AS t
WHERE ct.kind = 'production companies'
    AND it.info = 'top 250 rank'
    AND mc.note  not like '%(as Metro-Goldwyn-Mayer Pictures)%'
    AND (mc.note like '%(co-production)%' or mc.note like '%(presents)%')
    AND ct.id = mc.company_type_id
    AND t.id = mc.movie_id
    AND t.id = mi_idx.movie_id
    AND mc.movie_id = mi_idx.movie_id
    AND it.id = mi_idx.info_type_id;

Alternatively, you can also include the query directly from the SQL file from within an active psql session:

\i your/path/1a.sql

Get started with your own queries

In addition, you can of course play around with the dataset on your own however you like. Collect information on your favorite movies, update potentially outdated information, or enrich the data with external sources.

CH-benCHmark