dev July 2, 2024 14 minutes

Working with JSON and Graphs in CedarDB

The only thing that stops a bad guy with a database system, is a good guy with a database system: Learn how to work with semi-structured and graph data in CedarDB by hunting for Germany's most wanted white-collar criminal, Jan Marsalek, on a real world dataset.

Working with JSON and Graphs in CedarDB

While relational database systems are still by far the most popular database systems in use today, other data models are growing in both importance and popularity. The perceived simplicity and greater flexibility of semi-structured data, such as JSON, has made it the first choice for many application developers, especially for web interfaces and log data. As a result, much of the data being generated today is semi-structured. And the shift to new data models does not stop at data generation. Analysts are increasingly turning to graphs to better capture real-world relationships in their data.

Because both of these data models can be challenging for today’s relational database systems, engineers are forced to split their data between multiple systems, such as specialized NoSQL and graph engines. But this does not have to be the case!

In this post, we’ll show you how to work with semi-structured data in CedarDB and how to analyze graph-like structures using PostgresSQL-flavoured SQL queries and features available in CedarDB today. We will immediately put our newly learned skills to good use and hunt for Germany’s most wanted white-collar criminal, Jan Marsalek, a key figure in the Wirecard scandal. If you see something, say something!

Loading JSON data into CedarDB

The first important step in any investigation is gathering evidence. A great place to start when looking for potential accomplices of Jan Marsalek is the German commercial register “Handelsregister”, which contains records of all German companies and their officers. While there is no official API or data dump available, the Open Knowledge Foundation hosts a JSON dump with data from early 2019. As is typical for JSON, the data is not normalized but has information about a company’s officers nested as objects within the company information object. You can see the structure in the snippet below.

{"name": "Wirecard AG", "officers": [{"name": "Eric Pudewill", "type": "person", "end_date": "2007-11-19", "position": "Prokurist", "other_attributes": {"city": "Berlin", "lastname": "Pudewill", "dismissed": true, "firstname": "Eric"}}, {"name": "Salim Chujfi La Roche", "type": "person", "end_date": "2009-05-18", "position": "Prokurist", "other_attributes": {"city": "Berlin", "lastname": "La Roche", "dismissed": true, "firstname": "Salim Chujfi"}}, {"name": "Christian von Hammel-Bonten", "type": "person", "end_date": "2009-08-05", "position": "Prokurist", "other_attributes": {"city": "Oberpframmern", "lastname": "von Hammel-Bonten", "dismissed": true, "firstname": "Christian"}}, {"name": "Jan Marsalek", "type": "person", "end_date": "2010-05-06", "position": "Prokurist", "other_attributes": {"city": "München", "lastname": "Marsalek", "dismissed": true, "firstname": "Jan"}}, {"name": "Andrea Görres", "type": "person", "position": "Prokurist", "start_date": "2007-08-06", "other_attributes": {"city": "München", "lastname": "Görres", "firstname": "Andrea"}}, {"name": "Stephan Freiherr von Freiherr von Erffa", "type": "person", "position": "Prokurist", "start_date": "2007-11-19", "other_attributes": {"city": "Haar", "lastname": "Freiherr von Freiherr von Erffa", "firstname": "Stephan"}}, {"name": "Stephan von Freiherr von Erffa", "type": "person", "position": "Prokurist", "start_date": "2014-01-22", "other_attributes": {"city": "Baldham", "lastname": "von Freiherr von Erffa", "firstname": "Stephan"}}], "retrieved_at": "2018-11-03T13:15:59Z", "all_attributes": {"registrar": "München", "_registerArt": "HRB", "federal_state": "Bavaria", "_registerNummer": "169227", "additional_data": {"AD": true, "CD": true, "DK": true, "HD": false, "SI": true, "UT": true, "VÖ": true}, "registered_office": "Aschheim", "native_company_number": "München HRB 169227"}, "company_number": "D2601V_HRB169227", "current_status": "currently registered", "jurisdiction_code": "de", "registered_address": "Einsteinring 35, 85609 Aschheim."}
{"name": "Wirecard Technologies GmbH", "officers": [{"name": "Burkhard Ley", "type": "person", "end_date": "2012-12-04", "position": "Geschäftsführer", "other_attributes": {"city": "Solingen", "lastname": "Ley", "dismissed": true, "firstname": "Burkhard"}}, {"name": "Stephan Freiherr von Erffa", "type": "person", "end_date": "2012-12-04", "position": "Prokurist", "other_attributes": {"city": "Haar", "lastname": "Freiherr von Erffa", "dismissed": true, "firstname": "Stephan"}}, {"name": "Burkhard Ley", "type": "person", "end_date": "2018-01-03", "position": "Prokurist", "other_attributes": {"city": "Solingen", "lastname": "Ley", "dismissed": true, "firstname": "Burkhard"}}, {"name": "Carlos Häuser", "type": "person", "end_date": "2018-01-26", "position": "Geschäftsführer", "other_attributes": {"city": "Singapur / Singapur", "lastname": "Häuser", "dismissed": true, "firstname": "Carlos"}}, {"name": "Alexander von Knoop", "type": "person", "position": "Prokurist", "start_date": "2018-01-03", "other_attributes": {"city": "Steingau", "lastname": "von Knoop", "firstname": "Alexander"}}, {"name": "Burkhard Ley", "type": "person", "position": "Geschäftsführer", "start_date": "2012-08-13", "other_attributes": {"city": "Solingen", "lastname": "Ley", "firstname": "Burkhard"}}, {"name": "Burkhard Ley", "type": "person", "position": "Prokurist", "start_date": "2013-07-10", "other_attributes": {"city": "Solingen", "lastname": "Ley", "firstname": "Burkhard"}}, {"name": "Carlos Häuser", "type": "person", "position": "Geschäftsführer", "start_date": "2012-12-04", "other_attributes": {"city": "München", "lastname": "Häuser", "firstname": "Carlos"}}, {"name": "Carlos Häuser", "type": "person", "position": "Geschäftsführer", "start_date": "2017-05-31", "other_attributes": {"city": "Singapur / Singapur", "lastname": "Häuser", "firstname": "Carlos"}}, {"name": "Markus Doctor Braun", "type": "person", "position": "Geschäftsführer", "start_date": "2012-08-13", "other_attributes": {"city": "München", "flag": "mit der Befugnis im Namen der Gesellschaft mit sich im eigenen Namen oder als Vertreter eines Dritten Rechtsgeschäfte abzuschließen", "lastname": "Doctor Braun", "firstname": "Markus"}}, {"name": "Michael Brinkmann", "type": "person", "position": "Prokurist", "start_date": "2012-08-13", "other_attributes": {"city": "München", "lastname": "Brinkmann", "firstname": "Michael"}}, {"name": "Susanne Steidl", "type": "person", "position": "Geschäftsführer", "start_date": "2017-11-06", "other_attributes": {"city": "München", "lastname": "Steidl", "firstname": "Susanne"}}], "retrieved_at": "2019-01-31T00:15:25Z", "all_attributes": {"registrar": "München", "_registerArt": "HRB", "federal_state": "Bavaria", "_registerNummer": "200352", "additional_data": {"AD": true, "CD": true, "DK": true, "HD": false, "SI": true, "UT": true, "VÖ": false}, "registered_office": "Aschheim", "native_company_number": "München HRB 200352"}, "company_number": "D2601V_HRB200352", "current_status": "currently registered", "jurisdiction_code": "de", "registered_address": "Einsteinring 35, 85609 Aschheim."}
{"name": "Wirecard Bank AG", "officers": [{"name": "Oliver Bellenhaus", "type": "person", "end_date": "2013-06-27", "position": "Prokurist", "other_attributes": {"city": "München", "lastname": "Bellenhaus", "dismissed": true, "firstname": "Oliver"}}, {"name": "Alexander von Knoop", "type": "person", "end_date": "2014-01-30", "position": "Prokurist", "other_attributes": {"city": "Steingau", "lastname": "von Knoop", "dismissed": true, "firstname": "Alexander"}}, {"name": "Daniel Heuser", "type": "person", "end_date": "2018-01-02", "position": "Prokurist", "other_attributes": {"city": "München", "lastname": "Heuser", "dismissed": true, "firstname": "Daniel"}}, {"name": "Alexander von Knoop", "type": "person", "position": "Prokurist", "start_date": "2010-03-03", "other_attributes": {"city": "Steingau", "lastname": "von Knoop", "firstname": "Alexander"}}, {"name": "Daniel Heuser", "type": "person", "position": "Prokurist", "start_date": "2009-07-09", "other_attributes": {"city": "München", "lastname": "Heuser", "firstname": "Daniel"}}, {"name": "Oliver Bellenhaus", "type": "person", "position": "Prokurist", "start_date": "2011-11-09", "other_attributes": {"city": "München", "lastname": "Bellenhaus", "firstname": "Oliver"}}, {"name": "Thomas Doctor Käppner", "type": "person", "position": "Prokurist", "start_date": "2009-05-19", "other_attributes": {"city": "Leverkusen", "lastname": "Doctor Käppner", "firstname": "Thomas"}}, {"name": "Thorsten Holten", "type": "person", "position": "Prokurist", "start_date": "2014-09-01", "other_attributes": {"city": "München", "lastname": "Holten", "firstname": "Thorsten"}}, {"name": "Thorsten Holten", "type": "person", "position": "Prokurist", "start_date": "2014-09-10", "other_attributes": {"city": "München", "lastname": "Holten", "firstname": "Thorsten"}}], "retrieved_at": "2018-11-03T11:48:59Z", "all_attributes": {"registrar": "München", "_registerArt": "HRB", "federal_state": "Bavaria", "_registerNummer": "161178", "additional_data": {"AD": true, "CD": true, "DK": true, "HD": false, "SI": true, "UT": true, "VÖ": true}, "registered_office": "Aschheim", "native_company_number": "München HRB 161178"}, "company_number": "D2601V_HRB161178", "current_status": "currently registered", "previous_names": [{"company_name": "Wire Card Bank AG"}], "jurisdiction_code": "de", "registered_address": "Hopfenstraße 1d, 24114 Kiel."}

To work efficiently with this data, we first need to make it locally accessible:

curl -O https://daten.offeneregister.de/de_companies_ocdata.jsonl.bz2
bzip2 --decompress de_companies_ocdata.jsonl.bz2

The bzip2 compressed download is about 250 MB, which decompresses to about 4 GB on disk. Of all the information available, we are only interested in information that uniquely identifies the companies and their officers. The relevant part of the schema containing this information is as follows:

Parts of the JSON schema relevant to our investigation

Querying external JSON files

Before we load the JSON file into the database as a relation, which will make subsequent analysis much faster, we can first make sure it contains valuable information by querying the JSON file directly. For our investigation, we want to make sure that the dataset contains information about our target.

with company_officers(officer) as (
  select json_array_elements(coalesce(data::jsonb->'officers', '[]'::jsonb))
  from cedar.csvview('de_companies_ocdata.jsonl') d(data)
)
select count(*) from company_officers where officer->>'name' like '%Marsalek%';

Let’s take a look at what this statement does.

  • In a first step, it makes the JSON file available to CedarDB as a table d with a single data column containing all rows of the JSON by selecting from the csvview function.
  • From this data, we extract all 'officers' values nested in all companies using the json access operator ->. The coalesce ensures that missing officers entries in the dirty input data do not compromise the query; for JSON values with missing officers entries, we simply assume that there are no officers.
  • Using these cleaned entries, we unpack all officer arrays using the json_array_elements function, making them available as a common table expression (CTE) company_officers.
  • Using the resulting company_officers CTE, we access the 'name' property again using the text access operator ->>, looking for any officer whose name contains Marsalek.

The query returns the following result:

| **count** | |-----------:| | 8 |

As we can see from the result, there are eight officers named Marsalek, so it’s worth taking a closer look at the data.

Loading JSON data into CedarDB

While the query above is already reasonably fast on the raw JSON, taking a bit over four seconds to analyze the whole 4 GB of JSON on my laptop, every second counts in such a high-profile case. We can achieve much faster queries by not working on the raw JSON strings and instead loading the JSON file into CedarDB in a binary representation within a JSONB column first.

To do this, we first need to create a relation for our JSON data:

create table register_json (data jsonb not null);
copy register_json from 'de_companies_ocdata.jsonl';

This statement creates a new table register_json with all the json rows in optimized binary format in a single column, which we’ll call data. If you want even faster performance, you can transform this JSON data into relations, as we did in our docs example, but in this post we want to focus on working with plain JSON. Searching for officers named marsalek against the register_json table instead of the external JSON is now three times faster, which means we are ready for more demanding queries.

Finding graph-like connections in SQL

After gathering all our evidence, we can start looking for potential accomplices. We do not want to traverse the graph too far, because each hop brings exponentially more connections, which quickly leads to meaningless results (e.g., on Facebook in 2016, on average only 3.6 intermediate connections were needed to connect any two people). Thus, we only want to look for potential suspects who are two hops out, either directly serving (or having serverd) on the same board as Jan Marsalek (l1), or on the board (l3) of a company (l2) that has board members who also serve on the same board as Jan Marsalek (possibly at another company).

Hops of the graph query

Finding the start nodes

To find potential accomplices, we first need to find the correct Jan Marsalek records among the eight Marsaleks in our discovery query:

with company_officers_json as (
  select data->>'company_number' company_number, data->>'name' company_name,
    json_array_elements(data->'officers') officer_json
  from register_json where data->'officers' is not null
  ),
company_officers as (select company_number, company_name, officer_json->>'name' as name,
    officer_json->'other_attributes'->>'city' city
  from company_officers_json
  ),
marsalek as (select * from company_officers where name = 'Jan Marsalek' and city = 'München')

select * from marsalek;
Try it in CedarDB