July 2, 2024

Working with JSON and Graphs in CedarDB

The only thing that stops a bad guy with a database system, is a good guy with a database system: Learn how to work with semi-structured and graph data in CedarDB by hunting for Germany's most wanted white-collar criminal, Jan Marsalek, on a real world dataset.

Christian Winter
Christian Winter
Christian Winter
Christian Winter
Working with JSON and Graphs in CedarDB

Working with JSON and Graphs in CedarDB

While relational database systems are still by far the most popular database systems in use today, other data models are growing in both importance and popularity. The perceived simplicity and greater flexibility of semi-structured data, such as JSON, has made it the first choice for many application developers, especially for web interfaces and log data. As a result, much of the data being generated today is semi-structured. And the shift to new data models does not stop at data generation. Analysts are increasingly turning to graphs to better capture real-world relationships in their data.

Because both of these data models can be challenging for today’s relational database systems, engineers are forced to split their data between multiple systems, such as specialized NoSQL and graph engines. But this does not have to be the case!

In this post, we’ll show you how to work with semi-structured data in CedarDB and how to analyze graph-like structures using PostgresSQL-flavoured SQL queries and features available in CedarDB today. We will immediately put our newly learned skills to good use and hunt for Germany’s most wanted white-collar criminal, Jan Marsalek, a key figure in the Wirecard scandal. If you see something, say something!

Loading JSON data into CedarDB

The first important step in any investigation is gathering evidence. A great place to start when looking for potential accomplices of Jan Marsalek is the German commercial register “Handelsregister”, which contains records of all German companies and their officers. While there is no official API or data dump available, the Open Knowledge Foundation hosts a JSON dump with data from early 2019. As is typical for JSON, the data is not normalized but has information about a company’s officers nested as objects within the company information object. You can see the structure in the snippet below.

{"name": "Wirecard AG", "officers": [{"name": "Eric Pudewill", "type": "person", "end_date": "2007-11-19", "position": "Prokurist", "other_attributes": {"city": "Berlin", "lastname": "Pudewill", "dismissed": true, "firstname": "Eric"}}, {"name": "Salim Chujfi La Roche", "type": "person", "end_date": "2009-05-18", "position": "Prokurist", "other_attributes": {"city": "Berlin", "lastname": "La Roche", "dismissed": true, "firstname": "Salim Chujfi"}}, {"name": "Christian von Hammel-Bonten", "type": "person", "end_date": "2009-08-05", "position": "Prokurist", "other_attributes": {"city": "Oberpframmern", "lastname": "von Hammel-Bonten", "dismissed": true, "firstname": "Christian"}}, {"name": "Jan Marsalek", "type": "person", "end_date": "2010-05-06", "position": "Prokurist", "other_attributes": {"city": "München", "lastname": "Marsalek", "dismissed": true, "firstname": "Jan"}}, {"name": "Andrea Görres", "type": "person", "position": "Prokurist", "start_date": "2007-08-06", "other_attributes": {"city": "München", "lastname": "Görres", "firstname": "Andrea"}}, {"name": "Stephan Freiherr von Freiherr von Erffa", "type": "person", "position": "Prokurist", "start_date": "2007-11-19", "other_attributes": {"city": "Haar", "lastname": "Freiherr von Freiherr von Erffa", "firstname": "Stephan"}}, {"name": "Stephan von Freiherr von Erffa", "type": "person", "position": "Prokurist", "start_date": "2014-01-22", "other_attributes": {"city": "Baldham", "lastname": "von Freiherr von Erffa", "firstname": "Stephan"}}], "retrieved_at": "2018-11-03T13:15:59Z", "all_attributes": {"registrar": "München", "_registerArt": "HRB", "federal_state": "Bavaria", "_registerNummer": "169227", "additional_data": {"AD": true, "CD": true, "DK": true, "HD": false, "SI": true, "UT": true, "VÖ": true}, "registered_office": "Aschheim", "native_company_number": "München HRB 169227"}, "company_number": "D2601V_HRB169227", "current_status": "currently registered", "jurisdiction_code": "de", "registered_address": "Einsteinring 35, 85609 Aschheim."}
{"name": "Wirecard Technologies GmbH", "officers": [{"name": "Burkhard Ley", "type": "person", "end_date": "2012-12-04", "position": "Geschäftsführer", "other_attributes": {"city": "Solingen", "lastname": "Ley", "dismissed": true, "firstname": "Burkhard"}}, {"name": "Stephan Freiherr von Erffa", "type": "person", "end_date": "2012-12-04", "position": "Prokurist", "other_attributes": {"city": "Haar", "lastname": "Freiherr von Erffa", "dismissed": true, "firstname": "Stephan"}}, {"name": "Burkhard Ley", "type": "person", "end_date": "2018-01-03", "position": "Prokurist", "other_attributes": {"city": "Solingen", "lastname": "Ley", "dismissed": true, "firstname": "Burkhard"}}, {"name": "Carlos Häuser", "type": "person", "end_date": "2018-01-26", "position": "Geschäftsführer", "other_attributes": {"city": "Singapur / Singapur", "lastname": "Häuser", "dismissed": true, "firstname": "Carlos"}}, {"name": "Alexander von Knoop", "type": "person", "position": "Prokurist", "start_date": "2018-01-03", "other_attributes": {"city": "Steingau", "lastname": "von Knoop", "firstname": "Alexander"}}, {"name": "Burkhard Ley", "type": "person", "position": "Geschäftsführer", "start_date": "2012-08-13", "other_attributes": {"city": "Solingen", "lastname": "Ley", "firstname": "Burkhard"}}, {"name": "Burkhard Ley", "type": "person", "position": "Prokurist", "start_date": "2013-07-10", "other_attributes": {"city": "Solingen", "lastname": "Ley", "firstname": "Burkhard"}}, {"name": "Carlos Häuser", "type": "person", "position": "Geschäftsführer", "start_date": "2012-12-04", "other_attributes": {"city": "München", "lastname": "Häuser", "firstname": "Carlos"}}, {"name": "Carlos Häuser", "type": "person", "position": "Geschäftsführer", "start_date": "2017-05-31", "other_attributes": {"city": "Singapur / Singapur", "lastname": "Häuser", "firstname": "Carlos"}}, {"name": "Markus Doctor Braun", "type": "person", "position": "Geschäftsführer", "start_date": "2012-08-13", "other_attributes": {"city": "München", "flag": "mit der Befugnis im Namen der Gesellschaft mit sich im eigenen Namen oder als Vertreter eines Dritten Rechtsgeschäfte abzuschließen", "lastname": "Doctor Braun", "firstname": "Markus"}}, {"name": "Michael Brinkmann", "type": "person", "position": "Prokurist", "start_date": "2012-08-13", "other_attributes": {"city": "München", "lastname": "Brinkmann", "firstname": "Michael"}}, {"name": "Susanne Steidl", "type": "person", "position": "Geschäftsführer", "start_date": "2017-11-06", "other_attributes": {"city": "München", "lastname": "Steidl", "firstname": "Susanne"}}], "retrieved_at": "2019-01-31T00:15:25Z", "all_attributes": {"registrar": "München", "_registerArt": "HRB", "federal_state": "Bavaria", "_registerNummer": "200352", "additional_data": {"AD": true, "CD": true, "DK": true, "HD": false, "SI": true, "UT": true, "VÖ": false}, "registered_office": "Aschheim", "native_company_number": "München HRB 200352"}, "company_number": "D2601V_HRB200352", "current_status": "currently registered", "jurisdiction_code": "de", "registered_address": "Einsteinring 35, 85609 Aschheim."}
{"name": "Wirecard Bank AG", "officers": [{"name": "Oliver Bellenhaus", "type": "person", "end_date": "2013-06-27", "position": "Prokurist", "other_attributes": {"city": "München", "lastname": "Bellenhaus", "dismissed": true, "firstname": "Oliver"}}, {"name": "Alexander von Knoop", "type": "person", "end_date": "2014-01-30", "position": "Prokurist", "other_attributes": {"city": "Steingau", "lastname": "von Knoop", "dismissed": true, "firstname": "Alexander"}}, {"name": "Daniel Heuser", "type": "person", "end_date": "2018-01-02", "position": "Prokurist", "other_attributes": {"city": "München", "lastname": "Heuser", "dismissed": true, "firstname": "Daniel"}}, {"name": "Alexander von Knoop", "type": "person", "position": "Prokurist", "start_date": "2010-03-03", "other_attributes": {"city": "Steingau", "lastname": "von Knoop", "firstname": "Alexander"}}, {"name": "Daniel Heuser", "type": "person", "position": "Prokurist", "start_date": "2009-07-09", "other_attributes": {"city": "München", "lastname": "Heuser", "firstname": "Daniel"}}, {"name": "Oliver Bellenhaus", "type": "person", "position": "Prokurist", "start_date": "2011-11-09", "other_attributes": {"city": "München", "lastname": "Bellenhaus", "firstname": "Oliver"}}, {"name": "Thomas Doctor Käppner", "type": "person", "position": "Prokurist", "start_date": "2009-05-19", "other_attributes": {"city": "Leverkusen", "lastname": "Doctor Käppner", "firstname": "Thomas"}}, {"name": "Thorsten Holten", "type": "person", "position": "Prokurist", "start_date": "2014-09-01", "other_attributes": {"city": "München", "lastname": "Holten", "firstname": "Thorsten"}}, {"name": "Thorsten Holten", "type": "person", "position": "Prokurist", "start_date": "2014-09-10", "other_attributes": {"city": "München", "lastname": "Holten", "firstname": "Thorsten"}}], "retrieved_at": "2018-11-03T11:48:59Z", "all_attributes": {"registrar": "München", "_registerArt": "HRB", "federal_state": "Bavaria", "_registerNummer": "161178", "additional_data": {"AD": true, "CD": true, "DK": true, "HD": false, "SI": true, "UT": true, "VÖ": true}, "registered_office": "Aschheim", "native_company_number": "München HRB 161178"}, "company_number": "D2601V_HRB161178", "current_status": "currently registered", "previous_names": [{"company_name": "Wire Card Bank AG"}], "jurisdiction_code": "de", "registered_address": "Hopfenstraße 1d, 24114 Kiel."}

To work efficiently with this data, we first need to make it locally accessible:

curl -O https://daten.offeneregister.de/de_companies_ocdata.jsonl.bz2
bzip2 --decompress de_companies_ocdata.jsonl.bz2

The bzip2 compressed download is about 250 MB, which decompresses to about 4 GB on disk. Of all the information available, we are only interested in information that uniquely identifies the companies and their officers. The relevant part of the schema containing this information is as follows:

Parts of the JSON schema relevant to our investigation

Querying external JSON files

Before we load the JSON file into the database as a relation, which will make subsequent analysis much faster, we can first make sure it contains valuable information by querying the JSON file directly. For our investigation, we want to make sure that the dataset contains information about our target.

with company_officers(officer) as (
  select json_array_elements(coalesce(data::jsonb->'officers', '[]'::jsonb)) 
  from cedar.csvview('de_companies_ocdata.jsonl') d(data)
) 
select count(*) from company_officers where officer->>'name' like '%Marsalek%';

Let’s take a look at what this statement does.

  • In a first step, it makes the JSON file available to CedarDB as a table d with a single data column containing all rows of the JSON by selecting from the csvview function.
  • From this data, we extract all 'officers' values nested in all companies using the json access operator ->. The coalesce ensures that missing officers entries in the dirty input data do not compromise the query; for JSON values with missing officers entries, we simply assume that there are no officers.
  • Using these cleaned entries, we unpack all officer arrays using the json_array_elements function, making them available as a common table expression (CTE) company_officers.
  • Using the resulting company_officers CTE, we access the 'name' property again using the text access operator ->>, looking for any officer whose name contains Marsalek.

The query returns the following result:

count
8

As we can see from the result, there are eight officers named Marsalek, so it’s worth taking a closer look at the data.

Loading JSON data into CedarDB

While the query above is already reasonably fast on the raw JSON, taking a bit over four seconds to analyze the whole 4 GB of JSON on my laptop, every second counts in such a high-profile case. We can achieve much faster queries by not working on the raw JSON strings and instead loading the JSON file into CedarDB in a binary representation within a JSONB column first.

To do this, we first need to create a relation for our JSON data:

create table register_json (data jsonb not null);
copy register_json from 'de_companies_ocdata.jsonl';

This statement creates a new table register_json with all the json rows in optimized binary format in a single column, which we’ll call data. If you want even faster performance, you can transform this JSON data into relations, as we did in our docs example, but in this post we want to focus on working with plain JSON. Searching for officers named marsalek against the register_json table instead of the external JSON is now three times faster, which means we are ready for more demanding queries.

Finding graph-like connections in SQL

After gathering all our evidence, we can start looking for potential accomplices. We do not want to traverse the graph too far, because each hop brings exponentially more connections, which quickly leads to meaningless results (e.g., on Facebook in 2016, on average only 3.6 intermediate connections were needed to connect any two people). Thus, we only want to look for potential suspects who are two hops out, either directly serving (or having serverd) on the same board as Jan Marsalek (l1), or on the board (l3) of a company (l2) that has board members who also serve on the same board as Jan Marsalek (possibly at another company).

Hops of the graph query

Finding the start nodes

To find potential accomplices, we first need to find the correct Jan Marsalek records among the eight Marsaleks in our discovery query:

with company_officers_json as (
  select data->>'company_number' company_number, data->>'name' company_name,
    json_array_elements(data->'officers') officer_json
  from register_json where data->'officers' is not null
  ),
company_officers as (select company_number, company_name, officer_json->>'name' as name,
    officer_json->'other_attributes'->>'city' city
  from company_officers_json
  ),
marsalek as (select * from company_officers where name = 'Jan Marsalek' and city = 'München')

select * from marsalek;
Try it in CedarDB

Let’s decompose the query:

  • First, we extract all the relevant data for each company, such as the unique company number, the company name, and the officers. Again, we use the json_array_elements function to expand each officer into a separate row.
  • Next, we extract the officer information from the json format as well, leaving us with a purely relational CTE containing four columns: The company’s unique number, the company’s name, and the officer’s name and place of residence.
  • Finally, we display all information about officers named Jan Marsalek who are located in Munich.

Feel free to run or modify the query in CedarDB yourself using the Try in CedarDB button!

The query gives the following result:

company_numbercompany_namenamecity
D2601V_HRB187465Wirecard Sales International Holding GmbHJan MarsalekMünchen
F1103R_HRB88060Wirecard AGJan MarsalekMünchen
D2601V_HRB142427Wirecard Technologies AGJan MarsalekMünchen
D2601V_HRB142427Wirecard Technologies AGJan MarsalekMünchen
D2601V_HRB169227Wirecard AGJan MarsalekMünchen

We can see that we are on the right track, as all connected companies have Wirecard in their name.

Traversing the graph

Having found all of Jan Marsalek’s board positions, the starting nodes of our graph query, we can finally locate our potential suspects by traversing the graph. All of the necessary graph traversal steps can be expressed in a SQL query based on the CTEs defined above:

with company_officers_json as (
  select data->>'company_number' company_number, data->>'name' company_name,
    json_array_elements((data->'officers')::jsonb) officer_json
  from register_json where data->'officers' is not null
  ),
company_officers as (select company_number, company_name, officer_json->>'name' as name,
    officer_json->'other_attributes'->>'city' city
  from company_officers_json
  ),
marsalek    as (select * from company_officers where name = 'Jan Marsalek' and city = 'München'),
marsalek_l1 as (select * from company_officers where company_number in
                  (select company_number from marsalek)),
marsalek_l2 as (select * from company_officers o where exists
                  (select * from marsalek_l1 m where o.name = m.name and o.city = m.city)),
marsalek_l3 as (select * from company_officers where company_number in
                  (select company_number from marsalek_l2))

select distinct name, city from marsalek_l3 order by name;
Try it in CedarDB

Let’s take a look at what the new CTEs and the final query do:

  • For the first traversal step (l1), we perform a left semi join on the company_officers relation with the marsalek CTE, selecting all officers who serve on the same board as Jan Marsalek.
  • In the next step (l2), we use another left semi join to find all other companies where any of the officers part of the previous step holds a board position.
  • In the final step (l3), we now find all the officers who serve on any of the boards on which the officers from l2 serve.
  • Finally, we print the names and places of residence of our potential suspects. Since officers may appear multiple times in the result, e.g. by serving on multiple boards, we remove duplicates using the keyword distinct.

The query result will also include all officers part of l1, as they of course serve on the same board as another officer part of l1, namely the board that qualified them for l1 in the first place.

Simply serving on the same board as someone wanted for a crime does not make you an accomplice, so we will not print the final result. However, as you can see from the result excerpt below, there are definitely relevant suspects, such as Dr. Markus Braun, in the result.

namecity
Markus Doctor BraunGrasbrunn
Markus Doctor BraunMünchen

Summary

In this post, we have learned how to work efficiently with JSON and graph-like data in CedarDB. We learned how to query external files and how to import JSON values into an optimized JSONB column. We also learned how to use common table expressions and joins to build meaningful multi-hop graph queries in SQL. If you are interested in more examples, check out our data cookbooks and example datasets in our docs, or check back for future blog posts.

Can’t wait to apply your newly learned skills to your own data? Then join our waitlist today and be among the first to have access to CedarDB!

Join our waitlist!