C2. Data Models and Query Languages - Part 3
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Hello 👋 ,
Welcome to a new post!
In Chapter 2, we have been discussing data models and query languages (Part I, Part II). In this last part of the chapter, we will discuss graph models, their query languages, and comparison between network model and graph model.
This is post was relatively difficult to simplify. I have put in a lot of effort in this post, trying to think how can I put these complex concepts in a simple way. After a lot of exploring around, I tried my best with my limited available time and hope it all makes sense to you guys. Appreciate your feedback in comments. So, let’s get started.
*Click on the title to read this post in browser. IMO, it’s better experience.
Graph-Like Data Models
What is graph?
A graph is a collection of vertices (nodes or entities) and edges (relationship) that connects them.
In below example, Customer, Order, Product, Category, and Supplier are vertices. PART_OF, ORDERS, PURCHASED, and SUPPLIES are edges or relationships.
The edges can connect any two types of nodes. The nodes needs not be of same type.
When to use graph data model?
If your data is highly connected, in other words, if data has a lot of many-to-many relationships then exploring graph data model makes sense.
How to represent graphs in relational model?
Consider having two tables vertices and edges. The properties of both of these objects can be stored as json in one of the column. Following are some points to note with relational schema defined below.
Any vertex can have an edge connecting it with any other vertex.
Given any vertex, you can efficiently query it’s incoming or outgoing edges. Thereby, path b/w two vertices.
A clean data model can maintained even though you have N types of relationship out/in to the node.
What are other ways to represent graph data model?
There are several ways to represent graph data models but predominant are property graph and triple-store data models.
Property Graphs
What are vertices consists of?
A unique identifier
A set of outgoing edges
A set of incoming edges
A collection of properties (key-value pairs)
What are edges consists of?
A unique identifier
The vertex at which the edge starts (the tail vertex - for SUPPLY edge "Supplier")
The vertex at which the edge ends (the head vertex - for SUPPLY edge "Product")
A label to describe the kind of relationship between the two vertices - SUPPLY in SUPPLY edge 😂
A collection of properties (key-value pairs)
What is an example of graph data model?
The e-commerce data model presented in above diagram is a good example of property graph.
How do we query such model?
Neo4J uses Cypher Query Language but there also some other languages like Gremlin, Datalog, and SPARQL etc.
The Cypher Query Language
What is an example of creating an entry above e-commerce graph data model?
A sample query for adding vertices and edges can be written as below.
One important thing to note in above definition is that a category can have subcategories. For example, Food & Beverages can be parent category and then child categories can be Vegan, Dairy, Non-Dairy etc. And this nesting is unbounded, it means there aren't any restrictions on the depth as such.
How to write a query to fetch all the products of a category?
Try to understand, because categories can have subcategories and these subcategories can be further divided into sub-subcategories and so on, so you need to query recursively until node with specific conditions you are looking for is found (Depth first search kind of).
This is fairly simple in Cypher as shown below. For all the products recursively look for "Dairy" category, if it is there include product's name in the result.
How database executes above query?
DB has two choices, either start with each product node and look for a node with "Dairy" property.
OR go backwards, start with all category nodes with name "Dairy" and trace back to products referring them directly or indirectly.
The DB optimally decides what to do and all details are abstracted away from user.
Graph Queries in SQL
What is the SQL alternative look like?
Because this is recursive query and depth isn't fixed a query like below will return partial results.
SQL provides another way to query hierarchal data using Common Table Expression (CTE). But even with CTE the size of the query will be complex to write and complexity will increase based on sophistication of your business use case.
👉 Exact syntax of languages are not important, just note how simple it is to write such a complex query in Cypher. Because of such optimizations and abstractions, it makes sense to have different data models and ways to query them in your toolbox.
I recommend you to read about CTE here. It nicely explains how can you implement recursion in SQL query. Interesting topic! Another simple example can be found here.
Triple-Stores and SPARQL
What is Trip-Stores data model?
Similar to property graph, Triple-stores is another to represent graph data model.
In a triple-store, all information is stored in the form of three-part statements (subject, predicate, object). For example (Sharekh, writes, Notes), (Reader, shares, Post).
What is a subject?
A subject of triple is equivalent to a vertex in a graph.
What is an object?
If object is a primitive data type like number or string then predicate and object of the triple are equivalent to the key and value of a property on the subject vertex.
If object is another vertex in the graph in that case predicate is an edge in the graph, subject is tail vertex and object is the head vertex.
What does a graph definition (create statements) look like?
They are simple statements written as (subject, predicate, object), look like below
What is Semantic Web?
The intent was to create a "database of everything".
If websites are already publishing information in the form of text and picture, if they also publish the information in standard format it will allow different website to be automatically connected.
That standard format was supposed to be the Resource Description Format (RDF), just another XML specification.
Why talk about semantic web here?
The triple-store model and semantic web are closely linked (on online discussions) but they are independent of each other.
They pop up together in many online discussions, so it we discussed them.
What is the query language for triple-store format?
It is SPARQL. If SQL and Cypher had baby, it would look like SPARQL. 👼
Graph Databases Compared to the Network Model
What are the differences between graph databases and network models (CODASYL)?
Schema vs No-Schema - CODASYL database had a schema that specifies which record types could be nested within which other record type. In graph, there is no such restriction.
Access Path vs Direct Access - CODASYL required that you remember access path to a nested node from the its root. In graphs, nodes and edges can be accessed directly.
Ordered vs UnOrdered Records - Due to storage layout CODASYL required children of record to ordered and this responsibility was onto applications. In graphs, no ordering is required from application's point of view.
Imperative vs Declarative Queries - CODASYL's queries were imperative and can be easily broken by change in schema. In graph, you have already seen how easy it is to write queries in Cypher which are declarative and flexible with schema.
If you missed our discussion on Network Model (CODASYL) discussion go to Part I.
The Foundation: Datalog
What is Datalog?
This is kind of mix of both, a query language as well as data model.
Instead of writing triples as (subject, predicate, object), we write it as predicate(subject, object)
And query is also expressed in small functions (rules) which can be composed together to build more complex queries.
That’s it for today. In the next post we will start chapter 3, i.e., “Storage and Retrieval” 😮 where we will discuss data structures used in implementing database, difference between transactional and analytics processing, and column oriented storage.
Exciting? Me too. Why not share? 😁