C3. Storage and Retrieval - Part 1

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Feb 12, 2022

Hello 👋 ,

Welcome to a new post after a long hiatus!

In chapter 2, we discussed data models and query languages. A wild ride now begins 🤯. In this chapter, we are going to discuss the internals of databases!

*Click on the title to read this post in the browser. IMO, it’s a better experience.

Introduction

What are the fundamental responsibilities of a DB?

Store the given data.
Get the stored data.

Why do I need to know how storage and retrieval work internally?

When making db choices for your application, you should have a rough understanding of how it works.
It allows you to make the right decisions while tuning the db for your specific needs.

Data Structures That Power Your Databases

What is the most straightforward possible database?

For SET operation, append the key-value pair to the end of the file (not overwriting the existing entries for the same key).
For GET operation, scan the whole file and return the last entry.

What is a log?

Log is a sequenced append-only set of records.
It is intended for other programs to read.

What is the performance of SET operation?

It is the optimal way of adding a record to db because it’s the most straightforward operation to persist any data.
Hence, O(1).

What is the performance of GET operation?

The get operation requires scanning the whole file, hence O(N), N being a number of records added so far.
If the number of records doubles, the time to GET an item doubles.

How can we optimize the GET operation?

We can create the index to search the data needed quickly.
Indexes act as references or signposts for quickly getting to data.

What is an index?

An index is an additional data structure created from primary data.
It doesn’t affect the contents of db but affects the performance of queries.

Who creates an index?

It’s the responsibility of the developer to create an index to satisfy the application’s data requirements.
Developers can create more than one index if they wish to query data in other possible ways.

What are the performance implications of creating indexes?

One more data structure for the index requires an additional write on every update.
Indexes are overhead compared to simple append-only write.

Hash Indexes

How can we use Hash Tables to build an index?

Let’s assume all keys can be stored in memory; an in-memory hash map holds the key-value pair where the key is the same as in the db record and value is the byte offset (the start location of that key) in the log file.
On querying a particular key, get the byte offset of the key from the hash map and directly jump to that location on the disk.
The indexing saves from scanning the whole file during GET operation.

Which db uses this method of indexing?

Bitcask backend for Riak uses Hash Indexes.

What are the conditions where such dbs are a good choice?

A low number of keys to hold in memory.
A lot of writes for only a few keys.

How do you address the increasing size of log files because of the append-only nature?

Break the log files into smaller fixed-sized files called segments.
Each segment has its hash table.
When a segment freezes (or is full), the background running thread performs compaction and generates a new small-sized segment.
The new small size segment and its hash table replace the old one. Then delete the old segment.
During compaction of a segment, only the latest record of each key is kept, and delete the rest.
Merge all the compacted segments by taking the union of keys of their latest records to maintain a low count of segments.

How does lookup works after compaction and merge?

All the segments are kept ordered by their original creation timestamp, and they have their hash table.
When a lookup request comes first latest segment is checked. If there is no record, it checks the next latest segment and so on until the record is found.

How does the deletion work?

A special new entry of tombstoned record is appended to log for the key to be deleted.
All the previous records for the tombstoned key are deleted during compaction and merge.

What are the practical design considerations of designing such db?

File format - Selecting CSV or another human-readable format is inefficient as it leads to parsing and escaping problems. A better way is to store in a binary form where first you store the length of string and then the bytes of strings.
Deleting record - As described above.
Crash Recovery - You can create the state of db by processing all the files from the start. But this is slow. Hence, you can take a snapshot of the db index and store it on a disk. After recovery copies the index from the disk and catches up to all the logs.
Partially Written Records - db can fail while writing the record. A record could be partially written, a corrupt record. DBs implement a checksum method to discard such records on recovery while processing.
Concurrency control - Because records need to be written sequentially, there is only one writer head thread. But records are immutable. There can be multiple reading threads concurrently.

Why append-only logs? Why not just update the value at the item's location in the file?

Appending and merging both are sequential operations. Hence, no random access is needed. Sequential access is generally significantly faster than random access on disk.
Append only logs provides safety from concurrency problems such as crashes during write. You don’t have to worry about half of the previous value being overwritten by half of the new value, corrupting the record.
Merging the old data avoids the problem of data files getting fragmented over time.

What are the limitations of Hash Indexes?

Memory size limits the number of keys to be stored.
Range queries are not efficient. You need to query for each item in the range.

How to scale to a large number of keys in hash indexes?

By storing the hash table on disk, we can scale to many keys, but the hash table performance decreases as read is now from disk instead of memory.

SSTable and LSM-Trees

What is an SSTable?

Instead of storing the key-value pair in the sequence, they arrive at, we add them in Log in sorted order for each segment.
It requires that each key occurs only once per segment. (This is already taken care of by compaction).
This method of arranging logs is called Sorted String Table.

What are the advantages of SSTable?

The merge of two segments is similar to the merge step of merge sort. If a key occurs in 2 segments in S1 and S2 where S2 is created later than S1, we take the value from S2 and discard the value from S1 during the merge.
We only need to store the start key for each segment in memory. To find a particular key, we need to find the closest segment based on comparison with the start key then apply binary search within each segment.
It is possible to compress each segment where the start key points to the block's location in the disk. Compression saves space, I/O cost, and bandwidth.

Constructing and Maintaining SSTables

How do you create the SSTable?

Maintain a tree (Red-Black Tree or AVL Tree) in memory which keeps the records written in sorted order, called memtable.
As soon as the Tree reaches the maximum size, flush the content to disk as SSTable.
When trying to find a key first look into memtable; if not found, look into the most recent SSTable on disk; if not found, look into the next recent SSTable on disk, and so on.
In the background, make sure to run compaction & merge on segments and discard the overwritten or deleted values.

How to handle the crash when items are still in memtable?

We maintain an append-only log on the disk.
As soon as we write an item to memtable, a corresponding entry is added to the Log.
Log need not be sorted as its whole purpose is to restore from the crash, if any.

What is an LSM-Tree?

The above data structure implementation, a combination of Tree + SSTable + Log, is called Log-Structured Merge Tree (LSM-Tree).

Performance Optimisations

How to optimize for a missing key check?

Checking key which doesn’t exist in the database requires at least iterating through all the heads of segments.
Another probabilistic data structure called Bloom Filters is used to optimize this operation.

What are the compaction and merge strategies?

There are two compactions and merge strategies size-tiered compaction and level-tiered compaction.

What is size-tiered compaction?

Repeatedly compact and merge smaller tables to build the bigger ones.
Let’s say we have a memtable size of 252 MB.
Whenever it is full, we flush the data to 512 MB SSTable until the table is fully filled.
When one such table is filled, we create another SSTable of 512 MB, and this goes on until we have X number of 512 MB fully filled SSTable.
Then compaction & merge is run on these X tables, and output is written to a bigger SSTable, of size 1 GB.
The smaller SSTables are marked empty.
And this process continues.

What is the level-tiered compaction?

Illustrated Explanation: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction
Level-0 is memtable.
Each level is made of multiple segments. The total size of a level is fixed.
As the level number increases, the size of the level grows exponentially.
When records are flushed down from L0 to L1 and L1 overflow, all the overlapping records are merged with existing segments.
If the L1 size is still overflowing, the records are pushed down to L2, and this goes on.

That’s it for today! I hope you liked this one. In the next part, we will dive deep into B/B+ Tree-based data structures. See you soon!

Thank you for reading Notes by Sharekh. This post is public so feel free to share it.

Notes by Sharekh