5. Replication - Part 1

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Feb 21, 2021

This is an early release preview. A lot to be updated here.

Introduction

Assumption: All data can fit on a single host - no need to partitioning

Why replication?

Redundancy of data - Increased availability
Tolerance to failure
Geographic distribution - Reduced latency for user and increased performance

Why complexity arises in replications?

If data is not going to change we can simply copy paste it across different hosts
But complexity arises when updates happen - How to propagate change across all nodes?

What are the algorithms?

Single Leader, Multi-leader, Leaderless

What are the trade-offs?

Synchronous vs Asynchronous replication
Tolerance to node failures

Leaders and Followers

How to make sure all data ends up on all the replicas?

The common approach is leader based replication

How leader based replication works?

A single replica is selected as a leader, all other replicas are called followers
The leader is the only node which accepts any write requests
Read requests can be served from any of the replicas including leader
Whenever a write occurs at leader it sends the replication logs (also called change stream) to followers
By applying changes in same order as processed in leaders followers updates their own data sets
Examples: PostgreSQL, MySQL, MongoDB, Kafka, RabbitMQ

Synchronous vs Asynchronous Replication

What is synchronous replication?

A user sends write request to leader
Leader applies the change locally and keeps user waiting for response. Also doesn’t make it visible to other clients (something like git staging area, not committed yet)
Leader then asks follower(s) to apply the same change and send the acknowledgement
Once follower send the acknowledgment leader notifies the user of operation being complete and make data visible to other clients as well

What are the pros of sync. replication?

Each follower is update to date with leader
If leader dies the follower can immediately take its place without any data loss

What are the cons of sync. replication?

The user has to wait till all the followers send the ack
If one follower fails then the current as well as upcoming writes gets stuck till it comes back

What is async. replication?

A user sends a write request to leader
Leader applies the change locally (no staging area) and OK response to user
Leader then send the replication logs (or change stream) to followers
Followers apply replication log changes

What are the pros of async. replication?

User doesn’t have to wait till replication is done across all nodes
Leader can continue to accept the writes even if all follower replicas fail

What are the cons of async. replication?

Any writes accepted by leader but not replicated to followers will be lost if leader fails - Durability impacted
Due to replication lag inconsistency in data seen by user

What is the practical solution for Async. replication’s durability problem?

Chain Replication - Implemented in many systems such as Microsoft Azure Storage

What is semi-sync replication?

Only one follower is selected to sync. replica, all others are async. replicas
When one follower is slowing down some other follower is selected in its place

Which is the most common configuration?

Async. replication

Why not semi-sync, seems like best of both worlds?

You would need to deal with problems of both worlds. Additional complexity.

Setting Up a New Follower

What are the approaches to setup a new follower in case of failures or to increase the number of replicas?

Simply copying the DB files to new replica
Locking the DB till replica is up to date with current data
Snapshot and catch up approach

Why not “Simply copying the DB files to new replica”?

The data is constantly is updated by users/clients. It might keep updating the files which are already copied.
We would need to repeatedly identify all the files that are changing and copy them again and again

Why not lock DB (stop writes) till copying completes?

This goes against our availability goal - Users won’t be able to perform any writes

What is the snapshot and catch up approach?

Snapshot of leader’s database is taken without any locking of db at some time T.
The snapshot is copied to the follower.
Follower requests the changes since time T to leader
Once follower has applied all the backlog changes it starts accepting new data changes from leader

Handling Node Outages

Goal is to keep system running even if a node goes down. Keeping the impact of outage minimum.

Why would a node go down?

Maintenance activity
HW failure etc.

How to recover from follower failure?

This is similar as “Setting Up A New Follower"
If node is restarted and a snapshot already exists we don’t need to copy anything from leader. The follower starts requesting changes since its failure and catches up eventually

How to identify if leader is dead?

There is no full proof way to know if leader has been failed and its cause in a automated fashion.
The automated approach available is similar to health check
If leader doesn’t respond within configured time period it is considered as dead

How to recover from leader failure? (This process is called failover)

First detect that leader is dead
A leader is selected from the followers
The clients and the remaining followers are notified of new leader and asked to update their configuration

What are the problems associated with failover?

If async. replication is used the new leader might not have all the data from old leader. Hence, data loss. If new leader comes back and has previously un-replicated data, the safest option is to discard the data he has because the new leader now has a different set of logs.
If two nodes start thinking they both are leaders then we have problem of split brain. After detection, simplest solution is to turn off one.
During health check of leader, if response time is configured too short then we will end up with unnecessary failovers. It is common for leaders to respond a little late during high traffic. A failover is a overhead and will make situation worse. If time is configured longer then it will take longer time to recover.

Implementation of Replication Logs

What are the methods to implement replication logs?

Statement-based replication
Write-ahead Log Shipping
Logical (row-based) log replication
Trigger-based replication

Statement-based Replication

What is the statement based replication?

Whenever a request query is made with SQL statement, the statement itself is sent as it is in replication log
The followers read the statement from the logs and apply it on their data
Statement based replication was used in MySQL 5.1 and before but now it uses row based replication.

What are the cons of this approach?

For the non-deterministic function such as NOW() which gives current timestamp the results won’t be consistent. This problem can be solved by replacing such functions with their fixed value at leader. But such approaches have a number of edges cases that need to be dealt with.
If the db has auto incrementing column then all the changes at followers need to be applied in the exact same order as leader has done. This limits the transactional concurrency of db.

Write-Ahead Log (WAL) Shipping

What is write ahead log shipping?

As seen in Chapter 3, every write is appended to a log before actual write,
1. In case of log-structured dbs, the log is the main store of the data which is frequently compacted and merged.
2. In case of B-Tree based dbs, the log works as mechanism to recover from failure.
Because logs are written before applying the changes to dbs content they are called write-ahead logs.
In WAL Shipping we transfer this WAL itself to followers and they build exact same copy of this data structure found on leader.
This strategy is implemented by PostgreSQL and Oracle

What are cons WAL Shipping?

The WAL data structure describes the changes at very low level such as it mentions which bytes have changed at which block of disk.
This tightly couples the replication with storage engine. If database engine changes its storage format from one version to other then the replication will break.
These kind of tight coupling requires system downtime during version upgrade as changes are not backward compatible.

How loose coupling solves the problem of downtime?

Assuming database is loosely coupled with storage format and supports backward compatibility. Let’s say leader is running at version 1 of db and we upgraded the followers to version 2.
We can cause the failover and then select one of the followers.
Upgrade the previous leader and set it up as a new follower.

Logical (row-based) Log Replication

What is row-based log replication?

A sequence of records which describes the updated state of database at granularity of a row.
1. For an insert operation - values of all columns
2. For delete operation - enough information to identify a particular record uniquely
3. For update operation - enough information to identify the record and new values
4. For multiple row operation/transaction - at the end a special record which mentions commit of that transaction
The datatype used for row-based log replication is independent from data type of storage engine (physical data type) logs.

What are the pros?

The independent data type of of log allows version to be backward compatible more easily. This solves the problem of downtime during an version upgrade.
These logs can easily be parsed by other applications. This can be used in building a data warehouse. This process is called change data capture.

Trigger-based Replication

What is trigger based replication?

All the methods discussed so far are implemented by database themselves. The replication is not part of application code.
But sometime we want more flexibility like selecting a subset of data, performing some analysis on new data and then storing it. This can be done by application code.
Triggers and stored procedures allow developer to run a application code on response of an db event. The result can be stored in different table to be consumed by other applications.
This approach is preferred due to it’s flexibility.

What are the cons?

This approach is much error-prone and limited than dbs own replication method.
Triggers add additional overhead.

END OF PART-1

Notes by Sharekh

Discussion about this post