C1. Reliable, Scalable, and Maintainable Applications - Part 1
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Hello 👋 ,
We start our journey into amazing book “Designing Data-Intensive Applications” by Martin Kleppmann. This book not only provides internal view into most popular distributed systems but also serves a goto guide for in general designing day to day systems.
I highly recommend going through these notes if you are preparing for a system design interview. I have split notes for chapters into multiple parts so that each can be consumed within 10-15 mins.
Let’s get started.
Introduction
What is this book about?
Data size is increasing exponentially compared to compute capabilities
When building solution for data-intensive application we have abstractions at hand such as databases, caches, search indexes, streams, and batch processing
There are multiple implementation of these abstractions and various ways to use these tools in combination. Hence, making it difficult to select right tool for the job.
This book covers principles and practicalities of data systems to build data-intensive applications. It explores what these tools have in common, what distinguishes them, and how they achieve their characteristics.
What does this chapter cover?
We will define Reliability, Scalability, and Maintainability
Thinking About Data Systems
Why should we refer database, caches, queues etc. under umbrella term data systems?
Traditionally these tools are distinguished by their access patterns
But the boundaries between these tools is blurring. Some storage system can be used as queues (Redis) and some queues have db like querying capability (Kafka)
Today's application requirements have wide-range to be satisfied with a single tool. So we stitch together multiple such tools.
When we stitch together these tools to solve our applications problem we are creating a special purpose data system exposed through an API with its own characteristics. Hence, it is fair to bring them under single umbrella term like data systems.
What are the concerns which are common and should be considered for designing data systems?
Reliability - Continuing to work correctly, even when things go wrong
Scalability - It should be to grow, handle more load
Maintainability - Maintaining current behaviour and adapting to new use case should be productive
Reliability
Continuing to work correctly, even when things go wrong
What are expectation of reliability for software system?
App does what it is supposed to do even when user operates in unexpected ways with right set of authorization and authentication
Performance is good enough for required use case under the expected load and volume
What is fault?
A fault is usually defined as one component of the system deviating from its spec.
Things which can go wrong in a system are called faults.
What is fault-tolerant or resilient system?
System which expects the faults and cope with them
System's can't handle all kinds of faults (e.g. Earth swallowed by black hole)
The scope of the faults that can be handled by system should be reasonable. And it can only handle certain types of faults.
What is failure?
Fault is not same as failure. Failure is when the system as whole stops providing the required service to the user.
Faults can't be eliminated completely but we can try to prevent or recover from faults that can cause failure. In case of security faults, prevention is better than cure.
Curable faults are - Hardware Faults, Software Errors, Human Errors
The Netflix Chaos Monkey approach, deliberately introducing faults to test your systems fault tolerance
Hardware Faults
What are hardware faults?
Hard disk crash, RAM faults, power outages all kinds these faults are called hardware faults.
These happen all the time. A hard disks mean time to failure (MTTF) is 5-6 years. Means a datacenter with more than 10k hard disks can experience 1 disk failure a day.
How are these faults handled traditionally?
By providing redundancy. We maintain one or more instances of a component as a backup so if one fails we can easily swap and reduce downtime. For example, dual power supplies in data centers.
Is hardware redundancy sufficient?
It was sufficient, but isn't now due to applications with huge data volume needs to be processed by a large number of machines
When more and more machines are put together to perform a job, the rate of failure of some machine increases
So, move is to handle entire machine failure in a multi-machine environment with software fault tolerance techniques (along with hardware redundancy)
Software Errors
What are the software errors?
Hardware failures are considered independent failures, failing one hard disk doesn't cause failing another.
Software errors are generally assumptions or buggy code which comes out when the circumstances fail, for example, incorrect inputs.
Software errors impact one or more nodes/services, for example, if a downstream service slows down or returns incorrect responses it also may also cause its consumers to fail.
Is there a fix for such errors?
Short answer NO but carefully analyzing the assumptions and continuously monitoring the environment for their validity might help alarm at earliest.
Typically, such bugs go unnoticed for a long period of time.
Human Errors
Human operator are leading cause of outages, how can we make system reliable in spite of unreliable humans?
Design a well balanced system which allows humans to operate within their scope and contain the impact of mistake
Decouple the places where people make the most mistakes from the places where they can cause failures, for example, sandbox environments.
Test, Test, and Test! At all levels, unit test, integration tests, QA, manual testing.
Quick recovery mechanism in case of an error, such as configuration rollback, data backfill etc.
Setup detailed and clear monitoring & alarms (also called as telemetry)
Good management practices and training
How Important Is Reliability?
Loss of reliability impacts productivity and most cases revenues of the business
Loss of customer trust, for example, when parents can't access their kids pics because service is down
The situations where you might cut corners on reliability are like when you are developing a prototype. Be careful when cutting corners on reliability.
That’s it for this post. See you in the next post there we will discuss the last part of this chapter “Scalability”. Yeah, I know… everyone nowadays is like, is it scalable?
Liked the article? Please Share & Subscribe! 😀