L1: Distributed Systems: What and Why?

May 01, 2021

What is distributed systems?

What is partial failure?

Some parts (may be) broken, others are working fine
It is possible for software to be unaware of failed components (future lectures)

What are the types of computing philosophies?

What is Cloud Computing Philosophy?

The applications running on set of nodes are designed to work around partial failures
Partial failure is an expected outcome of such type of working system

What is High Performance Computing Philosophy?

Which philosophy will be discussed in this course?

Cloud Computing Philosophy will be discussed in this course
Checkpointing approach works for small set of powerful nodes but as number of nodes increase the probability that someone will fail also increases

Consider following simple setup for two machines M1 and M2

In above setup, what are the ways the system could fail?

Request from M1 gets lost (e.g. packet loss)
Request from M1 is slow (M2 might close the connection meanwhile)
M2 crashes - fails to process the request
M2 takes a lot of time to process the request
Response from M2 is slow
M1 crashes before/while receiving the response
Corrupt transmission (Byzantine faults)
1. M2 is compromised and returns incorrect value of x
2. M2 deliberately lies to M1 and returns incorrect value of x
3. Man in the middle which changes the response from M2 to an incorrect value of x

Can M1 tell what is the failure if it doesn't get response back?

How do real systems try to cope with this?

They implement Timeouts which means within a preset value of time if the response doesn't come back the request is assumed to be failed and may be retried
Timeouts are imperfect solution

Why timeouts are imperfect solution?

Configuring timeout correctly is difficult task due to uncertainties
Uncertainty such network latency is difficult to know for sure (Palvano's definition of distributed systems: Partial failure + unbounded latency)
Another uncertainty example, what if M1 modifies some value (x = x + 1) at M2 but it fails before receiving OK response from M2. Then should it retry?
Such uncertainties can't be solved completely, so we have to deal with them

If latency and processing time was known how would compute the timeout?

Why would you want distributed systems?

You want to make things faster by breaking computation in smaller chunks (e.g. MapReduce)
You don't have another choice because you have more data than can fit on one machine
More machines could give you reliability if one machine dies rest can still continue processing or serving requests

***

Notes by Sharekh