What is distributed systems?
A system running on several nodes connected by a network
And has characteristics of partial failure
What is partial failure?
Some parts (may be) broken, others are working fine
It is possible for software to be unaware of failed components (future lectures)
What are the types of computing philosophies?
Cloud Computing Philosophy
High Performance Computing Philosophy
What is Cloud Computing Philosophy?
The applications running on set of nodes are designed to work around partial failures
Partial failure is an expected outcome of such type of working system
What is High Performance Computing Philosophy?
The applications treat partial failure as total failure
Applications use checkpointing mechanism to deal with failures
Which philosophy will be discussed in this course?
Cloud Computing Philosophy will be discussed in this course
Checkpointing approach works for small set of powerful nodes but as number of nodes increase the probability that someone will fail also increases
Consider following simple setup for two machines M1 and M2
M1 requests value of x
M2 does some processing
M2 responds with value of x
In above setup, what are the ways the system could fail?
Request from M1 gets lost (e.g. packet loss)
Request from M1 is slow (M2 might close the connection meanwhile)
M2 crashes - fails to process the request
M2 takes a lot of time to process the request
Response from M2 is slow
M1 crashes before/while receiving the response
Corrupt transmission (Byzantine faults)
M2 is compromised and returns incorrect value of x
M2 deliberately lies to M1 and returns incorrect value of x
Man in the middle which changes the response from M2 to an incorrect value of x
Can M1 tell what is the failure if it doesn't get response back?
No, it is impossible for M1 to know why
How do real systems try to cope with this?
They implement Timeouts which means within a preset value of time if the response doesn't come back the request is assumed to be failed and may be retried
Timeouts are imperfect solution
Why timeouts are imperfect solution?
Configuring timeout correctly is difficult task due to uncertainties
Uncertainty such network latency is difficult to know for sure (Palvano's definition of distributed systems: Partial failure + unbounded latency)
Another uncertainty example, what if M1 modifies some value (x = x + 1) at M2 but it fails before receiving OK response from M2. Then should it retry?
Such uncertainties can't be solved completely, so we have to deal with them
If latency and processing time was known how would compute the timeout?
Let's say, latency or delay between two systems is "d"
Let's say, processing time at M2 is "r"
Then Timeout at M1 = 2 * d + r
Why would you want distributed systems?
You want to make things faster by breaking computation in smaller chunks (e.g. MapReduce)
You don't have another choice because you have more data than can fit on one machine
More machines could give you reliability if one machine dies rest can still continue processing or serving requests
***