L2: Time and Clocks; Lamport Diagrams; Causality and happens-before; Network models; State and events
CSE138 (Distributed Systems) by Lindsey Kuper
What will be covered in this lecture?
Time and clocks
Lamport diagrams
Network models
Causality and happens-before
State and events
Time and clocks
What do we use clocks for?
Marking points in time
Specifying durations and intervals of time
What are the examples of marking points in time?
Scheduling i.e. when something will happen e.g. This class starts at 3:30 PM PT
This item in my browser cache expires on 1st April, 2021 at 8 PM PT
This error message in errors.txt has a timestamp of April 1, 2021
What are the examples of duration?
This class is 95 minutes long
This request will timeout in 5 seconds
What are the different types of clocks?
Time of day clock
Monotonic clock
What is time of day clock?
Basically, it says what time of the day it is! For example 8 PM 1st April, 2021
Across multiple machines time of day clocks are synchronized using protocols such as Network Time Protocol (NTP)
Time of day clocks are okay to be used for marking points in time but they are error prone
But these clocks are bad for measuring duration
Consider following setup for two communicating machines M1 and M2
At 3 PM value of x is 5 at M1
Just after 3 PM, M1 calls M2 to increment x
M2 increment x to 6 but note the time at M2 is still not yet 3 PM. Because clocks are not in sync.
Why time of day clocks are error prone for marking points in time?
If we take a snapshot at 3 PM in above scenario the value of x will differ in snapshots of both machines
Why would clocks at M1 and M2 go out of sync?
NTP used for synching is not perfect. So machines can go out of sync and can disagree for a period for certain time
Why time of day clocks are bad for measuring duration?
These clocks can jump forward or backward in time, this happens due to different physical ways of measuring time on machines
A good case study is Cloudflare's leap year bug [Link]
What are monotonic clocks?
These clocks only goes forward
They are good for measuring duration (e.g. timeout) and intervals but not good for marking point in time
Why monotonic clocks are bad for marking point in time?
The implementation of monotonic clock is machine or programming language dependent
For example, the starting point of such clocks could be when you last restarted your machine
Following sample Python Code can be used for verifying this behaviour across machines
import time
time.monotonic()
# Output: 11.076321139
# For me starting point of clock was since I started my Python REPL
Summary
What are logical clocks?
These clocks only measure the order of events
Consider two events A & B, notation A → B means event A happened before event B
What can be inferred about causality from A → B?
A could have caused B
B could not have caused A
Where can ordering of events be helpful?
Debugging to see what lead to a particular bug or state
Designing application where order of events is inherent property of solution
Lamport Diagrams (Spacetime Diagram)
What are Lamport Diagrams and why?
Standard notation of discussing time based events across multiple processes or machines
A process is a line with discrete beginning and goes on indefinitely
Event happened on this process are denoted by dots
In below diagram we can infer X → Z, Y → Z, X → Y
What are the types of events?
When machines communicate they send messages to each other, these messages are also considered as events (important ones!). These are send and receive events. An event can't be both send and receive.
An event which isn't send or receive event is called internal event.
How to define A → B (A happened before B)?
If either of following (2-4) is are true for event A and B then we can say A → B (A happened before B)
A and B occur on the same process with A before B
A is a send event and B is the corresponding receive event
A → C and C → B - Transitivity obtained by combining above two parts of definitions
What are concurrent events?
If the relationship A → B can't be established between two events A and B then they are called concurrent event
In above diagram P & X are concurrent event. Note you might see P happening before X in diagram but the causality can't be implied as there is no shared notion of time (such as time of day or monotonic clock) among M1 and M2.
Similarly, Z and U are concurrent event, denoted Z /→ U
What is causal anomaly?
It can be explained with an example below where Carol is confused on whether Alice's comment or Bob's comment is responsible for other.
Carol received events in some order but she can't figure out causality between the two events
Network Models
What are the types of network models?
Synchronous
Partially synchronous
Asynchronous
What is synchronous network model?
A network model where there exists an N such that no message takes longer than N units of time to be delivered
This would have made solving many distributed systems related problem easy but this isn't how real world works
What is partially synchronous network model?
A network model where there exists an N such that no message takes longer than N units of time to be delivered but it is not known. It needs to be estimated/discovered.
What is asynchronous network model?
A network model where there exists no N such that no message takes longer than N units of time to be delivered
This model is hard but more practical and if you are designing a protocol it makes sense to design for async network model by considering it as worst case scenario
This is network model considered for future discussions
State and Events
What is a state of machine?
Events represented as dot on the Lamport diagram like send, receive etc.
A state can be defined as contents of a machine (in memory, register etc.)
If a machine is storing value of x then x = 5 can be considered as its state
How is state represented on Lamport's diagram?
We annotate a particular event with state of that system to denote on diagram
This is hacky solution but logically it makes sense
How is representing state as annotation right?
A state or contents of machine can be completely determined by all the events happened till given point
Replaying these event till a certain point will result in state of machine at that point
Can we rebuild the sequence of events if only state is provided?
No because there could be many ways to reach that particular state
Can we rebuild the logs if we know there is one and only way to reach a state?
May be but think about events which didn't have any impact on the state
***