# Exam

- distributed system, partial failure
	- distributed system is composed of multiple processes that seek to achieve some form of cooperation; its components may fail independently (= *partial failure*)
	- multi-core processor is not a distributed system – the whole processor crashes (we cannot have one core working and other one crashed)
- why do we want/have a distributed system?
	- users are in different places → it is a distributed system by nature
	- better performance, stability (if one server fails, we have another)

## Ordering of Events

- notation (processes, channels, events, states, history)
	- set of processes $P=\set{p_1,\dots,p_n}$
	- they communicate through channels (channel from $p_i$ to $p_j$ … $C_{ij}$)
	- event $\# k$ on process $p_i$ … $e_i^k$
		- event changes the state of the process
	- states of the process $\sigma_i^0,\sigma_i^1,\dots$
		- event $e_i^1$ performs the transition from $\sigma^0_i$ to $\sigma_i^1$ on process $p_i$
	- local history $h_i=e^1_i e^2_i e^3_i \dots$
	- global history $H=h_1\cup h_2\cup h_3\dots$
- happened_before ($\to$) relation between two events
	- first case: $e_i^k\to e_i^\ell$ iff $k\lt\ell$
	- second case: send($m$) $\to$ recv($m$)
		- we can receive a message $m$ only after we send it
	- transitivity: $e\to e'\land e'\to e''\implies e\to e''$
	- antisymmetry: $e\to e'\implies e'\not\to e$
	- it is a partial ordering of events
	- if there is no relation between $e$ and $e'$, we say they are concurrent
		- $e||e'$
- global state, cut
	- global state $(\sigma_1^{k_1},\sigma_2^{k_2},\dots,\sigma_n^{k_n})$ is composed of the local state of every process
	- cut $C$ … subset of the global history $H$ that includes a prefix of each local history
		- is defined by a tuple $(ct_1,ct_2,\dots,ct_n)$
		- $C\equiv h_1^{ct_1}\cup h_2^{ct_2}\cup\dots\cup h_n^{ct_n}$
		- cut $C$ defines the global state $(\sigma_1^{ct_1},\sigma_2^{ct_2},\dots,\sigma_n^{ct_n})$
	- consistent cut … the global state could have existed
		- cut $C$ is consistent iff $e'\in C\land e\to e'\implies e\in C$
		- *consistent global state* is a global state defined by a consistent cut
- Chandy-Lamport snapshot algorithm
	- we assume FIFO channels
	- to save a snapshot, we need an initiator process which is the first to save its state and broadcast the message SNAPSHOT
	- when process $p_i$ receives the SNAPSHOT message, it saves its state and also broadcasts SNAPSHOT (there are no other events in between)
	- the state of channel $c_{ji}$ corresponds to the messages that the process $p_i$ received from $p_j$ between broadcasting SNAPSHOT and receiving SNAPSHOT from $c_j$
		- messages are added into the channel state one by one
	- after $p_i$ receives SNAPSHOT from all processes, it knows that the computation of the snapshot is terminated

## Time

- asynchronous system
	- no bound on message transmission delays
	- no bound on relative speed of processes
- timestamp function for an asynchronous system
	- how to put timestamps on events to have guarantees that one event happened before another event?
	- I would like a timestamp function $TS$ such that $e\to e'\iff TS(e)\lt TS(e')$
- Lamport Clock $LC$
	- algorithm
		- initial state … $LC_i\leftarrow 0$
		- for any internal event … $LC_i\leftarrow LC_i+1$
		- on $\mathrm{send}(m)$ … $LC_i\leftarrow LC_i+1$
			- + attach $LC_i$ to $m$, so that $ts(m)=LC_i(\mathrm{send}(m))$
		- when $p_j$ receives $m$
			- $LC_j\leftarrow\max(LC_j,ts(m))+1$
	- what can we say about LC?
		- $e\to e'\implies LC(e)\lt LC(e')$
			- we don't have $\impliedby$
		- $LC_i(e)$ corresponds to the length of the longest chain of events (longest causal chain) leading to $e$
- vector clocks
	- definition
		- for $i=j:VC(e_i)[j]=$ number of events on $p_i$ up to and including $e_i$
		- for $i\neq j:VC(e_i)[j]=$ number of events on $p_j$ that happened\_before $e_i$
	- algorithm
		- if $e_i$ is internal or $\mathrm{send}(m)$
			- $VC(e_i)\leftarrow VC_i$
			- $VC(e_i)[i]\leftarrow VC_i[i]+1$
		- if $e_i$ is $\mathrm{recv}(m)$
			- $VC(e_i)\leftarrow\max(VC_i,ts(m))$
			- $VC(e_i)[i]\leftarrow VC(e_i)[i]+1$
	- how we compare vectors
		- $u\lt v\iff(\forall i)(u_i\leq v_i)\land(\exists j)(u_j\lt v_j)$
	- it holds that $e\to e'\iff VC(e)\lt VC(e')$

## Abstractions, Failure Detectors

- abstractions
	- we want something simple but useful
	- we use a stack of components
	- every component has an interface
- fault models for processes
	- **crash-stop**
		- process crashes and never comes back
		- the most typical model
	- crash-recovery
		- process crashes and later resumes from the point where it crashed
		- but this rarely happens
		- it is more usual to restart the process – reinitialize it (so it's a new process)
		- use case: if a process loses its connection frequently
	- byzantine
		- process doesn't behave according to its specification
		- might happen for various reasons – corrupted memory, attack, …
		- possible strategies
			- we need to guarantee that the process can be trusted
			- we assume that there is a limited number of byzantine process and use majority vote
- fault models for channels
	- integrity
		- “channels don't create messages & channels don't modify/corrupt messages”
		- a link from $p$ to $q$ satisfies integrity if $q$ receives a message $m$ at most once, and only if it was previously sent by $p$
		- note: $m$ can be lost
	- fair link – satisfies integrity & if $p$ sends $m$ infinitely often, then $q$ receives $m$ infinitely often (the channel can lose an infinity of messages)
	- reliable link – satisfies integrity & if $p$ sends $m$ and $q$ is correct (does not crash), then $q$ receives $m$
		- → the channel does not lose messages
		- problem: $p$ could crash right after sending $m$
	- quasi-reliable link – satisfies integrity & if $p$ sends $m$ and $p,q$ are correct, then $q$ receives $m$
		- if $m$ gets lost, $p$ can “send it again”
	- why do we define reliable link?
		- it can help us determine if a problem is solvable or not
		- if we cannot solve the problem using reliable link, there is no way to solve it using weaker fault models
	- in reality we only have fair links
		- we can implement stubborn links and then quasi-reliable links
		- then, we want to add a FIFO property on top of that
- implementation of quasi-reliable links
	- stubborn link – if $p$ sends $m$ once (and is correct?), $q$ receives it an infinite number of times
		- does not satisfy integrity (creates messages)
		- we just send all the previously sent messages repeatedly every $\Delta$ time units
		- we deliver every received message
		- receive vs. deliver
			- in our stack of abstractions, there are several layers: network, fair links, stubborn links, quasi-reliable links, process
			- the layer receives a message and then decides to deliver it
	- quasi-reliable link
		- we keep a set of delivered messages
		- but the set is always growing
		- we could use ACK and then remove the message from the set after $p$ stops sending it
			- but how to correctly detect that $p$ stopped sending?
		- problematic scenario
			- $p$ is sending $m$ repeatedly
			- $q$ sends ACK
			- after some time, $q$ removes $m$ from the delivered set
			- another instance of $m$ arrives to $q$
			- $q$ recognizes $m$ as a new message
		- in practice, we send a sequence number to let the other side know how many messages we already received
			- it can be bundled with messages
			- if there are no new messages in the channel, we may want to send the acknowledgement separately (with its own sequence number)
	- FIFO quasi-reliable link
		- properties
			- satisfies integrity
			- if $p$ sends $m$ and $p,q$ are correct, then $q$ receives $m$
			- if $p$ sends $m'$ after $m$ and $p,q$ are correct, then $q$ receives $m'$ after $m$
		- implementation proposal
			- sender $p$ assigns a timestamp to every message, the messages are then ordered by the timestamp
- synchronous system
	- bound on message delay … $\Delta$
		- the maximum time required to deliver a message
	- bound on process speed … $\beta$
		- the fastest process needs $x$ time to do something $\implies$ the slowest process needs $x\beta$ time to do this
	- $p_1$ asks $p_2$: “are you alive?”
		- in an asynchronous system, there is no way to tell if the other process is alive
		- in a synchronous system, $p_1$ can be sure that the response has to arrive at most after $2\Delta+x\beta$
- failure detector
	- a magic box which tells the process which other processes are alive
	- two properties: completeness, accuracy
	- by default
		- can make mistakes
		- can change its mind
		- different FDs can have different opinions
	- completeness
		- strong – eventually every crashed process is suspected by every correct process
		- weak – eventually every crashed process is suspected by at least one correct process (but this is too weak to be used in practice)
	- accuracy
		- strong – no process is suspected before it crashes
		- weak – some correct processes are never suspected
		- eventually strong – there is a time after which we get the strong accuracy
		- eventually weak – there is a time after which we get the weak accuracy (some correct processes are not suspected)
	- types of detectors
		- perfect failure detector $(P)$ – strong completeness, strong accuracy
			- if we have a perfect failure detector, the system is almost synchronous
		- eventually perfect failure detector $(\Diamond P)$ – strong completeness + eventually strong accuracy
		- strong failure detector $(S)$ – strong completeness + weak accuracy
		- eventually strong failure detector $(\Diamond S)$ – strong completeness + eventually weak accuracy
	- note: when solving consensus with a majority of correct processes, eventually weak failure detector ($\Diamond W$, weak completeness and eventually weak accuracy) may be enough

## Reliable Broadcast

- properties and assumptions of reliable broadcast
	- properties
		- safety – nothing bad will ever happen
		- liveness – something good will eventually happen
	- we assume crash-stop failure model for processes and quasi-reliable channels
- best-effort broadcast
	- integrity – each process delivers message $m$ at most once and only if it was broadcasted by some process
		- safety property
	- validity – if a correct process broadcasts a message $m$, then every correct process eventually delivers $m$
		- liveness property
	- implementation straightforward using quasi-reliable channels
		- broadcast … send to all processes in the group $\Pi$ (group of processes into which message $m$ is broadcasted)
		- deliver … simply deliver message
	- performance metrics
		- number of communication steps required to terminate one operation – one
		- number of messages exchanged during one operation – $O(N)$, where $N$ is the number of processes
	- problem: if a sender crashes, some processes might not get the message
- (regular) reliable broadcast
	- properties: integrity & validity & agreement
	- agreement – if a message $m$ is delivered by some correct process, then $m$ is eventually delivered by every correct process
		- liveness property
	- implementation uses best-effort broadcast and perfect failure detector
	- two rules
		- if a process delivers $m$, whose sender has crashed, it broadcasts $m$ again
		- when $p$ becomes aware that $q$ has crashed, $p$ broadcasts all $q$'s messages that $p$ has already delivered
	- performance
		- best case: one communication step, $O(N)$ messages
		- worst case: $N$ communication steps, $O(N^2)$ messages
	- exercise
		- if we used a failure detector satisfying only weak accuracy, the performance would be worse
		- with only weak completeness, agreement is not ensured
			- the correct process who delivered $m$ might not know that the sender has crashed
		- if we don't wanna rely on the failure detector, we can just broadcast every received (broadcasted) message
	- problematic scenario
		- $m$ is delivered by a process and then it crashes (the sender crashes too)
		- other correct processes don't get $m$
- uniform reliable broadcast
	- properties: integrity & validity & uniform agreement
	- uniform agreement – if a message m is delivered by some process (whether correct or not), then m is eventually delivered by every correct process
	- the proposed solution is called *All-ack Uniform Reliable Broadcast*
	- idea – process can deliver a message only when it has received a copy of that message from all correct processes 
		- so every process broadcasts $m$ after receiving it for the first time
		- process also updates its list of correct processes
		- $p$ checks whether $m$ can be delivered every time 1) $p$ it receives $m$, 2) a process crashes
	- performance
		- best case: two communication steps
			1. sender broadcasts … $N$ messages
			2. everyone else broadcasts … $(N-1)\times N$ messages
		- worst case: $N+1$ steps required to terminate (if processes crash in sequence)
		- $N^2$ messages sent
- FIFO reliable broadcast
	- properties: integrity & validity & agreement & FIFO delivery
	- FIFO delivery – if some process broadcasts message $m_1$ before it broadcasts message $m_2$, then no correct process delivers $m_2$ unless it has already delivered $m_1$
	- solution – piggybacking sequence numbers
		- so we just sort the messages (delay some of them)
- how to make the broadcast faster?
	- we could make the broadcast circular
	- we could build a binary tree
		- better latency
		- but what if a process crashes?
	- *gossip*
		- $k$ … number of processes to contact
		- $r$ … number of rounds to execute
		- for $k=3$, the original process randomly selects 3 processes it sends the message to
			- then, each process sends the message to 3 randomly selected processes
			- after $r$ rounds, the sending stops
		- probabilistic broadcast algorithm – we cannot guarantee that every correct process gets the message
		- efficiency declines over time – we send to processes who have already got the message
			- push strategy – processes who have the information send messages to processes who don't
			- pull strategy – “hey, is there anything I missed?”
				- we can use a vector clock for this

## Consensus

- consensus applications
	- clients & servers
		- all servers should behave the same
		- the servers should agree on the order of processing clients' requests → consensus problem
	- Kafka (who's the leader?)
	- blockchain
- definition of the consensus problem, valence
	- primitives
		- $\mathrm{propose}(v_i)$
			- $v_i$ … value proposed by process $i$
		- $\mathrm{decide}(v)$
			- all processes should agree on the same $v$ in the end
	- properties
		- termination – every correct process should eventually decide
		- validity – if a process decides $v$, then $v$ is the initial (proposed) value of some process
		- uniform agreement – two processes cannot decide differently
			- agreement (two *correct* processes cannot decide differently) would be too weak
	- valence of a configuration
		- $\mathrm{val}(c)$ … set of possible values that could be decided (initially – set of proposed values)
		- we want to get from a multivalent configuration to a univalent configuration
			- $v$-valent configuration … $\mathrm{val}(c)=\set{v}$
		- then, we we want the processes to detect we reached that configuration
	- we consider
		- asynchronous system
			- no bound on message delays
			- no bound on relative speed of processes
		- quasi-reliable channels, processes may crash
	- if we don't make any additional assumptions, consensus is impossible to ensure
- impossibility results
	- theorem: consensus cannot be solved in an asynchronous system with reliable channels if a majority of processes may be faulty
	- proof
		- we partition the processes into two sets $\Pi_0,\Pi_1$, containing $\lceil n/2\rceil$ and $\lfloor n/2\rfloor$ processes respectively
			- processes in $\Pi_0$ always propose 0, processes in $\Pi_1$ always propose 1
		- we consider three runs ($R_0,R_1,R$)
			- in run $R_0$, only processes in $\Pi_0$ are correct (the other crash at the beginning of the run)
				- by validity, they decide 0
				- so some process $q_0\in\Pi_0$ decides 0 at time $t_0$
			- in run $R_1$, only processes in $\Pi_1$ are correct
				- therefore $q_1\in\Pi_1$ decides 1 at time $t_1$
			- in run $R$, no process crashes
				- but the transmission of messages between the two sets is delayed until $t=\max(t_0,t_1)$
				- until time $t$, $R$ is indistinguishable from $R_0$ for processes in $\Pi_0$
				- similarly, processes in $\Pi_1$ cannot distinguish $R$ from $R_1$
				- therefore, in $R$ some processes decide 0 and some other decide 1, which violates the agreement
	- theorem (FLP impossibility result): there exists no deterministic algorithm that solves consensus in an asynchronous system with reliable channels if one single process may crash
	- intuition of the proof
		- there's no way to know if a non-responsive process $q$ has crashed or is just slow
		- if we wait, we might wait forever
		- if we decide, we might find out later that $q$ took a different decision
	- of course, we can make some additional assumptions (there are algorithms that work if we ignore some problematic situations)
- consensus algorithm in a synchronous system
	- we have a perfect failure detector $P$
	- we use the best-effort broadcast algorithm
	- assuming no crashes – trivial
		- everyone broadcasts proposed values
		- after receiving values from everyone, we use a deterministic decision function (like $\mathrm{min}$)
	- flooding consensus
		- several rounds
			- in every round, every process sends the values it has collected so far (to ensure that everyone has the same information)
			- for a round to end, the process needs to get messages from all the correct processes
		- edge case
			- in the first round, $p_1$ manages to send its (unique) value $v_1$ to only $p_2$ and then crashes (in the middle of the round)
			- in the second round, $p_2$ manages to propagate $v_1$ only to $p_3$, then $p_2$ also crashes
			- processes keep crashing one by one
		- after two consecutive rounds with the same correct processes (no one crashes between rounds), we can decide
			- we can be sure that we have all the information
		- this algorithm does not provide uniform agreement
			- there could have been a process $p'$ that managed to propagate a unique value to $p$ but crashed right after $p$ decided (so other processes don't have the unique value)
			- in other words: after two rounds, we can be sure that *we* have all the information, but we cannot be sure that *everyone else* has all the information 
	- how to get uniform agreement?
		- we can always do $f+1$ rounds where $f$ is the number of processes that can crash (3 rounds without crash would probably also work)
		- FloodSet algorithm
			- $W_p:=\set{v_p}$ … everyone starts with a value
			- in round $r$
				- send $W_p$
				- on recv $W_q$
					- $W_p:= W_p\cup W_q$
				- if $r=f+1$
					- decide $\min W_p$
- consensus in a partially synchronous system
	- partially synchronous system
		- $\Delta$ bound on transition delay and $\beta$ bound on how slow the process can be
			- both hold *eventually*
			- there is a time $T$ (unknown) after which bounds $\beta,\Delta$ will hold forever
		- $T$ … GST (global stabilization time)
			- channels can lose messages before GST
			- channels are quasi-reliable after GST
		- GSR (global stabilization round) = first round when the system behaves as synchronous
			- $\exists GSR\gt 0$ s.t. $\forall r\gt GSR,\forall p,q$ correct $(p$ sends $m$ to $q$ in round $r\implies q$ recv $m$ in round $r)$
		- we don't use a failure detector here
	- OneThirdRule algorithm
		- assumption $f\lt \frac n3$
			- so $n-f\gt\frac23 n$
		- at the start of each round, the process $p$ sends value $x_p$ to all processes
		- as the process receives messages in the given round
			- if it gets at least $n-f$ messages, it sets $x_p$ to the most frequent value received (take the smallest if there are multiple with the same greatest frequency)
			- if at least $n-f$ values received are equal to a value $v$, it decides $v$
		- we transition between rounds after predefined time
			- after GSR, we can guarantee that there is enough time to get all the messages
	- proof of uniform agreement
		- let $r_0$ be the smallest round when a process $p$ decides $v$
			- → $n-f$ times $v$ received
			- → $n-f$ processes proposed $v$
			- there cannot exist other $n-f$ processes proposing $v'$ (s.t. $v'\neq v$) in the same round
				- so all the other processes deciding in the same round also decide $v$
		- let $r_1\geq r_0$ be the smallest round when a process $q$ changes its value $x_q$
			- at this point, at most $f$ processes have a value $v'$ (as $n-f$ processes proposed $v$ at time $r_0$)
			- the new value $x_q$ has to be $v$ as there is no way for a $v'$ to achieve majority ($f$ cannot be the majority of $n-f$ messages)
			- so after $r_0$, there cannot arise a majority for $v'$
	- termination is easy
		- consider it's after GSR and all faulty processes have crashed
		- everyone gets $n-f$ messages and sets $x_p$ to the most frequent value
		- in the next round, everyone decides this value
	- valence
		- configuration where $n-f$ correct processes have value $v$ is $v$-valent
		- if in some round $r$ all correct processes receive $v$ from $n-f$ processes, the configuration becomes $v$-valent
		- if configuration becomes $v$-valent in round $r\geq GSR$, processes will decide in round $r+1$
- (broken) consensus algorithms that don't satisfy one of the criteria
	- only validity & uniform agrement – we don't need to do anything
	- only uniform agreement & termination – decide “1”
	- only validity & termination – everyone decides their own proposed value
- bonus: LastVoting algorithm (for a partially synchronous system)
	- it's possible to increase $f$ from $f\lt n/3$ to $f\lt n/2$ by considering a non-symmetric algorithm
	- round-based LastVoting algorithm (similar to the Paxos consensus algorithm)
		- assumes $f\lt n/2$
		- so $n-f\gt \frac12 n$
	- rounds are grouped into phases – one phase consists of three rounds
	- in each phase, one of the processes serves as a coordinator
		- the coordinator role rotates ($p_1$ is the coordinator in the first phase, then $p_2,p_3,\dots,p_n$, then $p_1$ again…)
	- every process $p$ stores the proposed value $x_p$ and the timestamp $ts_p$ (number of the most recent phase when $x_p$ was updated)
	- first round
		- everyone sends $(x_p,ts_p)$ to the coordinator
		- if the coordinator gets $\geq n-f$ messages, it selects one $x$ such that its $ts$ is the largest
	- second round
		- if the coordinator managed to select one $x$ in the first round, it now broadcasts this $x$
		- as the processes receive this value, they update their $x_p$ and $ts_p$
	- third round
		- every process which updated its $ts_p$ in the second round now broadcasts $(\mathrm{ack}, x_p)$
		- if the number of $(\mathrm{ack}, v)\geq n-f$, the process $p$ decides $v$
	- configuration becomes $v$-valent when there's a set $Q$ of processes s.t. $|Q|\geq n-f$, every $q\in Q$ has value $v$, and there are no processes outside $Q$ with a larger timestamp
		- in the third round, the $v$-valent configuration is identified
		- note that the second round cannot disrupt the $v$-valence (thanks to the condition in the first round)