# Exam - distributed system, partial failure - distributed system is composed of multiple processes that seek to achieve some form of cooperation; its components may fail independently (= *partial failure*) - multi-core processor is not a distributed system – the whole processor crashes (we cannot have one core working and other one crashed) - why do we want/have a distributed system? - users are in different places → it is a distributed system by nature - better performance, stability (if one server fails, we have another) ## Ordering of Events - notation (processes, channels, events, states, history) - set of processes $P=\set{p_1,\dots,p_n}$ - they communicate through channels (channel from $p_i$ to $p_j$ … $C_{ij}$) - event $\# k$ on process $p_i$ … $e_i^k$ - event changes the state of the process - states of the process $\sigma_i^0,\sigma_i^1,\dots$ - event $e_i^1$ performs the transition from $\sigma^0_i$ to $\sigma_i^1$ on process $p_i$ - local history $h_i=e^1_i e^2_i e^3_i \dots$ - global history $H=h_1\cup h_2\cup h_3\dots$ - happened_before ($\to$) relation between two events - first case: $e_i^k\to e_i^\ell$ iff $k\lt\ell$ - second case: send($m$) $\to$ recv($m$) - we can receive a message $m$ only after we send it - transitivity: $e\to e'\land e'\to e''\implies e\to e''$ - antisymmetry: $e\to e'\implies e'\not\to e$ - it is a partial ordering of events - if there is no relation between $e$ and $e'$, we say they are concurrent - $e||e'$ - global state, cut - global state $(\sigma_1^{k_1},\sigma_2^{k_2},\dots,\sigma_n^{k_n})$ is composed of the local state of every process - cut $C$ … subset of the global history $H$ that includes a prefix of each local history - is defined by a tuple $(ct_1,ct_2,\dots,ct_n)$ - $C\equiv h_1^{ct_1}\cup h_2^{ct_2}\cup\dots\cup h_n^{ct_n}$ - cut $C$ defines the global state $(\sigma_1^{ct_1},\sigma_2^{ct_2},\dots,\sigma_n^{ct_n})$ - consistent cut … the global state could have existed - cut $C$ is consistent iff $e'\in C\land e\to e'\implies e\in C$ - *consistent global state* is a global state defined by a consistent cut - Chandy-Lamport snapshot algorithm - we assume FIFO channels - to save a snapshot, we need an initiator process which is the first to save its state and broadcast the message SNAPSHOT - when process $p_i$ receives the SNAPSHOT message, it saves its state and also broadcasts SNAPSHOT (there are no other events in between) - the state of channel $c_{ji}$ corresponds to the messages that the process $p_i$ received from $p_j$ between broadcasting SNAPSHOT and receiving SNAPSHOT from $c_j$ - messages are added into the channel state one by one - after $p_i$ receives SNAPSHOT from all processes, it knows that the computation of the snapshot is terminated ## Time - asynchronous system - no bound on message transmission delays - no bound on relative speed of processes - timestamp function for an asynchronous system - how to put timestamps on events to have guarantees that one event happened before another event? - I would like a timestamp function $TS$ such that $e\to e'\iff TS(e)\lt TS(e')$ - Lamport Clock $LC$ - algorithm - initial state … $LC_i\leftarrow 0$ - for any internal event … $LC_i\leftarrow LC_i+1$ - on $\mathrm{send}(m)$ … $LC_i\leftarrow LC_i+1$ - + attach $LC_i$ to $m$, so that $ts(m)=LC_i(\mathrm{send}(m))$ - when $p_j$ receives $m$ - $LC_j\leftarrow\max(LC_j,ts(m))+1$ - what can we say about LC? - $e\to e'\implies LC(e)\lt LC(e')$ - we don't have $\impliedby$ - $LC_i(e)$ corresponds to the length of the longest chain of events (longest causal chain) leading to $e$ - vector clocks - definition - for $i=j:VC(e_i)[j]=$ number of events on $p_i$ up to and including $e_i$ - for $i\neq j:VC(e_i)[j]=$ number of events on $p_j$ that happened\_before $e_i$ - algorithm - if $e_i$ is internal or $\mathrm{send}(m)$ - $VC(e_i)\leftarrow VC_i$ - $VC(e_i)[i]\leftarrow VC_i[i]+1$ - if $e_i$ is $\mathrm{recv}(m)$ - $VC(e_i)\leftarrow\max(VC_i,ts(m))$ - $VC(e_i)[i]\leftarrow VC(e_i)[i]+1$ - how we compare vectors - $u\lt v\iff(\forall i)(u_i\leq v_i)\land(\exists j)(u_j\lt v_j)$ - it holds that $e\to e'\iff VC(e)\lt VC(e')$ ## Abstractions, Failure Detectors - abstractions - we want something simple but useful - we use a stack of components - every component has an interface - fault models for processes - **crash-stop** - process crashes and never comes back - the most typical model - crash-recovery - process crashes and later resumes from the point where it crashed - but this rarely happens - it is more usual to restart the process – reinitialize it (so it's a new process) - use case: if a process loses its connection frequently - byzantine - process doesn't behave according to its specification - might happen for various reasons – corrupted memory, attack, … - possible strategies - we need to guarantee that the process can be trusted - we assume that there is a limited number of byzantine process and use majority vote - fault models for channels - integrity - “channels don't create messages & channels don't modify/corrupt messages” - a link from $p$ to $q$ satisfies integrity if $q$ receives a message $m$ at most once, and only if it was previously sent by $p$ - note: $m$ can be lost - fair link – satisfies integrity & if $p$ sends $m$ infinitely often, then $q$ receives $m$ infinitely often (the channel can lose an infinity of messages) - reliable link – satisfies integrity & if $p$ sends $m$ and $q$ is correct (does not crash), then $q$ receives $m$ - → the channel does not lose messages - problem: $p$ could crash right after sending $m$ - quasi-reliable link – satisfies integrity & if $p$ sends $m$ and $p,q$ are correct, then $q$ receives $m$ - if $m$ gets lost, $p$ can “send it again” - why do we define reliable link? - it can help us determine if a problem is solvable or not - if we cannot solve the problem using reliable link, there is no way to solve it using weaker fault models - in reality we only have fair links - we can implement stubborn links and then quasi-reliable links - then, we want to add a FIFO property on top of that - implementation of quasi-reliable links - stubborn link – if $p$ sends $m$ once (and is correct?), $q$ receives it an infinite number of times - does not satisfy integrity (creates messages) - we just send all the previously sent messages repeatedly every $\Delta$ time units - we deliver every received message - receive vs. deliver - in our stack of abstractions, there are several layers: network, fair links, stubborn links, quasi-reliable links, process - the layer receives a message and then decides to deliver it - quasi-reliable link - we keep a set of delivered messages - but the set is always growing - we could use ACK and then remove the message from the set after $p$ stops sending it - but how to correctly detect that $p$ stopped sending? - problematic scenario - $p$ is sending $m$ repeatedly - $q$ sends ACK - after some time, $q$ removes $m$ from the delivered set - another instance of $m$ arrives to $q$ - $q$ recognizes $m$ as a new message - in practice, we send a sequence number to let the other side know how many messages we already received - it can be bundled with messages - if there are no new messages in the channel, we may want to send the acknowledgement separately (with its own sequence number) - FIFO quasi-reliable link - properties - satisfies integrity - if $p$ sends $m$ and $p,q$ are correct, then $q$ receives $m$ - if $p$ sends $m'$ after $m$ and $p,q$ are correct, then $q$ receives $m'$ after $m$ - implementation proposal - sender $p$ assigns a timestamp to every message, the messages are then ordered by the timestamp - synchronous system - bound on message delay … $\Delta$ - the maximum time required to deliver a message - bound on process speed … $\beta$ - the fastest process needs $x$ time to do something $\implies$ the slowest process needs $x\beta$ time to do this - $p_1$ asks $p_2$: “are you alive?” - in an asynchronous system, there is no way to tell if the other process is alive - in a synchronous system, $p_1$ can be sure that the response has to arrive at most after $2\Delta+x\beta$ - failure detector - a magic box which tells the process which other processes are alive - two properties: completeness, accuracy - by default - can make mistakes - can change its mind - different FDs can have different opinions - completeness - strong – eventually every crashed process is suspected by every correct process - weak – eventually every crashed process is suspected by at least one correct process (but this is too weak to be used in practice) - accuracy - strong – no process is suspected before it crashes - weak – some correct processes are never suspected - eventually strong – there is a time after which we get the strong accuracy - eventually weak – there is a time after which we get the weak accuracy (some correct processes are not suspected) - types of detectors - perfect failure detector $(P)$ – strong completeness, strong accuracy - if we have a perfect failure detector, the system is almost synchronous - eventually perfect failure detector $(\Diamond P)$ – strong completeness + eventually strong accuracy - strong failure detector $(S)$ – strong completeness + weak accuracy - eventually strong failure detector $(\Diamond S)$ – strong completeness + eventually weak accuracy - note: when solving consensus with a majority of correct processes, eventually weak failure detector ($\Diamond W$, weak completeness and eventually weak accuracy) may be enough ## Reliable Broadcast - properties and assumptions of reliable broadcast - properties - safety – nothing bad will ever happen - liveness – something good will eventually happen - we assume crash-stop failure model for processes and quasi-reliable channels - best-effort broadcast - integrity – each process delivers message $m$ at most once and only if it was broadcasted by some process - safety property - validity – if a correct process broadcasts a message $m$, then every correct process eventually delivers $m$ - liveness property - implementation straightforward using quasi-reliable channels - broadcast … send to all processes in the group $\Pi$ (group of processes into which message $m$ is broadcasted) - deliver … simply deliver message - performance metrics - number of communication steps required to terminate one operation – one - number of messages exchanged during one operation – $O(N)$, where $N$ is the number of processes - problem: if a sender crashes, some processes might not get the message - (regular) reliable broadcast - properties: integrity & validity & agreement - agreement – if a message $m$ is delivered by some correct process, then $m$ is eventually delivered by every correct process - liveness property - implementation uses best-effort broadcast and perfect failure detector - two rules - if a process delivers $m$, whose sender has crashed, it broadcasts $m$ again - when $p$ becomes aware that $q$ has crashed, $p$ broadcasts all $q$'s messages that $p$ has already delivered - performance - best case: one communication step, $O(N)$ messages - worst case: $N$ communication steps, $O(N^2)$ messages - exercise - if we used a failure detector satisfying only weak accuracy, the performance would be worse - with only weak completeness, agreement is not ensured - the correct process who delivered $m$ might not know that the sender has crashed - if we don't wanna rely on the failure detector, we can just broadcast every received (broadcasted) message - problematic scenario - $m$ is delivered by a process and then it crashes (the sender crashes too) - other correct processes don't get $m$ - uniform reliable broadcast - properties: integrity & validity & uniform agreement - uniform agreement – if a message m is delivered by some process (whether correct or not), then m is eventually delivered by every correct process - the proposed solution is called *All-ack Uniform Reliable Broadcast* - idea – process can deliver a message only when it has received a copy of that message from all correct processes - so every process broadcasts $m$ after receiving it for the first time - process also updates its list of correct processes - $p$ checks whether $m$ can be delivered every time 1) $p$ it receives $m$, 2) a process crashes - performance - best case: two communication steps 1. sender broadcasts … $N$ messages 2. everyone else broadcasts … $(N-1)\times N$ messages - worst case: $N+1$ steps required to terminate (if processes crash in sequence) - $N^2$ messages sent - FIFO reliable broadcast - properties: integrity & validity & agreement & FIFO delivery - FIFO delivery – if some process broadcasts message $m_1$ before it broadcasts message $m_2$, then no correct process delivers $m_2$ unless it has already delivered $m_1$ - solution – piggybacking sequence numbers - so we just sort the messages (delay some of them) - how to make the broadcast faster? - we could make the broadcast circular - we could build a binary tree - better latency - but what if a process crashes? - *gossip* - $k$ … number of processes to contact - $r$ … number of rounds to execute - for $k=3$, the original process randomly selects 3 processes it sends the message to - then, each process sends the message to 3 randomly selected processes - after $r$ rounds, the sending stops - probabilistic broadcast algorithm – we cannot guarantee that every correct process gets the message - efficiency declines over time – we send to processes who have already got the message - push strategy – processes who have the information send messages to processes who don't - pull strategy – “hey, is there anything I missed?” - we can use a vector clock for this ## Consensus - consensus applications - clients & servers - all servers should behave the same - the servers should agree on the order of processing clients' requests → consensus problem - Kafka (who's the leader?) - blockchain - definition of the consensus problem, valence - primitives - $\mathrm{propose}(v_i)$ - $v_i$ … value proposed by process $i$ - $\mathrm{decide}(v)$ - all processes should agree on the same $v$ in the end - properties - termination – every correct process should eventually decide - validity – if a process decides $v$, then $v$ is the initial (proposed) value of some process - uniform agreement – two processes cannot decide differently - agreement (two *correct* processes cannot decide differently) would be too weak - valence of a configuration - $\mathrm{val}(c)$ … set of possible values that could be decided (initially – set of proposed values) - we want to get from a multivalent configuration to a univalent configuration - $v$-valent configuration … $\mathrm{val}(c)=\set{v}$ - then, we we want the processes to detect we reached that configuration - we consider - asynchronous system - no bound on message delays - no bound on relative speed of processes - quasi-reliable channels, processes may crash - if we don't make any additional assumptions, consensus is impossible to ensure - impossibility results - theorem: consensus cannot be solved in an asynchronous system with reliable channels if a majority of processes may be faulty - proof - we partition the processes into two sets $\Pi_0,\Pi_1$, containing $\lceil n/2\rceil$ and $\lfloor n/2\rfloor$ processes respectively - processes in $\Pi_0$ always propose 0, processes in $\Pi_1$ always propose 1 - we consider three runs ($R_0,R_1,R$) - in run $R_0$, only processes in $\Pi_0$ are correct (the other crash at the beginning of the run) - by validity, they decide 0 - so some process $q_0\in\Pi_0$ decides 0 at time $t_0$ - in run $R_1$, only processes in $\Pi_1$ are correct - therefore $q_1\in\Pi_1$ decides 1 at time $t_1$ - in run $R$, no process crashes - but the transmission of messages between the two sets is delayed until $t=\max(t_0,t_1)$ - until time $t$, $R$ is indistinguishable from $R_0$ for processes in $\Pi_0$ - similarly, processes in $\Pi_1$ cannot distinguish $R$ from $R_1$ - therefore, in $R$ some processes decide 0 and some other decide 1, which violates the agreement - theorem (FLP impossibility result): there exists no deterministic algorithm that solves consensus in an asynchronous system with reliable channels if one single process may crash - intuition of the proof - there's no way to know if a non-responsive process $q$ has crashed or is just slow - if we wait, we might wait forever - if we decide, we might find out later that $q$ took a different decision - of course, we can make some additional assumptions (there are algorithms that work if we ignore some problematic situations) - consensus algorithm in a synchronous system - we have a perfect failure detector $P$ - we use the best-effort broadcast algorithm - assuming no crashes – trivial - everyone broadcasts proposed values - after receiving values from everyone, we use a deterministic decision function (like $\mathrm{min}$) - flooding consensus - several rounds - in every round, every process sends the values it has collected so far (to ensure that everyone has the same information) - for a round to end, the process needs to get messages from all the correct processes - edge case - in the first round, $p_1$ manages to send its (unique) value $v_1$ to only $p_2$ and then crashes (in the middle of the round) - in the second round, $p_2$ manages to propagate $v_1$ only to $p_3$, then $p_2$ also crashes - processes keep crashing one by one - after two consecutive rounds with the same correct processes (no one crashes between rounds), we can decide - we can be sure that we have all the information - this algorithm does not provide uniform agreement - there could have been a process $p'$ that managed to propagate a unique value to $p$ but crashed right after $p$ decided (so other processes don't have the unique value) - in other words: after two rounds, we can be sure that *we* have all the information, but we cannot be sure that *everyone else* has all the information - how to get uniform agreement? - we can always do $f+1$ rounds where $f$ is the number of processes that can crash (3 rounds without crash would probably also work) - FloodSet algorithm - $W_p:=\set{v_p}$ … everyone starts with a value - in round $r$ - send $W_p$ - on recv $W_q$ - $W_p:= W_p\cup W_q$ - if $r=f+1$ - decide $\min W_p$ - consensus in a partially synchronous system - partially synchronous system - $\Delta$ bound on transition delay and $\beta$ bound on how slow the process can be - both hold *eventually* - there is a time $T$ (unknown) after which bounds $\beta,\Delta$ will hold forever - $T$ … GST (global stabilization time) - channels can lose messages before GST - channels are quasi-reliable after GST - GSR (global stabilization round) = first round when the system behaves as synchronous - $\exists GSR\gt 0$ s.t. $\forall r\gt GSR,\forall p,q$ correct $(p$ sends $m$ to $q$ in round $r\implies q$ recv $m$ in round $r)$ - we don't use a failure detector here - OneThirdRule algorithm - assumption $f\lt \frac n3$ - so $n-f\gt\frac23 n$ - at the start of each round, the process $p$ sends value $x_p$ to all processes - as the process receives messages in the given round - if it gets at least $n-f$ messages, it sets $x_p$ to the most frequent value received (take the smallest if there are multiple with the same greatest frequency) - if at least $n-f$ values received are equal to a value $v$, it decides $v$ - we transition between rounds after predefined time - after GSR, we can guarantee that there is enough time to get all the messages - proof of uniform agreement - let $r_0$ be the smallest round when a process $p$ decides $v$ - → $n-f$ times $v$ received - → $n-f$ processes proposed $v$ - there cannot exist other $n-f$ processes proposing $v'$ (s.t. $v'\neq v$) in the same round - so all the other processes deciding in the same round also decide $v$ - let $r_1\geq r_0$ be the smallest round when a process $q$ changes its value $x_q$ - at this point, at most $f$ processes have a value $v'$ (as $n-f$ processes proposed $v$ at time $r_0$) - the new value $x_q$ has to be $v$ as there is no way for a $v'$ to achieve majority ($f$ cannot be the majority of $n-f$ messages) - so after $r_0$, there cannot arise a majority for $v'$ - termination is easy - consider it's after GSR and all faulty processes have crashed - everyone gets $n-f$ messages and sets $x_p$ to the most frequent value - in the next round, everyone decides this value - valence - configuration where $n-f$ correct processes have value $v$ is $v$-valent - if in some round $r$ all correct processes receive $v$ from $n-f$ processes, the configuration becomes $v$-valent - if configuration becomes $v$-valent in round $r\geq GSR$, processes will decide in round $r+1$ - (broken) consensus algorithms that don't satisfy one of the criteria - only validity & uniform agrement – we don't need to do anything - only uniform agreement & termination – decide “1” - only validity & termination – everyone decides their own proposed value - bonus: LastVoting algorithm (for a partially synchronous system) - it's possible to increase $f$ from $f\lt n/3$ to $f\lt n/2$ by considering a non-symmetric algorithm - round-based LastVoting algorithm (similar to the Paxos consensus algorithm) - assumes $f\lt n/2$ - so $n-f\gt \frac12 n$ - rounds are grouped into phases – one phase consists of three rounds - in each phase, one of the processes serves as a coordinator - the coordinator role rotates ($p_1$ is the coordinator in the first phase, then $p_2,p_3,\dots,p_n$, then $p_1$ again…) - every process $p$ stores the proposed value $x_p$ and the timestamp $ts_p$ (number of the most recent phase when $x_p$ was updated) - first round - everyone sends $(x_p,ts_p)$ to the coordinator - if the coordinator gets $\geq n-f$ messages, it selects one $x$ such that its $ts$ is the largest - second round - if the coordinator managed to select one $x$ in the first round, it now broadcasts this $x$ - as the processes receive this value, they update their $x_p$ and $ts_p$ - third round - every process which updated its $ts_p$ in the second round now broadcasts $(\mathrm{ack}, x_p)$ - if the number of $(\mathrm{ack}, v)\geq n-f$, the process $p$ decides $v$ - configuration becomes $v$-valent when there's a set $Q$ of processes s.t. $|Q|\geq n-f$, every $q\in Q$ has value $v$, and there are no processes outside $Q$ with a larger timestamp - in the third round, the $v$-valent configuration is identified - note that the second round cannot disrupt the $v$-valence (thanks to the condition in the first round)