# Exam

## Markov Chains

- Stochastic process
	- sequence of random variables $X_0,X_1,\dots$ such that “all probabilities are defined”
		- e.g. $P(X_0=1\land X_1=2\land X_3=0)$ is defined
- Markov process/chain
	- stochastic process such that
		- $\exists$ finite/countable set $S$ such that $\forall t:\text{Im}\,X_t\subseteq S$
			- elements of $S$ … states
			- $t$ … time
		- $\forall i,j\in S:\exists p_{ij}\in [0,1]$
	- notes
		- this is a discrete time Markov chain
		- discrete space … $|S|\leq |\mathbb N|$
		- time-homogeneous … $p_{ij}$ not depeding on $t$
		- Markov condition … “forgetting the past”
			- $P(X_{t+1}=a_{t+1}\mid X_t=a_t\land \dots\land X_0=a_0)=P(X_{t+1}=a_{t+1}\mid X_t=a_t)$
- Transition matrix, diagram
	- transition matrix is a matrix $P$ such that $P_{ij}=p_{ij}$
	- observation: every row of $P$ sums to 1
	- transition diagram/graph
		- vertices … $S$
		- directed edges … $\set{(i,j)\in S^2:p_{ij}\gt 0}$
	- Markov chain is equivalent to a random walk on the transition diagram
		- from state $i$ we go to $j$ with the probability of $p_{ij}$ independent of the past
- Probability mass function for $X_0,X_1,\dots$
	- $\pi_i^{(t)}=P(X_t=i)$
	- theorem: $\pi^{(t+1)}=\pi^{(t)}\cdot P$
		- where $\pi^{(t)}=(\pi_1^{(t)},\dots,\pi_n^{(t)})$ row vector
	- proof: definition of matrix multiplication & total probability theorem
	- by induction, we can prove that $\pi^{(k)}=\pi^{(0)}P^k$ and, more generally, $\pi^{(t+k)}=\pi^{(t)}P^k$
- Probability of $k$-step transition … $r_{ij}(k)$
	- $r_{ij}(k)=P(X_{t+k}=j\mid X_t=i)$
	- $r_{ij}(1)=P(X_{t+1}=j\mid X_t=i)=p_{ij}$
- Chapman-Kolmogorov theorem
	- $\forall$ Markov chain, $\forall i,j,k:$
		- $r_{ij}(k+1)=\sum_{u=1}^n r_{iu}(k) p_{uj}$
		- $r_{ij}(k+\ell)=\sum_{u=1}^n r_{iu}(k) r_{uj}(\ell)$
		- $r_{ij}(k)=(P^k)_{ij}$
	- proof
		- $2\implies 1$
			- 1 is a special case of 2 ($r_{uj}(1)=p_{uj})$
		- $1\implies 3$
			- matrix multiplication & induction
		- $2$ … we want $r_{ij}(k+\ell)=\sum_{u=1}^n r_{iu}(k) r_{uj}(\ell)$
			- $P(X_{t+k+\ell}=j\mid X_t=i)=$
			- $=\sum_{u=1}^nP(X_{t+k}=u\mid X_t=i)\cdot P(X_{t+k+\ell}=j\mid X_{t+k}=u)$
			- it follows from total probability theorem
- Accessible states
	- definition: $j$ is accessible from $i$ (where $i,j\in S$)
		- $i\to j$
		- $j\in A(i)$
	- $j$ is accessible from $i\equiv$ $\exists t:\underbrace{P(X_t=j\mid X_0=i)}_{r_{ij}(t)}\gt 0$
	- $\iff$ in the transition digraph exists directed path $i\to j$ of length $t$
- Communicating states
	- definition: $i\leftrightarrow j$ ($i$ and $j$ communicate) $\equiv i\in A(j)\land j\in A(i)$
	- theorem: $\leftrightarrow$ is an equivalence relation
	- proof
		- reflexive: $i\in A(i)$ … $t=0$ in the definition
		- symmetric: the formula $i\in A(j)\land j\in A(i)$ is symmetric
		- transitive: in the digraph $\exists$ path $i$ to $j$, $\exists$ path $j$ to $k$ $\implies\exists$ walk $i\to k$
	- decomposition of the transition digraph to components of strong connectivity $\equiv$ finding equivalence classes of $\leftrightarrow$
- Reducibility
	- Markov chain is irreducible if $\forall i,j\in S:i\leftrightarrow j$
- Time to get to $i\in S$
	- definition: $T_i=\min\set{t\geq 1:X_t=i}$
	- $T_i=\infty$ if the set is empty (there is no such $t$)
	- $T_i$ … random variable
	- $f_{ij}(n)=P(\text{we get to }j\text{ first at time }n\mid X_0=i)=P(T_j=n\mid X_0=i)$
		- we define it for $n\gt 0$
	- $f_{ij}=P(T_j\lt\infty\mid X_0=i)=\sum_{n\geq 1} f_{ij}(n)$
		- probability that we get to $j$ if we start from $i$ (after at least one step)
- Reccurent state
	- a state $i\in S$ is recurrent if we return to it with probability 1
		- it is transient otherwise
		- česky rekurentní, tranzientní
	- $i$ is transient
		- $f_{ii}\lt 1$ … probability of ever getting back
		- $\iff P(T_i=\infty)\gt 0$
		- $\iff P(\text{get back infinitely often})\lt 1$
	- theorem: $i$ is a recurrent state $\iff\sum_{n=1}^\infty r_{ii}(n)=\infty$
	- theorem: if $i\leftrightarrow j$, then either both $i$ and $j$ are recurrent or both $i$ and $j$ are transient
	- for finite Markov chains
		- $C$ … equiv class of $\leftrightarrow$ in a finite Markov chain
		- $C$ is recurrent ($\forall i\in C$ is recurrent) $\iff(\forall i \in C)(\forall j\in S):$ if $i\to j$ then $j\in C$ (“$C$ is closed”)
	- definition: $i$ is null recurrent if $i$ is recurrent and $\mathbb E(T_i\mid X_0=i)=\infty$
		- $i$ is positive recurrent if $i$ is recurrent and $\mathbb E(T_i\mid X_0=i)\lt\infty$
	- example: random walk on a line
- Periodic state
	- definition: $i\in S$ has period $d_i:=\text{gcd}\set{t:r_{ii}(t)\gt 0}$
		- $i\in S$ is aperiodic if $d_i=1$
	- example of a periodic Markov chain: two states, we change the state with probability equal to 1
- Properties of a communicating class
	- if $i,j\in S$, $i\leftrightarrow j$, then
		- $d_i=d_j$
		- $i$ is transient $\iff j$ is transient
		- $i$ is recurrent $\iff j$ is recurrent
		- $i$ is null recurrent $\iff j$ is null recurrent
		- $i$ is positive recurrent $\iff j$ is positive recurrent
	- these are properties of the class of $\leftrightarrow$
- Stationary distribution / steady state distribution
	- definition: $\pi:S\to [0,1]$ such that $\sum_{i\in S}\pi_i=1$ is called a stationary distribution if $\text{``}\pi P=\pi\text{"}$
		- $\forall i:\sum_i\pi_i\cdot p_{ij}=\pi_j$
	- if $\pi^{(0)}$ (the PMF of $X_0$) is $\pi$, then $\pi^{(1)}$ is $\pi$ as well (→ steady state)
	- reminder: $\pi^{(1)}=\pi^{(0)}P$
	- theorem
		- if a Markov chain is irreducible, aperiodic and finite, then
			- there exists a unique stationary distribution $\pi$ such that $\pi P=\pi$
			- $\forall ij:\lim(P^t)_{ij}=\pi_j$
				- $P(X_t=j\mid X_0=i)\doteq\pi_j$
		- actually, MC does not have to be finite, it suffices if all states are positive recurrent
	- to find $\pi$, solve system of linear equations $\pi P=\pi$, add $\sum_{i\in S}\pi_i=1$
		- for every $i$ there should be equation $\pi_i=\sum_j p_{ji}\cdot \pi_j$ (there is a term for every *incoming* arc)
	- $\pi$ describes long-term behavior of the MC
		- Page Rank (original google search) … MC model of people browsing WWW
		- given $\pi$, we can find a MC such that $\pi$ is its stationary distribution; then we can run the MC to generate random objects with distribution $\pi$
			- Markov chain Monte Carlo (MCMC)
- Detailed balance equation
	- MC may have this property
	- $\forall i\neq j:\pi_iP_{ij}=\pi_jP_{ji}$
	- to imagine this: ant colony moving independently according to a Markov chain
		- stationary distribution … the same number of ants at each state at each time – ants don't “accumulate”
	- detailed balance equation implies $\pi P=\pi$ (but it is stronger than that)
- MCMC sampling
	- Monte Carlo Markov chains
	- Metropolis–Hastings method
	- we construct a MC from the probability distribution we want
		- choose aperiodic irreducible digraph
		- use detailed balance equation
			- $p_{ij}=\min\set{1,\frac{\pi _j}{\pi_i}}\cdot C$
			- $p_{ji}=\min\set{1,\frac{\pi_i}{\pi_j}}\cdot C$
		- choose $C$ such that
			- $\forall i:\sum_{j\neq i} p_{ij}\lt 1$
			- df. $p_{ii}=1-\sum_{j\neq i}p_{ij}\gt 0$
			- $\implies d_i=1$
	- tune the process to make convergence fast
	- we run the MC for long enough
	- alternative sampling approach: rejection sampling
	- MCMCs are useful in Bayesian statistics – we only need the ratios so the normalizing constants cancel out (and we don't have to compute them)
- Absorbing state
	- $i\in S$ is absorbing state if $p_{ii}=1$
	- $A$ … set of absorbing states
	- which absorbing state do we end at? will it be state $0\in A$?
		- $a_i=P(\exists t:X_t=0\mid X_0=i)$
	- theorem: $(a_i)_{i\in S}$ are unique solutions to
		- $a_0=1$
		- $a_i=0$ if $i\in A,\,i\neq 0$
		- $a_i=\sum_j p_{ij}\cdot a_j$ otherwise
	- what is the expected number of steps to get to an absorbing state
		- $\mu_i=\mathbb E(T\mid X_0=i)$
		- $T=\min\set{t: X_t\in A}$
	- theorem: $(\mu_i)_{i\in S}$ are unique solutions to
		- $\mu_i=0$ if $i\in A$
		- $\mu_i=1+\sum_j p_{ij}\cdot\mu_j$ if $i\notin A$
	- example: drunk person on their way home
		- $A=\set{0}$
		- $\mu_0=0$
		- $\mu_1=1+\frac12\mu_0+\frac12\mu_2$
		- $\mu_2=1+\frac12\mu_1+\frac12\mu_3$
		- $\mu_{n-1}=1+\frac12\mu_{n-2}+\frac12\mu_{n}$
		- $\mu_n=1+\mu_{n-1}$
		- solution
			- $\mu_1=2n-1$
			- $\mu_n=n^2$
			- $\mu_{i}\leq n^2$
		- how to compute it
			- we start from the bottom up by combining the two last equations
				- we get $\mu_{n-1}=3+\mu_{n-2}$
			- then we get $\mu_{n-2}=5+\mu_{n-3}$
			- it can be shown that we can get to $\mu_1=2n-1+\mu_{0}=2n-1$
			- we see that $\mu_2=2n-3+\mu_1$
			- we can substitute $\mu_1$
			- when we get back to the bottom, we have $\mu_n=1+3+5+\dots+(2n-1)=n^2$
- 2-SAT problem
	- input: $\varphi=(x_1\lor x_2)\land(x_3\lor\neg x_1)\land\ldots$
		- clauses with exactly 2 literals
	- output: a satisfying assignment OR “unsatisfiable”
	- there exists a polynomial algorithm
	- we will show a randomized algorithm
		1. arbitrarily initialize $(x_1,\dots,x_n)$
		2. while $\varphi$ has an unsatisfied clause
			- choose one of the unsatisfied clauses and change one of its variables
		3. return $(x_1,\dots,x_n)$
	- repeat (2) $\leq2mn^2$ times
		- then, if $\varphi$ is still unsatisfied, return “unsatisfiable”
		- this may introduce errors
	- theorem: the algorithm makes an error with probability $\leq 2^{-m}$
	- proof
		- $x_1^*,\dots,x_n^*$ … one of the satisfying assignments
		- $D_t$ the number of $i$ such that $x_i\neq x_i^*$ at time $t$
			- $t=0,1,\dots,2mn^2$
			- $0\leq D_t\leq n$
			- $D_t=0\implies$ we found a solution
		- situation
			- we assume that clause $(x_1\lor x_2)$ is unsatisfied at time $t$ and we choose it
				- $\implies x_1=x_2=F$
				- $\implies x_1^*\lor x_2^*=T$
			- we randomly switch $x_1$ or $x_2$ which increases or decreases $D$
			- if $x_1^*=T$ and $x_2^*=F$ (or vice versa), the random switch may decrease or increase $D$ by one (each with probability $\frac12$)
				- we will call them “good cases”
			- if $x_1^*=x_2^*=T$, then the random switch will always decrease $D$ which is nice but it means that we cannot model $D$ using a Markov chain
				- this is a “bad case”
			- because of that, we define $D'$ which always decreases or increases with probability $\frac12$
				- in the good cases, it behaves the same way as $D$
				- in the bad case, we flip a coin to decide whether it should go up or down
			- clearly $D\leq D'$ and $T\leq T'$ (where $T$ is the first time $D=0$, similarly for $T',D'$)
			- we know that $\mathbb ET'\leq n^2$ which means that $\mathbb ET\leq n^2$
			- using Markov inequality, we get that $P(T\geq 2n^2)\leq\frac{\mathbb ET}{2n^2}\leq\frac{n^2}{2n^2}=\frac12$
			- we repeat the whole process $m$ times, that way we get the probability of fail $\leq 2^{-m}$ and the running time $O(mn^4)$
				- there are $O(n^2)$ clauses of two literals, we need to find an unsatisfied one $2mn^2$ times
	- we can get a 3-SAT randomized algorithm with $(\frac43)^n$ time complexity this way
- Hidden Markov Model (HMM)
	- there is some not observable $(X_t)$ … Markov chain
	- we can observe $(Y_t)$ … $Y_t$ is somehow obtained from $X_t$
	- widely applicable
	- smart algorithm (Viterbi algorithm)

## Bayesian Statistics

- Probabilistic method
	- we prove that some graph exists by showing that random graph has certain property with probability greater than zero
	- $G(n,p)$ … graph with vertices $1,\dots,n$
		- $\forall i,j:i\sim j$ with probability $p$ (all independent)
	- $P(G(n,\frac12)$ has no $K_k$ nor $\overline K_k$ as an induced subgraph$)\gt 0$ if $n\leq 2^{k/2}$
	- thus $\exists G$ on $2^{k/2}$ vertices with no induced $K_k$ nor $\overline K_k$
	- Ramsey theorem
- Bayes theorem
	- let $X,\Theta$ be discrete random variables
	- $p_{\Theta\mid X}(\theta\mid x)=\frac{p_{X\mid\Theta}(x\mid\theta)\cdot p_\Theta(\theta)}{\sum_{\theta'\in\text{Im}\,\Theta} p_{X\mid\Theta}(x\mid\theta')\cdot p_\Theta(\theta')}$
	- similar for continuous RVs
	- note: terms with $p_{\Theta}(\theta')=0$ are considered to be zero
- Bayesian statistics – general approach
	- $\Theta$ possible worlds
		- prior distribution $p_\Theta(\theta)$
		- example: with probability $\frac12$ the coin is fake (has heads on both sides)
		- $p_\Theta(\text{real})=\frac12$
		- $p_\Theta(\text{fake})=\frac12$
	- $X$ … measurement/observation
		- example: we got heads
	- posterior distribution
		- $p_{\Theta\mid X}(\theta\mid x)=\frac{ p_\Theta(\theta)\cdot p_{X\mid\Theta}(x\mid\theta)}{\sum_{t\in\Theta} p_\Theta(t)\cdot p_{X\mid\Theta}(x\mid t)}$
		- example: $p_{\Theta\mid X}(\text{fake}\mid\text{heads})=\frac{\frac12\cdot 1}{\frac12\cdot1+\frac12\cdot\frac12}=\frac23$
			- alternatively: which “heads” did we get?
				- one side of the fake coin
				- the other side of the fake coin
				- one side of the real coin
			- two of the three possible “worlds” are corresponding to the fake coin → $p_{\Theta\mid X}(\text{fake}\mid\text{heads})=\frac23$
				- $\hat\theta=\text{fake}$ … MAP estimate
- Bayesian estimates
	- point estimates
		- MAP (maximum a posteriori)
			- we choose $\hat\theta$ to maximize $p_{\Theta\mid X}$ … we replace the random variable by its mode
		- LMS (“least squares”)
			- we choose $\hat\theta=\mathbb E(\Theta\mid X=x)$ … we replace the random variable by its mean
			- we get an unbiased point estimate that has the smallest possible LMS (least mean square) error
	- interval estimates
		- credible sets $S$ (similar to confidence intervals used in classical statistics)
	- note: often, the constant in the denominator does not matter
- MAP (maximum a posteriori)
	- $\hat\theta_\text{MAP}=\text{argmax}_\theta\,p_{\Theta\mid X}(\theta\mid x)$
		- $\hat\theta$ is a point estimate for $\Theta$
	- that equals $\text{argmax}\,p_\Theta(\theta)\cdot p_{X\mid\Theta}(x\mid\theta)$
		- we can ignore the normalizing constant
	- we are replacing the random variable by its mode
	- node: maximum likelihood is similar but it uses $\text{argmax}\,p_{X\mid\Theta}(x\mid\theta)$
		- it assumes “flat prior” (uniform or discrete uniform)
- Naive Bayes classifier
	- let $x=(x_1,\dots,x_n)$
		- in machine learning … vector of features
	- we need $p_{X\mid\Theta}(x\mid\theta)$ to get MAP estimate
	- “naive assumption”: $p(x\mid\theta)=p(x_1\mid\theta)\cdot p(x_2\mid\theta)\cdot\ldots\cdot p(x_n\mid\theta)$
- LMS point estimate
	- LMS = least mean square
	- estimate such that $\mathbb E((\Theta-\hat\theta)^2\mid X=x)$ is minimal
	- to compute it
		- $\hat\theta_\text{LMS}=\mathbb E(\Theta\mid X=x)=\sum_{t\in\Theta}t\cdot p_{\Theta\mid X}(t\mid x)$
		- we find the mean of the random variable
	- proof
		- $\mathbb E((\Theta-\hat\theta)^2\mid X=x)=\mathbb E(\Theta^2\mid X=x)-\mathbb E(2\Theta\hat\theta\mid X=x)+\mathbb E(\hat\theta^2\mid X=x)$
		- we want to minimize it
		- the first term does not depend on $\hat\theta$, we get rid of it
		- we get $-2\hat\theta\mathbb E(\Theta\mid X=x)+\hat\theta^2$
		- to find its minimum, we find the derivative and set it equal to zero
		- $-2\mathbb E(\Theta\mid X=x)+2\hat\theta=0$
		- therefore $\hat\theta=\mathbb E(\Theta\mid X=x)$ as we wanted
- Beta distribution
	- $f_\Theta(x)=c\cdot x^{\alpha-1}(1-x)^{\beta-1}$
		- where $c$ is a normalizing constant
	- properties
		- $f_\Theta$ is maximal for $\theta=\frac{\alpha-1}{\alpha+\beta-2}$
		- $\mathbb E\Theta=\frac\alpha{\alpha+\beta}$
- Conjugate distributions
	- Beta (prior) & binomial (likelihood) → Beta (posterior)
		- “Beta and Beta are conjugate distributions with respect to binomial distribution”
	- normal (prior) & normal (likelihood) → normal (posterior)
	- both were proved in the lecture
	- it is an algebraic convenience, it does not work for all distributions

## Conditional Expectation

- Conditional independence
	- events
		- $A\perp B\iff P(A\cap B)=P(A)\cdot P(B)$
		- $A\perp_C B\iff P(A\cap B\mid C)=P(A\mid C)\cdot P(B\mid C)$
			- $A,B$ are independent conditionally given $C$
	- it is possible (even typical) that $A\perp_C B$, $A\perp_{C^C} B$, but not $A\perp B$
- Conditional expectation
	- $\mathbb E(Y\mid X=x)$ vs. $\mathbb E(Y\mid X)$
	- $\mathbb E(Y\mid X=x)=g(x)$ … number
	- $\mathbb E(Y\mid X)=g(X)$ … random variable
- Law of iterated expectation
	- $\mathbb E(\mathbb E(Y\mid X))=\mathbb E(Y)$
	- proof
		- $\mathbb E(\mathbb E(Y\mid X))=\mathbb E(g(X))=\sum_x P(X=x)\cdot g(x)$
			- by LOTUS
		- $\mathbb E(Y)=\sum_x P(X=x)\cdot\mathbb E(Y\mid X=x)=\sum_x P(X=x)\cdot g(x)$
			- by law of total probability
- Estimate error
	- basic task of statistics
		- estimate one quantity (Y) given data/measurement (X)
	- example: groups of students, their exam results
		- $X$ … groups of students
			- $X=2$ … we are currently looking at the second group
		- $Y$ … students' exam results
			- $\mathbb E(Y\mid X=3)$ … the expectation of the results in group 3
	- estimator
		- $\hat Y=\mathbb E(Y\mid X)=g(X)$
	- we want the estimate error
		- $\tilde Y=\hat Y-Y$
	- what is $\mathbb E(\tilde Y\mid X)$ equal to?
		- $\mathbb E(\tilde Y\mid X)=\mathbb E(\hat Y-Y\mid X)=\mathbb E(\hat Y\mid X)-\mathbb E(Y\mid X)=\hat Y-\hat Y=0$
		- because $\mathbb E(\hat Y\mid X)=\mathbb E(\mathbb E(Y\mid X)\mid X)=\mathbb E(Y\mid X)$
	- by law of iterated expectation
		- $\mathbb E(\tilde Y)=\mathbb E(\mathbb E(\tilde Y\mid X))=0$
- Estimate error covariance and variance
	- $\text{cov}(\tilde Y,\hat Y)=\mathbb E(\tilde Y\hat Y)-\underbrace{\mathbb E(\tilde Y)}_0\mathbb E(\hat Y)=\mathbb E(\mathbb E(\tilde Y\hat Y\mid X))=\mathbb E(\hat Y\underbrace{\mathbb E(\tilde Y\mid X)}_0)=0$
		- the penultimate equivalence holds because $\hat Y$ is constant for specified $X=x$
	- $\tilde Y,\hat Y$ are uncorellated
	- note
		- uncorellated $\impliedby$ independent
		- uncorellated $\;\;\not\!\!\!\implies$ independent
- Law of iterated variance / Eve's rule
	- $\text{var }Y=\mathbb E(\text{var}(Y\mid X))+\text{var}(\mathbb E(Y\mid X))$
	- intragroup variance + intergroup variance
	- proof
		- $Y=\hat Y-\tilde Y$
		- $\text{var }Y=\text{var}(\hat Y)+\text{var}(\tilde Y)-\underbrace{2\text{cov}(\hat Y,\tilde Y)}_0=\text{var}(\hat Y)+\text{var}(\tilde Y)$
		- $\text{var}(\tilde Y)=\mathbb E(\tilde Y^2)-\underbrace{(\mathbb E\tilde Y)^2}_0=\mathbb E(\mathbb E(\tilde Y^2\mid X))$
			- $=\mathbb E(\mathbb E((Y-\hat Y)^2\mid X))=\mathbb E(\text{var}(Y\mid X))$

## Balls & Bins

- Really useful approximation
	- $1-x\approx e^{-x}$
	- example usage: birthday paradox
		- $(1-\frac1{365})(1-\frac2{365})\dots(1-\frac{k-1}{365})\approx e^{-\frac{k(k-1)}{2\cdot 365}}$
		- because $1-\frac1{365}\approx e^{-\frac1{365}}$ etc.
- Union bound
	- for events $A_1,A_2,\ldots$ it holds that $P(\bigcup A_i)\leq \sum P(A_i)$
	- therefore $P(\text{none of }A_i)\geq 1-\sum P(A_i)$
- Balls and bins model
	- we will be throwing $m$ balls randomly into $n$ bins
	- $X_i$ … number of balls in bin $i$
- How many bins are empty?
	- $P(X_i=0)=(1-\frac1n)^m\approx e^{-\frac mn}$
	- $I_i=1$ if $X_i=0$ (otherwise 1)
	- $\mathbb E(\sum_i I_i)=\sum_i\mathbb EI_i\approx ne^{-\frac mn}$
- How many balls are in bin $i$?
	- $X_i\sim \text{Bin}(m,\frac 1n)\approx\text{Pois}(\frac mn)$
	- $\mathbb EX_i=\frac mn$
- Applications – bucket sort and hasing
	- (see the lecture)
- Max-load likely upper bound
	- definition: max-load … $\max\set{X_1,\dots,X_n}$
		- we assume $m=n$ in max-load bounds
	- theorem: for large enough $n$ we have $P(\text{maxload}\geq\frac{3\log n}{\log\log n})\leq\frac1n$
	- proof
		- $X_i$ … number of balls in bin $i$
		- $P(X_i\geq M)\leq{n\choose M}{1\over n^M}$
			- probability that bin $i$ gets at least $M$ balls
			- for all $M$-tuples of balls, we sum their probability to all fall in bin $B_i$ (using union bound)
		- $B_i$ … $X_i\geq M$
		- $P(\bigcup_i B_i)=P(\max X_i\geq M)$
		- $P(\bigcup_i B_i)\leq\sum_{i=1}^n P(B_i)\leq n{n\choose M}{1\over n^M}\leq n({e\over M})^M$
			- union bound again, then we use an estimate of factorial
		- we need to show that $n({e\over M})^M\lt^? \frac1n$
			- $\log(n^2({e\over M})^M)\lt^? \log 1$
			- $2L+M(1-\log M)\lt^? 0$
				- where $L$ is $\log n$ ($LL$ is $\log\log n$ etc.)
				- we substitute $M=\frac{3L}{LL}$
			- $2L+\frac{3L}{LL}(1-(\log 3+LL-LLL))\lt^? 0$
			- $2L+\frac{3L(1-\log3-LL+LLL)}{LL}\lt^? 0$
			- $-L+\frac{3L(1-\log 3)}{LL}+\frac{3L(LLL)}{LL}\lt^? 0$
			- second and third term are much smaller than $L$
			- therefore it holds
- Exact case vs. Poisson case
	- theorem: any event that happens with probability $\leq p$ in the Poisson case happens with probability $\leq p\cdot e\sqrt m$ in the exact case
- Max-load likely lower bound
	- theorem: for large enough $n$ we have $P(\text{maxload}\leq\frac{\log n}{\log\log n})\leq\frac1n$
	- proof
		- Poisson case
			- $P(X_i\lt M)\leq 1-P(X_i=M)=1-e^{-\lambda}\frac{\lambda^M}{M!}=1-\frac1{eM!}$
				- $\lambda=\frac mn=1$
			- $P(\max X_i\lt M)=P(X_1\lt M)\cdot\ldots\cdot P(X_n\lt M)$
				- $\leq(1-\frac1{eM!})^n\leq(e^{-\frac1{eM!}})^n=\text{exp}(-\frac n{eM!})=:p$
		- exact case
			- $P(\max X_i\lt M)\leq p\cdot e\sqrt{m}=e\sqrt n\cdot \text{exp}(-\frac n{eM!})$
			- we want $e\sqrt n\cdot \text{exp}(-\frac n{eM!})\lt^?\frac1n$
			- $\text{exp}(-\frac n{eM!})\lt^?\frac1{n^2}$
			- $\frac n{eM!}\gt^? 2\log n$
			- $M!\lt^?\frac{n}{2e\log n}$
			- $\log M!\lt^? L-LL-\log 2e$
			- we know $M!\leq M(\frac Me)^M$
			- $\log M!\leq\log M+M(\log M-1)$
				- $\leq LL-LLL+\frac L{LL}(LL-LLL-1)$
					- here, we substituted the $M$
				- $\leq L-L\frac{LLL}{LL}+LL\leq L-LL-c$
					- as $2LL\leq L\frac{LLL}{LL}$ (it's not that clear but it holds)
			- therefore $\log M!\leq L-LL-c$
				- that's what we wanted to prove

## Stochastic Processes

- Bernoulli process
	- infinite sequence of independent identically distributed Bernoulli trials
	- sequence of RVs $X_1,X_2,\dots$ that are independent and each follows $\text{Ber}(p)$ distribution
	- $X_k=1$ … success/arrival at time $k$
- Quantities of a Bernoulli process
	- $N_t$ … number of successes up to time $t$
		- $N_t\sim\text{Bin}(t,p)$
		- $\mathbb E[N_t]=t\cdot p$
	- $T_k$ … time of the $k$-th success
	- $L_k=T_k-T_{k-1}$
		- waiting time
		- memory-less
		- $L_k\sim\text{Geom}(p)$
			- $\mathbb E[L_k]=\frac1p$
			- $P(L=l)=(1-p)^{l-1}\cdot p$
	- $T_k$ properties
		- $T_k=L_1+\dots+L_k$
		- by linearity $\mathbb E[T_k]=\frac kp$
		- $T_k$ has Pascal distribution of order $k$
		- $\forall t\geq k:P(T_k=t)={t-1\choose k-1}p^{k}(1-p)^{t-k}$
		- $\forall t\lt k:P(T_k=t)=0$
- Alternative description of a Bernoulli process
	- we can describe the situation by the interarrival times
	- i.i.d. random variables $L_1,L_2,\ldots\sim\text{Geom}(p)$
	- $T_k=L_1+\dots+L_k$
	- $X_k=1$ whenever $T_t=k$ for some $t$
	- clearly, the sequence $X_1,X_2,\dots$ is a Bernoulli process
- Merging of Bernoulli processes
	- two independent Bernoulli processes, $Z_i=X_i\lor Y_i$
	- $BP(p)\text{[merge with]}BP(q)=BP(1-(1-p)(1-q))$
	- $=BP(p+q-pq)$
- Splitting of Bernoulli processes
	- $X_i,Y_i,Z_i$
	- $Z_i\sim BP(p)$
	- if $Z_i=0$, then $X_i=Y_i=0$
	- if $Z_i=1$, then with probability $q$ we set $(X_i,Y_i)=(1,0)$, otherwise $(X_i,Y_i)=(0,1)$
	- then $X_i\sim BP(pq)$ and $Y_i\sim BP(p(1-q))$
- Poisson process
	- continuous time
	- $T_1,T_2,T_3,\dots$ are the times of individual arrivals
	- $N_t$ … number of arrivals up to time $t$
	- $\lambda$ … intensity (how many successes per unit of time)
	- $N_t\sim\text{Pois}(\lambda\cdot t)$
	- $L_k\sim\text{Exp}(\lambda)$
	- $T_k$ has Erlang distribution, $\mathbb E[T_k]=\frac k\lambda$
- Poisson process interval independence
	- for any sequence $0\leq t_0\lt t_1\dots\lt t_k$
	- RVs $(N_{t_{i+1}}-N_{t_{i}})$
		- are independent
		- each one follows $\text{Pois}(\lambda(t_{i+1}-t_i))$
- Merging of Poisson process
	- $PP(\lambda)\text{[merge with]}PP(\lambda')=PP(\lambda+\lambda')$
- Splitting of Poisson process
	- for each success of $PP(\lambda)$, we toss a coin with probability $p$ (with probability $p$, the success is counted as valid)
	- the resulting process is $PP(\lambda p)$

## Non-parametric Tests

- Permutation test
	- two-sample test
	- A/B testing
		- $A$ … control group (no treatment)
		- $B$ … group with treatment applied
		- is the $A$ vs. $B$ difference only random?
	- we observe … $T(A,B)$
		- $T$ … test statistic
		- example: $T=\bar A-\bar B$
	- $\mathcal F=\set{T(\pi(A\cup B))\mid \pi\in S_{m+n}}$
		- usually the sizes of the groups stay the same: we use $T(\text{first }|A|\text{ elements of }\pi(A\cup B),\,\text{rest of }\pi(A\cup B))$
	- p-value: percentage of $\mathcal F$ more extreme than observed $T(A,B)$
	- speed-up … we use only $k$ random samples from $S_{m+n}$
	- for paired test, we only permute $X_i$ with $Y_i$
- Signed test
	- paired test
		- pairs $X_i,Y_i$
		- $Z_i=Y_i-X_i$
	- $H_0$ … $P(X_i\gt Y_i)=\frac12$
	- count number of pairs where $Z_i\gt 0$ (there was a measurable improvement after the treatment)
	- use $\text{Bin}(n,\frac12)$ to get the p-value
- Wilcoxon signed rank test
	- paired test
	- procedure
		- let $X_i$ be the difference $B_i-A_i$
		- sort by absolute value, assign ranks ($R_i$)
		- calculate $T$ (or $T^+$ or $T^-$, it does not matter)
			- $T^+=\sum_{X_i\gt 0}R_i$
			- $T^-=\sum_{X_i\lt 0} R_i$
			- $T=T^+-T^-=\sum R_i\cdot\text{sign}(X_n)$
		- compare with scores for measurements with random $+/-$ signs
	- we use stronger null hypothesis
		- $H_0:$ median = 0 & distribution of $X$ is symmetric
		- note: $X$ is symmetric $\iff A,B$ have the same continuous distribution ([source](https://stats.stackexchange.com/questions/348057/wilcoxon-signed-rank-symmetry-assumption))
- Mann-Whitney U-test
	- two-sample test
		- we have two populations, $X$ and $Y$
		- we get $X_1,\dots,X_m$ and $Y_1,\dots,Y_n$
		- usually, $m\neq n$
	- $S(X_i,Y_j)=\begin{cases} 1 & X_i\gt Y_j\\ \frac12 & X_i=Y_j \\ 0 & X_i\lt Y_j\end{cases}$
	- $U=\sum_{i=1}^m\sum_{j=1}^nS(X_i,Y_j)$
	- sign test is a paired variant of this
	- we use the same approach as in the permutation test with $U$ instead of $T=\bar X_m-\bar Y_n$
	- alternative view: we sort $X\cup Y$ and count pairs “$X_i$ before $Y_j$”
	- when to use it?
		- numerical data … real numbers (weight, time, price) → permutation test
		- ordinal data … classes (light/medium/heavy, quick/slow) → $U$-test

## Moment Generating Functions and their applications

- Markov inequality
	- for $X$ nonnegative
	- $\forall a\geq 1:P(X\geq a\cdot \mu)\leq\frac1a$
- Chebyshev inequality
	- $\forall a\gt 0:P(|X-\mu|\geq a\cdot\sigma)\leq\frac1{a^2}$
- Chernoff bounds (alternative)
	- for $X=\sum_{i=1}^nX_i$ where $X_i\sim\text{Ber}(p_i)$ are independent
	- $\mu=\sum_{i=1}^np_i$
	- $\forall\delta\in(0,1]:P(X\geq (1+\delta)\cdot\mu)\leq e^{-\mu\delta^2/3}$
	- $\forall\delta\in(0,1):P(X\leq(1-\delta)\cdot\mu)\leq e^{-\mu\delta^2/2}$
	- note: there are many more alternative bounds
- Moment generating function
	- given a random variable $X$, we define $M_X:\mathbb R\to\mathbb R$
	- $M_X(s):=\mathbb E(e^{sX})$
	- $\forall X:M_X(0)=\mathbb E(e^{0x})=1$
- MGF moment theorem
	- theorem: $M_X(s)=\sum_{k=0}^\infty\mathbb E[X^k]\cdot \frac{s^k}{k!}$
		- $\mathbb E[X^k]$ … $k$-th moment of $X$
	- proof
		- $\mathbb E[e^{sX}]=\mathbb E[\sum\frac{(sX)^k}{k!}]=\mathbb E[\sum X^k\frac{s^k}{k!}]=\sum\mathbb E[X^k\frac{s^k}{k!}]=\sum_{k=0}^\infty\mathbb E[X^k]\cdot \frac{s^k}{k!}$
- MGF for Bernoulli variable
	- $X\sim\text{Ber}(p)$
		- $M_X(s)=p\cdot e^{s\cdot 1}+(1-p)\cdot e^{s\cdot 0}$
		- $M_X(s)=1-p+pe^s$
		- note: $e^s=1+s+\frac{s^2}2+\frac{s^3}{3!}+\dots=\sum_{k=0}^\infty\frac{s^k}{k!}$
		- $M_X(s)=1+p(s+\frac{s^2}2+\frac{s^3}{3!}+\dots)$
	- $\mathbb EX=[s^1]M_X(s)=p$
		- this notation selects the coefficient of the $s^1$ in the GF
	- $\mathbb EX^2=[\frac{s^2}{2!}]M_X(s)=p$
- MGF for continuous $X$
	- $M_X(s)=\mathbb E[e^{sX}]\overset{\text{LOTUS}}{=}\int_{-\infty}^\infty e^{sx}f_X(x)\text { d}x$
- MGF “linearity” and “sum”
	- theorem
		- $M_{aX+b}(s)=e^{sb}M_X(as)$
	- proof
		- $\mathbb E(e^{s(aX+b)})=\mathbb E(e^{asX}\cdot e^{sb})=e^{sb}\cdot\mathbb E(e^{(as)X})$
	- theorem
		- $X,Y$ independent $\implies M_{X+Y}=M_X\cdot M_Y$
	- proof
		- $M_{X+Y}(s)=\mathbb E[e^{s(X+Y)}]=\mathbb E[e^{sX}\cdot e^{sY}]=\mathbb E[e^{sX}]\cdot \mathbb E[e^{sY}]=M_X(s)\cdot M_Y(s)$
	- example
		- $X\sim\text{Bin}(n,p)$
		- $M_X(s)=(1-p+pe^s)^n$
- MGF of normal distribution
	- $X\sim N(\mu,\sigma^2)$
	- $M_X(s)=e^{\mu s}\cdot e^{\sigma^2 s^2/2}$
- MGF equality and convergence
	- equality
		- let $X,Y$ be RVs such that $(\exists\varepsilon\gt 0)(\forall s\in (-\varepsilon,\varepsilon)):M_X(s)=M_Y(s)\in\mathbb R$
		- then $F_X=F_Y$
	- convergence
		- $Y,X_1,X_2,X_3,\dots$ RVs
		- $(\exists\varepsilon\gt 0)(\forall s\in (-\varepsilon,\varepsilon)):\lim_{n\to\infty} M_{X_n}(s)=M_Y(s)$
		- $F_Y$ is continuous
		- then $X_n\xrightarrow d Y$
			- $\lim_{n\to\infty} F_{X_n}(x)=F_Y(x)$
- Central limit theorem
	- theorem
		- $X_1,X_2,\dots$ i.i.d. (independent identically distributed) RVs
		- $\mathbb EX_i=\mu$, $\text{var} X_i=\sigma^2$
		- $Y_n:=\frac{X_1+\dots+X_n-n\mu}{\sqrt{n}\sigma}$
		- $Y_n\xrightarrow d N(0,1)$
			- that is $\lim_{n\to\infty}F_{Y_n}(t)=\Phi(t)$
	- proof
		- we may assume $\mu=0$
			- otherwise, we would set $X_n'=X_n-\mu$
			- variance would not change
			- the formula for $Y_n$ also would not change (we subtract $n\mu$ there)
		- $M_{X_i}(s)=1+as+bs^2+O(s^3)\quad(s\doteq 0)$
			- $a=\mu=0$
			- $b=\frac12\mathbb EX_i^2=\frac12(\sigma^2-\mu^2)=\frac12\sigma^2$
				- from the alternative formula for $\text{var}$
		- $M_{X_i}(s)=1+\frac{\sigma^2}2s^2+O(s^3)$
		- $M_{Y_n}(s)=\prod M_{X_i}\left(\frac s{\sqrt n\sigma}\right)$
			- because $Y_n=\frac{X_1+\dots+X_n}{\sqrt n\sigma}$
		- we need to show that $\lim M_{Y_n}(s)=M_{Y}(s)$ where $Y\sim N(0,1)$
			- or that $\lim_{n\to\infty}M_{X_n}(\frac s{\sqrt n\sigma})^n=e^{s^2/2}$
		- $M_{Y_n}(s)=M_{X_i}(\frac s{\sqrt n\sigma})^n=(1+\frac{\sigma^2}2(\frac s{\sqrt n\sigma})^2+O(s^3))^n=(1+\frac{s^2}{2n}+O(s^3))^n$
			- we substituted $s=\frac s{\sqrt n\sigma}$ into $M_{X_i}$
		- we know that $(1+\frac xn)^n\to e^x$
		- therefore $(1+\frac{s^2/2}{n}+O(s^3))^n\to e^{s^2/2}$
- Chernoff inequality
	- theorem
		- suppose i.i.d.  $X_1,X_2,\dots,X_n$ are $\pm1$, each with probability $\frac12$
		- $X=\sum_iX_i$
		- we have $\sigma^2=n$
		- $\forall t\gt 0:P(X\geq t)=P(X\leq -t)\leq e^{-t^2/2\sigma^2}$
	- proof
		- $M_{X_i}=\frac{e^{-s}+e^{s}}2$
		- Markov inequality
			- for $X$ nonnegative
			- $\forall a\geq 1:P(X\geq a)\leq\frac \mu a$
		- we use it
			- $\forall s\gt 0:P(X\geq t)=P(e^{sX}\geq e^{st})\leq\frac{\mathbb E(e^{sX})}{e^{st}}$
		- $\frac{\mathbb E(e^{sX})}{e^{st}}=\frac{M_{X_i}(s)^n}{e^{st}}=\frac{(\frac{e^{-s}+e^{s}}2)^n}{e^{st}}$
		- it can be shown that $\frac{e^{-s}+e^{s}}2\leq e^{s^2/2}$
		- therefore $P(X\geq t)\leq \frac{e^{ns^2/2}}{e^{st}}=e^{ns^2/2-st}$
		- we set $s=t/n$
		- then $P(X\geq t)\leq e^{-t^2/2n}$
- Set balancing using Chernoff inequality
	- we have subsets $S_1,\dots,S_n\subseteq[m]$
		- if may be features like “men”, “old people”, …
	- we want $T\subseteq[m]$ such that it almost dissects every $S_i$
		- to get two groups with the same (or similar) representation of every feature
	- $D_i=|T\cap S_i|-|S_i\setminus T|$
	- $\text{disc}(T)=\max D_i$
		- discrepancy
	- use random $T$!
	- $\forall a_j\in S_i$ we define “indicators”
		- $X_j=1\iff a_j\in T$
		- $X_j=-1\iff a_j\notin T$
	- $D_i=\sum_{j=1}^{|S_i|}X_j$
	- $P(D_i\geq t)\leq e^{-\frac{t^2}{2|S_i|}}\leq e^{-\frac{t^2}{2m}}$
	- $P(|D_i|\geq t)\leq 2e^{-t^2/2m}$
	- $P(\exists j:|D_j|\geq t)\leq n\cdot 2e^{-t^2/2m}\leq \frac2n$
		- for $t=\sqrt{4m\log n}$
	- idea: it is useful to have an inequality (Chernoff) that holds always, we don't need to worry if our $n$ is big enough