# Exam ## Markov Chains - Stochastic process - sequence of random variables $X_0,X_1,\dots$ such that “all probabilities are defined” - e.g. $P(X_0=1\land X_1=2\land X_3=0)$ is defined - Markov process/chain - stochastic process such that - $\exists$ finite/countable set $S$ such that $\forall t:\text{Im}\,X_t\subseteq S$ - elements of $S$ … states - $t$ … time - $\forall i,j\in S:\exists p_{ij}\in [0,1]$ - notes - this is a discrete time Markov chain - discrete space … $|S|\leq |\mathbb N|$ - time-homogeneous … $p_{ij}$ not depeding on $t$ - Markov condition … “forgetting the past” - $P(X_{t+1}=a_{t+1}\mid X_t=a_t\land \dots\land X_0=a_0)=P(X_{t+1}=a_{t+1}\mid X_t=a_t)$ - Transition matrix, diagram - transition matrix is a matrix $P$ such that $P_{ij}=p_{ij}$ - observation: every row of $P$ sums to 1 - transition diagram/graph - vertices … $S$ - directed edges … $\set{(i,j)\in S^2:p_{ij}\gt 0}$ - Markov chain is equivalent to a random walk on the transition diagram - from state $i$ we go to $j$ with the probability of $p_{ij}$ independent of the past - Probability mass function for $X_0,X_1,\dots$ - $\pi_i^{(t)}=P(X_t=i)$ - theorem: $\pi^{(t+1)}=\pi^{(t)}\cdot P$ - where $\pi^{(t)}=(\pi_1^{(t)},\dots,\pi_n^{(t)})$ row vector - proof: definition of matrix multiplication & total probability theorem - by induction, we can prove that $\pi^{(k)}=\pi^{(0)}P^k$ and, more generally, $\pi^{(t+k)}=\pi^{(t)}P^k$ - Probability of $k$-step transition … $r_{ij}(k)$ - $r_{ij}(k)=P(X_{t+k}=j\mid X_t=i)$ - $r_{ij}(1)=P(X_{t+1}=j\mid X_t=i)=p_{ij}$ - Chapman-Kolmogorov theorem - $\forall$ Markov chain, $\forall i,j,k:$ - $r_{ij}(k+1)=\sum_{u=1}^n r_{iu}(k) p_{uj}$ - $r_{ij}(k+\ell)=\sum_{u=1}^n r_{iu}(k) r_{uj}(\ell)$ - $r_{ij}(k)=(P^k)_{ij}$ - proof - $2\implies 1$ - 1 is a special case of 2 ($r_{uj}(1)=p_{uj})$ - $1\implies 3$ - matrix multiplication & induction - $2$ … we want $r_{ij}(k+\ell)=\sum_{u=1}^n r_{iu}(k) r_{uj}(\ell)$ - $P(X_{t+k+\ell}=j\mid X_t=i)=$ - $=\sum_{u=1}^nP(X_{t+k}=u\mid X_t=i)\cdot P(X_{t+k+\ell}=j\mid X_{t+k}=u)$ - it follows from total probability theorem - Accessible states - definition: $j$ is accessible from $i$ (where $i,j\in S$) - $i\to j$ - $j\in A(i)$ - $j$ is accessible from $i\equiv$ $\exists t:\underbrace{P(X_t=j\mid X_0=i)}_{r_{ij}(t)}\gt 0$ - $\iff$ in the transition digraph exists directed path $i\to j$ of length $t$ - Communicating states - definition: $i\leftrightarrow j$ ($i$ and $j$ communicate) $\equiv i\in A(j)\land j\in A(i)$ - theorem: $\leftrightarrow$ is an equivalence relation - proof - reflexive: $i\in A(i)$ … $t=0$ in the definition - symmetric: the formula $i\in A(j)\land j\in A(i)$ is symmetric - transitive: in the digraph $\exists$ path $i$ to $j$, $\exists$ path $j$ to $k$ $\implies\exists$ walk $i\to k$ - decomposition of the transition digraph to components of strong connectivity $\equiv$ finding equivalence classes of $\leftrightarrow$ - Reducibility - Markov chain is irreducible if $\forall i,j\in S:i\leftrightarrow j$ - Time to get to $i\in S$ - definition: $T_i=\min\set{t\geq 1:X_t=i}$ - $T_i=\infty$ if the set is empty (there is no such $t$) - $T_i$ … random variable - $f_{ij}(n)=P(\text{we get to }j\text{ first at time }n\mid X_0=i)=P(T_j=n\mid X_0=i)$ - we define it for $n\gt 0$ - $f_{ij}=P(T_j\lt\infty\mid X_0=i)=\sum_{n\geq 1} f_{ij}(n)$ - probability that we get to $j$ if we start from $i$ (after at least one step) - Reccurent state - a state $i\in S$ is recurrent if we return to it with probability 1 - it is transient otherwise - česky rekurentní, tranzientní - $i$ is transient - $f_{ii}\lt 1$ … probability of ever getting back - $\iff P(T_i=\infty)\gt 0$ - $\iff P(\text{get back infinitely often})\lt 1$ - theorem: $i$ is a recurrent state $\iff\sum_{n=1}^\infty r_{ii}(n)=\infty$ - theorem: if $i\leftrightarrow j$, then either both $i$ and $j$ are recurrent or both $i$ and $j$ are transient - for finite Markov chains - $C$ … equiv class of $\leftrightarrow$ in a finite Markov chain - $C$ is recurrent ($\forall i\in C$ is recurrent) $\iff(\forall i \in C)(\forall j\in S):$ if $i\to j$ then $j\in C$ (“$C$ is closed”) - definition: $i$ is null recurrent if $i$ is recurrent and $\mathbb E(T_i\mid X_0=i)=\infty$ - $i$ is positive recurrent if $i$ is recurrent and $\mathbb E(T_i\mid X_0=i)\lt\infty$ - example: random walk on a line - Periodic state - definition: $i\in S$ has period $d_i:=\text{gcd}\set{t:r_{ii}(t)\gt 0}$ - $i\in S$ is aperiodic if $d_i=1$ - example of a periodic Markov chain: two states, we change the state with probability equal to 1 - Properties of a communicating class - if $i,j\in S$, $i\leftrightarrow j$, then - $d_i=d_j$ - $i$ is transient $\iff j$ is transient - $i$ is recurrent $\iff j$ is recurrent - $i$ is null recurrent $\iff j$ is null recurrent - $i$ is positive recurrent $\iff j$ is positive recurrent - these are properties of the class of $\leftrightarrow$ - Stationary distribution / steady state distribution - definition: $\pi:S\to [0,1]$ such that $\sum_{i\in S}\pi_i=1$ is called a stationary distribution if $\text{``}\pi P=\pi\text{"}$ - $\forall i:\sum_i\pi_i\cdot p_{ij}=\pi_j$ - if $\pi^{(0)}$ (the PMF of $X_0$) is $\pi$, then $\pi^{(1)}$ is $\pi$ as well (→ steady state) - reminder: $\pi^{(1)}=\pi^{(0)}P$ - theorem - if a Markov chain is irreducible, aperiodic and finite, then - there exists a unique stationary distribution $\pi$ such that $\pi P=\pi$ - $\forall ij:\lim(P^t)_{ij}=\pi_j$ - $P(X_t=j\mid X_0=i)\doteq\pi_j$ - actually, MC does not have to be finite, it suffices if all states are positive recurrent - to find $\pi$, solve system of linear equations $\pi P=\pi$, add $\sum_{i\in S}\pi_i=1$ - for every $i$ there should be equation $\pi_i=\sum_j p_{ji}\cdot \pi_j$ (there is a term for every *incoming* arc) - $\pi$ describes long-term behavior of the MC - Page Rank (original google search) … MC model of people browsing WWW - given $\pi$, we can find a MC such that $\pi$ is its stationary distribution; then we can run the MC to generate random objects with distribution $\pi$ - Markov chain Monte Carlo (MCMC) - Detailed balance equation - MC may have this property - $\forall i\neq j:\pi_iP_{ij}=\pi_jP_{ji}$ - to imagine this: ant colony moving independently according to a Markov chain - stationary distribution … the same number of ants at each state at each time – ants don't “accumulate” - detailed balance equation implies $\pi P=\pi$ (but it is stronger than that) - MCMC sampling - Monte Carlo Markov chains - Metropolis–Hastings method - we construct a MC from the probability distribution we want - choose aperiodic irreducible digraph - use detailed balance equation - $p_{ij}=\min\set{1,\frac{\pi _j}{\pi_i}}\cdot C$ - $p_{ji}=\min\set{1,\frac{\pi_i}{\pi_j}}\cdot C$ - choose $C$ such that - $\forall i:\sum_{j\neq i} p_{ij}\lt 1$ - df. $p_{ii}=1-\sum_{j\neq i}p_{ij}\gt 0$ - $\implies d_i=1$ - tune the process to make convergence fast - we run the MC for long enough - alternative sampling approach: rejection sampling - MCMCs are useful in Bayesian statistics – we only need the ratios so the normalizing constants cancel out (and we don't have to compute them) - Absorbing state - $i\in S$ is absorbing state if $p_{ii}=1$ - $A$ … set of absorbing states - which absorbing state do we end at? will it be state $0\in A$? - $a_i=P(\exists t:X_t=0\mid X_0=i)$ - theorem: $(a_i)_{i\in S}$ are unique solutions to - $a_0=1$ - $a_i=0$ if $i\in A,\,i\neq 0$ - $a_i=\sum_j p_{ij}\cdot a_j$ otherwise - what is the expected number of steps to get to an absorbing state - $\mu_i=\mathbb E(T\mid X_0=i)$ - $T=\min\set{t: X_t\in A}$ - theorem: $(\mu_i)_{i\in S}$ are unique solutions to - $\mu_i=0$ if $i\in A$ - $\mu_i=1+\sum_j p_{ij}\cdot\mu_j$ if $i\notin A$ - example: drunk person on their way home - $A=\set{0}$ - $\mu_0=0$ - $\mu_1=1+\frac12\mu_0+\frac12\mu_2$ - $\mu_2=1+\frac12\mu_1+\frac12\mu_3$ - $\mu_{n-1}=1+\frac12\mu_{n-2}+\frac12\mu_{n}$ - $\mu_n=1+\mu_{n-1}$ - solution - $\mu_1=2n-1$ - $\mu_n=n^2$ - $\mu_{i}\leq n^2$ - how to compute it - we start from the bottom up by combining the two last equations - we get $\mu_{n-1}=3+\mu_{n-2}$ - then we get $\mu_{n-2}=5+\mu_{n-3}$ - it can be shown that we can get to $\mu_1=2n-1+\mu_{0}=2n-1$ - we see that $\mu_2=2n-3+\mu_1$ - we can substitute $\mu_1$ - when we get back to the bottom, we have $\mu_n=1+3+5+\dots+(2n-1)=n^2$ - 2-SAT problem - input: $\varphi=(x_1\lor x_2)\land(x_3\lor\neg x_1)\land\ldots$ - clauses with exactly 2 literals - output: a satisfying assignment OR “unsatisfiable” - there exists a polynomial algorithm - we will show a randomized algorithm 1. arbitrarily initialize $(x_1,\dots,x_n)$ 2. while $\varphi$ has an unsatisfied clause - choose one of the unsatisfied clauses and change one of its variables 3. return $(x_1,\dots,x_n)$ - repeat (2) $\leq2mn^2$ times - then, if $\varphi$ is still unsatisfied, return “unsatisfiable” - this may introduce errors - theorem: the algorithm makes an error with probability $\leq 2^{-m}$ - proof - $x_1^*,\dots,x_n^*$ … one of the satisfying assignments - $D_t$ the number of $i$ such that $x_i\neq x_i^*$ at time $t$ - $t=0,1,\dots,2mn^2$ - $0\leq D_t\leq n$ - $D_t=0\implies$ we found a solution - situation - we assume that clause $(x_1\lor x_2)$ is unsatisfied at time $t$ and we choose it - $\implies x_1=x_2=F$ - $\implies x_1^*\lor x_2^*=T$ - we randomly switch $x_1$ or $x_2$ which increases or decreases $D$ - if $x_1^*=T$ and $x_2^*=F$ (or vice versa), the random switch may decrease or increase $D$ by one (each with probability $\frac12$) - we will call them “good cases” - if $x_1^*=x_2^*=T$, then the random switch will always decrease $D$ which is nice but it means that we cannot model $D$ using a Markov chain - this is a “bad case” - because of that, we define $D'$ which always decreases or increases with probability $\frac12$ - in the good cases, it behaves the same way as $D$ - in the bad case, we flip a coin to decide whether it should go up or down - clearly $D\leq D'$ and $T\leq T'$ (where $T$ is the first time $D=0$, similarly for $T',D'$) - we know that $\mathbb ET'\leq n^2$ which means that $\mathbb ET\leq n^2$ - using Markov inequality, we get that $P(T\geq 2n^2)\leq\frac{\mathbb ET}{2n^2}\leq\frac{n^2}{2n^2}=\frac12$ - we repeat the whole process $m$ times, that way we get the probability of fail $\leq 2^{-m}$ and the running time $O(mn^4)$ - there are $O(n^2)$ clauses of two literals, we need to find an unsatisfied one $2mn^2$ times - we can get a 3-SAT randomized algorithm with $(\frac43)^n$ time complexity this way - Hidden Markov Model (HMM) - there is some not observable $(X_t)$ … Markov chain - we can observe $(Y_t)$ … $Y_t$ is somehow obtained from $X_t$ - widely applicable - smart algorithm (Viterbi algorithm) ## Bayesian Statistics - Probabilistic method - we prove that some graph exists by showing that random graph has certain property with probability greater than zero - $G(n,p)$ … graph with vertices $1,\dots,n$ - $\forall i,j:i\sim j$ with probability $p$ (all independent) - $P(G(n,\frac12)$ has no $K_k$ nor $\overline K_k$ as an induced subgraph$)\gt 0$ if $n\leq 2^{k/2}$ - thus $\exists G$ on $2^{k/2}$ vertices with no induced $K_k$ nor $\overline K_k$ - Ramsey theorem - Bayes theorem - let $X,\Theta$ be discrete random variables - $p_{\Theta\mid X}(\theta\mid x)=\frac{p_{X\mid\Theta}(x\mid\theta)\cdot p_\Theta(\theta)}{\sum_{\theta'\in\text{Im}\,\Theta} p_{X\mid\Theta}(x\mid\theta')\cdot p_\Theta(\theta')}$ - similar for continuous RVs - note: terms with $p_{\Theta}(\theta')=0$ are considered to be zero - Bayesian statistics – general approach - $\Theta$ possible worlds - prior distribution $p_\Theta(\theta)$ - example: with probability $\frac12$ the coin is fake (has heads on both sides) - $p_\Theta(\text{real})=\frac12$ - $p_\Theta(\text{fake})=\frac12$ - $X$ … measurement/observation - example: we got heads - posterior distribution - $p_{\Theta\mid X}(\theta\mid x)=\frac{ p_\Theta(\theta)\cdot p_{X\mid\Theta}(x\mid\theta)}{\sum_{t\in\Theta} p_\Theta(t)\cdot p_{X\mid\Theta}(x\mid t)}$ - example: $p_{\Theta\mid X}(\text{fake}\mid\text{heads})=\frac{\frac12\cdot 1}{\frac12\cdot1+\frac12\cdot\frac12}=\frac23$ - alternatively: which “heads” did we get? - one side of the fake coin - the other side of the fake coin - one side of the real coin - two of the three possible “worlds” are corresponding to the fake coin → $p_{\Theta\mid X}(\text{fake}\mid\text{heads})=\frac23$ - $\hat\theta=\text{fake}$ … MAP estimate - Bayesian estimates - point estimates - MAP (maximum a posteriori) - we choose $\hat\theta$ to maximize $p_{\Theta\mid X}$ … we replace the random variable by its mode - LMS (“least squares”) - we choose $\hat\theta=\mathbb E(\Theta\mid X=x)$ … we replace the random variable by its mean - we get an unbiased point estimate that has the smallest possible LMS (least mean square) error - interval estimates - credible sets $S$ (similar to confidence intervals used in classical statistics) - note: often, the constant in the denominator does not matter - MAP (maximum a posteriori) - $\hat\theta_\text{MAP}=\text{argmax}_\theta\,p_{\Theta\mid X}(\theta\mid x)$ - $\hat\theta$ is a point estimate for $\Theta$ - that equals $\text{argmax}\,p_\Theta(\theta)\cdot p_{X\mid\Theta}(x\mid\theta)$ - we can ignore the normalizing constant - we are replacing the random variable by its mode - node: maximum likelihood is similar but it uses $\text{argmax}\,p_{X\mid\Theta}(x\mid\theta)$ - it assumes “flat prior” (uniform or discrete uniform) - Naive Bayes classifier - let $x=(x_1,\dots,x_n)$ - in machine learning … vector of features - we need $p_{X\mid\Theta}(x\mid\theta)$ to get MAP estimate - “naive assumption”: $p(x\mid\theta)=p(x_1\mid\theta)\cdot p(x_2\mid\theta)\cdot\ldots\cdot p(x_n\mid\theta)$ - LMS point estimate - LMS = least mean square - estimate such that $\mathbb E((\Theta-\hat\theta)^2\mid X=x)$ is minimal - to compute it - $\hat\theta_\text{LMS}=\mathbb E(\Theta\mid X=x)=\sum_{t\in\Theta}t\cdot p_{\Theta\mid X}(t\mid x)$ - we find the mean of the random variable - proof - $\mathbb E((\Theta-\hat\theta)^2\mid X=x)=\mathbb E(\Theta^2\mid X=x)-\mathbb E(2\Theta\hat\theta\mid X=x)+\mathbb E(\hat\theta^2\mid X=x)$ - we want to minimize it - the first term does not depend on $\hat\theta$, we get rid of it - we get $-2\hat\theta\mathbb E(\Theta\mid X=x)+\hat\theta^2$ - to find its minimum, we find the derivative and set it equal to zero - $-2\mathbb E(\Theta\mid X=x)+2\hat\theta=0$ - therefore $\hat\theta=\mathbb E(\Theta\mid X=x)$ as we wanted - Beta distribution - $f_\Theta(x)=c\cdot x^{\alpha-1}(1-x)^{\beta-1}$ - where $c$ is a normalizing constant - properties - $f_\Theta$ is maximal for $\theta=\frac{\alpha-1}{\alpha+\beta-2}$ - $\mathbb E\Theta=\frac\alpha{\alpha+\beta}$ - Conjugate distributions - Beta (prior) & binomial (likelihood) → Beta (posterior) - “Beta and Beta are conjugate distributions with respect to binomial distribution” - normal (prior) & normal (likelihood) → normal (posterior) - both were proved in the lecture - it is an algebraic convenience, it does not work for all distributions ## Conditional Expectation - Conditional independence - events - $A\perp B\iff P(A\cap B)=P(A)\cdot P(B)$ - $A\perp_C B\iff P(A\cap B\mid C)=P(A\mid C)\cdot P(B\mid C)$ - $A,B$ are independent conditionally given $C$ - it is possible (even typical) that $A\perp_C B$, $A\perp_{C^C} B$, but not $A\perp B$ - Conditional expectation - $\mathbb E(Y\mid X=x)$ vs. $\mathbb E(Y\mid X)$ - $\mathbb E(Y\mid X=x)=g(x)$ … number - $\mathbb E(Y\mid X)=g(X)$ … random variable - Law of iterated expectation - $\mathbb E(\mathbb E(Y\mid X))=\mathbb E(Y)$ - proof - $\mathbb E(\mathbb E(Y\mid X))=\mathbb E(g(X))=\sum_x P(X=x)\cdot g(x)$ - by LOTUS - $\mathbb E(Y)=\sum_x P(X=x)\cdot\mathbb E(Y\mid X=x)=\sum_x P(X=x)\cdot g(x)$ - by law of total probability - Estimate error - basic task of statistics - estimate one quantity (Y) given data/measurement (X) - example: groups of students, their exam results - $X$ … groups of students - $X=2$ … we are currently looking at the second group - $Y$ … students' exam results - $\mathbb E(Y\mid X=3)$ … the expectation of the results in group 3 - estimator - $\hat Y=\mathbb E(Y\mid X)=g(X)$ - we want the estimate error - $\tilde Y=\hat Y-Y$ - what is $\mathbb E(\tilde Y\mid X)$ equal to? - $\mathbb E(\tilde Y\mid X)=\mathbb E(\hat Y-Y\mid X)=\mathbb E(\hat Y\mid X)-\mathbb E(Y\mid X)=\hat Y-\hat Y=0$ - because $\mathbb E(\hat Y\mid X)=\mathbb E(\mathbb E(Y\mid X)\mid X)=\mathbb E(Y\mid X)$ - by law of iterated expectation - $\mathbb E(\tilde Y)=\mathbb E(\mathbb E(\tilde Y\mid X))=0$ - Estimate error covariance and variance - $\text{cov}(\tilde Y,\hat Y)=\mathbb E(\tilde Y\hat Y)-\underbrace{\mathbb E(\tilde Y)}_0\mathbb E(\hat Y)=\mathbb E(\mathbb E(\tilde Y\hat Y\mid X))=\mathbb E(\hat Y\underbrace{\mathbb E(\tilde Y\mid X)}_0)=0$ - the penultimate equivalence holds because $\hat Y$ is constant for specified $X=x$ - $\tilde Y,\hat Y$ are uncorellated - note - uncorellated $\impliedby$ independent - uncorellated $\;\;\not\!\!\!\implies$ independent - Law of iterated variance / Eve's rule - $\text{var }Y=\mathbb E(\text{var}(Y\mid X))+\text{var}(\mathbb E(Y\mid X))$ - intragroup variance + intergroup variance - proof - $Y=\hat Y-\tilde Y$ - $\text{var }Y=\text{var}(\hat Y)+\text{var}(\tilde Y)-\underbrace{2\text{cov}(\hat Y,\tilde Y)}_0=\text{var}(\hat Y)+\text{var}(\tilde Y)$ - $\text{var}(\tilde Y)=\mathbb E(\tilde Y^2)-\underbrace{(\mathbb E\tilde Y)^2}_0=\mathbb E(\mathbb E(\tilde Y^2\mid X))$ - $=\mathbb E(\mathbb E((Y-\hat Y)^2\mid X))=\mathbb E(\text{var}(Y\mid X))$ ## Balls & Bins - Really useful approximation - $1-x\approx e^{-x}$ - example usage: birthday paradox - $(1-\frac1{365})(1-\frac2{365})\dots(1-\frac{k-1}{365})\approx e^{-\frac{k(k-1)}{2\cdot 365}}$ - because $1-\frac1{365}\approx e^{-\frac1{365}}$ etc. - Union bound - for events $A_1,A_2,\ldots$ it holds that $P(\bigcup A_i)\leq \sum P(A_i)$ - therefore $P(\text{none of }A_i)\geq 1-\sum P(A_i)$ - Balls and bins model - we will be throwing $m$ balls randomly into $n$ bins - $X_i$ … number of balls in bin $i$ - How many bins are empty? - $P(X_i=0)=(1-\frac1n)^m\approx e^{-\frac mn}$ - $I_i=1$ if $X_i=0$ (otherwise 1) - $\mathbb E(\sum_i I_i)=\sum_i\mathbb EI_i\approx ne^{-\frac mn}$ - How many balls are in bin $i$? - $X_i\sim \text{Bin}(m,\frac 1n)\approx\text{Pois}(\frac mn)$ - $\mathbb EX_i=\frac mn$ - Applications – bucket sort and hasing - (see the lecture) - Max-load likely upper bound - definition: max-load … $\max\set{X_1,\dots,X_n}$ - we assume $m=n$ in max-load bounds - theorem: for large enough $n$ we have $P(\text{maxload}\geq\frac{3\log n}{\log\log n})\leq\frac1n$ - proof - $X_i$ … number of balls in bin $i$ - $P(X_i\geq M)\leq{n\choose M}{1\over n^M}$ - probability that bin $i$ gets at least $M$ balls - for all $M$-tuples of balls, we sum their probability to all fall in bin $B_i$ (using union bound) - $B_i$ … $X_i\geq M$ - $P(\bigcup_i B_i)=P(\max X_i\geq M)$ - $P(\bigcup_i B_i)\leq\sum_{i=1}^n P(B_i)\leq n{n\choose M}{1\over n^M}\leq n({e\over M})^M$ - union bound again, then we use an estimate of factorial - we need to show that $n({e\over M})^M\lt^? \frac1n$ - $\log(n^2({e\over M})^M)\lt^? \log 1$ - $2L+M(1-\log M)\lt^? 0$ - where $L$ is $\log n$ ($LL$ is $\log\log n$ etc.) - we substitute $M=\frac{3L}{LL}$ - $2L+\frac{3L}{LL}(1-(\log 3+LL-LLL))\lt^? 0$ - $2L+\frac{3L(1-\log3-LL+LLL)}{LL}\lt^? 0$ - $-L+\frac{3L(1-\log 3)}{LL}+\frac{3L(LLL)}{LL}\lt^? 0$ - second and third term are much smaller than $L$ - therefore it holds - Exact case vs. Poisson case - theorem: any event that happens with probability $\leq p$ in the Poisson case happens with probability $\leq p\cdot e\sqrt m$ in the exact case - Max-load likely lower bound - theorem: for large enough $n$ we have $P(\text{maxload}\leq\frac{\log n}{\log\log n})\leq\frac1n$ - proof - Poisson case - $P(X_i\lt M)\leq 1-P(X_i=M)=1-e^{-\lambda}\frac{\lambda^M}{M!}=1-\frac1{eM!}$ - $\lambda=\frac mn=1$ - $P(\max X_i\lt M)=P(X_1\lt M)\cdot\ldots\cdot P(X_n\lt M)$ - $\leq(1-\frac1{eM!})^n\leq(e^{-\frac1{eM!}})^n=\text{exp}(-\frac n{eM!})=:p$ - exact case - $P(\max X_i\lt M)\leq p\cdot e\sqrt{m}=e\sqrt n\cdot \text{exp}(-\frac n{eM!})$ - we want $e\sqrt n\cdot \text{exp}(-\frac n{eM!})\lt^?\frac1n$ - $\text{exp}(-\frac n{eM!})\lt^?\frac1{n^2}$ - $\frac n{eM!}\gt^? 2\log n$ - $M!\lt^?\frac{n}{2e\log n}$ - $\log M!\lt^? L-LL-\log 2e$ - we know $M!\leq M(\frac Me)^M$ - $\log M!\leq\log M+M(\log M-1)$ - $\leq LL-LLL+\frac L{LL}(LL-LLL-1)$ - here, we substituted the $M$ - $\leq L-L\frac{LLL}{LL}+LL\leq L-LL-c$ - as $2LL\leq L\frac{LLL}{LL}$ (it's not that clear but it holds) - therefore $\log M!\leq L-LL-c$ - that's what we wanted to prove ## Stochastic Processes - Bernoulli process - infinite sequence of independent identically distributed Bernoulli trials - sequence of RVs $X_1,X_2,\dots$ that are independent and each follows $\text{Ber}(p)$ distribution - $X_k=1$ … success/arrival at time $k$ - Quantities of a Bernoulli process - $N_t$ … number of successes up to time $t$ - $N_t\sim\text{Bin}(t,p)$ - $\mathbb E[N_t]=t\cdot p$ - $T_k$ … time of the $k$-th success - $L_k=T_k-T_{k-1}$ - waiting time - memory-less - $L_k\sim\text{Geom}(p)$ - $\mathbb E[L_k]=\frac1p$ - $P(L=l)=(1-p)^{l-1}\cdot p$ - $T_k$ properties - $T_k=L_1+\dots+L_k$ - by linearity $\mathbb E[T_k]=\frac kp$ - $T_k$ has Pascal distribution of order $k$ - $\forall t\geq k:P(T_k=t)={t-1\choose k-1}p^{k}(1-p)^{t-k}$ - $\forall t\lt k:P(T_k=t)=0$ - Alternative description of a Bernoulli process - we can describe the situation by the interarrival times - i.i.d. random variables $L_1,L_2,\ldots\sim\text{Geom}(p)$ - $T_k=L_1+\dots+L_k$ - $X_k=1$ whenever $T_t=k$ for some $t$ - clearly, the sequence $X_1,X_2,\dots$ is a Bernoulli process - Merging of Bernoulli processes - two independent Bernoulli processes, $Z_i=X_i\lor Y_i$ - $BP(p)\text{[merge with]}BP(q)=BP(1-(1-p)(1-q))$ - $=BP(p+q-pq)$ - Splitting of Bernoulli processes - $X_i,Y_i,Z_i$ - $Z_i\sim BP(p)$ - if $Z_i=0$, then $X_i=Y_i=0$ - if $Z_i=1$, then with probability $q$ we set $(X_i,Y_i)=(1,0)$, otherwise $(X_i,Y_i)=(0,1)$ - then $X_i\sim BP(pq)$ and $Y_i\sim BP(p(1-q))$ - Poisson process - continuous time - $T_1,T_2,T_3,\dots$ are the times of individual arrivals - $N_t$ … number of arrivals up to time $t$ - $\lambda$ … intensity (how many successes per unit of time) - $N_t\sim\text{Pois}(\lambda\cdot t)$ - $L_k\sim\text{Exp}(\lambda)$ - $T_k$ has Erlang distribution, $\mathbb E[T_k]=\frac k\lambda$ - Poisson process interval independence - for any sequence $0\leq t_0\lt t_1\dots\lt t_k$ - RVs $(N_{t_{i+1}}-N_{t_{i}})$ - are independent - each one follows $\text{Pois}(\lambda(t_{i+1}-t_i))$ - Merging of Poisson process - $PP(\lambda)\text{[merge with]}PP(\lambda')=PP(\lambda+\lambda')$ - Splitting of Poisson process - for each success of $PP(\lambda)$, we toss a coin with probability $p$ (with probability $p$, the success is counted as valid) - the resulting process is $PP(\lambda p)$ ## Non-parametric Tests - Permutation test - two-sample test - A/B testing - $A$ … control group (no treatment) - $B$ … group with treatment applied - is the $A$ vs. $B$ difference only random? - we observe … $T(A,B)$ - $T$ … test statistic - example: $T=\bar A-\bar B$ - $\mathcal F=\set{T(\pi(A\cup B))\mid \pi\in S_{m+n}}$ - usually the sizes of the groups stay the same: we use $T(\text{first }|A|\text{ elements of }\pi(A\cup B),\,\text{rest of }\pi(A\cup B))$ - p-value: percentage of $\mathcal F$ more extreme than observed $T(A,B)$ - speed-up … we use only $k$ random samples from $S_{m+n}$ - for paired test, we only permute $X_i$ with $Y_i$ - Signed test - paired test - pairs $X_i,Y_i$ - $Z_i=Y_i-X_i$ - $H_0$ … $P(X_i\gt Y_i)=\frac12$ - count number of pairs where $Z_i\gt 0$ (there was a measurable improvement after the treatment) - use $\text{Bin}(n,\frac12)$ to get the p-value - Wilcoxon signed rank test - paired test - procedure - let $X_i$ be the difference $B_i-A_i$ - sort by absolute value, assign ranks ($R_i$) - calculate $T$ (or $T^+$ or $T^-$, it does not matter) - $T^+=\sum_{X_i\gt 0}R_i$ - $T^-=\sum_{X_i\lt 0} R_i$ - $T=T^+-T^-=\sum R_i\cdot\text{sign}(X_n)$ - compare with scores for measurements with random $+/-$ signs - we use stronger null hypothesis - $H_0:$ median = 0 & distribution of $X$ is symmetric - note: $X$ is symmetric $\iff A,B$ have the same continuous distribution ([source](https://stats.stackexchange.com/questions/348057/wilcoxon-signed-rank-symmetry-assumption)) - Mann-Whitney U-test - two-sample test - we have two populations, $X$ and $Y$ - we get $X_1,\dots,X_m$ and $Y_1,\dots,Y_n$ - usually, $m\neq n$ - $S(X_i,Y_j)=\begin{cases} 1 & X_i\gt Y_j\\ \frac12 & X_i=Y_j \\ 0 & X_i\lt Y_j\end{cases}$ - $U=\sum_{i=1}^m\sum_{j=1}^nS(X_i,Y_j)$ - sign test is a paired variant of this - we use the same approach as in the permutation test with $U$ instead of $T=\bar X_m-\bar Y_n$ - alternative view: we sort $X\cup Y$ and count pairs “$X_i$ before $Y_j$” - when to use it? - numerical data … real numbers (weight, time, price) → permutation test - ordinal data … classes (light/medium/heavy, quick/slow) → $U$-test ## Moment Generating Functions and their applications - Markov inequality - for $X$ nonnegative - $\forall a\geq 1:P(X\geq a\cdot \mu)\leq\frac1a$ - Chebyshev inequality - $\forall a\gt 0:P(|X-\mu|\geq a\cdot\sigma)\leq\frac1{a^2}$ - Chernoff bounds (alternative) - for $X=\sum_{i=1}^nX_i$ where $X_i\sim\text{Ber}(p_i)$ are independent - $\mu=\sum_{i=1}^np_i$ - $\forall\delta\in(0,1]:P(X\geq (1+\delta)\cdot\mu)\leq e^{-\mu\delta^2/3}$ - $\forall\delta\in(0,1):P(X\leq(1-\delta)\cdot\mu)\leq e^{-\mu\delta^2/2}$ - note: there are many more alternative bounds - Moment generating function - given a random variable $X$, we define $M_X:\mathbb R\to\mathbb R$ - $M_X(s):=\mathbb E(e^{sX})$ - $\forall X:M_X(0)=\mathbb E(e^{0x})=1$ - MGF moment theorem - theorem: $M_X(s)=\sum_{k=0}^\infty\mathbb E[X^k]\cdot \frac{s^k}{k!}$ - $\mathbb E[X^k]$ … $k$-th moment of $X$ - proof - $\mathbb E[e^{sX}]=\mathbb E[\sum\frac{(sX)^k}{k!}]=\mathbb E[\sum X^k\frac{s^k}{k!}]=\sum\mathbb E[X^k\frac{s^k}{k!}]=\sum_{k=0}^\infty\mathbb E[X^k]\cdot \frac{s^k}{k!}$ - MGF for Bernoulli variable - $X\sim\text{Ber}(p)$ - $M_X(s)=p\cdot e^{s\cdot 1}+(1-p)\cdot e^{s\cdot 0}$ - $M_X(s)=1-p+pe^s$ - note: $e^s=1+s+\frac{s^2}2+\frac{s^3}{3!}+\dots=\sum_{k=0}^\infty\frac{s^k}{k!}$ - $M_X(s)=1+p(s+\frac{s^2}2+\frac{s^3}{3!}+\dots)$ - $\mathbb EX=[s^1]M_X(s)=p$ - this notation selects the coefficient of the $s^1$ in the GF - $\mathbb EX^2=[\frac{s^2}{2!}]M_X(s)=p$ - MGF for continuous $X$ - $M_X(s)=\mathbb E[e^{sX}]\overset{\text{LOTUS}}{=}\int_{-\infty}^\infty e^{sx}f_X(x)\text { d}x$ - MGF “linearity” and “sum” - theorem - $M_{aX+b}(s)=e^{sb}M_X(as)$ - proof - $\mathbb E(e^{s(aX+b)})=\mathbb E(e^{asX}\cdot e^{sb})=e^{sb}\cdot\mathbb E(e^{(as)X})$ - theorem - $X,Y$ independent $\implies M_{X+Y}=M_X\cdot M_Y$ - proof - $M_{X+Y}(s)=\mathbb E[e^{s(X+Y)}]=\mathbb E[e^{sX}\cdot e^{sY}]=\mathbb E[e^{sX}]\cdot \mathbb E[e^{sY}]=M_X(s)\cdot M_Y(s)$ - example - $X\sim\text{Bin}(n,p)$ - $M_X(s)=(1-p+pe^s)^n$ - MGF of normal distribution - $X\sim N(\mu,\sigma^2)$ - $M_X(s)=e^{\mu s}\cdot e^{\sigma^2 s^2/2}$ - MGF equality and convergence - equality - let $X,Y$ be RVs such that $(\exists\varepsilon\gt 0)(\forall s\in (-\varepsilon,\varepsilon)):M_X(s)=M_Y(s)\in\mathbb R$ - then $F_X=F_Y$ - convergence - $Y,X_1,X_2,X_3,\dots$ RVs - $(\exists\varepsilon\gt 0)(\forall s\in (-\varepsilon,\varepsilon)):\lim_{n\to\infty} M_{X_n}(s)=M_Y(s)$ - $F_Y$ is continuous - then $X_n\xrightarrow d Y$ - $\lim_{n\to\infty} F_{X_n}(x)=F_Y(x)$ - Central limit theorem - theorem - $X_1,X_2,\dots$ i.i.d. (independent identically distributed) RVs - $\mathbb EX_i=\mu$, $\text{var} X_i=\sigma^2$ - $Y_n:=\frac{X_1+\dots+X_n-n\mu}{\sqrt{n}\sigma}$ - $Y_n\xrightarrow d N(0,1)$ - that is $\lim_{n\to\infty}F_{Y_n}(t)=\Phi(t)$ - proof - we may assume $\mu=0$ - otherwise, we would set $X_n'=X_n-\mu$ - variance would not change - the formula for $Y_n$ also would not change (we subtract $n\mu$ there) - $M_{X_i}(s)=1+as+bs^2+O(s^3)\quad(s\doteq 0)$ - $a=\mu=0$ - $b=\frac12\mathbb EX_i^2=\frac12(\sigma^2-\mu^2)=\frac12\sigma^2$ - from the alternative formula for $\text{var}$ - $M_{X_i}(s)=1+\frac{\sigma^2}2s^2+O(s^3)$ - $M_{Y_n}(s)=\prod M_{X_i}\left(\frac s{\sqrt n\sigma}\right)$ - because $Y_n=\frac{X_1+\dots+X_n}{\sqrt n\sigma}$ - we need to show that $\lim M_{Y_n}(s)=M_{Y}(s)$ where $Y\sim N(0,1)$ - or that $\lim_{n\to\infty}M_{X_n}(\frac s{\sqrt n\sigma})^n=e^{s^2/2}$ - $M_{Y_n}(s)=M_{X_i}(\frac s{\sqrt n\sigma})^n=(1+\frac{\sigma^2}2(\frac s{\sqrt n\sigma})^2+O(s^3))^n=(1+\frac{s^2}{2n}+O(s^3))^n$ - we substituted $s=\frac s{\sqrt n\sigma}$ into $M_{X_i}$ - we know that $(1+\frac xn)^n\to e^x$ - therefore $(1+\frac{s^2/2}{n}+O(s^3))^n\to e^{s^2/2}$ - Chernoff inequality - theorem - suppose i.i.d. $X_1,X_2,\dots,X_n$ are $\pm1$, each with probability $\frac12$ - $X=\sum_iX_i$ - we have $\sigma^2=n$ - $\forall t\gt 0:P(X\geq t)=P(X\leq -t)\leq e^{-t^2/2\sigma^2}$ - proof - $M_{X_i}=\frac{e^{-s}+e^{s}}2$ - Markov inequality - for $X$ nonnegative - $\forall a\geq 1:P(X\geq a)\leq\frac \mu a$ - we use it - $\forall s\gt 0:P(X\geq t)=P(e^{sX}\geq e^{st})\leq\frac{\mathbb E(e^{sX})}{e^{st}}$ - $\frac{\mathbb E(e^{sX})}{e^{st}}=\frac{M_{X_i}(s)^n}{e^{st}}=\frac{(\frac{e^{-s}+e^{s}}2)^n}{e^{st}}$ - it can be shown that $\frac{e^{-s}+e^{s}}2\leq e^{s^2/2}$ - therefore $P(X\geq t)\leq \frac{e^{ns^2/2}}{e^{st}}=e^{ns^2/2-st}$ - we set $s=t/n$ - then $P(X\geq t)\leq e^{-t^2/2n}$ - Set balancing using Chernoff inequality - we have subsets $S_1,\dots,S_n\subseteq[m]$ - if may be features like “men”, “old people”, … - we want $T\subseteq[m]$ such that it almost dissects every $S_i$ - to get two groups with the same (or similar) representation of every feature - $D_i=|T\cap S_i|-|S_i\setminus T|$ - $\text{disc}(T)=\max D_i$ - discrepancy - use random $T$! - $\forall a_j\in S_i$ we define “indicators” - $X_j=1\iff a_j\in T$ - $X_j=-1\iff a_j\notin T$ - $D_i=\sum_{j=1}^{|S_i|}X_j$ - $P(D_i\geq t)\leq e^{-\frac{t^2}{2|S_i|}}\leq e^{-\frac{t^2}{2m}}$ - $P(|D_i|\geq t)\leq 2e^{-t^2/2m}$ - $P(\exists j:|D_j|\geq t)\leq n\cdot 2e^{-t^2/2m}\leq \frac2n$ - for $t=\sqrt{4m\log n}$ - idea: it is useful to have an inequality (Chernoff) that holds always, we don't need to worry if our $n$ is big enough