# Probability and Statistics 2 ## Markov chains - example 1 - machine in 2 states – working × broken - w → b … probability 0.01 - w → w … 0.99 - b → w … 0.9 - b → b … 0.1 - example 2 - a fly which may get caught in a web … absorbing states - df: stochastic process (náhodný proces) - sequence of random variables $X_0,X_1,\dots$ such that "all probabilities are defined" - e.g. $P(X_0=1\land X_1=2\land X_3=0)$ is defined - df: Markov process/chain - stochastic process such that - $\exists$ finite/countable set $S:\text{Im}X_t\subseteq S\;\forall t$ - elements of $S$ … states - $t$ … time - $(\forall i,j\in S)(\exists p_{ij}\in [0,1])(\forall t)(\forall a_0,a_1,\dots,a_t,a_{t+1})$ - notes - this is a discrete time Markov chain - discrete space … $|S|\leq |\mathbb N|$ - time-homogeneous - $p_{ij}$ not depeding on $t$ - Markov condition - only applies if the condition probabilities are defined - Markov condition … "forgetting the past" - df: transition matrix $P=(p_{ij})$ - df: transition diagram/graph - vertices … $S$ - directed edges … $\set{(i,j):p_{ij}\gt 0}$ - observation: every row of $P$ sums to 1 - Markov chain is equivalent to a random walk on the transition diagram - from state $i$ we go to $j$ with the probability of $p_{ij}$ independent of the past - in example 1, assume $X_0$, what is $P(X_2=w)$? - $P(X_1=w)=p_{11}=0.99$ - $P(X_2=w)=p_{11}\cdot p_{11}+p_{12}\cdot p_{21}=0.99\cdot 0.99+0.01\cdot 0.9$ - using the total probability theorem - probability mass function (PMF, pravděpodobnostní funkce) for $X_0,X_1,\dots$ - $\pi_i^{(t)}=P(X_t=i)$ - $t\in\mathbb N_0$ - $i\in S$ - theorem - $\pi^{(t+1)}=\pi^{(t)}\cdot P$ - where $\pi^{(t)}=(\pi_1^{(t)},\dots,\pi_n^{(t)})$ row vector - proof … definition of matrix multiplication & total probability theorem - theorem - $P(X_0=a_0,X_1=a_1,\dots,X_t=a_t)=\pi_{a_0}^{(0)}\cdot P_{a_0a_1}\cdot P_{a_1a_2}\cdot\ldots\cdot P_{a_{t-1}a_t}$ - proof by induction - df: probability of $k$-step transition … $r_{ij}(k)$ - $r_{ij}(k)=P(X_{t+k}=j\mid X_t=i)$ - $r_{ij}(1)=P(X_{t+1}=j\mid X_t=i)=p_{ij}$ - it is independent of $t$ - theorem (Chapman-Kolmogorov) - $\forall$ Markov chain, $\forall i,j,k:$ - $r_{ij}(k+1)=\sum_{u=1}^n r_{iu}(k) p_{uj}$ - $r_{ij}(k+\ell)=\sum_{u=1}^n r_{iu}(k) r_{uj}(\ell)$ - $r_{ij}(k)=(P^k)_{ij}$ - proof - 1 is a special case of 2 ($r_{uj}(1)=p_{uj})$ - $1\implies 3$ … matrix multiplication & induction - … - df: $j$ is accessible from $i$ (where $i,j\in S$) - $i\to j$ - $j\in A(i)$ - $j$ is accessible from $i\equiv$ $\exists t:\underbrace{P(X_t=j\mid X_0=i)}_{r_{ij}(t)}\gt 0$ - $\iff$ in the transition digraph exists directed path $i\to j$ of length $t$ - df: $i\leftrightarrow j$ ($i$ and $j$ communicate) $\equiv i\in A(j)\land j\in A(i)$ - theorem: $\leftrightarrow$ is an equivalence relation - proof - reflexive: $i\in A(i)$ … $t=0$ in the definition - symmetric: the formula $i\in A(j)\land j\in A(i)$ is symmetric - transitive: in the digraph $\exists$ path $i$ to $j$, $\exists$ path $j$ to $k$ $\implies\exists$ walk $i\to k$ - decomposition of the transition digraph to components of strong connectivity $\equiv$ finding equivalence classes of $\leftrightarrow$ - df: a state $i\in S$ is recurrent if we return to it with probability 1 - it is transient otherwise - česky rekurentní, tranzientní - df: Markov chain is irreducible if $\forall i,j\in S:i\leftrightarrow j$ - df: $T_i=\min\set{t\geq 1:X_t=i}$ - $T_i=\infty$ if the set is empty (there is no such $t$) - $T_i$ … random variable - $f_{ij}(n)=P(\text{we get to }j\text{ first at time }n\mid X_0=i)=P(T_j=n\mid X_0=i)$ - we define it for $n\gt 0$ - $f_{ij}=P(T_j\lt\infty\mid X_0=i)=\sum_{n\geq 1} f_{ij}(n)$ - $i$ is transient - $f_{ii}\lt 1$ … probability of ever getting back - $\iff P(T_i=\infty)\gt 0$ - $\iff P(\text{get back infinitely often})\lt 1$ - example – random walk on a line - with probability 1/2 we go to the left - $f_{00}(2)=(\frac12)^2+(\frac12)^2=\frac12$ - $f_{00}(1)=f_{00}(3)=\dots=0$ - $f_{00}(4)=\frac{1}{2^4}\cdot 2$ - $r_{00}(4)\neq f_{00}(4)$ - $r_{00}(4)=\frac{1}{2^4}\cdot 4$ - is 0 recurrent? - by definition, we should add $f_{00}(2)+f_{00}(4)+\dots$ and check if it equals 1 - theorem: $i$ is a recurrent state $\iff\sum_{n=1}^\infty r_{ii}(n)=\infty$ - $B_n=\begin{cases} 1&\text{ if }X_n=i\\ 0 &\text{ otherwise}\end{cases}$ - we got back - $\mathbb E(B_n\mid X_0=i)=P(X_n=i\mid X_0=i)=r_{ii}(n)$ - $r_{00}(2n)=$ probability that out of the $2n$ steps, $n$ were to the left, $n$ were to the right - $=\frac1{2^{2n}}\cdot{2n\choose n}\doteq\frac c{\sqrt n}$ … see Matoušek, Nešetřil - $\sum_{n=1}^\infty \frac{c}{\sqrt n}=\infty$ (taught in calculus) - $\mathbb E(T_0\mid X_0=0)=\infty$ (we did not prove that) - theorem - if $i\leftrightarrow j$, then either both $i$ and $j$ are recurrent or both $i$ and $j$ are transient - for finite Markov chains - $C$ … equiv class of $\leftrightarrow$ in a finite Markov chain - $C$ is recurrent ($\forall i\in C$ is recurrent) $\iff(\forall i \in C)(\forall j\in S):$ if $i\to j$ then $j\in C$ - → $C$ is closed - stationary distribution / steady state distribution - df: $\pi:S\to [0,1]$ such that $\sum_{i\in S}\pi_i=1$ is called a stationary distribution if $\text{``}\pi P=\pi\text{"}$ - $\forall i:\sum_i\pi_i p_{ij}=\pi_j$ - if $\pi^{(0)}$ (the PMF of $X_0$) is $\pi$, then $\pi^{(1)}$ is $\pi$ as well - $\pi^{(1)}=\pi^{(0)}P$ - theorem: if a Markov chain is finite, aperiodic (→ not periodic) and irreducible, then 1. $\exists$ unique stat. distribution $\pi$ 2. $\forall i:\lim_{n\to\infty}(P^n)_{ij}=\pi_j$ - example of periodic Markov chain: two states, we change the state with probability 1 - df: $i\in S$ has period $d_i:=\text{gcd}\set{t:r_{ii}(t)\gt 0}$ - $i\in S$ is aperiodic if $d_i=1$ - df: $i$ is null recurrent if $i$ is recurrent and $\mathbb E(T_i\mid X_0=i)=\infty$ - $i$ is positive recurrent if $i$ is recurrent and $\mathbb E(T_i\mid X_0=i)\lt\infty$ - example: random walk on a line - theorem - if $i,j\in S$, $i\leftrightarrow j$, then - $d_i=d_j$ - $i$ is transient $\iff j$ is trainsient - $i$ is recurrent $\iff j$ is recurrent - $i$ is null recurrent $\iff j$ is null recurrent - $i$ is positive recurrent $\iff j$ is positive recurrent - these are properties of the class of $\leftrightarrow$ - theorem - if a Markov chain is irreducible, aperiodic and finite, then - there exists a unique stationary distribution $\pi$: $\pi P=\pi$ - $\forall ij:\lim(P^t)_{ij}=\pi_j$ - $P(X_t=j\mid X_0=i)\doteq\pi_j$ - actually, MC does not have to be finite, it suffices if all states are positive recurrent (?) - steady state (?) - if $\pi^{(0)}=\pi$ then $\pi^{(1)}=\pi$ - the proof is not easy, here is a cheat proof - $Pj=Ij$ (row sums are 1) - $(P-I)j=0$ - $P-I$ is sungular matrix - $\exists x:x(P-I)=0\implies xP=x$ - $\pi=\frac xc$ such that $\sum \pi_i=1$ - problem - $x$ may have negative coordinates - to fix: use Perron-Frobenius theorem - the correct proof is shown in class of probabilistic techniques - to find $\pi$, solve system of linear equations $\pi P=\pi$, add $\sum_{i\in S}\pi_i=1$ - $\pi$ describes long-term behavior of the MC - Page Rank (original google search) … MC model of people browsing WWW - given $\pi$, we can find a MC such that $\pi$ is its stationary distribution; then we can run the MC to generate random objects with distribution $\pi$ - Markov chain Monte Carlo (MCMC) - detailed balance equation - MC may have this property - $\forall i\neq j:\pi_iP_{ij}=\pi_jP_{ji}$ - to imagine this: ant colony moving independently according to a Markov chain - stationary distribution $\iff$ the same number of ants at each state at each time – ants don't "accumulate" - detailed balance equation implies $\pi P=\pi$ - detailed balance equation is stronger than $\pi P=\pi$ - MCMC algo. sketch - choose aperiodic irreducible digraph - $p_{ij}=\min\set{1,\frac{\pi _j}{\pi_i}}\cdot C$ - $p_{ji}=\min\set{1,\frac{\pi_i}{\pi_j}}\cdot C$ - choose $C$ such that - $\forall i:\sum_{j\neq i} p_{ij}\lt 1$ - df. $p_{ii}=1-\sum_{j\neq i}p_{ij}\gt 0$ - $\implies d_i=1$ - tune the process to make convergence fast - absorbing state $i:p_{ii}=1$ - $A$ … set of absorbing states - question 1: which $i\in A$ we end at? - question 2: how fast? - example: $0\in A$ (?) - $a_i=P(\exists t:X_t=0\mid X_0=i)$ - $\mu_i=\mathbb E(T\mid X_0=i)$ - $T=\min\set{t: X_t\in A}$ - theorem: $(a_i)_{i\in S}$ are the unique solution to - $a_0=1$ - $a_i=0$ if $i\in A,\,i\neq 0$ - $a_i=\sum_j p_{ij}\cdot a_j$ otherwise - theorem: $(\mu_i)_{i\in S}$ are unieque solutions to - $\mu_i=0$ if $i\in A$ - $\mu_i=1+\sum_j p_{ij}\mu_j$ if $i\notin A$ - proof - $P(\exists t:X_t=0\mid X_0=0)=1$ - $P(\exists t: X_t=0\mid X_0=i\in A\setminus\set{0})=0$ - $i\notin A$ - $B_j=\set{\exists t:X_t=0}$ - $P(B_i)=\sum_{j\in S} p_{ij}\cdot \underbrace{P(B_i\mid X_1=j)}_{P(B_j)=a_j}$ - example: drunk person on their way home - $A=\set{0}$ - $\mu_0=0$ - $\mu_1=1+\frac12\mu_0+\frac12\mu_2$ - $\mu_2=1+\frac12\mu_1+\frac12\mu_3$ - $\mu_{n-1}=1+\frac12\mu_{n-2}+\frac12\mu_{n}$ - $\mu_n=1+\mu_{n-1}$ - solution - $\mu_1=2n-1$ - $\mu_n=n^2$ - $\mu_{i}\leq n^2$ - 2-SAT problem - input: $\varphi=(x_1\lor x_2)\land(x_3\lor\neg x_1)\land\ldots$ - clauses with exactly 2 literals - output: a satisfying assignment OR “unsatisfiable” - there exists a polynomial algorithm - we will show a randomized algorithm 1. arbitrarily initialize $(x_1,\dots,x_n)$ 2. while $\varphi$ has an unsatisfied clause - choose one of the unsatisfied clauses and change one of its variables 3. return $(x_1,\dots,x_n)$ - repeat (2) $\leq2mn^2$ times - then, if $\varphi$ is still unsatisfied, return “unsatisfiable” - this may introduce errors - theorem: the algorithm makes an error with probability $\leq 2^{-m}$ - proof - $x_1^*,\dots,x_n^*$ … one of the satisfying assignments - $D_t$ the number of $i$ such that $x_i\neq x_i^*$ at time $t$ - $t=0,1,\dots,2mn^2$ - $0\leq D_t\leq n$ - $D_t=0\implies$ we found a solution - situation - we assume that clause $(x_1\lor x_2)$ is unsatisfied at time $t$ and we choose it - $\implies x_1=x_2=F$ - $(x_1\neq x_2)$ - ($x_2\neq\neg x_1)$ … we could remove such clauses - $\implies x_1^*\lor x_2^*$ is $T$ - we randomly switch $x_1$ or $x_2$ which increases or decreases $D$ - … - we can get a 3-SAT randomized algorithm with $(\frac43)^n$ time complexity this way - Hidden Markov Model (HMM) - not observable $(X_t)$ … Markov chain - observable $(Y_t)$ … $Y_t$ is obtained from $X_t$ - widely aplplicable - smart algorithm (Vitebri algorithm) ## Probability - what is probability? - $P:\mathcal F\to[0,1]$ such that - $P(\Omega)=1$ - $P(\bigcup A_n)=\sum P(A_n)$ if disjoint - where to find/use it? - randomized algorithms - is it possible to have true randomness? - hardware methods to sample random bits - software methods - their independence can be “weak” (for statistics) or “strong” (for cryptography) - probabilistic method - we prove that some graph exists by showing that random graph has certain property with probability greater than zero - $G(n,p)$ … graph with vertices $1,\dots,n$ - $\forall i,j:i\sim j$ with probability $p$ (all independent) - $P(G(n,\frac12)$ has no $K_k$ nor $\overline K_k$ as an induced subgraph$)\gt 0$ if $n\leq 2^{k/2}$ - thus $\exists G$ on $2^{k/2}$ vertices with no induced $K_k$ nor $\overline K_k$ - Ramsey theorem - statistics - frequentist - $P(A)=$ number of good / number of all - “in long term repetition” - repeat a random experiment independently $n$ times - observe that $A$ happens $k$ times - $P(A):=k/n$ - Bayesian - P(it will rain tomorrow) - subjective probability → betting - does satisfy axioms! - “random universe” - $\Omega$ = set of all possible universes - Bayes theorem - MAP (maximum a posteriori) - $\hat\theta_\text{MAP}=\text{argmax}_\theta\,p_{\Theta\mid X}(\theta\mid x)$ - $\hat\theta$ is a point estimate for $\Theta$ - that equals $\text{argmax}\,p_\Theta(\hat\theta)\cdot p_{X\mid\Theta}(x\mid\theta)$ as we can ignore the normalization constant - Beta function - $B(\alpha,\beta)=\int_0^1x^{\alpha-1}(1-x)^{\beta-1}=\frac{(\alpha-1)!(\beta-1)!}{(\alpha+\beta-1)!}$ - Beta distribution - $f_\Theta(x)=\frac1{B(\alpha,\beta)}x^{\alpha-1}(1-x)^{\beta-1}$ - $(\ln f)'=(c+(\alpha-1)\ln x+(\beta-1)\ln (1-x))'={\alpha-1\over x}-{\beta-1\over 1-x}$ - in the maximum, this will be equal to zero - maximum … $\frac{\alpha-1}{\alpha-1+\beta-1}$ - … - LMS point estimate - LMS = least mean square - estimate such that $\mathbb E((\Theta-\hat\theta)^2\mid X=x)$ is minimal - to compute it - … - example: measurement error with a normal distribution - often, the constant in the denominator does not matter - from posterior $f_{\Theta\mid X}$ we can find - point extimates - MAP - LMS - interval estimates - confidence intervals (in classical statistics) → credible sets $S$ - sampling - rejection sympling - MCMC sampling - Monte Carlo Markov chains - Metropolis Hastings method - we construct a MC from the probability distribution we want - we run the MC for long enough - LMS - $\Theta\mid X=x$ - mean … LMS = $\min\mathbb E((\Theta-\hat\theta)^2\mid X=x)$ - median … $\min\mathbb E(|\Theta-\hat\theta|\mid X=x)$ - modus … MAP - conditional independence - events - $A\perp B\iff P(A\cap B)=P(A)\cdot P(B)$ - $A\perp_C B\iff P(A\cap B\mid C)=P(A\mid C)\cdot P(B\mid C)$ - $A,B$ are independent conditionally given $C$ - it is possible (even typical) that $A\perp_C B$, $A\perp_{C^C} B$, but not $A\perp B$ - conditional expectation - $\mathbb E(Y\mid X=x)$ vs. $\mathbb E(Y\mid X)$ - $\mathbb E(Y\mid X=x)=g(x)$ … number - $\mathbb E(Y\mid X)=g(X)$ … random variable - we proved that $\mathbb E(\mathbb E(Y\mid X))=\mathbb E(g(X))=\mathbb EY$ - “law of iterated expectation” - basic task of statistics - estimate one quantity (Y) given data/measurement (X) - example: groups of students, their exam results - estimator - $\hat Y=\mathbb E(Y\mid X)=g(X)$ - $\tilde Y=\hat Y-Y$ - we proved that $\mathbb E(\tilde Y\mid X)=0$ - therefore $\mathbb E(\tilde Y)=\mathbb E(\mathbb E(\tilde Y\mid X))=0$ - also, $\text{cov}(\tilde Y,\hat Y)=0$ - they are uncorellated - note - uncorellated $\impliedby$ independent - uncorellated $\centernot\implies$ independent - conditional variance - iterated variance / eve's rule - $\text{var }Y=\mathbb E(\text{var}(Y\mid X))+\text{var}(\mathbb E(Y\mid X))$ - intragroup variance + intergroup variance - birthday paradox - $1-x\approx e^{-x}$ - balls into bins - $m$ balls - $n$ bins - $X_i$ … number of balls in bin $i$ - birthday paradox … $P(\max X_i\geq 2)$ - questions - how many bins are used? - approximation of $X_i$ - max load … $\max\set{X_1,\dots,X_n}$ - $P(X_i=0)=(1-\frac1n)^m\approx e^{-\frac mn}$ - $I_i=1$ if $X_i=0$ (otherwise 1) - $\mathbb E(\sum_i I_i)=\sum_i\mathbb EI_i\approx ne^{-\frac mn}$ - distribution of $X_i$ - $X_i\sim \text{Bin}(m,\frac 1n)\approx\text{Pois}(\frac mn)$ - $\mathbb EX_i=\frac mn$ - applications - bucket sort - hash collisions - exact case vs. Poisson case - theorem: any event that happens with probability $\leq p$ in the Poisson case happens with probability $\leq p\cdot e\sqrt m$ in the exact case - bernoulli process, poisson process - … - statistics – descriptive vs. inferential - descriptive – describe what we observed - inferential – deduce properties of a larger set - how many people are left-handed? - we can observe this on a small sample of people - observational study vs. randomized study (RCT, randomized control trial) - treatment – what you do to experimental units - example - experimental units … students - treatment … which tutorial student attends - placebo, control group - observational study → we let the students decide which tutorial to attend - randomized study → we assign the tutorials to students (randomly) - confounders - people can get better because they believe they are treated - random $\neq$ arbitrary - haphazard - statistics – parametric vs. non-parametric - parametric → observations from parametrized random variables - t-test - permutation test - example: earbuds - black (4.3/5) vs. navy blue (4/5) - descriptive statistics - observational study - example: alpacas - permutation test - https://www.jwilber.me/permutationtest/ - we observe … $T(X,Y)$ - $Z:=X,Y$ - $\mathcal F=\set{T(\pi(Z))\mid \pi\in S_{m+n}}$ - p-value: percentage of $\mathcal F$ more extreme than observed - speed-up … we use only $k$ random samples from $S_{m+n}$ - tests - one-sample test - two-sample test - paired test - quality of wool before/after treatment - give black & blue to $n$ testers, ask each tester to score both - randomize the order of testing the color - use one-sample test on $D_i=X_i-Y_i$ - parametrized statistics … t-test (assuming normal distribution) - non-parametric version … paired test & permutation test - for paired test, we only permute $X_i$ with $Y_i$ - sign test - $Y_i=$ "sign of $X_i$" - example - 10 values - $H_0$ … median = 0 - sign $S\sim\text{Bin}(10,\frac12)$ - one-sided test … $F_S(3)=P(S\leq 3)=0.17$ - two-sided test … $P(S\leq 3\lor S\geq 7)=0.34$ - but we are suspicious - we can rank absolute values (from lowest to highest) and write the averages of the ranks (for rank 3–5 → the average will be 4) - we can multiply that by the sign ($S$) - test procedure (Wilcoxon signed rank test) - data … $X_1,\dots,X_n$ - ranks of $|X_i|$ … $R_1,\dots,R_n$ - $T^+=\sum_{X_i\gt 0}R_i$ - $T^-=\sum_{X_i\lt 0} R_i$ - $T=T^+-T^-=\sum R_i\cdot\text{sign}(X_n)$ - $T^++T^-=\sum i=\frac {n(n+1)}2$ - we will use stronger null hypothesis - $H_0:$ median = 0 & distribution is symmetric - assuming $H_0$… (and ignoring ties) - $T=\sum_{i=1}^n i\cdot V_i$ - $V_i=\begin{cases}+1 \text{ prob. } 1/2\\ -1\end{cases}$ - $V_i^+=0/1$ - $\sim\text{Ber}(\frac12)$ - $T^+=\sum i\cdot V_i^+$ - $V_i$ … independent - what can we do? - we can compute CDF of $T$ - we can apply CLT to approximate (can we really?) - back to our example - `wilcox` test in R → $0.07$ - what is correct? $0.07$ or $0.17$? - it depends on what we can assume - sign test - Wilcoxon signed rank test - Student's t-test - power of test - $1-P(\text{type II error})$ - to increase, we have two choices - we can get more data - or we can make stronger assumptions - but we should not cheat – if it is obvious that the data is not normally distributed, we should not say they are - paired test - $X_i=A_i-B_i$ - null hypothesis … the procedure did not help nor hurt - the distribution may be very weird - but it will be symmetrical - if $A_i,B_i$ have same distribution, then $A_i-B_i$ has distribution symmetrical around 0, median 0 - Mann-Whitney U-test - $U=\sum_{i=1}^m\sum_{j=1}^nS(X_i,Y_j)$ - $S(X_i,Y_j)=\begin{cases} 1 & X_i\gt Y_j\\ \frac12 & X_i=Y_j \\ 0 & X_i\lt Y_j\end{cases}$ - 2-sample test - we have two populations, $X$ and $Y$ - we get $X_1,\dots,X_m$ and $Y_1,\dots,Y_n$ - usually, $m\neq n$ - sign test is a paired variant of this - we use the same approach as in the permutation test with $U$ instead of $T=\bar X_m-\bar Y_n$ - types of data - numerical … real numbers (weight, time, price) - → permutation test - ordinal … classes (light/medium/heavy, quick/slow) - → $U$-test - bootstrapping - data $X_1,\dots,X_n$ iid - sample mean $\bar X_n=\frac1n\sum_{i=1}^n X_i$ - distr. of $\bar X_n:\mathbb E \bar X_n=\mathbb E X_i=\mu$ - variance of $\bar X_n:\text{var}\bar X_n=\frac1n\sigma^2$ - CLT: $\bar X_n$ has $\approx$ Normal distribution $N(\mu,(\frac\sigma{\sqrt n})^2)$ - $P(\bar X_n\lt a)$ can be approximated by the normal distribution $N$ - now, we want the same of a different parameter - $M$ … median of $X_i$ - $\hat M=$ sample median: “middle value of $X_1,\dots,X_n$” - $X_{\set{1}},\dots,X_{\set{n}}$ … sorted data - $\hat M=X_{\set{\frac n2}}$ - estimator function of data we use to estimate the true $M$ - bootstrapping … sampling with repetition from $X_1,\dots,X_n$ - we can approximate the distribution of $\hat M$ - interval estimate for median $M$ - $M^*$ … bootstrapped median (median of the one set of bootstrapped data) - $M_\alpha^*=\set{x:P(M^*\leq x)=\alpha}$ - this method is simple and does not depend on the distribution of the data ## Moment generating functions - today: proof of CLT, proof of Chernoff inequality - given a random variable $X$, we define $M_X:\mathbb R\to\mathbb R$ - $M_X(s):=\mathbb E(e^{sX})$ - $\forall X:M_X(0)=\mathbb E(e^{0x})=1$ - $X\sim\text{Ber}(p)$ - $M_X(s)=p\cdot e^{s\cdot 1}+(1-p)\cdot e^{s\cdot 0}$ - $M_X(s)=1-p+pe^s$ - note: $e^s=1+s+\frac{s^2}2+\frac{s^3}{3!}+\dots=\sum_{k=0}^\infty\frac{s^k}{k!}$ - $M_X(s)=1+p(s+\frac{s^2}2+\frac{s^3}{3!}+\dots)$ - theorem: $M_X(s)=\sum_{k=0}^\infty\mathbb E[X^k]\cdot \frac{s^k}{k!}$ - $\mathbb E[X^k]$ … $k$-th moment of $X$ - proof - $\mathbb E[e^{sX}]=\mathbb E[\sum\frac{(sX)^k}{k!}]=\mathbb E[\sum X^k\frac{s^k}{k!}]=\sum\mathbb E[X^k\frac{s^k}{k!}]=\sum_{k=0}^\infty\mathbb E[X^k]\cdot \frac{s^k}{k!}$ - back to Ber(p) - $\mathbb EX=[s^1]M_X(s)=p$ - this notation selects the coefficient of the $s^1$ in the GF - $\mathbb EX^2=[\frac{s^2}{2!}]M_X(s)=p$ - $X\sim N(0,1)$ - $M_X(s)=\mathbb E[e^{sX}]\overset{\text{LOTUS}}{=}\int_{-\infty}^\infty e^{sx}f_X(x)\text { d}x$ - … - $M_X(s)=e^{s^2/2}$ - theorem: $M_{aX+b}(s)=e^{sb}M_X(as)$ - proof: … - example usage - $X\sim N(\mu,\sigma^2)$ - $\frac{X-\mu}{\sigma}=Y\sim N(0,1)$ - $X=\sigma Y+\mu$ - $M_X(s)=e^{\mu s}\cdot e^{\sigma^2 s^2/2}$ - theorem - let $X,Y$ be RVs such that $(\exists\varepsilon\gt 0)(\forall s\in (-\varepsilon,\varepsilon)):M_X(s)=M_Y(s)\in\mathbb R$ - then $F_X=F_Y$ - example - $X_1\sim N(\mu_1,\sigma^2_1)$ - $X_2\sim N(\mu_2,\sigma^2_2)$ - $\implies X_1+X_2\sim N(\mu,\sigma^2)$ - proof - $M_{X_1}=\text{exp}(\mu_1s+\sigma_1^2s^2/2)$ - $M_{X_2}=\text{exp}(\mu_2s+\sigma_2^2s^2/2)$ - … - theorem - $X,Y$ independent $\implies M_{X+Y}=M_X\cdot M_Y$ - proof - $M_{X+Y}(s)=\mathbb E[e^{s(X+Y)}]=\mathbb E[e^{sX}\cdot e^{sY}]=\mathbb E[e^{sX}]\cdot \mathbb E[e^{sY}]=M_X(s)\cdot M_Y(s)$ - example - $X\sim\text{Bin}(n,p)$ - $M_X(s)=(1-p+pe^s)^n$ - theorem - $Y,X_1,X_2,X_3,\dots$ RVs - $(\exists\varepsilon\gt 0)(\forall s\in (-\varepsilon,\varepsilon)):\lim_{n\to\infty} M_{X_n}(s)=M_Y(s)$ - $F_Y$ is continuous - then $X_n\xrightarrow d Y$ - $\lim_{n\to\infty} F_{X_n}(s)=F_Y(s)$ - theorem (CLT) - $X_1,X_2,\dots$ i.i.d. (independent identically distributed) RVs - $\mathbb EX_i=\mu$, $\text{var} X_i=\sigma^2$ - $Y_n:=\frac{X_1+\dots+X_n-n\mu}{\sqrt{n}\sigma}$ - $Y_n\xrightarrow d N(0,1)$ - that is $\lim_{n\to\infty}F_{Y_n}(t)=\Phi(t)$ - proof - we may assume $\mu=0$ - otherwise, we would set $X_n'=X_n-\mu$ - variance would not change - the formula for $Y_n$ also would not change (we subtract $n\mu$ there) - $M_{X_i}(s)=1+as+bs^2+O(s^3)\quad(s\doteq 0)$ - $a=\mu=0$ - $b=\frac12\mathbb EX_i^2=\frac12(\sigma^2-\mu^2)=\frac12\sigma^2$ - $M_{X_i}(s)=1+\frac{\sigma^2}2s^2+O(s^3)$ - $M_{Y_n}(s)=\prod M_{X_i}\left(\frac s{\sqrt n\sigma}\right)$ - we will use the previous theorem for $Y,Y_1,Y_2,\dots$ - we need to show that $\lim M_{Y_n}(s)=M_Y(s)$ - or that $\lim_{n\to\infty}M_{X_n}(\frac s{\sqrt n\sigma})^n=e^{s^2/2}$ - $M_{X_n}(\frac s{\sqrt n\sigma})^n=…=\text{exp}(n\ln(1+\frac{s^2}{2n}+O(s^3))$ - … - theorem (Chernoff inequality) - … - application - set balancing - discrepancy - we have subsets $S_1,\dots,S_n\subseteq[m]$ - we want $T\subseteq[m]$ such that it almost disects every $S_i$ - $D_i=|T\cap S_i|-|S_i\setminus T|$ - $\text{disc}(T)=\max D_i$ - use random $T$! - $D_i=\sum_{j=1}^{|S_i|}X_j$ - $\forall a_j\in S_i:X_j$ indicator ($X_j=1\iff a_j\in T$) - $P(D_i\geq t)\leq e^{-\frac{t^2}{2|S_i|}}$ - … - idea: it is useful to have an inequality (Chernoff) that holds always, we don't need to worry if our $n$ is big enough