Exam

Markov Chains Bayesian Statistics Conditional Expectation Balls & Bins Stochastic Processes Non-parametric Tests Moment Generating Functions and their applications

Markov Chains

Stochastic process
- sequence of random variables $X_0,X_1,\dots$ $X_{0}, X_{1}, \dots$ such that “all probabilities are defined”
  - e.g. $P(X_0=1\land X_1=2\land X_3=0)$ is defined
Markov process/chain
- stochastic process such that
  - $\exists$ $\exists$ finite/countable set $S$ $S$ such that $\forall t:\text{Im}\,X_t\subseteq S$ $\forall t : Im X_{t} \subseteq S$
    - elements of $S$ … states
    - $t$ … time
  - $\forall i,j\in S:\exists p_{ij}\in [0,1]$
- notes
  - this is a discrete time Markov chain
  - discrete space … $|S|\leq |\mathbb N|$
  - time-homogeneous … $p_{ij}$ not depeding on $t$
  - Markov condition … “forgetting the past”
    - $P(X_{t+1}=a_{t+1}\mid X_t=a_t\land \dots\land X_0=a_0)=P(X_{t+1}=a_{t+1}\mid X_t=a_t)$
Transition matrix, diagram
- transition matrix is a matrix $P$ such that $P_{ij}=p_{ij}$
- observation: every row of $P$ sums to 1
- transition diagram/graph
  - vertices … $S$
  - directed edges … $\set{(i,j)\in S^2:p_{ij}\gt 0}$
- Markov chain is equivalent to a random walk on the transition diagram
  - from state $i$ we go to $j$ with the probability of $p_{ij}$ independent of the past
Probability mass function for $X_0,X_1,\dots$ $X_{0}, X_{1}, \dots$
- $\pi_i^{(t)}=P(X_t=i)$
- theorem: $\pi^{(t+1)}=\pi^{(t)}\cdot P$ $π^{(t + 1)} = π^{(t)} \cdot P$
  - where $\pi^{(t)}=(\pi_1^{(t)},\dots,\pi_n^{(t)})$ row vector
- proof: definition of matrix multiplication & total probability theorem
- by induction, we can prove that $\pi^{(k)}=\pi^{(0)}P^k$ and, more generally, $\pi^{(t+k)}=\pi^{(t)}P^k$
Probability of $k$ $k$ -step transition … $r_{ij}(k)$ $r_{ij} (k)$
- $r_{ij}(k)=P(X_{t+k}=j\mid X_t=i)$
- $r_{ij}(1)=P(X_{t+1}=j\mid X_t=i)=p_{ij}$
Chapman-Kolmogorov theorem
- $\forall$ $\forall$ Markov chain, $\forall i,j,k:$ $\forall i, j, k :$
  - $r_{ij}(k+1)=\sum_{u=1}^n r_{iu}(k) p_{uj}$
  - $r_{ij}(k+\ell)=\sum_{u=1}^n r_{iu}(k) r_{uj}(\ell)$
  - $r_{ij}(k)=(P^k)_{ij}$
- proof
  - $2\implies 1$ $2 ⟹ 1$
    - 1 is a special case of 2 ( $r_{uj}(1)=p_{uj})$
  - $1\implies 3$ $1 ⟹ 3$
    - matrix multiplication & induction
  - $2$ $2$ … we want $r_{ij}(k+\ell)=\sum_{u=1}^n r_{iu}(k) r_{uj}(\ell)$ $r_{ij} (k + ℓ) = \sum_{u = 1}^{n} r_{i u} (k) r_{u j} (ℓ)$
    - $P(X_{t+k+\ell}=j\mid X_t=i)=$
    - $=\sum_{u=1}^nP(X_{t+k}=u\mid X_t=i)\cdot P(X_{t+k+\ell}=j\mid X_{t+k}=u)$
    - it follows from total probability theorem
Accessible states
- definition: $j$ $j$ is accessible from $i$ $i$ (where $i,j\in S$ $i, j \in S$ )
  - $i\to j$
  - $j\in A(i)$
- $j$ is accessible from $i\equiv$ $\exists t:\underbrace{P(X_t=j\mid X_0=i)}_{r_{ij}(t)}\gt 0$
- $\iff$ in the transition digraph exists directed path $i\to j$ of length $t$
Communicating states
- definition: $i\leftrightarrow j$ ( $i$ and $j$ communicate) $\equiv i\in A(j)\land j\in A(i)$
- theorem: $\leftrightarrow$ is an equivalence relation
- proof
  - reflexive: $i\in A(i)$ … $t=0$ in the definition
  - symmetric: the formula $i\in A(j)\land j\in A(i)$ is symmetric
  - transitive: in the digraph $\exists$ path $i$ to $j$ , $\exists$ path $j$ to $k$ $\implies\exists$ walk $i\to k$
- decomposition of the transition digraph to components of strong connectivity $\equiv$ finding equivalence classes of $\leftrightarrow$
Reducibility
- Markov chain is irreducible if $\forall i,j\in S:i\leftrightarrow j$
Time to get to $i\in S$ $i \in S$
- definition: $T_i=\min\set{t\geq 1:X_t=i}$
- $T_i=\infty$ if the set is empty (there is no such $t$ )
- $T_i$ … random variable
- $f_{ij}(n)=P(\text{we get to }j\text{ first at time }n\mid X_0=i)=P(T_j=n\mid X_0=i)$ $f_{ij} (n) = P (we get to j first at time n ∣ X_{0} = i) = P (T_{j} = n ∣ X_{0} = i)$
  - we define it for $n\gt 0$
- $f_{ij}=P(T_j\lt\infty\mid X_0=i)=\sum_{n\geq 1} f_{ij}(n)$ $f_{ij} = P (T_{j} < \infty ∣ X_{0} = i) = \sum_{n \geq 1} f_{ij} (n)$
  - probability that we get to $j$ if we start from $i$ (after at least one step)
Reccurent state
- a state $i\in S$ $i \in S$ is recurrent if we return to it with probability 1
  - it is transient otherwise
  - česky rekurentní, tranzientní
- $i$ $i$ is transient
  - $f_{ii}\lt 1$ … probability of ever getting back
  - $\iff P(T_i=\infty)\gt 0$
  - $\iff P(\text{get back infinitely often})\lt 1$
- theorem: $i$ is a recurrent state $\iff\sum_{n=1}^\infty r_{ii}(n)=\infty$
- theorem: if $i\leftrightarrow j$ , then either both $i$ and $j$ are recurrent or both $i$ and $j$ are transient
- for finite Markov chains
  - $C$ … equiv class of $\leftrightarrow$ in a finite Markov chain
  - $C$ is recurrent ( $\forall i\in C$ is recurrent) $\iff(\forall i \in C)(\forall j\in S):$ if $i\to j$ then $j\in C$ (“ $C$ is closed”)
- definition: $i$ $i$ is null recurrent if $i$ $i$ is recurrent and $\mathbb E(T_i\mid X_0=i)=\infty$ $E (T_{i} ∣ X_{0} = i) = \infty$
  - $i$ is positive recurrent if $i$ is recurrent and $\mathbb E(T_i\mid X_0=i)\lt\infty$
- example: random walk on a line
Periodic state
- definition: $i\in S$ $i \in S$ has period $d_i:=\text{gcd}\set{t:r_{ii}(t)\gt 0}$ $d_{i} := gcd {t : r_{ii} (t) > 0}$
  - $i\in S$ is aperiodic if $d_i=1$
- example of a periodic Markov chain: two states, we change the state with probability equal to 1
Properties of a communicating class
- if $i,j\in S$ $i, j \in S$ , $i\leftrightarrow j$ $i \leftrightarrow j$ , then
  - $d_i=d_j$
  - $i$ is transient $\iff j$ is transient
  - $i$ is recurrent $\iff j$ is recurrent
  - $i$ is null recurrent $\iff j$ is null recurrent
  - $i$ is positive recurrent $\iff j$ is positive recurrent
- these are properties of the class of $\leftrightarrow$
Stationary distribution / steady state distribution
- definition: $\pi:S\to [0,1]$ $π : S \to [0, 1]$ such that $\sum_{i\in S}\pi_i=1$ $\sum_{i \in S} π_{i} = 1$ is called a stationary distribution if $\text{``}\pi P=\pi\text{"}$ $“ π P = π "$
  - $\forall i:\sum_i\pi_i\cdot p_{ij}=\pi_j$
- if $\pi^{(0)}$ (the PMF of $X_0$ ) is $\pi$ , then $\pi^{(1)}$ is $\pi$ as well (→ steady state)
- reminder: $\pi^{(1)}=\pi^{(0)}P$
- theorem
  - if a Markov chain is irreducible, aperiodic and finite, then
    - there exists a unique stationary distribution $\pi$ such that $\pi P=\pi$
    - $\forall ij:\lim(P^t)_{ij}=\pi_j$ $\forall ij : lim (P^{t})_{ij} = π_{j}$
      - $P(X_t=j\mid X_0=i)\doteq\pi_j$
  - actually, MC does not have to be finite, it suffices if all states are positive recurrent
- to find $\pi$ $π$ , solve system of linear equations $\pi P=\pi$ $π P = π$ , add $\sum_{i\in S}\pi_i=1$ $\sum_{i \in S} π_{i} = 1$
  - for every $i$ there should be equation $\pi_i=\sum_j p_{ji}\cdot \pi_j$ (there is a term for every incoming arc)
- $\pi$ $π$ describes long-term behavior of the MC
  - Page Rank (original google search) … MC model of people browsing WWW
  - given $\pi$ $π$ , we can find a MC such that $\pi$ $π$ is its stationary distribution; then we can run the MC to generate random objects with distribution $\pi$ $π$
    - Markov chain Monte Carlo (MCMC)
Detailed balance equation
- MC may have this property
- $\forall i\neq j:\pi_iP_{ij}=\pi_jP_{ji}$
- to imagine this: ant colony moving independently according to a Markov chain
  - stationary distribution … the same number of ants at each state at each time – ants don't “accumulate”
- detailed balance equation implies $\pi P=\pi$ (but it is stronger than that)
MCMC sampling
- Monte Carlo Markov chains
- Metropolis–Hastings method
- we construct a MC from the probability distribution we want
  - choose aperiodic irreducible digraph
  - use detailed balance equation
    - $p_{ij}=\min\set{1,\frac{\pi _j}{\pi_i}}\cdot C$
    - $p_{ji}=\min\set{1,\frac{\pi_i}{\pi_j}}\cdot C$
  - choose $C$ $C$ such that
    - $\forall i:\sum_{j\neq i} p_{ij}\lt 1$
    - df. $p_{ii}=1-\sum_{j\neq i}p_{ij}\gt 0$
    - $\implies d_i=1$
- tune the process to make convergence fast
- we run the MC for long enough
- alternative sampling approach: rejection sampling
- MCMCs are useful in Bayesian statistics – we only need the ratios so the normalizing constants cancel out (and we don't have to compute them)
Absorbing state
- $i\in S$ is absorbing state if $p_{ii}=1$
- $A$ … set of absorbing states
- which absorbing state do we end at? will it be state $0\in A$ $0 \in A$ ?
  - $a_i=P(\exists t:X_t=0\mid X_0=i)$
- theorem: $(a_i)_{i\in S}$ $(a_{i})_{i \in S}$ are unique solutions to
  - $a_0=1$
  - $a_i=0$ if $i\in A,\,i\neq 0$
  - $a_i=\sum_j p_{ij}\cdot a_j$ otherwise
- what is the expected number of steps to get to an absorbing state
  - $\mu_i=\mathbb E(T\mid X_0=i)$
  - $T=\min\set{t: X_t\in A}$
- theorem: $(\mu_i)_{i\in S}$ $(μ_{i})_{i \in S}$ are unique solutions to
  - $\mu_i=0$ if $i\in A$
  - $\mu_i=1+\sum_j p_{ij}\cdot\mu_j$ if $i\notin A$
- example: drunk person on their way home
  - $A=\set{0}$
  - $\mu_0=0$
  - $\mu_1=1+\frac12\mu_0+\frac12\mu_2$
  - $\mu_2=1+\frac12\mu_1+\frac12\mu_3$
  - $\mu_{n-1}=1+\frac12\mu_{n-2}+\frac12\mu_{n}$
  - $\mu_n=1+\mu_{n-1}$
  - solution
    - $\mu_1=2n-1$
    - $\mu_n=n^2$
    - $\mu_{i}\leq n^2$
  - how to compute it
    - we start from the bottom up by combining the two last equations
      - we get $\mu_{n-1}=3+\mu_{n-2}$
    - then we get $\mu_{n-2}=5+\mu_{n-3}$
    - it can be shown that we can get to $\mu_1=2n-1+\mu_{0}=2n-1$
    - we see that $\mu_2=2n-3+\mu_1$
    - we can substitute $\mu_1$
    - when we get back to the bottom, we have $\mu_n=1+3+5+\dots+(2n-1)=n^2$
2-SAT problem
- input: $\varphi=(x_1\lor x_2)\land(x_3\lor\neg x_1)\land\ldots$ $φ = (x_{1} \lor x_{2}) \land (x_{3} \lor \neg x_{1}) \land \dots$
  - clauses with exactly 2 literals
- output: a satisfying assignment OR “unsatisfiable”
- there exists a polynomial algorithm
- we will show a randomized algorithm
  1. arbitrarily initialize $(x_1,\dots,x_n)$
  2. while $\varphi$ $φ$ has an unsatisfied clause
    - choose one of the unsatisfied clauses and change one of its variables
  3. return $(x_1,\dots,x_n)$
- repeat (2) $\leq2mn^2$ $\leq 2 m n^{2}$ times
  - then, if $\varphi$ is still unsatisfied, return “unsatisfiable”
  - this may introduce errors
- theorem: the algorithm makes an error with probability $\leq 2^{-m}$
- proof
  - $x_1^*,\dots,x_n^*$ … one of the satisfying assignments
  - $D_t$ $D_{t}$ the number of $i$ $i$ such that $x_i\neq x_i^*$ $x_{i} \neq = x_{i}^{*}$ at time $t$ $t$
    - $t=0,1,\dots,2mn^2$
    - $0\leq D_t\leq n$
    - $D_t=0\implies$ we found a solution
  - situation
    - we assume that clause $(x_1\lor x_2)$ $(x_{1} \lor x_{2})$ is unsatisfied at time $t$ $t$ and we choose it
      - $\implies x_1=x_2=F$
      - $\implies x_1^*\lor x_2^*=T$
    - we randomly switch $x_1$ or $x_2$ which increases or decreases $D$
    - if $x_1^*=T$ $x_{1}^{*} = T$ and $x_2^*=F$ $x_{2}^{*} = F$ (or vice versa), the random switch may decrease or increase $D$ $D$ by one (each with probability $\frac12$ $\frac{1}{2}$ )
      - we will call them “good cases”
    - if $x_1^*=x_2^*=T$ $x_{1}^{*} = x_{2}^{*} = T$ , then the random switch will always decrease $D$ $D$ which is nice but it means that we cannot model $D$ $D$ using a Markov chain
      - this is a “bad case”
    - because of that, we define $D'$ $D^{'}$ which always decreases or increases with probability $\frac12$ $\frac{1}{2}$
      - in the good cases, it behaves the same way as $D$
      - in the bad case, we flip a coin to decide whether it should go up or down
    - clearly $D\leq D'$ and $T\leq T'$ (where $T$ is the first time $D=0$ , similarly for $T',D'$ )
    - we know that $\mathbb ET'\leq n^2$ which means that $\mathbb ET\leq n^2$
    - using Markov inequality, we get that $P(T\geq 2n^2)\leq\frac{\mathbb ET}{2n^2}\leq\frac{n^2}{2n^2}=\frac12$
    - we repeat the whole process $m$ $m$ times, that way we get the probability of fail $\leq 2^{-m}$ $\leq 2^{- m}$ and the running time $O(mn^4)$ $O (m n^{4})$
      - there are $O(n^2)$ clauses of two literals, we need to find an unsatisfied one $2mn^2$ times
- we can get a 3-SAT randomized algorithm with $(\frac43)^n$ time complexity this way
Hidden Markov Model (HMM)
- there is some not observable $(X_t)$ … Markov chain
- we can observe $(Y_t)$ … $Y_t$ is somehow obtained from $X_t$
- widely applicable
- smart algorithm (Viterbi algorithm)

Bayesian Statistics

Probabilistic method
- we prove that some graph exists by showing that random graph has certain property with probability greater than zero
- $G(n,p)$ $G (n, p)$ … graph with vertices $1,\dots,n$ $1, \dots, n$
  - $\forall i,j:i\sim j$ with probability $p$ (all independent)
- $P(G(n,\frac12)$ has no $K_k$ nor $\overline K_k$ as an induced subgraph $)\gt 0$ if $n\leq 2^{k/2}$
- thus $\exists G$ on $2^{k/2}$ vertices with no induced $K_k$ nor $\overline K_k$
- Ramsey theorem
Bayes theorem
- let $X,\Theta$ be discrete random variables
- $p_{\Theta\mid X}(\theta\mid x)=\frac{p_{X\mid\Theta}(x\mid\theta)\cdot p_\Theta(\theta)}{\sum_{\theta'\in\text{Im}\,\Theta} p_{X\mid\Theta}(x\mid\theta')\cdot p_\Theta(\theta')}$
- similar for continuous RVs
- note: terms with $p_{\Theta}(\theta')=0$ are considered to be zero
Bayesian statistics – general approach
- $\Theta$ $Θ$ possible worlds
  - prior distribution $p_\Theta(\theta)$
  - example: with probability $\frac12$ the coin is fake (has heads on both sides)
  - $p_\Theta(\text{real})=\frac12$
  - $p_\Theta(\text{fake})=\frac12$
- $X$ $X$ … measurement/observation
  - example: we got heads
- posterior distribution
  - $p_{\Theta\mid X}(\theta\mid x)=\frac{ p_\Theta(\theta)\cdot p_{X\mid\Theta}(x\mid\theta)}{\sum_{t\in\Theta} p_\Theta(t)\cdot p_{X\mid\Theta}(x\mid t)}$
  - example: $p_{\Theta\mid X}(\text{fake}\mid\text{heads})=\frac{\frac12\cdot 1}{\frac12\cdot1+\frac12\cdot\frac12}=\frac23$ $p_{Θ ∣ X} (fake ∣ heads) = \frac{\frac{1}{2} \cdot 1}{\frac{1}{2} \cdot 1 + \frac{1}{2} \cdot \frac{1}{2}} = \frac{2}{3}$
    - alternatively: which “heads” did we get?
      - one side of the fake coin
      - the other side of the fake coin
      - one side of the real coin
    - two of the three possible “worlds” are corresponding to the fake coin → $p_{\Theta\mid X}(\text{fake}\mid\text{heads})=\frac23$ $p_{Θ ∣ X} (fake ∣ heads) = \frac{2}{3}$
      - $\hat\theta=\text{fake}$ … MAP estimate
Bayesian estimates
- point estimates
  - MAP (maximum a posteriori)
    - we choose $\hat\theta$ to maximize $p_{\Theta\mid X}$ … we replace the random variable by its mode
  - LMS (“least squares”)
    - we choose $\hat\theta=\mathbb E(\Theta\mid X=x)$ … we replace the random variable by its mean
    - we get an unbiased point estimate that has the smallest possible LMS (least mean square) error
- interval estimates
  - credible sets $S$ (similar to confidence intervals used in classical statistics)
- note: often, the constant in the denominator does not matter
MAP (maximum a posteriori)
- $\hat\theta_\text{MAP}=\text{argmax}_\theta\,p_{\Theta\mid X}(\theta\mid x)$ $\hat{θ}_{MAP} = argmax_{θ} p_{Θ ∣ X} (θ ∣ x)$
  - $\hat\theta$ is a point estimate for $\Theta$
- that equals $\text{argmax}\,p_\Theta(\theta)\cdot p_{X\mid\Theta}(x\mid\theta)$ $argmax p_{Θ} (θ) \cdot p_{X ∣ Θ} (x ∣ θ)$
  - we can ignore the normalizing constant
- we are replacing the random variable by its mode
- node: maximum likelihood is similar but it uses $\text{argmax}\,p_{X\mid\Theta}(x\mid\theta)$ $argmax p_{X ∣ Θ} (x ∣ θ)$
  - it assumes “flat prior” (uniform or discrete uniform)
Naive Bayes classifier
- let $x=(x_1,\dots,x_n)$ $x = (x_{1}, \dots, x_{n})$
  - in machine learning … vector of features
- we need $p_{X\mid\Theta}(x\mid\theta)$ to get MAP estimate
- “naive assumption”: $p(x\mid\theta)=p(x_1\mid\theta)\cdot p(x_2\mid\theta)\cdot\ldots\cdot p(x_n\mid\theta)$
LMS point estimate
- LMS = least mean square
- estimate such that $\mathbb E((\Theta-\hat\theta)^2\mid X=x)$ is minimal
- to compute it
  - $\hat\theta_\text{LMS}=\mathbb E(\Theta\mid X=x)=\sum_{t\in\Theta}t\cdot p_{\Theta\mid X}(t\mid x)$
  - we find the mean of the random variable
- proof
  - $\mathbb E((\Theta-\hat\theta)^2\mid X=x)=\mathbb E(\Theta^2\mid X=x)-\mathbb E(2\Theta\hat\theta\mid X=x)+\mathbb E(\hat\theta^2\mid X=x)$
  - we want to minimize it
  - the first term does not depend on $\hat\theta$ , we get rid of it
  - we get $-2\hat\theta\mathbb E(\Theta\mid X=x)+\hat\theta^2$
  - to find its minimum, we find the derivative and set it equal to zero
  - $-2\mathbb E(\Theta\mid X=x)+2\hat\theta=0$
  - therefore $\hat\theta=\mathbb E(\Theta\mid X=x)$ as we wanted
Beta distribution
- $f_\Theta(x)=c\cdot x^{\alpha-1}(1-x)^{\beta-1}$ $f_{Θ} (x) = c \cdot x^{α - 1} (1 - x)^{β - 1}$
  - where $c$ is a normalizing constant
- properties
  - $f_\Theta$ is maximal for $\theta=\frac{\alpha-1}{\alpha+\beta-2}$
  - $\mathbb E\Theta=\frac\alpha{\alpha+\beta}$
Conjugate distributions
- Beta (prior) & binomial (likelihood) → Beta (posterior)
  - “Beta and Beta are conjugate distributions with respect to binomial distribution”
- normal (prior) & normal (likelihood) → normal (posterior)
- both were proved in the lecture
- it is an algebraic convenience, it does not work for all distributions

Conditional Expectation

Conditional independence
- events
  - $A\perp B\iff P(A\cap B)=P(A)\cdot P(B)$
  - $A\perp_C B\iff P(A\cap B\mid C)=P(A\mid C)\cdot P(B\mid C)$ $A ⊥_{C} B ⟺ P (A \cap B ∣ C) = P (A ∣ C) \cdot P (B ∣ C)$
    - $A,B$ are independent conditionally given $C$
- it is possible (even typical) that $A\perp_C B$ , $A\perp_{C^C} B$ , but not $A\perp B$
Conditional expectation
- $\mathbb E(Y\mid X=x)$ vs. $\mathbb E(Y\mid X)$
- $\mathbb E(Y\mid X=x)=g(x)$ … number
- $\mathbb E(Y\mid X)=g(X)$ … random variable
Law of iterated expectation
- $\mathbb E(\mathbb E(Y\mid X))=\mathbb E(Y)$
- proof
  - $\mathbb E(\mathbb E(Y\mid X))=\mathbb E(g(X))=\sum_x P(X=x)\cdot g(x)$ $E (E (Y ∣ X)) = E (g (X)) = \sum_{x} P (X = x) \cdot g (x)$
    - by LOTUS
  - $\mathbb E(Y)=\sum_x P(X=x)\cdot\mathbb E(Y\mid X=x)=\sum_x P(X=x)\cdot g(x)$ $E (Y) = \sum_{x} P (X = x) \cdot E (Y ∣ X = x) = \sum_{x} P (X = x) \cdot g (x)$
    - by law of total probability
Estimate error
- basic task of statistics
  - estimate one quantity (Y) given data/measurement (X)
- example: groups of students, their exam results
  - $X$ $X$ … groups of students
    - $X=2$ … we are currently looking at the second group
  - $Y$ $Y$ … students' exam results
    - $\mathbb E(Y\mid X=3)$ … the expectation of the results in group 3
- estimator
  - $\hat Y=\mathbb E(Y\mid X)=g(X)$
- we want the estimate error
  - $\tilde Y=\hat Y-Y$
- what is $\mathbb E(\tilde Y\mid X)$ $E (\tilde{Y} ∣ X)$ equal to?
  - $\mathbb E(\tilde Y\mid X)=\mathbb E(\hat Y-Y\mid X)=\mathbb E(\hat Y\mid X)-\mathbb E(Y\mid X)=\hat Y-\hat Y=0$
  - because $\mathbb E(\hat Y\mid X)=\mathbb E(\mathbb E(Y\mid X)\mid X)=\mathbb E(Y\mid X)$
- by law of iterated expectation
  - $\mathbb E(\tilde Y)=\mathbb E(\mathbb E(\tilde Y\mid X))=0$
Estimate error covariance and variance
- $\text{cov}(\tilde Y,\hat Y)=\mathbb E(\tilde Y\hat Y)-\underbrace{\mathbb E(\tilde Y)}_0\mathbb E(\hat Y)=\mathbb E(\mathbb E(\tilde Y\hat Y\mid X))=\mathbb E(\hat Y\underbrace{\mathbb E(\tilde Y\mid X)}_0)=0$ $cov (\tilde{Y}, \hat{Y}) = E (\tilde{Y} \hat{Y}) - 0 E (\tilde{Y}) E (\hat{Y}) = E (E (\tilde{Y} \hat{Y} ∣ X)) = E (\hat{Y} 0 E (\tilde{Y} ∣ X)) = 0$
  - the penultimate equivalence holds because $\hat Y$ is constant for specified $X=x$
- $\tilde Y,\hat Y$ are uncorellated
- note
  - uncorellated $\impliedby$ independent
  - uncorellated $\;\;\not\!\!\!\implies$ independent
Law of iterated variance / Eve's rule
- $\text{var }Y=\mathbb E(\text{var}(Y\mid X))+\text{var}(\mathbb E(Y\mid X))$
- intragroup variance + intergroup variance
- proof
  - $Y=\hat Y-\tilde Y$
  - $\text{var }Y=\text{var}(\hat Y)+\text{var}(\tilde Y)-\underbrace{2\text{cov}(\hat Y,\tilde Y)}_0=\text{var}(\hat Y)+\text{var}(\tilde Y)$
  - $\text{var}(\tilde Y)=\mathbb E(\tilde Y^2)-\underbrace{(\mathbb E\tilde Y)^2}_0=\mathbb E(\mathbb E(\tilde Y^2\mid X))$ $var (\tilde{Y}) = E (\tilde{Y}^{2}) - 0 (E \tilde{Y})^{2} = E (E (\tilde{Y}^{2} ∣ X))$
    - $=\mathbb E(\mathbb E((Y-\hat Y)^2\mid X))=\mathbb E(\text{var}(Y\mid X))$

Balls & Bins

Really useful approximation
- $1-x\approx e^{-x}$
- example usage: birthday paradox
  - $(1-\frac1{365})(1-\frac2{365})\dots(1-\frac{k-1}{365})\approx e^{-\frac{k(k-1)}{2\cdot 365}}$
  - because $1-\frac1{365}\approx e^{-\frac1{365}}$ etc.
Union bound
- for events $A_1,A_2,\ldots$ it holds that $P(\bigcup A_i)\leq \sum P(A_i)$
- therefore $P(\text{none of }A_i)\geq 1-\sum P(A_i)$
Balls and bins model
- we will be throwing $m$ balls randomly into $n$ bins
- $X_i$ … number of balls in bin $i$
How many bins are empty?
- $P(X_i=0)=(1-\frac1n)^m\approx e^{-\frac mn}$
- $I_i=1$ if $X_i=0$ (otherwise 1)
- $\mathbb E(\sum_i I_i)=\sum_i\mathbb EI_i\approx ne^{-\frac mn}$
How many balls are in bin $i$ $i$ ?
- $X_i\sim \text{Bin}(m,\frac 1n)\approx\text{Pois}(\frac mn)$
- $\mathbb EX_i=\frac mn$
Applications – bucket sort and hasing
- (see the lecture)
Max-load likely upper bound
- definition: max-load … $\max\set{X_1,\dots,X_n}$ $max {X_{1}, \dots, X_{n}}$
  - we assume $m=n$ in max-load bounds
- theorem: for large enough $n$ we have $P(\text{maxload}\geq\frac{3\log n}{\log\log n})\leq\frac1n$
- proof
  - $X_i$ … number of balls in bin $i$
  - $P(X_i\geq M)\leq{n\choose M}{1\over n^M}$ $P (X_{i} \geq M) \leq (M n) \frac{1}{n ^{M}}$
    - probability that bin $i$ gets at least $M$ balls
    - for all $M$ -tuples of balls, we sum their probability to all fall in bin $B_i$ (using union bound)
  - $B_i$ … $X_i\geq M$
  - $P(\bigcup_i B_i)=P(\max X_i\geq M)$
  - $P(\bigcup_i B_i)\leq\sum_{i=1}^n P(B_i)\leq n{n\choose M}{1\over n^M}\leq n({e\over M})^M$ $P (⋃_{i} B_{i}) \leq \sum_{i = 1}^{n} P (B_{i}) \leq n (M n) \frac{1}{n ^{M}} \leq n (\frac{e}{M})^{M}$
    - union bound again, then we use an estimate of factorial
  - we need to show that $n({e\over M})^M\lt^? \frac1n$ $n (\frac{e}{M})^{M} <^{?} \frac{1}{n}$
    - $\log(n^2({e\over M})^M)\lt^? \log 1$
    - $2L+M(1-\log M)\lt^? 0$ $2 L + M (1 - lo g M) <^{?} 0$
      - where $L$ is $\log n$ ( $LL$ is $\log\log n$ etc.)
      - we substitute $M=\frac{3L}{LL}$
    - $2L+\frac{3L}{LL}(1-(\log 3+LL-LLL))\lt^? 0$
    - $2L+\frac{3L(1-\log3-LL+LLL)}{LL}\lt^? 0$
    - $-L+\frac{3L(1-\log 3)}{LL}+\frac{3L(LLL)}{LL}\lt^? 0$
    - second and third term are much smaller than $L$
    - therefore it holds
Exact case vs. Poisson case
- theorem: any event that happens with probability $\leq p$ in the Poisson case happens with probability $\leq p\cdot e\sqrt m$ in the exact case
Max-load likely lower bound
- theorem: for large enough $n$ we have $P(\text{maxload}\leq\frac{\log n}{\log\log n})\leq\frac1n$
- proof
  - Poisson case
    - $P(X_i\lt M)\leq 1-P(X_i=M)=1-e^{-\lambda}\frac{\lambda^M}{M!}=1-\frac1{eM!}$ $P (X_{i} < M) \leq 1 - P (X_{i} = M) = 1 - e^{- λ} \frac{λ ^{M}}{M !} = 1 - \frac{1}{e M !}$
      - $\lambda=\frac mn=1$
    - $P(\max X_i\lt M)=P(X_1\lt M)\cdot\ldots\cdot P(X_n\lt M)$ $P (max X_{i} < M) = P (X_{1} < M) \cdot \dots \cdot P (X_{n} < M)$
      - $\leq(1-\frac1{eM!})^n\leq(e^{-\frac1{eM!}})^n=\text{exp}(-\frac n{eM!})=:p$
  - exact case
    - $P(\max X_i\lt M)\leq p\cdot e\sqrt{m}=e\sqrt n\cdot \text{exp}(-\frac n{eM!})$
    - we want $e\sqrt n\cdot \text{exp}(-\frac n{eM!})\lt^?\frac1n$
    - $\text{exp}(-\frac n{eM!})\lt^?\frac1{n^2}$
    - $\frac n{eM!}\gt^? 2\log n$
    - $M!\lt^?\frac{n}{2e\log n}$
    - $\log M!\lt^? L-LL-\log 2e$
    - we know $M!\leq M(\frac Me)^M$
    - $\log M!\leq\log M+M(\log M-1)$ $lo g M! \leq lo g M + M (lo g M - 1)$
      - $\leq LL-LLL+\frac L{LL}(LL-LLL-1)$
        
        here, we substituted the $M$
      - $\leq L-L\frac{LLL}{LL}+LL\leq L-LL-c$
        
        as $2LL\leq L\frac{LLL}{LL}$ (it's not that clear but it holds)
    - therefore $\log M!\leq L-LL-c$ $lo g M! \leq L - LL - c$
      - that's what we wanted to prove

Stochastic Processes

Bernoulli process
- infinite sequence of independent identically distributed Bernoulli trials
- sequence of RVs $X_1,X_2,\dots$ that are independent and each follows $\text{Ber}(p)$ distribution
- $X_k=1$ … success/arrival at time $k$
Quantities of a Bernoulli process
- $N_t$ $N_{t}$ … number of successes up to time $t$ $t$
  - $N_t\sim\text{Bin}(t,p)$
  - $\mathbb E[N_t]=t\cdot p$
- $T_k$ … time of the $k$ -th success
- $L_k=T_k-T_{k-1}$ $L_{k} = T_{k} - T_{k - 1}$
  - waiting time
  - memory-less
  - $L_k\sim\text{Geom}(p)$ $L_{k} \sim Geom (p)$
    - $\mathbb E[L_k]=\frac1p$
    - $P(L=l)=(1-p)^{l-1}\cdot p$
- $T_k$ $T_{k}$ properties
  - $T_k=L_1+\dots+L_k$
  - by linearity $\mathbb E[T_k]=\frac kp$
  - $T_k$ has Pascal distribution of order $k$
  - $\forall t\geq k:P(T_k=t)={t-1\choose k-1}p^{k}(1-p)^{t-k}$
  - $\forall t\lt k:P(T_k=t)=0$
Alternative description of a Bernoulli process
- we can describe the situation by the interarrival times
- i.i.d. random variables $L_1,L_2,\ldots\sim\text{Geom}(p)$
- $T_k=L_1+\dots+L_k$
- $X_k=1$ whenever $T_t=k$ for some $t$
- clearly, the sequence $X_1,X_2,\dots$ is a Bernoulli process
Merging of Bernoulli processes
- two independent Bernoulli processes, $Z_i=X_i\lor Y_i$
- $BP(p)\text{[merge with]}BP(q)=BP(1-(1-p)(1-q))$
- $=BP(p+q-pq)$
Splitting of Bernoulli processes
- $X_i,Y_i,Z_i$
- $Z_i\sim BP(p)$
- if $Z_i=0$ , then $X_i=Y_i=0$
- if $Z_i=1$ , then with probability $q$ we set $(X_i,Y_i)=(1,0)$ , otherwise $(X_i,Y_i)=(0,1)$
- then $X_i\sim BP(pq)$ and $Y_i\sim BP(p(1-q))$
Poisson process
- continuous time
- $T_1,T_2,T_3,\dots$ are the times of individual arrivals
- $N_t$ … number of arrivals up to time $t$
- $\lambda$ … intensity (how many successes per unit of time)
- $N_t\sim\text{Pois}(\lambda\cdot t)$
- $L_k\sim\text{Exp}(\lambda)$
- $T_k$ has Erlang distribution, $\mathbb E[T_k]=\frac k\lambda$
Poisson process interval independence
- for any sequence $0\leq t_0\lt t_1\dots\lt t_k$
- RVs $(N_{t_{i+1}}-N_{t_{i}})$ $(N_{t_{i + 1}} - N_{t_{i}})$
  - are independent
  - each one follows $\text{Pois}(\lambda(t_{i+1}-t_i))$
Merging of Poisson process
- $PP(\lambda)\text{[merge with]}PP(\lambda')=PP(\lambda+\lambda')$
Splitting of Poisson process
- for each success of $PP(\lambda)$ , we toss a coin with probability $p$ (with probability $p$ , the success is counted as valid)
- the resulting process is $PP(\lambda p)$

Non-parametric Tests

Permutation test
- two-sample test
- A/B testing
  - $A$ … control group (no treatment)
  - $B$ … group with treatment applied
  - is the $A$ vs. $B$ difference only random?
- we observe … $T(A,B)$ $T (A, B)$
  - $T$ … test statistic
  - example: $T=\bar A-\bar B$
- $\mathcal F=\set{T(\pi(A\cup B))\mid \pi\in S_{m+n}}$ $F = {T (π (A \cup B)) ∣ π \in S_{m + n}}$
  - usually the sizes of the groups stay the same: we use $T(\text{first }|A|\text{ elements of }\pi(A\cup B),\,\text{rest of }\pi(A\cup B))$
- p-value: percentage of $\mathcal F$ more extreme than observed $T(A,B)$
- speed-up … we use only $k$ random samples from $S_{m+n}$
- for paired test, we only permute $X_i$ with $Y_i$
Signed test
- paired test
  - pairs $X_i,Y_i$
  - $Z_i=Y_i-X_i$
- $H_0$ … $P(X_i\gt Y_i)=\frac12$
- count number of pairs where $Z_i\gt 0$ (there was a measurable improvement after the treatment)
- use $\text{Bin}(n,\frac12)$ to get the p-value
Wilcoxon signed rank test
- paired test
- procedure
  - let $X_i$ be the difference $B_i-A_i$
  - sort by absolute value, assign ranks ( $R_i$ )
  - calculate $T$ $T$ (or $T^+$ $T^{+}$ or $T^-$ $T^{-}$ , it does not matter)
    - $T^+=\sum_{X_i\gt 0}R_i$
    - $T^-=\sum_{X_i\lt 0} R_i$
    - $T=T^+-T^-=\sum R_i\cdot\text{sign}(X_n)$
  - compare with scores for measurements with random $+/-$ signs
- we use stronger null hypothesis
  - $H_0:$ median = 0 & distribution of $X$ is symmetric
  - note: $X$ is symmetric $\iff A,B$ have the same continuous distribution (source)
Mann-Whitney U-test
- two-sample test
  - we have two populations, $X$ and $Y$
  - we get $X_1,\dots,X_m$ and $Y_1,\dots,Y_n$
  - usually, $m\neq n$
- $S(X_i,Y_j)=\begin{cases} 1 & X_i\gt Y_j\\ \frac12 & X_i=Y_j \\ 0 & X_i\lt Y_j\end{cases}$
- $U=\sum_{i=1}^m\sum_{j=1}^nS(X_i,Y_j)$
- sign test is a paired variant of this
- we use the same approach as in the permutation test with $U$ instead of $T=\bar X_m-\bar Y_n$
- alternative view: we sort $X\cup Y$ and count pairs “ $X_i$ before $Y_j$ ”
- when to use it?
  - numerical data … real numbers (weight, time, price) → permutation test
  - ordinal data … classes (light/medium/heavy, quick/slow) → $U$ -test

Moment Generating Functions and their applications

Markov inequality
- for $X$ nonnegative
- $\forall a\geq 1:P(X\geq a\cdot \mu)\leq\frac1a$
Chebyshev inequality
- $\forall a\gt 0:P(|X-\mu|\geq a\cdot\sigma)\leq\frac1{a^2}$
Chernoff bounds (alternative)
- for $X=\sum_{i=1}^nX_i$ where $X_i\sim\text{Ber}(p_i)$ are independent
- $\mu=\sum_{i=1}^np_i$
- $\forall\delta\in(0,1]:P(X\geq (1+\delta)\cdot\mu)\leq e^{-\mu\delta^2/3}$
- $\forall\delta\in(0,1):P(X\leq(1-\delta)\cdot\mu)\leq e^{-\mu\delta^2/2}$
- note: there are many more alternative bounds
Moment generating function
- given a random variable $X$ , we define $M_X:\mathbb R\to\mathbb R$
- $M_X(s):=\mathbb E(e^{sX})$
- $\forall X:M_X(0)=\mathbb E(e^{0x})=1$
MGF moment theorem
- theorem: $M_X(s)=\sum_{k=0}^\infty\mathbb E[X^k]\cdot \frac{s^k}{k!}$ $M_{X} (s) = \sum_{k = 0}^{\infty} E [X^{k}] \cdot \frac{s ^{k}}{k !}$
  - $\mathbb E[X^k]$ … $k$ -th moment of $X$
- proof
  - $\mathbb E[e^{sX}]=\mathbb E[\sum\frac{(sX)^k}{k!}]=\mathbb E[\sum X^k\frac{s^k}{k!}]=\sum\mathbb E[X^k\frac{s^k}{k!}]=\sum_{k=0}^\infty\mathbb E[X^k]\cdot \frac{s^k}{k!}$
MGF for Bernoulli variable
- $X\sim\text{Ber}(p)$ $X \sim Ber (p)$
  - $M_X(s)=p\cdot e^{s\cdot 1}+(1-p)\cdot e^{s\cdot 0}$
  - $M_X(s)=1-p+pe^s$
  - note: $e^s=1+s+\frac{s^2}2+\frac{s^3}{3!}+\dots=\sum_{k=0}^\infty\frac{s^k}{k!}$
  - $M_X(s)=1+p(s+\frac{s^2}2+\frac{s^3}{3!}+\dots)$
- $\mathbb EX=[s^1]M_X(s)=p$ $E X = [s^{1}] M_{X} (s) = p$
  - this notation selects the coefficient of the $s^1$ in the GF
- $\mathbb EX^2=[\frac{s^2}{2!}]M_X(s)=p$
MGF for continuous $X$ $X$
- $M_X(s)=\mathbb E[e^{sX}]\overset{\text{LOTUS}}{=}\int_{-\infty}^\infty e^{sx}f_X(x)\text { d}x$
MGF “linearity” and “sum”
- theorem
  - $M_{aX+b}(s)=e^{sb}M_X(as)$
- proof
  - $\mathbb E(e^{s(aX+b)})=\mathbb E(e^{asX}\cdot e^{sb})=e^{sb}\cdot\mathbb E(e^{(as)X})$
- theorem
  - $X,Y$ independent $\implies M_{X+Y}=M_X\cdot M_Y$
- proof
  - $M_{X+Y}(s)=\mathbb E[e^{s(X+Y)}]=\mathbb E[e^{sX}\cdot e^{sY}]=\mathbb E[e^{sX}]\cdot \mathbb E[e^{sY}]=M_X(s)\cdot M_Y(s)$
- example
  - $X\sim\text{Bin}(n,p)$
  - $M_X(s)=(1-p+pe^s)^n$
MGF of normal distribution
- $X\sim N(\mu,\sigma^2)$
- $M_X(s)=e^{\mu s}\cdot e^{\sigma^2 s^2/2}$
MGF equality and convergence
- equality
  - let $X,Y$ be RVs such that $(\exists\varepsilon\gt 0)(\forall s\in (-\varepsilon,\varepsilon)):M_X(s)=M_Y(s)\in\mathbb R$
  - then $F_X=F_Y$
- convergence
  - $Y,X_1,X_2,X_3,\dots$ RVs
  - $(\exists\varepsilon\gt 0)(\forall s\in (-\varepsilon,\varepsilon)):\lim_{n\to\infty} M_{X_n}(s)=M_Y(s)$
  - $F_Y$ is continuous
  - then $X_n\xrightarrow d Y$ $X_{n} d Y$
    - $\lim_{n\to\infty} F_{X_n}(x)=F_Y(x)$
Central limit theorem
- theorem
  - $X_1,X_2,\dots$ i.i.d. (independent identically distributed) RVs
  - $\mathbb EX_i=\mu$ , $\text{var} X_i=\sigma^2$
  - $Y_n:=\frac{X_1+\dots+X_n-n\mu}{\sqrt{n}\sigma}$
  - $Y_n\xrightarrow d N(0,1)$ $Y_{n} d N (0, 1)$
    - that is $\lim_{n\to\infty}F_{Y_n}(t)=\Phi(t)$
- proof
  - we may assume $\mu=0$ $μ = 0$
    - otherwise, we would set $X_n'=X_n-\mu$
    - variance would not change
    - the formula for $Y_n$ also would not change (we subtract $n\mu$ there)
  - $M_{X_i}(s)=1+as+bs^2+O(s^3)\quad(s\doteq 0)$ $M_{X_{i}} (s) = 1 + a s + b s^{2} + O (s^{3}) (s ≐ 0)$
    - $a=\mu=0$
    - $b=\frac12\mathbb EX_i^2=\frac12(\sigma^2-\mu^2)=\frac12\sigma^2$ $b = \frac{1}{2} E X_{i}^{2} = \frac{1}{2} (σ^{2} - μ^{2}) = \frac{1}{2} σ^{2}$
      - from the alternative formula for $\text{var}$
  - $M_{X_i}(s)=1+\frac{\sigma^2}2s^2+O(s^3)$
  - $M_{Y_n}(s)=\prod M_{X_i}\left(\frac s{\sqrt n\sigma}\right)$ $M_{Y_{n}} (s) = \prod M_{X_{i}} (\frac{s}{n σ})$
    - because $Y_n=\frac{X_1+\dots+X_n}{\sqrt n\sigma}$
  - we need to show that $\lim M_{Y_n}(s)=M_{Y}(s)$ $lim M_{Y_{n}} (s) = M_{Y} (s)$ where $Y\sim N(0,1)$ $Y \sim N (0, 1)$
    - or that $\lim_{n\to\infty}M_{X_n}(\frac s{\sqrt n\sigma})^n=e^{s^2/2}$
  - $M_{Y_n}(s)=M_{X_i}(\frac s{\sqrt n\sigma})^n=(1+\frac{\sigma^2}2(\frac s{\sqrt n\sigma})^2+O(s^3))^n=(1+\frac{s^2}{2n}+O(s^3))^n$ $M_{Y_{n}} (s) = M_{X_{i}} (\frac{s}{n σ})^{n} = (1 + \frac{σ ^{2}}{2} (\frac{s}{n σ})^{2} + O (s^{3}))^{n} = (1 + \frac{s ^{2}}{2 n} + O (s^{3}))^{n}$
    - we substituted $s=\frac s{\sqrt n\sigma}$ into $M_{X_i}$
  - we know that $(1+\frac xn)^n\to e^x$
  - therefore $(1+\frac{s^2/2}{n}+O(s^3))^n\to e^{s^2/2}$
Chernoff inequality
- theorem
  - suppose i.i.d. $X_1,X_2,\dots,X_n$ are $\pm1$ , each with probability $\frac12$
  - $X=\sum_iX_i$
  - we have $\sigma^2=n$
  - $\forall t\gt 0:P(X\geq t)=P(X\leq -t)\leq e^{-t^2/2\sigma^2}$
- proof
  - $M_{X_i}=\frac{e^{-s}+e^{s}}2$
  - Markov inequality
    - for $X$ nonnegative
    - $\forall a\geq 1:P(X\geq a)\leq\frac \mu a$
  - we use it
    - $\forall s\gt 0:P(X\geq t)=P(e^{sX}\geq e^{st})\leq\frac{\mathbb E(e^{sX})}{e^{st}}$
  - $\frac{\mathbb E(e^{sX})}{e^{st}}=\frac{M_{X_i}(s)^n}{e^{st}}=\frac{(\frac{e^{-s}+e^{s}}2)^n}{e^{st}}$
  - it can be shown that $\frac{e^{-s}+e^{s}}2\leq e^{s^2/2}$
  - therefore $P(X\geq t)\leq \frac{e^{ns^2/2}}{e^{st}}=e^{ns^2/2-st}$
  - we set $s=t/n$
  - then $P(X\geq t)\leq e^{-t^2/2n}$
Set balancing using Chernoff inequality
- we have subsets $S_1,\dots,S_n\subseteq[m]$ $S_{1}, \dots, S_{n} \subseteq [m]$
  - if may be features like “men”, “old people”, …
- we want $T\subseteq[m]$ $T \subseteq [m]$ such that it almost dissects every $S_i$ $S_{i}$
  - to get two groups with the same (or similar) representation of every feature
- $D_i=|T\cap S_i|-|S_i\setminus T|$
- $\text{disc}(T)=\max D_i$ $disc (T) = max D_{i}$
  - discrepancy
- use random $T$ !
- $\forall a_j\in S_i$ $\forall a_{j} \in S_{i}$ we define “indicators”
  - $X_j=1\iff a_j\in T$
  - $X_j=-1\iff a_j\notin T$
- $D_i=\sum_{j=1}^{|S_i|}X_j$
- $P(D_i\geq t)\leq e^{-\frac{t^2}{2|S_i|}}\leq e^{-\frac{t^2}{2m}}$
- $P(|D_i|\geq t)\leq 2e^{-t^2/2m}$
- $P(\exists j:|D_j|\geq t)\leq n\cdot 2e^{-t^2/2m}\leq \frac2n$ $P (\exists j : ∣ D_{j} ∣ \geq t) \leq n \cdot 2 e^{- t^{2} /2 m} \leq \frac{2}{n}$
  - for $t=\sqrt{4m\log n}$
- idea: it is useful to have an inequality (Chernoff) that holds always, we don't need to worry if our $n$ is big enough