this dir | view | cards | source | edit | dark
top
Probability and Statistics 2
Markov chains
- example 1
- machine in 2 states – working × broken
- w → b … probability 0.01
- w → w … 0.99
- b → w … 0.9
- b → b … 0.1
- example 2
- a fly which may get caught in a web … absorbing states
- df: stochastic process (náhodný proces)
- sequence of random variables X0,X1,… such that "all probabilities are defined"
- e.g. P(X0=1∧X1=2∧X3=0) is defined
- df: Markov process/chain
- stochastic process such that
- ∃ finite/countable set S:ImXt⊆S∀t
- elements of S … states
- t … time
- (∀i,j∈S)(∃pij∈[0,1])(∀t)(∀a0,a1,…,at,at+1)
- notes
- this is a discrete time Markov chain
- discrete space … ∣S∣≤∣N∣
- time-homogeneous
- pij not depeding on t
- Markov condition
- only applies if the condition probabilities are defined
- Markov condition … "forgetting the past"
- df: transition matrix P=(pij)
- df: transition diagram/graph
- vertices … S
- directed edges … {(i,j):pij>0}
- observation: every row of P sums to 1
- Markov chain is equivalent to a random walk on the transition diagram
- from state i we go to j with the probability of pij independent of the past
- in example 1, assume X0, what is P(X2=w)?
- P(X1=w)=p11=0.99
- P(X2=w)=p11⋅p11+p12⋅p21=0.99⋅0.99+0.01⋅0.9
- using the total probability theorem
- probability mass function (PMF, pravděpodobnostní funkce) for X0,X1,…
- πi(t)=P(Xt=i)
- t∈N0
- i∈S
- theorem
- π(t+1)=π(t)⋅P
- where π(t)=(π1(t),…,πn(t)) row vector
- proof … definition of matrix multiplication & total probability theorem
- theorem
- P(X0=a0,X1=a1,…,Xt=at)=πa0(0)⋅Pa0a1⋅Pa1a2⋅…⋅Pat−1at
- proof by induction
- df: probability of k-step transition … rij(k)
- rij(k)=P(Xt+k=j∣Xt=i)
- rij(1)=P(Xt+1=j∣Xt=i)=pij
- it is independent of t
- theorem (Chapman-Kolmogorov)
- ∀ Markov chain, ∀i,j,k:
- rij(k+1)=∑u=1nriu(k)puj
- rij(k+ℓ)=∑u=1nriu(k)ruj(ℓ)
- rij(k)=(Pk)ij
- proof
- 1 is a special case of 2 (ruj(1)=puj)
- 1⟹3 … matrix multiplication & induction
- …
- df: j is accessible from i (where i,j∈S)
- i→j
- j∈A(i)
- j is accessible from i≡ ∃t:rij(t)P(Xt=j∣X0=i)>0
- ⟺ in the transition digraph exists directed path i→j of length t
- df: i↔j (i and j communicate) ≡i∈A(j)∧j∈A(i)
- theorem: ↔ is an equivalence relation
- proof
- reflexive: i∈A(i) … t=0 in the definition
- symmetric: the formula i∈A(j)∧j∈A(i) is symmetric
- transitive: in the digraph ∃ path i to j, ∃ path j to k ⟹∃ walk i→k
- decomposition of the transition digraph to components of strong connectivity ≡ finding equivalence classes of ↔
- df: a state i∈S is recurrent if we return to it with probability 1
- it is transient otherwise
- česky rekurentní, tranzientní
- df: Markov chain is irreducible if ∀i,j∈S:i↔j
- df: Ti=min{t≥1:Xt=i}
- Ti=∞ if the set is empty (there is no such t)
- Ti … random variable
- fij(n)=P(we get to j first at time n∣X0=i)=P(Tj=n∣X0=i)
- we define it for n>0
- fij=P(Tj<∞∣X0=i)=∑n≥1fij(n)
- i is transient
- fii<1 … probability of ever getting back
- ⟺P(Ti=∞)>0
- ⟺P(get back infinitely often)<1
- example – random walk on a line
- with probability 1/2 we go to the left
- f00(2)=(21)2+(21)2=21
- f00(1)=f00(3)=⋯=0
- f00(4)=241⋅2
- r00(4)=f00(4)
- r00(4)=241⋅4
- is 0 recurrent?
- by definition, we should add f00(2)+f00(4)+… and check if it equals 1
- theorem: i is a recurrent state ⟺∑n=1∞rii(n)=∞
- Bn={10 if Xn=i otherwise
- we got back
- E(Bn∣X0=i)=P(Xn=i∣X0=i)=rii(n)
- r00(2n)= probability that out of the 2n steps, n were to the left, n were to the right
- =22n1⋅(n2n)≐nc … see Matoušek, Nešetřil
- ∑n=1∞nc=∞ (taught in calculus)
- E(T0∣X0=0)=∞ (we did not prove that)
- theorem
- if i↔j, then either both i and j are recurrent or both i and j are transient
- for finite Markov chains
- C … equiv class of ↔ in a finite Markov chain
- C is recurrent (∀i∈C is recurrent) ⟺(∀i∈C)(∀j∈S): if i→j then j∈C
- stationary distribution / steady state distribution
- df: π:S→[0,1] such that ∑i∈Sπi=1 is called a stationary distribution if “πP=π"
- ∀i:∑iπipij=πj
- if π(0) (the PMF of X0) is π, then π(1) is π as well
- π(1)=π(0)P
- theorem: if a Markov chain is finite, aperiodic (→ not periodic) and irreducible, then
- ∃ unique stat. distribution π
- ∀i:limn→∞(Pn)ij=πj
- example of periodic Markov chain: two states, we change the state with probability 1
- df: i∈S has period di:=gcd{t:rii(t)>0}
- i∈S is aperiodic if di=1
- df: i is null recurrent if i is recurrent and E(Ti∣X0=i)=∞
- i is positive recurrent if i is recurrent and E(Ti∣X0=i)<∞
- example: random walk on a line
- theorem
- if i,j∈S, i↔j, then
- di=dj
- i is transient ⟺j is trainsient
- i is recurrent ⟺j is recurrent
- i is null recurrent ⟺j is null recurrent
- i is positive recurrent ⟺j is positive recurrent
- these are properties of the class of ↔
- theorem
- if a Markov chain is irreducible, aperiodic and finite, then
- there exists a unique stationary distribution π: πP=π
- ∀ij:lim(Pt)ij=πj
- P(Xt=j∣X0=i)≐πj
- actually, MC does not have to be finite, it suffices if all states are positive recurrent (?)
- steady state (?)
- if π(0)=π then π(1)=π
- the proof is not easy, here is a cheat proof
- Pj=Ij (row sums are 1)
- (P−I)j=0
- P−I is sungular matrix
- ∃x:x(P−I)=0⟹xP=x
- π=cx such that ∑πi=1
- problem
- x may have negative coordinates
- to fix: use Perron-Frobenius theorem
- the correct proof is shown in class of probabilistic techniques
- to find π, solve system of linear equations πP=π, add ∑i∈Sπi=1
- π describes long-term behavior of the MC
- Page Rank (original google search) … MC model of people browsing WWW
- given π, we can find a MC such that π is its stationary distribution; then we can run the MC to generate random objects with distribution π
- Markov chain Monte Carlo (MCMC)
- detailed balance equation
- MC may have this property
- ∀i=j:πiPij=πjPji
- to imagine this: ant colony moving independently according to a Markov chain
- stationary distribution ⟺ the same number of ants at each state at each time – ants don't "accumulate"
- detailed balance equation implies πP=π
- detailed balance equation is stronger than πP=π
- MCMC algo. sketch
- choose aperiodic irreducible digraph
- pij=min{1,πiπj}⋅C
- pji=min{1,πjπi}⋅C
- choose C such that
- ∀i:∑j=ipij<1
- df. pii=1−∑j=ipij>0
- ⟹di=1
- tune the process to make convergence fast
- absorbing state i:pii=1
- A … set of absorbing states
- question 1: which i∈A we end at?
- question 2: how fast?
- example: 0∈A (?)
- ai=P(∃t:Xt=0∣X0=i)
- μi=E(T∣X0=i)
- T=min{t:Xt∈A}
- theorem: (ai)i∈S are the unique solution to
- a0=1
- ai=0 if i∈A,i=0
- ai=∑jpij⋅aj otherwise
- theorem: (μi)i∈S are unieque solutions to
- μi=0 if i∈A
- μi=1+∑jpijμj if i∈/A
- proof
- P(∃t:Xt=0∣X0=0)=1
- P(∃t:Xt=0∣X0=i∈A∖{0})=0
- i∈/A
- Bj={∃t:Xt=0}
- P(Bi)=∑j∈Spij⋅P(Bj)=ajP(Bi∣X1=j)
- example: drunk person on their way home
- A={0}
- μ0=0
- μ1=1+21μ0+21μ2
- μ2=1+21μ1+21μ3
- μn−1=1+21μn−2+21μn
- μn=1+μn−1
- solution
- μ1=2n−1
- μn=n2
- μi≤n2
- 2-SAT problem
- input: φ=(x1∨x2)∧(x3∨¬x1)∧…
- clauses with exactly 2 literals
- output: a satisfying assignment OR “unsatisfiable”
- there exists a polynomial algorithm
- we will show a randomized algorithm
- arbitrarily initialize (x1,…,xn)
- while φ has an unsatisfied clause
- choose one of the unsatisfied clauses and change one of its variables
- return (x1,…,xn)
- repeat (2) ≤2mn2 times
- then, if φ is still unsatisfied, return “unsatisfiable”
- this may introduce errors
- theorem: the algorithm makes an error with probability ≤2−m
- proof
- x1∗,…,xn∗ … one of the satisfying assignments
- Dt the number of i such that xi=xi∗ at time t
- t=0,1,…,2mn2
- 0≤Dt≤n
- Dt=0⟹ we found a solution
- situation
- we assume that clause (x1∨x2) is unsatisfied at time t and we choose it
- ⟹x1=x2=F
- (x1=x2)
- (x2=¬x1) … we could remove such clauses
- ⟹x1∗∨x2∗ is T
- we randomly switch x1 or x2 which increases or decreases D
- …
- we can get a 3-SAT randomized algorithm with (34)n time complexity this way
- Hidden Markov Model (HMM)
- not observable (Xt) … Markov chain
- observable (Yt) … Yt is obtained from Xt
- widely aplplicable
- smart algorithm (Vitebri algorithm)
Probability
- what is probability?
- P:F→[0,1] such that
- P(Ω)=1
- P(⋃An)=∑P(An) if disjoint
- where to find/use it?
- randomized algorithms
- is it possible to have true randomness?
- hardware methods to sample random bits
- software methods
- their independence can be “weak” (for statistics) or “strong” (for cryptography)
- probabilistic method
- we prove that some graph exists by showing that random graph has certain property with probability greater than zero
- G(n,p) … graph with vertices 1,…,n
- ∀i,j:i∼j with probability p (all independent)
- P(G(n,21) has no Kk nor Kk as an induced subgraph)>0 if n≤2k/2
- thus ∃G on 2k/2 vertices with no induced Kk nor Kk
- Ramsey theorem
- statistics
- frequentist
- P(A)= number of good / number of all
- “in long term repetition”
- repeat a random experiment independently n times
- observe that A happens k times
- P(A):=k/n
- Bayesian
- P(it will rain tomorrow)
- subjective probability → betting
- “random universe”
- Ω = set of all possible universes
- Bayes theorem
- MAP (maximum a posteriori)
- θ^MAP=argmaxθpΘ∣X(θ∣x)
- θ^ is a point estimate for Θ
- that equals argmaxpΘ(θ^)⋅pX∣Θ(x∣θ) as we can ignore the normalization constant
- Beta function
- B(α,β)=∫01xα−1(1−x)β−1=(α+β−1)!(α−1)!(β−1)!
- Beta distribution
- fΘ(x)=B(α,β)1xα−1(1−x)β−1
- (lnf)′=(c+(α−1)lnx+(β−1)ln(1−x))′=xα−1−1−xβ−1
- in the maximum, this will be equal to zero
- maximum … α−1+β−1α−1
- …
- LMS point estimate
- LMS = least mean square
- estimate such that E((Θ−θ^)2∣X=x) is minimal
- to compute it
- example: measurement error with a normal distribution
- often, the constant in the denominator does not matter
- from posterior fΘ∣X we can find
- point extimates
- interval estimates
- confidence intervals (in classical statistics) → credible sets S
- sampling
- rejection sympling
- MCMC sampling
- Monte Carlo Markov chains
- Metropolis Hastings method
- we construct a MC from the probability distribution we want
- we run the MC for long enough
- LMS
- Θ∣X=x
- mean … LMS = minE((Θ−θ^)2∣X=x)
- median … minE(∣Θ−θ^∣∣X=x)
- modus … MAP
- conditional independence
- events
- A⊥B⟺P(A∩B)=P(A)⋅P(B)
- A⊥CB⟺P(A∩B∣C)=P(A∣C)⋅P(B∣C)
- A,B are independent conditionally given C
- it is possible (even typical) that A⊥CB, A⊥CCB, but not A⊥B
- conditional expectation
- E(Y∣X=x) vs. E(Y∣X)
- E(Y∣X=x)=g(x) … number
- E(Y∣X)=g(X) … random variable
- we proved that E(E(Y∣X))=E(g(X))=EY
- “law of iterated expectation”
- basic task of statistics
- estimate one quantity (Y) given data/measurement (X)
- example: groups of students, their exam results
- estimator
- Y^=E(Y∣X)=g(X)
- Y~=Y^−Y
- we proved that E(Y~∣X)=0
- therefore E(Y~)=E(E(Y~∣X))=0
- also, cov(Y~,Y^)=0
- note
- uncorellated ⟸ independent
- uncorellated \centernot⟹ independent
- conditional variance
- iterated variance / eve's rule
- var Y=E(var(Y∣X))+var(E(Y∣X))
- intragroup variance + intergroup variance
- birthday paradox
- 1−x≈e−x
- balls into bins
- m balls
- n bins
- Xi … number of balls in bin i
- birthday paradox … P(maxXi≥2)
- questions
- how many bins are used?
- approximation of Xi
- max load … max{X1,…,Xn}
- P(Xi=0)=(1−n1)m≈e−nm
- Ii=1 if Xi=0 (otherwise 1)
- E(∑iIi)=∑iEIi≈ne−nm
- distribution of Xi
- Xi∼Bin(m,n1)≈Pois(nm)
- EXi=nm
- applications
- bucket sort
- hash collisions
- exact case vs. Poisson case
- theorem: any event that happens with probability ≤p in the Poisson case happens with probability ≤p⋅em in the exact case
- bernoulli process, poisson process
- statistics – descriptive vs. inferential
- descriptive – describe what we observed
- inferential – deduce properties of a larger set
- how many people are left-handed?
- we can observe this on a small sample of people
- observational study vs. randomized study (RCT, randomized control trial)
- treatment – what you do to experimental units
- example
- experimental units … students
- treatment … which tutorial student attends
- observational study → we let the students decide which tutorial to attend
- randomized study → we assign the tutorials to students (randomly)
- confounders
- people can get better because they believe they are treated
- random = arbitrary
- statistics – parametric vs. non-parametric
- parametric → observations from parametrized random variables
- t-test
- permutation test
- example: earbuds
- black (4.3/5) vs. navy blue (4/5)
- descriptive statistics
- observational study
- example: alpacas
- permutation test
- https://www.jwilber.me/permutationtest/
- we observe … T(X,Y)
- Z:=X,Y
- F={T(π(Z))∣π∈Sm+n}
- p-value: percentage of F more extreme than observed
- speed-up … we use only k random samples from Sm+n
- tests
- one-sample test
- two-sample test
- paired test
- quality of wool before/after treatment
- give black & blue to n testers, ask each tester to score both
- randomize the order of testing the color
- use one-sample test on Di=Xi−Yi
- parametrized statistics … t-test (assuming normal distribution)
- non-parametric version … paired test & permutation test
- for paired test, we only permute Xi with Yi
- sign test
- Yi= "sign of Xi"
- example
- 10 values
- H0 … median = 0
- sign S∼Bin(10,21)
- one-sided test … FS(3)=P(S≤3)=0.17
- two-sided test … P(S≤3∨S≥7)=0.34
- but we are suspicious
- we can rank absolute values (from lowest to highest) and write the averages of the ranks (for rank 3–5 → the average will be 4)
- we can multiply that by the sign (S)
- test procedure (Wilcoxon signed rank test)
- data … X1,…,Xn
- ranks of ∣Xi∣ … R1,…,Rn
- T+=∑Xi>0Ri
- T−=∑Xi<0Ri
- T=T+−T−=∑Ri⋅sign(Xn)
- T++T−=∑i=2n(n+1)
- we will use stronger null hypothesis
- H0: median = 0 & distribution is symmetric
- assuming H0… (and ignoring ties)
- T=∑i=1ni⋅Vi
- Vi={+1 prob. 1/2−1
- Vi+=0/1
- ∼Ber(21)
- T+=∑i⋅Vi+
- Vi … independent
- what can we do?
- we can compute CDF of T
- we can apply CLT to approximate (can we really?)
- back to our example
wilcox
test in R → 0.07
- what is correct? 0.07 or 0.17?
- it depends on what we can assume
- sign test
- Wilcoxon signed rank test
- Student's t-test
- power of test
- 1−P(type II error)
- to increase, we have two choices
- we can get more data
- or we can make stronger assumptions
- but we should not cheat – if it is obvious that the data is not normally distributed, we should not say they are
- paired test
- Xi=Ai−Bi
- null hypothesis … the procedure did not help nor hurt
- the distribution may be very weird
- but it will be symmetrical
- if Ai,Bi have same distribution, then Ai−Bi has distribution symmetrical around 0, median 0
- Mann-Whitney U-test
- U=∑i=1m∑j=1nS(Xi,Yj)
- S(Xi,Yj)=⎩⎨⎧1210Xi>YjXi=YjXi<Yj
- 2-sample test
- we have two populations, X and Y
- we get X1,…,Xm and Y1,…,Yn
- usually, m=n
- sign test is a paired variant of this
- we use the same approach as in the permutation test with U instead of T=Xˉm−Yˉn
- types of data
- numerical … real numbers (weight, time, price)
- ordinal … classes (light/medium/heavy, quick/slow)
- bootstrapping
- data X1,…,Xn iid
- sample mean Xˉn=n1∑i=1nXi
- distr. of Xˉn:EXˉn=EXi=μ
- variance of Xˉn:varXˉn=n1σ2
- CLT: Xˉn has ≈ Normal distribution N(μ,(nσ)2)
- P(Xˉn<a) can be approximated by the normal distribution N
- now, we want the same of a different parameter
- M … median of Xi
- M^= sample median: “middle value of X1,…,Xn”
- X{1},…,X{n} … sorted data
- M^=X{2n}
- estimator function of data we use to estimate the true M
- bootstrapping … sampling with repetition from X1,…,Xn
- we can approximate the distribution of M^
- interval estimate for median M
- M∗ … bootstrapped median (median of the one set of bootstrapped data)
- Mα∗={x:P(M∗≤x)=α}
- this method is simple and does not depend on the distribution of the data
Moment generating functions
- today: proof of CLT, proof of Chernoff inequality
- given a random variable X, we define MX:R→R
- MX(s):=E(esX)
- ∀X:MX(0)=E(e0x)=1
- X∼Ber(p)
- MX(s)=p⋅es⋅1+(1−p)⋅es⋅0
- MX(s)=1−p+pes
- note: es=1+s+2s2+3!s3+⋯=∑k=0∞k!sk
- MX(s)=1+p(s+2s2+3!s3+…)
- theorem: MX(s)=∑k=0∞E[Xk]⋅k!sk
- E[Xk] … k-th moment of X
- proof
- E[esX]=E[∑k!(sX)k]=E[∑Xkk!sk]=∑E[Xkk!sk]=∑k=0∞E[Xk]⋅k!sk
- back to Ber(p)
- EX=[s1]MX(s)=p
- this notation selects the coefficient of the s1 in the GF
- EX2=[2!s2]MX(s)=p
- X∼N(0,1)
- MX(s)=E[esX]=LOTUS∫−∞∞esxfX(x) dx
- …
- MX(s)=es2/2
- theorem: MaX+b(s)=esbMX(as)
- proof: …
- example usage
- X∼N(μ,σ2)
- σX−μ=Y∼N(0,1)
- X=σY+μ
- MX(s)=eμs⋅eσ2s2/2
- theorem
- let X,Y be RVs such that (∃ε>0)(∀s∈(−ε,ε)):MX(s)=MY(s)∈R
- then FX=FY
- example
- X1∼N(μ1,σ12)
- X2∼N(μ2,σ22)
- ⟹X1+X2∼N(μ,σ2)
- proof
- MX1=exp(μ1s+σ12s2/2)
- MX2=exp(μ2s+σ22s2/2)
- …
- theorem
- X,Y independent ⟹MX+Y=MX⋅MY
- proof
- MX+Y(s)=E[es(X+Y)]=E[esX⋅esY]=E[esX]⋅E[esY]=MX(s)⋅MY(s)
- example
- X∼Bin(n,p)
- MX(s)=(1−p+pes)n
- theorem
- Y,X1,X2,X3,… RVs
- (∃ε>0)(∀s∈(−ε,ε)):limn→∞MXn(s)=MY(s)
- FY is continuous
- then XndY
- limn→∞FXn(s)=FY(s)
- theorem (CLT)
- X1,X2,… i.i.d. (independent identically distributed) RVs
- EXi=μ, varXi=σ2
- Yn:=nσX1+⋯+Xn−nμ
- YndN(0,1)
- that is limn→∞FYn(t)=Φ(t)
- proof
- we may assume μ=0
- otherwise, we would set Xn′=Xn−μ
- variance would not change
- the formula for Yn also would not change (we subtract nμ there)
- MXi(s)=1+as+bs2+O(s3)(s≐0)
- a=μ=0
- b=21EXi2=21(σ2−μ2)=21σ2
- MXi(s)=1+2σ2s2+O(s3)
- MYn(s)=∏MXi(nσs)
- we will use the previous theorem for Y,Y1,Y2,…
- we need to show that limMYn(s)=MY(s)
- or that limn→∞MXn(nσs)n=es2/2
- MXn(nσs)n=…=exp(nln(1+2ns2+O(s3))
- …
- theorem (Chernoff inequality)
- application
- set balancing
- discrepancy
- we have subsets S1,…,Sn⊆[m]
- we want T⊆[m] such that it almost disects every Si
- Di=∣T∩Si∣−∣Si∖T∣
- disc(T)=maxDi
- use random T!
- Di=∑j=1∣Si∣Xj
- ∀aj∈Si:Xj indicator (Xj=1⟺aj∈T)
- P(Di≥t)≤e−2∣Si∣t2
- …
- idea: it is useful to have an inequality (Chernoff) that holds always, we don't need to worry if our n is big enough