this dir | view | cards | source | edit | dark
top
Exam
Markov Chains
- Stochastic process
- sequence of random variables X0,X1,… such that “all probabilities are defined”
- e.g. P(X0=1∧X1=2∧X3=0) is defined
- Markov process/chain
- stochastic process such that
- ∃ finite/countable set S such that ∀t:ImXt⊆S
- elements of S … states
- t … time
- ∀i,j∈S:∃pij∈[0,1]
- notes
- this is a discrete time Markov chain
- discrete space … ∣S∣≤∣N∣
- time-homogeneous … pij not depeding on t
- Markov condition … “forgetting the past”
- P(Xt+1=at+1∣Xt=at∧⋯∧X0=a0)=P(Xt+1=at+1∣Xt=at)
- Transition matrix, diagram
- transition matrix is a matrix P such that Pij=pij
- observation: every row of P sums to 1
- transition diagram/graph
- vertices … S
- directed edges … {(i,j)∈S2:pij>0}
- Markov chain is equivalent to a random walk on the transition diagram
- from state i we go to j with the probability of pij independent of the past
- Probability mass function for X0,X1,…
- πi(t)=P(Xt=i)
- theorem: π(t+1)=π(t)⋅P
- where π(t)=(π1(t),…,πn(t)) row vector
- proof: definition of matrix multiplication & total probability theorem
- by induction, we can prove that π(k)=π(0)Pk and, more generally, π(t+k)=π(t)Pk
- Probability of k-step transition … rij(k)
- rij(k)=P(Xt+k=j∣Xt=i)
- rij(1)=P(Xt+1=j∣Xt=i)=pij
- Chapman-Kolmogorov theorem
- ∀ Markov chain, ∀i,j,k:
- rij(k+1)=∑u=1nriu(k)puj
- rij(k+ℓ)=∑u=1nriu(k)ruj(ℓ)
- rij(k)=(Pk)ij
- proof
- 2⟹1
- 1 is a special case of 2 (ruj(1)=puj)
- 1⟹3
- matrix multiplication & induction
- 2 … we want rij(k+ℓ)=∑u=1nriu(k)ruj(ℓ)
- P(Xt+k+ℓ=j∣Xt=i)=
- =∑u=1nP(Xt+k=u∣Xt=i)⋅P(Xt+k+ℓ=j∣Xt+k=u)
- it follows from total probability theorem
- Accessible states
- definition: j is accessible from i (where i,j∈S)
- i→j
- j∈A(i)
- j is accessible from i≡ ∃t:rij(t)P(Xt=j∣X0=i)>0
- ⟺ in the transition digraph exists directed path i→j of length t
- Communicating states
- definition: i↔j (i and j communicate) ≡i∈A(j)∧j∈A(i)
- theorem: ↔ is an equivalence relation
- proof
- reflexive: i∈A(i) … t=0 in the definition
- symmetric: the formula i∈A(j)∧j∈A(i) is symmetric
- transitive: in the digraph ∃ path i to j, ∃ path j to k ⟹∃ walk i→k
- decomposition of the transition digraph to components of strong connectivity ≡ finding equivalence classes of ↔
- Reducibility
- Markov chain is irreducible if ∀i,j∈S:i↔j
- Time to get to i∈S
- definition: Ti=min{t≥1:Xt=i}
- Ti=∞ if the set is empty (there is no such t)
- Ti … random variable
- fij(n)=P(we get to j first at time n∣X0=i)=P(Tj=n∣X0=i)
- we define it for n>0
- fij=P(Tj<∞∣X0=i)=∑n≥1fij(n)
- probability that we get to j if we start from i (after at least one step)
- Reccurent state
- a state i∈S is recurrent if we return to it with probability 1
- it is transient otherwise
- česky rekurentní, tranzientní
- i is transient
- fii<1 … probability of ever getting back
- ⟺P(Ti=∞)>0
- ⟺P(get back infinitely often)<1
- theorem: i is a recurrent state ⟺∑n=1∞rii(n)=∞
- theorem: if i↔j, then either both i and j are recurrent or both i and j are transient
- for finite Markov chains
- C … equiv class of ↔ in a finite Markov chain
- C is recurrent (∀i∈C is recurrent) ⟺(∀i∈C)(∀j∈S): if i→j then j∈C (“C is closed”)
- definition: i is null recurrent if i is recurrent and E(Ti∣X0=i)=∞
- i is positive recurrent if i is recurrent and E(Ti∣X0=i)<∞
- example: random walk on a line
- Periodic state
- definition: i∈S has period di:=gcd{t:rii(t)>0}
- i∈S is aperiodic if di=1
- example of a periodic Markov chain: two states, we change the state with probability equal to 1
- Properties of a communicating class
- if i,j∈S, i↔j, then
- di=dj
- i is transient ⟺j is transient
- i is recurrent ⟺j is recurrent
- i is null recurrent ⟺j is null recurrent
- i is positive recurrent ⟺j is positive recurrent
- these are properties of the class of ↔
- Stationary distribution / steady state distribution
- definition: π:S→[0,1] such that ∑i∈Sπi=1 is called a stationary distribution if “πP=π"
- ∀i:∑iπi⋅pij=πj
- if π(0) (the PMF of X0) is π, then π(1) is π as well (→ steady state)
- reminder: π(1)=π(0)P
- theorem
- if a Markov chain is irreducible, aperiodic and finite, then
- there exists a unique stationary distribution π such that πP=π
- ∀ij:lim(Pt)ij=πj
- P(Xt=j∣X0=i)≐πj
- actually, MC does not have to be finite, it suffices if all states are positive recurrent
- to find π, solve system of linear equations πP=π, add ∑i∈Sπi=1
- for every i there should be equation πi=∑jpji⋅πj (there is a term for every incoming arc)
- π describes long-term behavior of the MC
- Page Rank (original google search) … MC model of people browsing WWW
- given π, we can find a MC such that π is its stationary distribution; then we can run the MC to generate random objects with distribution π
- Markov chain Monte Carlo (MCMC)
- Detailed balance equation
- MC may have this property
- ∀i=j:πiPij=πjPji
- to imagine this: ant colony moving independently according to a Markov chain
- stationary distribution … the same number of ants at each state at each time – ants don't “accumulate”
- detailed balance equation implies πP=π (but it is stronger than that)
- MCMC sampling
- Monte Carlo Markov chains
- Metropolis–Hastings method
- we construct a MC from the probability distribution we want
- choose aperiodic irreducible digraph
- use detailed balance equation
- pij=min{1,πiπj}⋅C
- pji=min{1,πjπi}⋅C
- choose C such that
- ∀i:∑j=ipij<1
- df. pii=1−∑j=ipij>0
- ⟹di=1
- tune the process to make convergence fast
- we run the MC for long enough
- alternative sampling approach: rejection sampling
- MCMCs are useful in Bayesian statistics – we only need the ratios so the normalizing constants cancel out (and we don't have to compute them)
- Absorbing state
- i∈S is absorbing state if pii=1
- A … set of absorbing states
- which absorbing state do we end at? will it be state 0∈A?
- ai=P(∃t:Xt=0∣X0=i)
- theorem: (ai)i∈S are unique solutions to
- a0=1
- ai=0 if i∈A,i=0
- ai=∑jpij⋅aj otherwise
- what is the expected number of steps to get to an absorbing state
- μi=E(T∣X0=i)
- T=min{t:Xt∈A}
- theorem: (μi)i∈S are unique solutions to
- μi=0 if i∈A
- μi=1+∑jpij⋅μj if i∈/A
- example: drunk person on their way home
- A={0}
- μ0=0
- μ1=1+21μ0+21μ2
- μ2=1+21μ1+21μ3
- μn−1=1+21μn−2+21μn
- μn=1+μn−1
- solution
- μ1=2n−1
- μn=n2
- μi≤n2
- how to compute it
- we start from the bottom up by combining the two last equations
- we get μn−1=3+μn−2
- then we get μn−2=5+μn−3
- it can be shown that we can get to μ1=2n−1+μ0=2n−1
- we see that μ2=2n−3+μ1
- we can substitute μ1
- when we get back to the bottom, we have μn=1+3+5+⋯+(2n−1)=n2
- 2-SAT problem
- input: φ=(x1∨x2)∧(x3∨¬x1)∧…
- clauses with exactly 2 literals
- output: a satisfying assignment OR “unsatisfiable”
- there exists a polynomial algorithm
- we will show a randomized algorithm
- arbitrarily initialize (x1,…,xn)
- while φ has an unsatisfied clause
- choose one of the unsatisfied clauses and change one of its variables
- return (x1,…,xn)
- repeat (2) ≤2mn2 times
- then, if φ is still unsatisfied, return “unsatisfiable”
- this may introduce errors
- theorem: the algorithm makes an error with probability ≤2−m
- proof
- x1∗,…,xn∗ … one of the satisfying assignments
- Dt the number of i such that xi=xi∗ at time t
- t=0,1,…,2mn2
- 0≤Dt≤n
- Dt=0⟹ we found a solution
- situation
- we assume that clause (x1∨x2) is unsatisfied at time t and we choose it
- ⟹x1=x2=F
- ⟹x1∗∨x2∗=T
- we randomly switch x1 or x2 which increases or decreases D
- if x1∗=T and x2∗=F (or vice versa), the random switch may decrease or increase D by one (each with probability 21)
- we will call them “good cases”
- if x1∗=x2∗=T, then the random switch will always decrease D which is nice but it means that we cannot model D using a Markov chain
- because of that, we define D′ which always decreases or increases with probability 21
- in the good cases, it behaves the same way as D
- in the bad case, we flip a coin to decide whether it should go up or down
- clearly D≤D′ and T≤T′ (where T is the first time D=0, similarly for T′,D′)
- we know that ET′≤n2 which means that ET≤n2
- using Markov inequality, we get that P(T≥2n2)≤2n2ET≤2n2n2=21
- we repeat the whole process m times, that way we get the probability of fail ≤2−m and the running time O(mn4)
- there are O(n2) clauses of two literals, we need to find an unsatisfied one 2mn2 times
- we can get a 3-SAT randomized algorithm with (34)n time complexity this way
- Hidden Markov Model (HMM)
- there is some not observable (Xt) … Markov chain
- we can observe (Yt) … Yt is somehow obtained from Xt
- widely applicable
- smart algorithm (Viterbi algorithm)
Bayesian Statistics
- Probabilistic method
- we prove that some graph exists by showing that random graph has certain property with probability greater than zero
- G(n,p) … graph with vertices 1,…,n
- ∀i,j:i∼j with probability p (all independent)
- P(G(n,21) has no Kk nor Kk as an induced subgraph)>0 if n≤2k/2
- thus ∃G on 2k/2 vertices with no induced Kk nor Kk
- Ramsey theorem
- Bayes theorem
- let X,Θ be discrete random variables
- pΘ∣X(θ∣x)=∑θ′∈ImΘpX∣Θ(x∣θ′)⋅pΘ(θ′)pX∣Θ(x∣θ)⋅pΘ(θ)
- similar for continuous RVs
- note: terms with pΘ(θ′)=0 are considered to be zero
- Bayesian statistics – general approach
- Θ possible worlds
- prior distribution pΘ(θ)
- example: with probability 21 the coin is fake (has heads on both sides)
- pΘ(real)=21
- pΘ(fake)=21
- X … measurement/observation
- posterior distribution
- pΘ∣X(θ∣x)=∑t∈ΘpΘ(t)⋅pX∣Θ(x∣t)pΘ(θ)⋅pX∣Θ(x∣θ)
- example: pΘ∣X(fake∣heads)=21⋅1+21⋅2121⋅1=32
- alternatively: which “heads” did we get?
- one side of the fake coin
- the other side of the fake coin
- one side of the real coin
- two of the three possible “worlds” are corresponding to the fake coin → pΘ∣X(fake∣heads)=32
- θ^=fake … MAP estimate
- Bayesian estimates
- point estimates
- MAP (maximum a posteriori)
- we choose θ^ to maximize pΘ∣X … we replace the random variable by its mode
- LMS (“least squares”)
- we choose θ^=E(Θ∣X=x) … we replace the random variable by its mean
- we get an unbiased point estimate that has the smallest possible LMS (least mean square) error
- interval estimates
- credible sets S (similar to confidence intervals used in classical statistics)
- note: often, the constant in the denominator does not matter
- MAP (maximum a posteriori)
- θ^MAP=argmaxθpΘ∣X(θ∣x)
- θ^ is a point estimate for Θ
- that equals argmaxpΘ(θ)⋅pX∣Θ(x∣θ)
- we can ignore the normalizing constant
- we are replacing the random variable by its mode
- node: maximum likelihood is similar but it uses argmaxpX∣Θ(x∣θ)
- it assumes “flat prior” (uniform or discrete uniform)
- Naive Bayes classifier
- let x=(x1,…,xn)
- in machine learning … vector of features
- we need pX∣Θ(x∣θ) to get MAP estimate
- “naive assumption”: p(x∣θ)=p(x1∣θ)⋅p(x2∣θ)⋅…⋅p(xn∣θ)
- LMS point estimate
- LMS = least mean square
- estimate such that E((Θ−θ^)2∣X=x) is minimal
- to compute it
- θ^LMS=E(Θ∣X=x)=∑t∈Θt⋅pΘ∣X(t∣x)
- we find the mean of the random variable
- proof
- E((Θ−θ^)2∣X=x)=E(Θ2∣X=x)−E(2Θθ^∣X=x)+E(θ^2∣X=x)
- we want to minimize it
- the first term does not depend on θ^, we get rid of it
- we get −2θ^E(Θ∣X=x)+θ^2
- to find its minimum, we find the derivative and set it equal to zero
- −2E(Θ∣X=x)+2θ^=0
- therefore θ^=E(Θ∣X=x) as we wanted
- Beta distribution
- fΘ(x)=c⋅xα−1(1−x)β−1
- where c is a normalizing constant
- properties
- fΘ is maximal for θ=α+β−2α−1
- EΘ=α+βα
- Conjugate distributions
- Beta (prior) & binomial (likelihood) → Beta (posterior)
- “Beta and Beta are conjugate distributions with respect to binomial distribution”
- normal (prior) & normal (likelihood) → normal (posterior)
- both were proved in the lecture
- it is an algebraic convenience, it does not work for all distributions
Conditional Expectation
- Conditional independence
- events
- A⊥B⟺P(A∩B)=P(A)⋅P(B)
- A⊥CB⟺P(A∩B∣C)=P(A∣C)⋅P(B∣C)
- A,B are independent conditionally given C
- it is possible (even typical) that A⊥CB, A⊥CCB, but not A⊥B
- Conditional expectation
- E(Y∣X=x) vs. E(Y∣X)
- E(Y∣X=x)=g(x) … number
- E(Y∣X)=g(X) … random variable
- Law of iterated expectation
- E(E(Y∣X))=E(Y)
- proof
- E(E(Y∣X))=E(g(X))=∑xP(X=x)⋅g(x)
- E(Y)=∑xP(X=x)⋅E(Y∣X=x)=∑xP(X=x)⋅g(x)
- by law of total probability
- Estimate error
- basic task of statistics
- estimate one quantity (Y) given data/measurement (X)
- example: groups of students, their exam results
- X … groups of students
- X=2 … we are currently looking at the second group
- Y … students' exam results
- E(Y∣X=3) … the expectation of the results in group 3
- estimator
- Y^=E(Y∣X)=g(X)
- we want the estimate error
- Y~=Y^−Y
- what is E(Y~∣X) equal to?
- E(Y~∣X)=E(Y^−Y∣X)=E(Y^∣X)−E(Y∣X)=Y^−Y^=0
- because E(Y^∣X)=E(E(Y∣X)∣X)=E(Y∣X)
- by law of iterated expectation
- E(Y~)=E(E(Y~∣X))=0
- Estimate error covariance and variance
- cov(Y~,Y^)=E(Y~Y^)−0E(Y~)E(Y^)=E(E(Y~Y^∣X))=E(Y^0E(Y~∣X))=0
- the penultimate equivalence holds because Y^ is constant for specified X=x
- Y~,Y^ are uncorellated
- note
- uncorellated ⟸ independent
- uncorellated ⟹ independent
- Law of iterated variance / Eve's rule
- var Y=E(var(Y∣X))+var(E(Y∣X))
- intragroup variance + intergroup variance
- proof
- Y=Y^−Y~
- var Y=var(Y^)+var(Y~)−02cov(Y^,Y~)=var(Y^)+var(Y~)
- var(Y~)=E(Y~2)−0(EY~)2=E(E(Y~2∣X))
- =E(E((Y−Y^)2∣X))=E(var(Y∣X))
Balls & Bins
- Really useful approximation
- 1−x≈e−x
- example usage: birthday paradox
- (1−3651)(1−3652)…(1−365k−1)≈e−2⋅365k(k−1)
- because 1−3651≈e−3651 etc.
- Union bound
- for events A1,A2,… it holds that P(⋃Ai)≤∑P(Ai)
- therefore P(none of Ai)≥1−∑P(Ai)
- Balls and bins model
- we will be throwing m balls randomly into n bins
- Xi … number of balls in bin i
- How many bins are empty?
- P(Xi=0)=(1−n1)m≈e−nm
- Ii=1 if Xi=0 (otherwise 1)
- E(∑iIi)=∑iEIi≈ne−nm
- How many balls are in bin i?
- Xi∼Bin(m,n1)≈Pois(nm)
- EXi=nm
- Applications – bucket sort and hasing
- Max-load likely upper bound
- definition: max-load … max{X1,…,Xn}
- we assume m=n in max-load bounds
- theorem: for large enough n we have P(maxload≥loglogn3logn)≤n1
- proof
- Xi … number of balls in bin i
- P(Xi≥M)≤(Mn)nM1
- probability that bin i gets at least M balls
- for all M-tuples of balls, we sum their probability to all fall in bin Bi (using union bound)
- Bi … Xi≥M
- P(⋃iBi)=P(maxXi≥M)
- P(⋃iBi)≤∑i=1nP(Bi)≤n(Mn)nM1≤n(Me)M
- union bound again, then we use an estimate of factorial
- we need to show that n(Me)M<?n1
- log(n2(Me)M)<?log1
- 2L+M(1−logM)<?0
- where L is logn (LL is loglogn etc.)
- we substitute M=LL3L
- 2L+LL3L(1−(log3+LL−LLL))<?0
- 2L+LL3L(1−log3−LL+LLL)<?0
- −L+LL3L(1−log3)+LL3L(LLL)<?0
- second and third term are much smaller than L
- therefore it holds
- Exact case vs. Poisson case
- theorem: any event that happens with probability ≤p in the Poisson case happens with probability ≤p⋅em in the exact case
- Max-load likely lower bound
- theorem: for large enough n we have P(maxload≤loglognlogn)≤n1
- proof
- Poisson case
- P(Xi<M)≤1−P(Xi=M)=1−e−λM!λM=1−eM!1
- λ=nm=1
- P(maxXi<M)=P(X1<M)⋅…⋅P(Xn<M)
- ≤(1−eM!1)n≤(e−eM!1)n=exp(−eM!n)=:p
- exact case
- P(maxXi<M)≤p⋅em=en⋅exp(−eM!n)
- we want en⋅exp(−eM!n)<?n1
- exp(−eM!n)<?n21
- eM!n>?2logn
- M!<?2elognn
- logM!<?L−LL−log2e
- we know M!≤M(eM)M
- logM!≤logM+M(logM−1)
- ≤LL−LLL+LLL(LL−LLL−1)
- here, we substituted the M
- ≤L−LLLLLL+LL≤L−LL−c
- as 2LL≤LLLLLL (it's not that clear but it holds)
- therefore logM!≤L−LL−c
- that's what we wanted to prove
Stochastic Processes
- Bernoulli process
- infinite sequence of independent identically distributed Bernoulli trials
- sequence of RVs X1,X2,… that are independent and each follows Ber(p) distribution
- Xk=1 … success/arrival at time k
- Quantities of a Bernoulli process
- Nt … number of successes up to time t
- Nt∼Bin(t,p)
- E[Nt]=t⋅p
- Tk … time of the k-th success
- Lk=Tk−Tk−1
- waiting time
- memory-less
- Lk∼Geom(p)
- E[Lk]=p1
- P(L=l)=(1−p)l−1⋅p
- Tk properties
- Tk=L1+⋯+Lk
- by linearity E[Tk]=pk
- Tk has Pascal distribution of order k
- ∀t≥k:P(Tk=t)=(k−1t−1)pk(1−p)t−k
- ∀t<k:P(Tk=t)=0
- Alternative description of a Bernoulli process
- we can describe the situation by the interarrival times
- i.i.d. random variables L1,L2,…∼Geom(p)
- Tk=L1+⋯+Lk
- Xk=1 whenever Tt=k for some t
- clearly, the sequence X1,X2,… is a Bernoulli process
- Merging of Bernoulli processes
- two independent Bernoulli processes, Zi=Xi∨Yi
- BP(p)[merge with]BP(q)=BP(1−(1−p)(1−q))
- =BP(p+q−pq)
- Splitting of Bernoulli processes
- Xi,Yi,Zi
- Zi∼BP(p)
- if Zi=0, then Xi=Yi=0
- if Zi=1, then with probability q we set (Xi,Yi)=(1,0), otherwise (Xi,Yi)=(0,1)
- then Xi∼BP(pq) and Yi∼BP(p(1−q))
- Poisson process
- continuous time
- T1,T2,T3,… are the times of individual arrivals
- Nt … number of arrivals up to time t
- λ … intensity (how many successes per unit of time)
- Nt∼Pois(λ⋅t)
- Lk∼Exp(λ)
- Tk has Erlang distribution, E[Tk]=λk
- Poisson process interval independence
- for any sequence 0≤t0<t1⋯<tk
- RVs (Nti+1−Nti)
- are independent
- each one follows Pois(λ(ti+1−ti))
- Merging of Poisson process
- PP(λ)[merge with]PP(λ′)=PP(λ+λ′)
- Splitting of Poisson process
- for each success of PP(λ), we toss a coin with probability p (with probability p, the success is counted as valid)
- the resulting process is PP(λp)
Non-parametric Tests
- Permutation test
- two-sample test
- A/B testing
- A … control group (no treatment)
- B … group with treatment applied
- is the A vs. B difference only random?
- we observe … T(A,B)
- T … test statistic
- example: T=Aˉ−Bˉ
- F={T(π(A∪B))∣π∈Sm+n}
- usually the sizes of the groups stay the same: we use T(first ∣A∣ elements of π(A∪B),rest of π(A∪B))
- p-value: percentage of F more extreme than observed T(A,B)
- speed-up … we use only k random samples from Sm+n
- for paired test, we only permute Xi with Yi
- Signed test
- paired test
- pairs Xi,Yi
- Zi=Yi−Xi
- H0 … P(Xi>Yi)=21
- count number of pairs where Zi>0 (there was a measurable improvement after the treatment)
- use Bin(n,21) to get the p-value
- Wilcoxon signed rank test
- paired test
- procedure
- let Xi be the difference Bi−Ai
- sort by absolute value, assign ranks (Ri)
- calculate T (or T+ or T−, it does not matter)
- T+=∑Xi>0Ri
- T−=∑Xi<0Ri
- T=T+−T−=∑Ri⋅sign(Xn)
- compare with scores for measurements with random +/− signs
- we use stronger null hypothesis
- H0: median = 0 & distribution of X is symmetric
- note: X is symmetric ⟺A,B have the same continuous distribution (source)
- Mann-Whitney U-test
- two-sample test
- we have two populations, X and Y
- we get X1,…,Xm and Y1,…,Yn
- usually, m=n
- S(Xi,Yj)=⎩⎨⎧1210Xi>YjXi=YjXi<Yj
- U=∑i=1m∑j=1nS(Xi,Yj)
- sign test is a paired variant of this
- we use the same approach as in the permutation test with U instead of T=Xˉm−Yˉn
- alternative view: we sort X∪Y and count pairs “Xi before Yj”
- when to use it?
- numerical data … real numbers (weight, time, price) → permutation test
- ordinal data … classes (light/medium/heavy, quick/slow) → U-test
Moment Generating Functions and their applications
- Markov inequality
- for X nonnegative
- ∀a≥1:P(X≥a⋅μ)≤a1
- Chebyshev inequality
- ∀a>0:P(∣X−μ∣≥a⋅σ)≤a21
- Chernoff bounds (alternative)
- for X=∑i=1nXi where Xi∼Ber(pi) are independent
- μ=∑i=1npi
- ∀δ∈(0,1]:P(X≥(1+δ)⋅μ)≤e−μδ2/3
- ∀δ∈(0,1):P(X≤(1−δ)⋅μ)≤e−μδ2/2
- note: there are many more alternative bounds
- Moment generating function
- given a random variable X, we define MX:R→R
- MX(s):=E(esX)
- ∀X:MX(0)=E(e0x)=1
- MGF moment theorem
- theorem: MX(s)=∑k=0∞E[Xk]⋅k!sk
- E[Xk] … k-th moment of X
- proof
- E[esX]=E[∑k!(sX)k]=E[∑Xkk!sk]=∑E[Xkk!sk]=∑k=0∞E[Xk]⋅k!sk
- MGF for Bernoulli variable
- X∼Ber(p)
- MX(s)=p⋅es⋅1+(1−p)⋅es⋅0
- MX(s)=1−p+pes
- note: es=1+s+2s2+3!s3+⋯=∑k=0∞k!sk
- MX(s)=1+p(s+2s2+3!s3+…)
- EX=[s1]MX(s)=p
- this notation selects the coefficient of the s1 in the GF
- EX2=[2!s2]MX(s)=p
- MGF for continuous X
- MX(s)=E[esX]=LOTUS∫−∞∞esxfX(x) dx
- MGF “linearity” and “sum”
- theorem
- MaX+b(s)=esbMX(as)
- proof
- E(es(aX+b))=E(easX⋅esb)=esb⋅E(e(as)X)
- theorem
- X,Y independent ⟹MX+Y=MX⋅MY
- proof
- MX+Y(s)=E[es(X+Y)]=E[esX⋅esY]=E[esX]⋅E[esY]=MX(s)⋅MY(s)
- example
- X∼Bin(n,p)
- MX(s)=(1−p+pes)n
- MGF of normal distribution
- X∼N(μ,σ2)
- MX(s)=eμs⋅eσ2s2/2
- MGF equality and convergence
- equality
- let X,Y be RVs such that (∃ε>0)(∀s∈(−ε,ε)):MX(s)=MY(s)∈R
- then FX=FY
- convergence
- Y,X1,X2,X3,… RVs
- (∃ε>0)(∀s∈(−ε,ε)):limn→∞MXn(s)=MY(s)
- FY is continuous
- then XndY
- limn→∞FXn(x)=FY(x)
- Central limit theorem
- theorem
- X1,X2,… i.i.d. (independent identically distributed) RVs
- EXi=μ, varXi=σ2
- Yn:=nσX1+⋯+Xn−nμ
- YndN(0,1)
- that is limn→∞FYn(t)=Φ(t)
- proof
- we may assume μ=0
- otherwise, we would set Xn′=Xn−μ
- variance would not change
- the formula for Yn also would not change (we subtract nμ there)
- MXi(s)=1+as+bs2+O(s3)(s≐0)
- a=μ=0
- b=21EXi2=21(σ2−μ2)=21σ2
- from the alternative formula for var
- MXi(s)=1+2σ2s2+O(s3)
- MYn(s)=∏MXi(nσs)
- because Yn=nσX1+⋯+Xn
- we need to show that limMYn(s)=MY(s) where Y∼N(0,1)
- or that limn→∞MXn(nσs)n=es2/2
- MYn(s)=MXi(nσs)n=(1+2σ2(nσs)2+O(s3))n=(1+2ns2+O(s3))n
- we substituted s=nσs into MXi
- we know that (1+nx)n→ex
- therefore (1+ns2/2+O(s3))n→es2/2
- Chernoff inequality
- theorem
- suppose i.i.d. X1,X2,…,Xn are ±1, each with probability 21
- X=∑iXi
- we have σ2=n
- ∀t>0:P(X≥t)=P(X≤−t)≤e−t2/2σ2
- proof
- MXi=2e−s+es
- Markov inequality
- for X nonnegative
- ∀a≥1:P(X≥a)≤aμ
- we use it
- ∀s>0:P(X≥t)=P(esX≥est)≤estE(esX)
- estE(esX)=estMXi(s)n=est(2e−s+es)n
- it can be shown that 2e−s+es≤es2/2
- therefore P(X≥t)≤estens2/2=ens2/2−st
- we set s=t/n
- then P(X≥t)≤e−t2/2n
- Set balancing using Chernoff inequality
- we have subsets S1,…,Sn⊆[m]
- if may be features like “men”, “old people”, …
- we want T⊆[m] such that it almost dissects every Si
- to get two groups with the same (or similar) representation of every feature
- Di=∣T∩Si∣−∣Si∖T∣
- disc(T)=maxDi
- use random T!
- ∀aj∈Si we define “indicators”
- Xj=1⟺aj∈T
- Xj=−1⟺aj∈/T
- Di=∑j=1∣Si∣Xj
- P(Di≥t)≤e−2∣Si∣t2≤e−2mt2
- P(∣Di∣≥t)≤2e−t2/2m
- P(∃j:∣Dj∣≥t)≤n⋅2e−t2/2m≤n2
- for t=4mlogn
- idea: it is useful to have an inequality (Chernoff) that holds always, we don't need to worry if our n is big enough