this dir | view | cards | source | edit | dark
top
Lecture
- look at the slides before the lecture
- rational agent
- agent – perceives environment through sensors, acts upon environment through actuators
- rational agent maximizes its expected performance measure
- in AI 1 we used logical approach; we ignored uncertainty
- including interface between agent and environment
- we also ignored self-improvement capabilities
- uncertainty in pure logical approach
- belief states
- instead of having state, we have a belief state (set of all possibilities that can be true)
- drawbacks
- logical agent must consider every logically possible explanation for the observations (no matter how unlikely they are) → large and complex representations
- correct contingent plan must consider arbitrary likely contingencies (?)
- sometimes there's no plan which would guarantee the desired result (but the agent still has to act somehow)
- practical problems (in medicine)
- laziness – it is too much work to list the complete set of antecedents or consequents and too hard to use such rules
- theoretical ignorance – medical science has no complete theory for the domain
- practical ignorance – even if we know all the rules, we may not be able run all the necessary tests
- that's why we'll use another approach – probability theory
- probability
- conditional probability
- P(a∣b)=P(b)P(a∧b) whenever P(b)>0
- in a factored representation, a possible world is represented by a set of variable/value pairs
- every variable has a domain
- a possible world is fully identified by values of all random variables
- probability of all possible worlds can be represented using a table called a full joint probability distribution
- P(Cavity)=⟨0.2,0.8⟩
- Cavity = true → 0.2
- Cavity = false → 0.8
- inclusion-exclusion principle
- P(a∨b)=P(a)+P(b)−P(a∧b)
- we can do inference by summing up certain cells of the full joint distribution table
- using normalization constant α may be helpful
- drawbacks of inference by enumeration
- worst-case complexity O(dn) where d is number of values in domains of each variable
- adding another variable – weather
- does not influence tooth problems (is independent) → we add another table
- also, we don't need to store the entire table (probabilities add up to one)
- usually, we are interested in the diagnostic direction P(disease | symptoms)
- but we know P(disease), P(symptoms), and P(symptoms | disease) … causal direction
- we use Bayes' rule to get the diagnostic direction
- it may be better not to store the diagnostic direction as the original probabilities (we base the calculation upon) may change
- naive Bayes model
- generally, we can exploit conditional independence by ordering the variables properlyexa
- naive Bayes model – we assume independence
- P(Cause,Effect1,…,Effectn)=P(Cause)∏iP(Effecti∣Cause)
- Wumpus world
- we consider worlds with n holes for every possible n
- we know that probability of a hole is 0.2
- (is this useful???)
- in the end, we get P(P1,3∣known,b)=α′P(P1,3)∑fringeP(b∣P1,3,known,fringe)P(fringe)
- summary
- we use probability theory to handle uncertainty
- full joint distribution describes probabilities of all possible worlds
- answers to queries can be obtained by summing out probabilities of worlds consistent with the observation (marginalization)
- too expensive for larger problems → we exploit conditional independence
- we can remove some “dependencies”
- Bayesian network
- DAG; specifies conditional independence relationships among random variables
- nodes correspond to random variables
- arcs describe the dependencies
- example: burglary detection
- CPT (conditional probability table) for each variable
- to get full joint distribution, we just multiply the tables
- we construct the network based on the selected order
- usually, we want to use the causal direction (so that we know how to construct the CPTs)
- the arc should be there if there is a dependence between the variables
- for X independent on Y, it should hold that P(X∣Y,Z)=P(X∣¬Y,Z)=P(X∣Z)
- how much space can we save?
- assume that each variable is directly influenced by at most k other variables
- then the size of representation is n⋅2k for Bayesian network vs. 2n for full joint distribution
- Markov blanket
- inference by enumeration
- using joint probability (including hidden variables) + moving terms out of the sums if possible
- using branching to compute the probabilities
- dynamic programming
- we can use factors (tables constructed from CPTs)
- multiplication: we can multiply the rows that have the same values of the shared variables
- elimination: we sum the rows that are almost the same (differ only in the value of the variable we want to eliminate)
- we construct the factors based on the evidence
- suggested ordering – eliminate the variables that would need large factors in the subsequent stages
- complexity
- approximate inference
- Monte Carlo methods
- direct sampling
- we are going through the Bayesian network in top-down order
- in each step, we sample one variable and then base further sampling on this choice
- but we don't care about the samples!
- rejection sampling
- we generate samples using direct sampling
- we eliminate the ones that don't correspond to the evidence
- we estimate P(X∣e) as the number of correct (not eliminated) samples divided by the number of all samples
- problem: we may reject too many samples
- likelihood weighting
- instead of rejecting samples, we generate only the ones consistent with evidence e
- but then, we lose information about the probability of the specific piece of evidence
- so we assign weights to the samples according to the probability of the evidence
- our arithmetic has to be precise enough
- MCMC (Markov chain Monte Carlo)