Lecture: kartičky | artificial-intelligence2

rational agent

agent – perceives environment through sensors, acts upon environment through actuators
rational agent maximizes its expected performance measure
in AI 1 we used logical approach; we ignored uncertainty
- including interface between agent and environment
- we also ignored self-improvement capabilities

uncertainty in pure logical approach

belief states
- instead of having state, we have a belief state (set of all possibilities that can be true)
drawbacks
- logical agent must consider every logically possible explanation for the observations (no matter how unlikely they are) → large and complex representations
- correct contingent plan must consider arbitrary likely contingencies (?)
- sometimes there's no plan which would guarantee the desired result (but the agent still has to act somehow)
practical problems (in medicine)
- laziness – it is too much work to list the complete set of antecedents or consequents and too hard to use such rules
- theoretical ignorance – medical science has no complete theory for the domain
- practical ignorance – even if we know all the rules, we may not be able run all the necessary tests
that's why we'll use another approach – probability theory

probability

conditional probability
- $P(a\mid b)=\frac{P(a\land b)}{P(b)}$ whenever $P(b)\gt 0$
in a factored representation, a possible world is represented by a set of variable/value pairs
- every variable has a domain
- a possible world is fully identified by values of all random variables
probability of all possible worlds can be represented using a table called a full joint probability distribution
$\textbf{P}(\mathrm{Cavity})=\braket{0.2,0.8}$ $P (Cavity) = ⟨ 0.2, 0.8 ⟩$
- Cavity = true → 0.2
- Cavity = false → 0.8
inclusion-exclusion principle
- $P(a\lor b)=P(a)+P(b)-P(a\land b)$
we can do inference by summing up certain cells of the full joint distribution table
using normalization constant $\alpha$ may be helpful
drawbacks of inference by enumeration
- worst-case complexity $O(d^n)$ where $d$ is number of values in domains of each variable
adding another variable – weather
- does not influence tooth problems (is independent) → we add another table
also, we don't need to store the entire table (probabilities add up to one)

usually, we are interested in the diagnostic direction P(disease | symptoms)

but we know P(disease), P(symptoms), and P(symptoms | disease) … causal direction
we use Bayes' rule to get the diagnostic direction
it may be better not to store the diagnostic direction as the original probabilities (we base the calculation upon) may change

naive Bayes model

generally, we can exploit conditional independence by ordering the variables properlyexa
naive Bayes model – we assume independence
- $P(\text{Cause}, \text{Effect}_1, \dots, \text{Effect}_n) = P(\text{Cause})\prod_iP(\text{Effect}_i\mid\text{Cause})$

Wumpus world

we consider worlds with $n$ $n$ holes for every possible $n$ $n$
- we know that probability of a hole is 0.2
- (is this useful???)
in the end, we get $P(P_{1,3}\mid \mathrm{known},b)=\alpha'P(P_{1,3})\sum_{\mathrm{fringe}}P(b\mid P_{1,3},\mathrm{known},\mathrm{fringe})P(\mathrm{fringe})$

summary

we use probability theory to handle uncertainty
full joint distribution describes probabilities of all possible worlds
answers to queries can be obtained by summing out probabilities of worlds consistent with the observation (marginalization)
- too expensive for larger problems → we exploit conditional independence
  - we can remove some “dependencies”

Bayesian network

DAG; specifies conditional independence relationships among random variables
nodes correspond to random variables
arcs describe the dependencies
example: burglary detection
CPT (conditional probability table) for each variable
- to get full joint distribution, we just multiply the tables
we construct the network based on the selected order
- usually, we want to use the causal direction (so that we know how to construct the CPTs)
- the arc should be there if there is a dependence between the variables
  - for $X$ independent on $Y$ , it should hold that $P(X\mid Y,Z)=P(X\mid \neg Y,Z)=P(X\mid Z)$
how much space can we save?
- assume that each variable is directly influenced by at most $k$ other variables
- then the size of representation is $n\cdot 2^k$ for Bayesian network vs. $2^n$ for full joint distribution
Markov blanket
inference by enumeration
- using joint probability (including hidden variables) + moving terms out of the sums if possible
- using branching to compute the probabilities
- dynamic programming
- we can use factors (tables constructed from CPTs)
  - multiplication: we can multiply the rows that have the same values of the shared variables
  - elimination: we sum the rows that are almost the same (differ only in the value of the variable we want to eliminate)
  - we construct the factors based on the evidence
  - suggested ordering – eliminate the variables that would need large factors in the subsequent stages
- complexity
approximate inference
- Monte Carlo methods
- direct sampling
  - we are going through the Bayesian network in top-down order
  - in each step, we sample one variable and then base further sampling on this choice
- but we don't care about the samples!
- rejection sampling
  - we generate samples using direct sampling
  - we eliminate the ones that don't correspond to the evidence
  - we estimate $P(X\mid e)$ as the number of correct (not eliminated) samples divided by the number of all samples
  - problem: we may reject too many samples
- likelihood weighting
  - instead of rejecting samples, we generate only the ones consistent with evidence $e$
  - but then, we lose information about the probability of the specific piece of evidence
  - so we assign weights to the samples according to the probability of the evidence
  - our arithmetic has to be precise enough
- MCMC (Markov chain Monte Carlo)
  - Gibbs sampling