# Lecture

- look at the slides before the lecture
- rational agent
	- agent – perceives environment through sensors, acts upon environment through actuators
	- rational agent maximizes its expected performance measure
	- in AI 1 we used logical approach; we ignored uncertainty
		- including interface between agent and environment
		- we also ignored self-improvement capabilities
- uncertainty in pure logical approach
	- belief states
		- instead of having *state*, we have a belief state (set of all possibilities that can be true)
	- drawbacks
		- logical agent must consider every logically possible explanation for the observations (no matter how unlikely they are) → large and complex representations
		- correct contingent plan must consider arbitrary likely contingencies (?)
		- sometimes there's no plan which would guarantee the desired result (but the agent still has to act somehow)
	- practical problems (in medicine)
		- laziness – it is too much work to list the complete set of antecedents or consequents and too hard to use such rules
		- theoretical ignorance – medical science has no complete theory for the domain
		- practical ignorance – even if we know all the rules, we may not be able run all the necessary tests
	- that's why we'll use another approach – probability theory
- probability
	- conditional probability
		- $P(a\mid b)=\frac{P(a\land b)}{P(b)}$ whenever $P(b)\gt 0$
	- in a factored representation, a possible world is represented by a set of variable/value pairs
		- every variable has a domain
		- a possible world is fully identified by values of all random variables
	- probability of all possible worlds can be represented using a table called a *full joint probability distribution*
	- $\textbf{P}(\mathrm{Cavity})=\braket{0.2,0.8}$
		- Cavity = true → 0.2
		- Cavity = false → 0.8
	- inclusion-exclusion principle
		- $P(a\lor b)=P(a)+P(b)-P(a\land b)$
	- we can do inference by summing up certain cells of the full joint distribution table
	- using normalization constant $\alpha$ may be helpful
	- drawbacks of inference by enumeration
		- worst-case complexity $O(d^n)$ where $d$ is number of values in domains of each variable
	- adding another variable – weather
		- does not influence tooth problems (is independent) → we add another table
	- also, we don't need to store the entire table (probabilities add up to one)
- usually, we are interested in the diagnostic direction P(disease | symptoms)
	- but we know P(disease), P(symptoms), and P(symptoms | disease) … causal direction
	- we use Bayes' rule to get the diagnostic direction
	- it may be better not to store the diagnostic direction as the original probabilities (we base the calculation upon) may change
- naive Bayes model
	- generally, we can exploit conditional independence by ordering the variables properlyexa
	- naive Bayes model – we assume independence
		- $P(\text{Cause}, \text{Effect}_1, \dots, \text{Effect}_n) = P(\text{Cause})\prod_iP(\text{Effect}_i\mid\text{Cause})$
- Wumpus world
	- we consider worlds with $n$ holes for every possible $n$
		- we know that probability of a hole is 0.2
		- (is this useful???)
	- in the end, we get $P(P_{1,3}\mid \mathrm{known},b)=\alpha'P(P_{1,3})\sum_{\mathrm{fringe}}P(b\mid P_{1,3},\mathrm{known},\mathrm{fringe})P(\mathrm{fringe})$
- summary
	- we use probability theory to handle uncertainty
	- full joint distribution describes probabilities of all possible worlds
	- answers to queries can be obtained by summing out probabilities of worlds consistent with the observation (marginalization)
		- too expensive for larger problems → we exploit conditional independence
			- we can remove some “dependencies”
- Bayesian network
	- DAG; specifies conditional independence relationships among random variables
	- nodes correspond to random variables
	- arcs describe the dependencies
	- example: burglary detection
	- CPT (conditional probability table) for each variable
		- to get full joint distribution, we just multiply the tables
	- we construct the network based on the selected order
		- usually, we want to use the causal direction (so that we know how to construct the CPTs)
		- the arc should be there if there is a dependence between the variables
			- for $X$ independent on $Y$, it should hold that $P(X\mid Y,Z)=P(X\mid \neg Y,Z)=P(X\mid Z)$
	- how much space can we save?
		- assume that each variable is directly influenced by at most $k$ other variables
		- then the size of representation is $n\cdot 2^k$ for Bayesian network vs. $2^n$ for full joint distribution
	- *Markov blanket*
	- inference by enumeration
		- using joint probability (including hidden variables) + moving terms out of the sums if possible
		- using branching to compute the probabilities
		- dynamic programming
		- we can use factors (tables constructed from CPTs)
			- multiplication: we can multiply the rows that have the same values of the shared variables
			- elimination: we sum the rows that are almost the same (differ only in the value of the variable we want to eliminate)
			- we construct the factors based on the evidence
			- suggested ordering – eliminate the variables that would need large factors in the subsequent stages
		- complexity
	- approximate inference
		- Monte Carlo methods
		- direct sampling
			- we are going through the Bayesian network in top-down order
			- in each step, we sample one variable and then base further sampling on this choice
		- but we don't care about the samples!
		- rejection sampling
			- we generate samples using direct sampling
			- we eliminate the ones that don't correspond to the evidence
			- we estimate $P(X\mid e)$ as the number of correct (not eliminated) samples divided by the number of all samples
			- problem: we may reject too many samples
		- likelihood weighting
			- instead of rejecting samples, we generate only the ones consistent with evidence $e$
			- but then, we lose information about the probability of the specific piece of evidence
			- so we assign weights to the samples according to the probability of the evidence
			- our arithmetic has to be precise enough
		- MCMC (Markov chain Monte Carlo)
			- Gibbs sampling