Online Test

Introduction Methodologies Data Analysis Regression Analysis Cluster Analysis Decision Trees

Introduction

data mining from databases
- non-trivial process of gaining implicit, previously not known, but potentially useful information from the data
- originated in the 90s (there was not enough data before)
- knowledge discovery in databases (KDD)
- data mining (DM) – business intelligence (BI) and big data
foundations
- artificial intelligence, machine learning methods
- database systems (to store large data sets), information retrieval
- statistics – modeling and analysis of dependencies found in the data
- + how to use the results for decision-making
data mining is an interactive and iterative process
- data preparation – we build one table containing all the relevant data
  - selection (of useful attributes), preprocessing, transformation
- the actual “data mining” – we find patterns in the data
- interpretation – found knowledge shall be evaluated from the point of view of the end user (manager, customer, etc.)
PoV of a manager
- there’s a topical issue
- goal of the data mining process is to obtain as much information as possible that is relevant to solving the problem
- example
  - find groups of customers of a department store to offer special services to
  - the found groups can be interpreted as segments in the given market area
- steps
  - form a team: data analyst, domain expert, expert on databases, …
  - specify the problem
  - obtain all data available
    - we should also obtain the external data describing the environment of the analyzed processes (time period of the year, advertising, political issues, weather, …)
  - select the methods
    - clustering, classification, exploratory data analysis, association rules, decision trees, genetic algorithms, Bayesian networks, neural networks
    - visualization methods – helpful for presentation
  - preprocess the data
  - mine the data
  - interpret the results
    - we may need to create an analytical report
    - make the results easy to understand
    - the output can also mean to carry out a reasonable action
tasks
- classification and prediction
  - goal: predict a continuous or discrete value based on some attributes
  - interpretation may be challenging
  - prediction: weather forecast, stock prices, …
  - we should be able to cover the entire domain (all the data may be useful for a reasonable prediction)
- description
  - goal: find a dominant structure or relationships
  - we may ignore some of the information; the extracted knowledge does not need to be that precise (but it should be easily understandable)
- looking for “nuggets”
  - goal: find some interesting knowledge (does not have to fully cover the given concept)
real tasks (examples)
- segmentation and classification of bank clients
- causes of failures in telecommunication networks
- causes of change of service provider
- prediction of power consumption
- analysis of the patient database in a hospital
  - Florence Nightingale
  - Ignaz Semmelweis
- market basket analysis

Methodologies

goal, popular methodologies
- goal: provide the users with a unified framework; guide data mining applications regardless of industry
- it’s necessary to have high-quality data; the steps are usually iterative
- popular methodologies: 5A, SEMMA, CRISPR-DM
SEMMA
- sample – select data for modeling
  - may include sampling, imputation (adding other useful information, e.g. adding seasons of the year to the data about the sales), partitioning (train-test-validation split)
- explore – visual exploration and dimensionality reduction
- modify – prepare the objects, values, and variables for data modeling; transform the data
- model – apply data mining techniques (decision trees, regression models, NNs, …)
  - create models providing relevant outcome
- assess – evaluate the results of modeling (assess their reliability and usefulness)
  - so that the manager can understand
CRISP-DM
- cross-industry standard process for data mining; a robust general-purpose model
  - software-independent
- 6 phases (their order is not strict)
  - business understanding
    - determine our business objective
    - assess our present situation, what resources and data we have
    - risk assessment
    - setting KPIs
  - data understanding
    - collect and describe initial data; explore and visualize it
    - verify the quality of the data
  - data preparation
    - cleaning, integration (merging), aggregation, …
    - make the dataset ready for further analysis and modeling
  - modeling
    - select the modeling technique
    - generate test design
    - build the model
    - assess the model
  - evaluation
    - evaluate the results – from the point of view of the manager
    - review the process
    - determine the next steps
  - deployment
    - results should be presented in a simple and comprehensible way
    - plan the deployment (what steps should be done)
    - final report, review the project
      - to preserve knowledge
- ciritique
  - little support for project management
  - what to do if there are problems?
  - → IBM released Analytics Solutions Unified Method for Data Mining (ASUM)
ASUM
- phases
  1. analyze – define the needs (desired features, performance, usability, …), obtain agreement
  2. design – define solution components, identify resources, clarify requirements
  3. configure & build – including testing and validation
  4. deploy – create a plan to run and maintain the solution
  5. operate & optimize – preserve the health
  - + project management – there are processes that help to monitor the progress and maintain the project
- benefits
  - minimized risk
  - scalable and enterprise-ready
  - comprehensive
  - product-specific implementation roadmaps
  - users can replicate previous software implementations

Data Analysis

types of attributes
- interval-scaled attributes – their values are real number following a linear scale
- ratio-scaled attributes – follow an exponential scale → need to be log-transformed ( $\log x_i$ ) before treating them like interval-scaled attributes
- a nominal attribute can be transformed to a series of binary attributes (one-hot encoding)
- ordinal attributes are sometimes treated as interval-scaled attributes
data standardization of interval-scaled attributes
- decimal scaling – divide the data by the smallest power of 10 to get them to the interval $[-1,1]$ (while maintaining the mutual relations)
- range standardization
  - to $[0,1]$ … $\frac{x-\mathrm{min}}{\mathrm{max}-\mathrm{min}}$
  - to $[-1,1]$ … $2\cdot \frac{x-\mathrm{min}}{\mathrm{max}-\mathrm{min}}-1$
- Z-score standardization … $\frac{x-\mu}{\sigma}$ $\frac{x - μ}{σ}$
  - $\mu$ … mean
  - $\sigma$ $σ$ … corrected sample standard deviation
    - $\sigma=\sqrt{\frac1{n-1}\sum_{i=1}^n(x_i-\mu)^2}$
  - or we can use the corrected sample standard deviation
- standardization according to the mean absolute difference
  - $\frac{x-\mu}{s}$
  - where $s=\frac1n\sum_{i=1}^n|x_i-\mu|$
range, quartiles, outliers
- range = difference between the highest and the lowest value in the set
- quartiles – values $Q_1,Q_2,Q_3$ $Q_{1}, Q_{2}, Q_{3}$
  - 25% of the observations below $Q_1$ (lower quartile)
  - 50% below $Q_2$ (median)
  - 75% below $Q_3$ (upper quartile)
- interquartile range $IQR=Q_3-Q_1$
- formulas (for discrete data)
  - $Q_1=x_{\lceil n/4 \rceil}$
  - $Q_3=x_{\lceil 3n/4 \rceil}$
  - but if $x_{n/4}$ or $x_{3n/4}$ exists ( $n$ is divisible by 4), we need to average this value with the next one $($ $x_{n/4+1}$ or $x_{3n/4+1}$ )
- outlier is any value $\notin[Q_1-1.5\cdot IQR,\ Q_3+1.5\cdot IQR]$
- box plot
  - box … between $Q_1$ and $Q_3$ with a mark (line) representing the median
  - two other lines … boundaries for outliers (or the extrema that are not outliers)
  - other marks outside … individual outliers
contingency table
- relationship between two categorical quantities (may be binary)
- cell … how many observations have this combination of attributes
- there are also column and row totals
- we get an expected frequency of combination if we multiply the column total with the row total and divide it by the total number of observations
  - intuition: we take the number of elements in the column and multiply it by the (observed) probability that any sample belongs to the given row (has this value of the attribute)
$\chi^2$ $χ^{2}$ -test
- $\chi^2=\sum_k\sum_\ell\frac{(a_{k\ell}-e_{k\ell})^2}{e_{k\ell}}$ $χ^{2} = \sum_{k} \sum_{ℓ} \frac{( a _{k ℓ} - e _{k ℓ} ) ^{2}}{e _{k ℓ}}$
  - where $a_{k\ell}$ is the value of the cell (number of observation with this combination of attributes)
  - $e_{k\ell}$ is the expected frequency as described above: $e_{k\ell}=\frac{r_k\cdot s_\ell}n$
- $\chi^2$ distribution has $(R-1)(S-1)$ degrees of freedom (the last row and column are somehow determined by the rest of the table)
- we choose a level of significance $\alpha$ (usually 0.05) = probability of rejecting a true null hypothesis
- we reject $H_0$ if our $\chi^2$ -statistic is greater than the value of the $\chi^2$ distribution (for the given DoF and $\alpha$ )
- it can be used only for large-enough frequencies – when $\forall k,\ell:e_{k\ell}\geq 5$ $\forall k, ℓ : e_{k ℓ} \geq 5$
  - if the frequencies are too low and the contingency table contains only 4 fields, we can use Fisher’s test
Fisher’s test
- for a contingency table with only 4 fields (two binary attributes)
- one-sided or two-sided version
- we get the $p$ $p$ -value directly: what is the probability of this or more extreme table if we assume fixed column and row totals?
  - for the two-sided version, we need to compute the expected frequencies $e_{k\ell}$
  - probability of this table: $p=\frac{r_1!\cdot r_2!\cdot s_1!\cdot s_2!}{n!\cdot a!\cdot b!\cdot c!\cdot d!}$
    - $a,b,c,d$ … values of the cells
    - $r_i,s_j$ … row and column totals respectively
  - probability of some more extreme table $T_i$ – we consider values $a-i,\ b+i,\ c+i,\ d-i$ instead
  - the probabilities of the tables need to be summed up to get the $p$ -value
- null hypothesis: independence of the attributes
  - we reject for $p$ -value $\leq \alpha$

Regression Analysis

motivation
- it may be costly to obtain output values
- input values are known earlier than the output (we want to estimate the output)
- controlling input variables may lead to the desired behavior of the output
- we want to find a (causal) relationship between the input and the output
linear regression
- what are the parameters of the linear relationship between two numerical quantities?
- $y=\beta x+\alpha \textcolor{gray}{+ \varepsilon}$
- least squares method
  - find parameters $\alpha,\beta$
  - we minimize $SSE=\sum_i(y_i-f(x_i))^2$ (sum of squared errors)
  - set partial derivatives of SSE equal to zero, we get two equations:
    - $n\alpha+\beta\sum_i x_i=\sum_i y_i$ $n α + β \sum_{i} x_{i} = \sum_{i} y_{i}$
      - or $\alpha=\bar y-\beta\bar x$
    - $\alpha\sum_i x_i+\beta\sum_i x_i^2=\sum_i x_iy_i$
  - so $(\bar y-\beta\bar x)\sum_i x_i+\beta\sum_i x_i^2=\sum x_iy_i$
  - we get $\beta=\frac{\sum_i (x_i-\bar x)(y_i-\bar y)}{\sum_i(x_i-\bar x)^2}$ $β = \frac{\sum _{i} ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum _{i} ( x _{i} - x ˉ ) ^{2}}$
    - and $\alpha=\bar y-\beta\bar x$
correlation analysis
- is there a linear relationship between two numerical quantities?
- correlation coefficient $\rho(x,y)=\frac{\text{cov}(x,y)}{\sigma_x\sigma_y}\in[-1,1]$ $ρ (x, y) = \frac{cov ( x , y )}{σ _{x} σ _{y}} \in [- 1, 1]$
  - corrected sample covariance $\text{cov}(x,y)=\frac{1}{n-1}\sum_i(x_i-\bar x)(y_i-\bar y)$
  - corrected sample standard deviation $\sigma_x=\sqrt{\frac1{n-1}\sum_i(x_i-\bar x)^2}$ $σ_{x} = \frac{1}{n - 1} \sum_{i} (x_{i} - \overset{x}{ˉ})^{2}$
    - similar for $\sigma_y$
multi-dimensional regression
- linear → least-squares method in matrix form
  - $y=X\beta$
  - $X$ $X$ … matrix of input samples
    - before the $x$ values of the sample, there is $1$ (so that we don’t need to keep the constant $\alpha$ )
  - so $\beta=(X^TX)^{-1}(X^Ty)$
  - may be computationally expensive → approximate solutions
- non-linear – example: logistic regression
  - if the dependent variable is categorical
  - we use log (conditional) odds to model the conditional probability
    - $s_i=\ln\frac{P}{1-P}$
    - so $P=\frac 1{1+e^{-s_i}}$ (sigmoid)
    - where $s_i=\alpha+\beta x_i=\alpha+\sum_j\beta_j x_{ij}$
  - we estimate the parameters using maximum likelihood
    - $L=\prod_i P(y_i\mid x_i)=\prod_i p_i^{y_i}(1-p_i)^{1-y_i}$ $L = \prod_{i} P (y_{i} ∣ x_{i}) = \prod_{i} p_{i}^{y_{i}} (1 - p_{i})^{1 - y_{i}}$
      - $p_i=\sigma(s_i)=\frac1{1+e^{-\alpha-\beta x_i}}$
      - note that $1-p_i=1-\sigma(s_i)=\sigma(-s_i)$
    - we minimize negative log-likelihood (= cross-entropy loss)
      - $E=-\sum_i(y_i\log p_i+(1-y_i)\log(1-p_i))$
  - gradient descent is then applied like this
    - $\beta_j\leftarrow\beta_j-\eta\sum_i(\sigma(\alpha+\beta x_i)-y_i)x_{ij}$
    - $\alpha\leftarrow\alpha-\eta\sum_i(\sigma(\alpha+\beta x_i)-y_i)$
    - $\eta$ … learning rate
    - note that $\sigma'(s_i)=\sigma(s_i)(1-\sigma(s_i))$
discriminant analysis
- classification into classes
- we consider a discriminant function $f_t$ $f_{t}$ for every class $c_t$ $c_{t}$
  - conditional (a posteriori) probability of classifying $x$ to the class
- for two classes, we get $f(x)=f_1(x)-f_2(x)=P(x|c_1)P(c_1)-P(x|c_2)P(c_2)$ $f (x) = f_{1} (x) - f_{2} (x) = P (x ∣ c_{1}) P (c_{1}) - P (x ∣ c_{2}) P (c_{2})$
  - (there should be a normalization constant $\alpha$ maybe)
- for normal distributions, we only need to estimate the means and the covariance matrices to get the discriminant functions
  - the decision boundary is located where the distribution curves intersect

Cluster Analysis

cluster analysis, assumptions
- we want to divide the observed samples into groups of mutually similar ones
- assumption: we can measure the distance between samples (or between clusters)
- distance between two samples
  - Minkovski metrics $L_z(x_1,x_2)=\sqrt[z]{\sum_j (x_{1j}-x_{2j})^z}$ $L_{z} (x_{1}, x_{2}) = z \sum_{j} (x_{1 j} - x_{2 j})^{z}$
    - Manhattan distance … $L_1$
    - Euclidean distance … $L_2$
    - Chebyshev distance … $L_\infty$
  - Hamming distance – number of positions at which $x_1$ and $x_2$ differ
  - Mahalanobis distance – useful if the quantities have different variance
    - uses the covariance matrix
    - $d(x_1,x_2)=(x_1-x_2)^TS^{-1}(x_1-x_2)$
- distance between two clusters
  - method of the nearest neighbor
  - method of the farthest neighbor
  - method of average distance (we compute the mean distance of two arbitrary samples – one from each cluster)
  - centroid method (distance between the centers of the clusters)
centroid
- compute the average over all the features
- but such point may not be in the data at all
k-means clustering
- Lloyd
  - divide the points randomly into $k$ clusters
  - repeat until convergence
    - compute the centroids
    - assign each point to its closest centroid
- alternative initializations
  - pronounce the first $k$ points to be centroids
  - k-means++ … sample $k$ points (to be used as centroids) with probability proportional to their squared distance to the closest existing centroid
  - LocalSearch++ … start with k-means++, then sample some more points (in the same way) one by one and replace some of the existing centroids if they cover the distribution (points) better
k-medians
- just use Manhattan distance instead of Euclidean
- → centroid = median (independently along each dimension)
  - the representative may not exist in the dataset
- median is less sensitive to the presence of outliers than the mean
hierarchical clustering
- bottom-up approach
- we get a dendrogram
  - a clustering corresponds to some “cut” in the dendrogram
- algorithm
  - initialization: each point is its own cluster
  - repeat: merge the two closest clusters
learning vector quantization (LVQ)
- for supervised clustering
- we consider several weight vectors (representatives)
  - each weight vector corresponds to a single class
- first, we need to initialize the weight vectors
- then, for every sample, we update the closest weight vector
  - if the class of the vector corresponds to the class of the sample, we move the vector closer to the sample: $w\leftarrow w+\mu(x-w)$
  - otherwise, we move it further from the sample: $w\leftarrow w-\mu(x-w)$
- $\mu$ … learning rate (usually some constant divided by the number of the epoch)
- no guarantee of convergence
k-medoids
- uses a similarity measure instead of averaging
- we try to find $k$ $k$ best representatives $Y_1,\dots,Y_k$ $Y_{1}, \dots, Y_{k}$ (from the dataset)
  - so the medoids are real data points
- objective function: $O=\sum_{i=1}^n(\min_j\mathrm{Dist}(X_i,Y_j))$
- more robust to outliers than k-means
- sometimes, we cannot easily compute average (or it does not make sense)
- possible strategy
  - randomly select $r$ pairs $(X_i,Y_j)$ for a possible exchange
  - the pair with the best improvement of the objective function is exchanged
  - requires time proportional to $rn$
grid-based methods
- e.g. MAFIA
- which cells of the grid are densely covered by the data?
- merge neighboring dense hyper-cubes (cells) to get clusters
  - there is a threshold which tells us if the hyper-cube is dense
density-based algorithms
- core point: there are at least $\tau$ points closer than $\varepsilon$
- border point: there is at least one core point closer than $\varepsilon$
- noise point: otherwise
- DBSCAN
  - connectivity graph
  - nodes = core points
  - nodes are connected if the core points are closer than $\varepsilon$
  - clusters = connected components
  - border points still belong to the corresponding clusters
  - noise points are reported as outliers
- another algorithm: DENCLUE
scalable approaches – for lots of data
- k-medoids are expensive to compute
- CLARA
  - based on partitioning around medoids (PAM) – a variant of k-medoids
    - all possible $k(n-k)$ pairs of medoids and non-medoids are tested for an exchange
    - the pair with the best improvement of the objective function is exchanged
    - $O(kn^2d)$ time per iteration for a $d$ -dimensional dataset
  - we apply the algorithm to a smaller sample set of size $fn$ $f n$ where $f\in(0,1)$ $f \in (0, 1)$
    - the remaining data points are assigned to the medoids found in the smaller sample
    - we repeat this multiple times and use the best found clustering
    - $O(kf^2n^2d+k(n-k))$ time
  - the main problem occurs when the preselected samples do not include a good choice of medoids
- CLARANS
  - k-medoids, but uses the entire dataset
  - finds several locally optimal solutions and uses the best one
  - finding a locally optimal solution
    - randomly initialize $k$ medoids
    - repeat: take a random pair of a medoid and a non-medoid, swap them if this improves the objective function
    - if a certain number of unsuccessful exchanges is reached, we consider the solution to be locally optimal
- CURE (clustering using representatives) – agglomerative hierarchical algorithm
  - sample $s$ points, divide them into $p$ equal partitions
  - cluster each partition independently using hierarchical merging to $k'$ clusters
  - perform hierarchical clustering over the $k'p$ clusters across all partitions to get the desired number of clusters ( $k$ )
  - assign each point to its closest cluster

Decision Trees

decision tree, (dis)advantages
- each inner node is labeled by an attribute
- each edge is labeled by a predicate applicable to the attribute of the parent node
- each leaf has a class label
- there’s a stopping criterion – e.g. all the training data in the node belong to the same class
- advantages: easy to implement and interpret, efficient, extract simple rules, applicable to large datasets
- disadvantages
  - difficult to process continuous data (decision trees divide the features space into rectangular regions)
  - difficult to process for missing data
  - overfitting (solved by pruning)
  - mutual correlations among the attributes are not considered
    - problem if the decision boundary looks like $y=x$ (we would need to check a value of one attribute relatively to another attribute)
top down induction of decision trees (TDIDT)
- top-down method
- algorithm (divide and conquer)
  - choose one attribute as a root of the subtree
  - divide the data according to the values of this attribute and add a node for each subset
  - repeat the steps recursively for any nodes that do contain data from at least two different classes
- how to choose the attribute
  - we use Shannon entropy $H=-\sum_c p_c\log_2p_c$ $H = - \sum_{c} p_{c} lo g_{2} p_{c}$
    - $p_c$ … probability of occurrence of class $c$ (relative frequency of class $c$ in the data)
    - by definition, $0\log_2 0=0$
    - note that $\log_2 x=\frac{\log_{10} x}{\log_{10} 2}=\frac{\log_{10} x}{0.301}$
  - we calculate $H$ for every possible value of the attribute
  - then we apply weighted average – weights correspond to the frequencies of the values of the attribute in the data
  - we choose the attribute with the minimum weighted average entropy
    - se we choose the attribute like this: $\mathrm{argmin}_A \sum_v P(A=v) H(v)$
    - where $H(v)=-\sum_c p_c\log_2 p_c$
    - $p_c=P(C=c\mid A=v)$
    - obviously, we always consider probability $P$ estimated for the subset of the data that belongs to the current node
- it is useful to perform early stopping (by the principle of Occam’s razor)
- time complexity: $O(np\log p)$ $O (n p lo g p)$
  - $n$ … number of attributes
  - $p$ … number of training data
ID3 algorithm
- boolean output (only two classes)
- greedy top-down method
- divides the nodes until there is a single class or there are no attributes (or data) left
- uses information gain = entropy reduction
  - $\mathrm{gain}(A)=H-\sum_v P(A=v)H(v)$
  - where $H$ is the entropy of the node, $H(v)$ is the entropy of the child node containing samples s.t. $A=v$
C4.5, C5.0
- modifications of the ID3 algorithm
- C4.5
  - missing attribute values are ignored during tree construction; they are estimated from the other data during prediction (probably just by using mode?)
  - continuous data – categorized according to the values in the training set
  - pruning
    - train/validation split
    - subtree is replaced by a leaf (with majority vote) if the new tree does not perform worse on the validation set than the original one
    - in a similar fashion, a whole subtree can be raised (the most frequently used one)
  - rule post-pruning
    - rules can be extracted from the tree
    - then, the rules can be generalized (pruned) and ordered by their expected accuracy (then used in this order instead of the original tree)
    - quite radical pruning
    - the rules are transparent and easy to understand
  - estimates confidence interval for accuracy of rules and trees
    - for a given rule, we can evaluate training accuracy $\hat p=$ correctly classified samples ÷ all samples processed by the rule
    - we assume binomial distribution
    - for the confidence interval computed at the 95% confidence level, the lower accuracy bound for new data will correspond to $\hat p-1.96\cdot\underbrace{\sqrt{\frac{\hat p(1-\hat p)}{N}}}_{\text{std dev}}$
      - where $N$ is the total number of samples
  - uses information gain ratio as criterion for data division
    - $\mathrm{argmax}_A\,\frac{\mathrm{gain}(A)}{-\sum_v P(A=v)\log_2 P(A=v)}$
    - in the denominator, we have SplitInformation, which is the same formula as entropy if we consider the attribute values to be “classes”
    - this penalizes excessive branching
- C5.0
  - commercial version of C4.5 for large databases
  - improved generation of rules
  - higher accuracy is achieved using boosting
    - we gradually construct an ensemble of several classifiers (trees)
    - every new tree tries to improve the accuracy of the ensemble
      - the (previously) incorrectly classified data points are assigned larger weights → the tree knows they have high priority
    - majority vote is used
classification and regression trees (algorithm CART)
- generates binary trees
  - chooses a value of an attribute it can use to split data into the two subtrees (usually it's a threshold)
- uses entropy or Gini index to choose the best decision attribute
- measure of “goodness” of the data division
  - $\phi(v|t)=2P_LP_R\sum_c |P(c|t_L)-P(c|t_R)|$ $ϕ (v ∣ t) = 2 P_{L} P_{R} \sum_{c} ∣ P (c ∣ t_{L}) - P (c ∣ t_{R}) ∣$
    - $P_L$ probability that a sample belongs to the left subtree
    - $P(c|t_L)$ probability that a sample in the left subtree belongs to class $c$
  - we want the children to be both balanced and pure
  - we use the value $v^*=\mathrm{argmax}_v\ \phi(v|t)$
- characteristic properties of CART
  - order of the attributes corresponds to their impact during classification
  - missing data is ignored
  - stopping criterion: there's no way to divide the node in order to improve the classification accuracy
  - high accuracy achieved on the training set does not have to reflect the accuracy on the test data
algorithm CHAID
- chi-square automatic interaction detection
- $\chi^2$ criterion for branching
- values of the categorical attributes are gradually grouped together (by $\chi^2$ “similarity”) → there remain only two groups
bagging
- bootstrapped aggregating
- reduces the variance of the classifier
- if we consider an ensemble of $k$ i.i.d. classifiers with variance $\sigma^2$ each, the ensemble has variance $\sigma^2/k$
- decision trees are an ideal choice for bagging – they have low bias and high variance (if they are sufficiently deep)
- bagging does not reduce bias (it may even worsen accuracy if bias is the main problem)
- to get classifiers that are i.i.d. (independent and identically distributed) to some extent, we perform bootstrapping
  - the data points are sampled uniformly from the original data with replacement
  - this way, the new dataset contains roughly $1-(1-1/n)^n\approx 1-1/e\approx 63.2\%$ distinct data points
random forests
- we use bagging
- problem: pairwise correlation of models constructed using bagging
  - if our $k$ classifiers, each with variance $\sigma^2$ , have a positive pairwise correlation $\rho$ , then the variance of the averaged prediction will be $\rho\sigma^2+\frac{(1-\rho)\sigma^2}k$
- solution: we limit the number of features that can be used for splitting
  - at each node, only a random subset of features is available
- this speeds up training due to fewer features to search over at each stage and there is no need to prune the trees
- reduced variance does not negatively affect the bias
- two parameters
  - number of features to be sampled
  - number of trees to be built
boosting
- classifiers are built one by one based on a weighted dataset
  - misclassified data points tend to have larger weights
- each new tree depends on the previous ones
  - it is trained on a dataset with different weights (it focuses on the misclassified data points)
- in the end, we get a weighted combination of the classifiers
- the ensemble has a lower overall bias
- disadvantage: outliers may hurt the overall performance of the model
random forests vs. boosting
- random forests: fast (can run in parallel, use a small set of features at each stage)
- boosting: sequential, can be expensive, but outperforms random forests