# Online Test

## Introduction

- data mining from databases
	- non-trivial process of gaining implicit, previously not known, but potentially useful information from the data
	- originated in the 90s (there was not enough data before)
	- knowledge discovery in databases (KDD)
	- data mining (DM) – business intelligence (BI) and big data
- foundations
	- artificial intelligence, machine learning methods
	- database systems (to store large data sets), information retrieval
	- statistics – modeling and analysis of dependencies found in the data
	- \+ how to use the results for decision-making
- data mining is an interactive and iterative process
	- data preparation – we build one table containing all the relevant data
		- selection (of useful attributes), preprocessing, transformation
	- the actual “data mining” – we find *patterns* in the data
	- interpretation – found knowledge shall be evaluated from the point of view of the end user (manager, customer, etc.)
- PoV of a manager
	- there’s a topical issue
	- goal of the data mining process is to obtain as much information as possible that is relevant to solving the problem
	- example
		- find groups of customers of a department store to offer special services to
		- the found groups can be interpreted as segments in the given market area
	- steps
		- form a team: data analyst, domain expert, expert on databases, …
		- specify the problem
		- obtain all data available
			- we should also obtain the external data describing the environment of the analyzed processes (time period of the year, advertising, political issues, weather, …)
		- select the methods
			- clustering, classification, exploratory data analysis, association rules, decision trees, genetic algorithms, Bayesian networks, neural networks
			- visualization methods – helpful for presentation
		- preprocess the data
		- mine the data
		- interpret the results
			- we may need to create an analytical report
			- make the results easy to understand
			- the output can also mean to carry out a reasonable action
- tasks
	- classification and prediction
		- goal: predict a continuous or discrete value based on some attributes
		- interpretation may be challenging
		- prediction: weather forecast, stock prices, …
		- we should be able to cover the entire domain (all the data may be useful for a reasonable prediction)
	- description
		- goal: find a dominant structure or relationships
		- we may ignore some of the information; the extracted knowledge does not need to be that precise (but it should be easily understandable)
	- looking for “nuggets”
		- goal: find some interesting knowledge (does not have to fully cover the given concept)
- real tasks (examples)
	- segmentation and classification of bank clients
	- causes of failures in telecommunication networks
	- causes of change of service provider
	- prediction of power consumption
	- analysis of the patient database in a hospital
		- Florence Nightingale
		- Ignaz Semmelweis
	- market basket analysis

## Methodologies

- goal, popular methodologies
	- goal: provide the users with a unified framework; guide data mining applications regardless of industry
	- it’s necessary to have high-quality data; the steps are usually iterative
	- popular methodologies: 5A, SEMMA, CRISPR-DM
- SEMMA
	- sample – select data for modeling
		- may include sampling, imputation (adding other useful information, e.g. adding seasons of the year to the data about the sales), partitioning (train-test-validation split)
	- explore – visual exploration and dimensionality reduction
	- modify – prepare the objects, values, and variables for data modeling; transform the data
	- model – apply data mining techniques (decision trees, regression models, NNs, …)
		- create models providing relevant outcome
	- assess – evaluate the results of modeling (assess their reliability and usefulness)
		- so that the manager can understand
- CRISP-DM
	- cross-industry standard process for data mining; a robust general-purpose model
		- software-independent
	- 6 phases (their order is not strict)
		- business understanding
			- determine our business objective
			- assess our present situation, what resources and data we have
			- risk assessment
			- setting KPIs
		- data understanding
			- collect and describe initial data; explore and visualize it
			- verify the quality of the data
		- data preparation
			- cleaning, integration (merging), aggregation, …
			- make the dataset ready for further analysis and modeling
		- modeling
			- select the modeling technique
			- generate test design
			- build the model
			- assess the model
		- evaluation
			- evaluate the results – from the point of view of the manager
			- review the process
			- determine the next steps
		- deployment
			- results should be presented in a simple and comprehensible way
			- plan the deployment (what steps should be done)
			- final report, review the project
				- to preserve knowledge
	- ciritique
		- little support for project management
		- what to do if there are problems?
		- → IBM released Analytics Solutions Unified Method for Data Mining (ASUM)
- ASUM
	- phases
		1. analyze – define the needs (desired features, performance, usability, …), obtain agreement
		2. design – define solution components, identify resources, clarify requirements
		3. configure & build – including testing and validation
		4. deploy – create a plan to run and maintain the solution
		5. operate & optimize – preserve the health
		- \+ project management – there are processes that help to monitor the progress and maintain the project
	- benefits
		- minimized risk
		- scalable and enterprise-ready
		- comprehensive
		- product-specific implementation roadmaps
		- users can replicate previous software implementations

## Data Analysis

- types of attributes
	- interval-scaled attributes – their values are real number following a linear scale
	- ratio-scaled attributes – follow an exponential scale → need to be log-transformed ($\log x_i$) before treating them like interval-scaled attributes
	- a nominal attribute can be transformed to a series of binary attributes (one-hot encoding)
	- ordinal attributes are sometimes treated as interval-scaled attributes
- data standardization of interval-scaled attributes
	- decimal scaling – divide the data by the smallest power of 10 to get them to the interval $[-1,1]$ (while maintaining the mutual relations)
	- range standardization
		- to $[0,1]$ … $\frac{x-\mathrm{min}}{\mathrm{max}-\mathrm{min}}$
		- to $[-1,1]$ … $2\cdot \frac{x-\mathrm{min}}{\mathrm{max}-\mathrm{min}}-1$
	- Z-score standardization … $\frac{x-\mu}{\sigma}$
		- $\mu$ … mean
		- $\sigma$ … corrected sample standard deviation
			- $\sigma=\sqrt{\frac1{n-1}\sum_{i=1}^n(x_i-\mu)^2}$
		- or we can use the *corrected* sample standard deviation
	- standardization according to the mean absolute difference
		- $\frac{x-\mu}{s}$
		- where $s=\frac1n\sum_{i=1}^n|x_i-\mu|$
- range, quartiles, outliers
	- range = difference between the highest and the lowest value in the set
	- quartiles – values $Q_1,Q_2,Q_3$
		- 25% of the observations below $Q_1$ (lower quartile)
		- 50% below $Q_2$ (median)
		- 75% below $Q_3$ (upper quartile)
	- interquartile range $IQR=Q_3-Q_1$
	- formulas (for discrete data)
		- $Q_1=x_{\lceil n/4 \rceil}$
		- $Q_3=x_{\lceil 3n/4 \rceil}$
		- but if $x_{n/4}$ or $x_{3n/4}$ exists ($n$ is divisible by 4), we need to average this value with the next one $($$x_{n/4+1}$ or $x_{3n/4+1}$)
	- outlier is any value $\notin[Q_1-1.5\cdot IQR,\ Q_3+1.5\cdot IQR]$
	- box plot
		- box … between $Q_1$ and $Q_3$ with a mark (line) representing the median
		- two other lines … boundaries for outliers (or the extrema that are not outliers)
		- other marks outside … individual outliers
- contingency table
	- relationship between two categorical quantities (may be binary)
	- cell … how many observations have this combination of attributes
	- there are also column and row totals
	- we get an expected frequency of combination if we multiply the column total with the row total and divide it by the total number of observations
		- intuition: we take the number of elements in the column and multiply it by the (observed) probability that any sample belongs to the given row (has this value of the attribute)
- $\chi^2$-test
	- $\chi^2=\sum_k\sum_\ell\frac{(a_{k\ell}-e_{k\ell})^2}{e_{k\ell}}$
		- where $a_{k\ell}$ is the value of the cell (number of observation with this combination of attributes)
		- $e_{k\ell}$ is the expected frequency as described above: $e_{k\ell}=\frac{r_k\cdot s_\ell}n$
	- $\chi^2$ distribution has $(R-1)(S-1)$ degrees of freedom (the last row and column are somehow determined by the rest of the table)
	- we choose a level of significance $\alpha$ (usually 0.05) = probability of rejecting a true null hypothesis
	- we reject $H_0$ if our $\chi^2$-statistic is greater than the value of the $\chi^2$ distribution (for the given DoF and $\alpha$)
	- it can be used only for large-enough frequencies – when $\forall k,\ell:e_{k\ell}\geq 5$
		- if the frequencies are too low and the contingency table contains only 4 fields, we can use Fisher’s test
- Fisher’s test
	- for a contingency table with only 4 fields (two binary attributes)
	- one-sided or two-sided version
	- we get the $p$-value directly: what is the probability of this or more extreme table if we assume fixed column and row totals?
		- for the two-sided version, we need to compute the expected frequencies $e_{k\ell}$
		- probability of this table: $$p=\frac{r_1!\cdot r_2!\cdot s_1!\cdot s_2!}{n!\cdot a!\cdot b!\cdot c!\cdot d!}$$
			- $a,b,c,d$ … values of the cells
			- $r_i,s_j$ … row and column totals respectively
		- probability of some more extreme table $T_i$ – we consider values $a-i,\ b+i,\ c+i,\ d-i$ instead
		- the probabilities of the tables need to be summed up to get the $p$-value
	- null hypothesis: independence of the attributes
		- we reject for $p$-value $\leq \alpha$

## Regression Analysis

- motivation
	- it may be costly to obtain output values
	- input values are known earlier than the output (we want to estimate the output)
	- controlling input variables may lead to the desired behavior of the output
	- we want to find a (causal) relationship between the input and the output
- linear regression
	- what are the parameters of the linear relationship between two numerical quantities?
	- $y=\beta x+\alpha \textcolor{gray}{+ \varepsilon}$
	- least squares method
		- find parameters $\alpha,\beta$
		- we minimize $SSE=\sum_i(y_i-f(x_i))^2$ (sum of squared errors)
		- set partial derivatives of SSE equal to zero, we get two equations:
			- $n\alpha+\beta\sum_i x_i=\sum_i y_i$
				- or $\alpha=\bar y-\beta\bar x$
			- $\alpha\sum_i x_i+\beta\sum_i x_i^2=\sum_i x_iy_i$
		- so $(\bar y-\beta\bar x)\sum_i x_i+\beta\sum_i x_i^2=\sum x_iy_i$
		- we get $\beta=\frac{\sum_i (x_i-\bar x)(y_i-\bar y)}{\sum_i(x_i-\bar x)^2}$
			- and $\alpha=\bar y-\beta\bar x$
- correlation analysis
	- is there a linear relationship between two numerical quantities?
	- correlation coefficient $\rho(x,y)=\frac{\text{cov}(x,y)}{\sigma_x\sigma_y}\in[-1,1]$
		- corrected sample covariance $\text{cov}(x,y)=\frac{1}{n-1}\sum_i(x_i-\bar x)(y_i-\bar y)$
		- corrected sample standard deviation $\sigma_x=\sqrt{\frac1{n-1}\sum_i(x_i-\bar x)^2}$
			- similar for $\sigma_y$
- multi-dimensional regression
	- linear → least-squares method in matrix form
		- $y=X\beta$
		- $X$ … matrix of input samples
			- before the $x$ values of the sample, there is $1$ (so that we don’t need to keep the constant $\alpha$)
		- so $\beta=(X^TX)^{-1}(X^Ty)$
		- may be computationally expensive → approximate solutions
	- non-linear – example: logistic regression
		- if the dependent variable is categorical
		- we use log (conditional) odds to model the conditional probability
			- $s_i=\ln\frac{P}{1-P}$
			- so $P=\frac 1{1+e^{-s_i}}$ (sigmoid)
			- where $s_i=\alpha+\beta x_i=\alpha+\sum_j\beta_j x_{ij}$
		- we estimate the parameters using *maximum likelihood*
			- $L=\prod_i P(y_i\mid x_i)=\prod_i p_i^{y_i}(1-p_i)^{1-y_i}$
				- $p_i=\sigma(s_i)=\frac1{1+e^{-\alpha-\beta x_i}}$
				- note that $1-p_i=1-\sigma(s_i)=\sigma(-s_i)$
			- we minimize negative log-likelihood (= cross-entropy loss)
				- $E=-\sum_i(y_i\log p_i+(1-y_i)\log(1-p_i))$
		- gradient descent is then applied like this
			- $\beta_j\leftarrow\beta_j-\eta\sum_i(\sigma(\alpha+\beta x_i)-y_i)x_{ij}$
			- $\alpha\leftarrow\alpha-\eta\sum_i(\sigma(\alpha+\beta x_i)-y_i)$
			- $\eta$ … learning rate
			- note that $\sigma'(s_i)=\sigma(s_i)(1-\sigma(s_i))$
- discriminant analysis
	- classification into classes
	- we consider a discriminant function $f_t$ for every class $c_t$
		- conditional (a posteriori) probability of classifying $x$ to the class
	- for two classes, we get $f(x)=f_1(x)-f_2(x)=P(x|c_1)P(c_1)-P(x|c_2)P(c_2)$
		- (there should be a normalization constant $\alpha$ maybe)
	- for normal distributions, we only need to estimate the means and the covariance matrices to get the discriminant functions
		- the decision boundary is located where the distribution curves intersect

## Cluster Analysis

- cluster analysis, assumptions
	- we want to divide the observed samples into groups of mutually similar ones
	- assumption: we can measure the distance between samples (or between clusters)
	- distance between two samples
		- Minkovski metrics $L_z(x_1,x_2)=\sqrt[z]{\sum_j (x_{1j}-x_{2j})^z}$
			- Manhattan distance … $L_1$
			- Euclidean distance … $L_2$
			- Chebyshev distance … $L_\infty$
		- Hamming distance – number of positions at which $x_1$ and $x_2$ differ
		- Mahalanobis distance – useful if the quantities have different variance
			- uses the covariance matrix
			- $d(x_1,x_2)=(x_1-x_2)^TS^{-1}(x_1-x_2)$
	- distance between two clusters
		- method of the nearest neighbor
		- method of the farthest neighbor
		- method of average distance (we compute the mean distance of two arbitrary samples – one from each cluster)
		- centroid method (distance between the centers of the clusters)
- centroid
	- compute the average over all the features
	- but such point may not be in the data at all
- k-means clustering
	- Lloyd
		- divide the points randomly into $k$ clusters
		- repeat until convergence
			- compute the centroids
			- assign each point to its closest centroid
	- alternative initializations
		- pronounce the first $k$ points to be centroids
		- k-means++ … sample $k$ points (to be used as centroids) with probability proportional to their squared distance to the closest existing centroid
		- LocalSearch++ … start with k-means++, then sample some more points (in the same way) one by one and replace some of the existing centroids if they cover the distribution (points) better
- k-medians
	- just use Manhattan distance instead of Euclidean
	- → centroid = median (independently along each dimension)
		- the representative may not exist in the dataset
	- median is less sensitive to the presence of outliers than the mean
- hierarchical clustering
	- bottom-up approach
	- we get a dendrogram
		- a clustering corresponds to some “cut” in the dendrogram
	- algorithm
		- initialization: each point is its own cluster
		- repeat: merge the two closest clusters
- learning vector quantization (LVQ)
	- for *supervised* clustering
	- we consider several weight vectors (representatives)
		- each weight vector corresponds to a single class
	- first, we need to initialize the weight vectors
	- then, for every sample, we update the closest weight vector
		- if the class of the vector corresponds to the class of the sample, we move the vector closer to the sample: $w\leftarrow w+\mu(x-w)$
		- otherwise, we move it further from the sample: $w\leftarrow w-\mu(x-w)$
	- $\mu$ … learning rate (usually some constant divided by the number of the epoch)
	- no guarantee of convergence
- k-medoids
	- uses a similarity measure instead of averaging
	- we try to find $k$ best representatives $Y_1,\dots,Y_k$ (from the dataset)
		- so the medoids are real data points
	- objective function: $O=\sum_{i=1}^n(\min_j\mathrm{Dist}(X_i,Y_j))$
	- more robust to outliers than k-means
	- sometimes, we cannot easily compute average (or it does not make sense)
	- possible strategy
		- randomly select $r$ pairs $(X_i,Y_j)$ for a possible exchange
		- the pair with the best improvement of the objective function is exchanged
		- requires time proportional to $rn$
- grid-based methods
	- e.g. MAFIA
	- which cells of the grid are densely covered by the data?
	- merge neighboring *dense* hyper-cubes (cells) to get clusters
		- there is a threshold which tells us if the hyper-cube is dense
- density-based algorithms
	- core point: there are at least $\tau$ points closer than $\varepsilon$
	- border point: there is at least one core point closer than $\varepsilon$
	- noise point: otherwise
	- DBSCAN
		- connectivity graph
		- nodes = core points
		- nodes are connected if the core points are closer than $\varepsilon$
		- clusters = connected components
		- border points still belong to the corresponding clusters
		- noise points are reported as outliers
	- another algorithm: DENCLUE
- scalable approaches – for lots of data
	- k-medoids are expensive to compute
	- CLARA
		- based on *partitioning around medoids* (PAM) – a variant of k-medoids
			- all possible $k(n-k)$ pairs of medoids and non-medoids are tested for an exchange
			- the pair with the best improvement of the objective function is exchanged
			- $O(kn^2d)$ time per iteration for a $d$-dimensional dataset
		- we apply the algorithm to a smaller sample set of size $fn$ where $f\in(0,1)$
			- the remaining data points are assigned to the medoids found in the smaller sample
			- we repeat this multiple times and use the best found clustering
			- $O(kf^2n^2d+k(n-k))$ time
		- the main problem occurs when the preselected samples do not include a good choice of medoids
	- CLARANS
		- k-medoids, but uses the entire dataset
		- finds several locally optimal solutions and uses the best one
		- finding a locally optimal solution
			- randomly initialize $k$ medoids
			- repeat: take a random pair of a medoid and a non-medoid, swap them if this improves the objective function
			- if a certain number of unsuccessful exchanges is reached, we consider the solution to be locally optimal
	- CURE (clustering using representatives) – agglomerative hierarchical algorithm
		- sample $s$ points, divide them into $p$ equal partitions
		- cluster each partition independently using hierarchical merging to $k'$ clusters
		- perform hierarchical clustering over the $k'p$ clusters across all partitions to get the desired number of clusters ($k$)
		- assign each point to its closest cluster

## Decision Trees

- decision tree, (dis)advantages
	- each inner node is labeled by an attribute
	- each edge is labeled by a predicate applicable to the attribute of the parent node
	- each leaf has a class label
	- there’s a stopping criterion – e.g. all the training data in the node belong to the same class
	- advantages: easy to implement and interpret, efficient, extract simple rules, applicable to large datasets
	- disadvantages
		- difficult to process continuous data (decision trees divide the features space into rectangular regions)
		- difficult to process for missing data
		- overfitting (solved by pruning)
		- mutual correlations among the attributes are not considered
			- problem if the decision boundary looks like $y=x$ (we would need to check a value of one attribute relatively to another attribute)
- top down induction of decision trees (TDIDT)
	- top-down method
	- algorithm (divide and conquer)
		- choose one attribute as a root of the subtree
		- divide the data according to the values of this attribute and add a node for each subset
		- repeat the steps recursively for any nodes that do contain data from at least two different classes
	- how to choose the attribute
		- we use Shannon entropy $H=-\sum_c p_c\log_2p_c$
			- $p_c$ … probability of occurrence of class $c$ (relative frequency of class $c$ in the data)
			- by definition, $0\log_2 0=0$
			- note that $\log_2 x=\frac{\log_{10} x}{\log_{10} 2}=\frac{\log_{10} x}{0.301}$
		- we calculate $H$ for every possible value of the attribute
		- then we apply weighted average – weights correspond to the frequencies of the values of the attribute in the data
		- we choose the attribute with the minimum weighted average entropy
			- se we choose the attribute like this: $\mathrm{argmin}_A \sum_v P(A=v) H(v)$
			- where $H(v)=-\sum_c p_c\log_2 p_c$
			- $p_c=P(C=c\mid A=v)$
			- obviously, we always consider probability $P$ estimated for the subset of the data that belongs to the current node
	- it is useful to perform early stopping (by the principle of Occam’s razor)
	- time complexity: $O(np\log p)$
		- $n$ … number of attributes
		- $p$ … number of training data
- ID3 algorithm
	- boolean output (only two classes)
	- greedy top-down method
	- divides the nodes until there is a single class or there are no attributes (or data) left
	- uses information gain = entropy reduction
		- $\mathrm{gain}(A)=H-\sum_v P(A=v)H(v)$
		- where $H$ is the entropy of the node, $H(v)$ is the entropy of the child node containing samples s.t. $A=v$
- C4.5, C5.0
	- modifications of the ID3 algorithm
	- C4.5
		- missing attribute values are ignored during tree construction; they are estimated from the other data during prediction (probably just by using mode?)
		- continuous data – categorized according to the values in the training set
		- pruning
			- train/validation split
			- subtree is replaced by a leaf (with majority vote) if the new tree does not perform worse on the validation set than the original one
			- in a similar fashion, a whole subtree can be raised (the most frequently used one)
		- rule post-pruning
			- rules can be extracted from the tree
			- then, the rules can be generalized (pruned) and ordered by their expected accuracy (then used in this order instead of the original tree)
			- quite radical pruning
			- the rules are transparent and easy to understand
		- estimates confidence interval for accuracy of rules and trees
			- for a given rule, we can evaluate training accuracy $\hat p=$ correctly classified samples ÷ all samples processed by the rule
			- we assume binomial distribution
			- for the confidence interval computed at the 95% confidence level, the lower accuracy bound for new data will correspond to $$\hat p-1.96\cdot\underbrace{\sqrt{\frac{\hat p(1-\hat p)}{N}}}_{\text{std dev}}$$
				- where $N$ is the total number of samples
		- uses information gain ratio as criterion for data division
			- $\mathrm{argmax}_A\,\frac{\mathrm{gain}(A)}{-\sum_v P(A=v)\log_2 P(A=v)}$
			- in the denominator, we have *SplitInformation*, which is the same formula as entropy if we consider the attribute values to be “classes”
			- this penalizes excessive branching
	- C5.0
		- commercial version of C4.5 for large databases
		- improved generation of rules
		- higher accuracy is achieved using *boosting*
			- we gradually construct an ensemble of several classifiers (trees)
			- every new tree tries to improve the accuracy of the ensemble
				- the (previously) incorrectly classified data points are assigned larger weights → the tree knows they have high priority
			- majority vote is used
- classification and regression trees (algorithm CART)
	- generates *binary* trees
		- chooses a value of an attribute it can use to split data into the two subtrees (usually it's a threshold)
	- uses entropy or Gini index to choose the best decision attribute
	- measure of “goodness” of the data division
		- $\phi(v|t)=2P_LP_R\sum_c |P(c|t_L)-P(c|t_R)|$
			- $P_L$ probability that a sample belongs to the left subtree
			- $P(c|t_L)$ probability that a sample in the left subtree belongs to class $c$
		- we want the children to be both balanced and pure
		- we use the value $v^*=\mathrm{argmax}_v\ \phi(v|t)$
	- characteristic properties of CART
		- order of the attributes corresponds to their impact during classification
		- missing data is ignored
		- stopping criterion: there's no way to divide the node in order to improve the classification accuracy
		- high accuracy achieved on the training set does not have to reflect the accuracy on the test data
- algorithm CHAID
	- chi-square automatic interaction detection
	- $\chi^2$ criterion for branching
	- values of the categorical attributes are gradually grouped together (by $\chi^2$ “similarity”) → there remain only two groups
- bagging
	- bootstrapped aggregating
	- reduces the variance of the classifier
	- if we consider an ensemble of $k$ i.i.d. classifiers with variance $\sigma^2$ each, the ensemble has variance $\sigma^2/k$
	- decision trees are an ideal choice for bagging – they have low bias and high variance (if they are sufficiently deep)
	- bagging does not reduce bias (it may even worsen accuracy if bias is the main problem)
	- to get classifiers that are i.i.d. (independent and identically distributed) to some extent, we perform *bootstrapping*
		- the data points are sampled uniformly from the original data with replacement
		- this way, the new dataset contains roughly $1-(1-1/n)^n\approx 1-1/e\approx 63.2\%$ distinct data points
- random forests
	- *we use bagging*
	- problem: pairwise correlation of models constructed using bagging
		- if our $k$ classifiers, each with variance $\sigma^2$, have a positive pairwise correlation $\rho$, then the variance of the averaged prediction will be $\rho\sigma^2+\frac{(1-\rho)\sigma^2}k$
	- solution: we limit the number of features that can be used for splitting
		- at each node, only a random subset of features is available
	- this speeds up training due to fewer features to search over at each stage and there is no need to prune the trees
	- reduced variance does not negatively affect the bias
	- two parameters
		- number of features to be sampled
		- number of trees to be built
- boosting
	- classifiers are built one by one based on a weighted dataset
		- misclassified data points tend to have larger weights
	- each new tree depends on the previous ones
		- it is trained on a dataset with different weights (it focuses on the misclassified data points)
	- in the end, we get a weighted combination of the classifiers
	- the ensemble has a lower overall bias
	- disadvantage: outliers may hurt the overall performance of the model
- random forests vs. boosting
	- random forests: fast (can run in parallel, use a small set of features at each stage)
	- boosting: sequential, can be expensive, but outperforms random forests