# Online Test ## Introduction - data mining from databases - non-trivial process of gaining implicit, previously not known, but potentially useful information from the data - originated in the 90s (there was not enough data before) - knowledge discovery in databases (KDD) - data mining (DM) – business intelligence (BI) and big data - foundations - artificial intelligence, machine learning methods - database systems (to store large data sets), information retrieval - statistics – modeling and analysis of dependencies found in the data - \+ how to use the results for decision-making - data mining is an interactive and iterative process - data preparation – we build one table containing all the relevant data - selection (of useful attributes), preprocessing, transformation - the actual “data mining” – we find *patterns* in the data - interpretation – found knowledge shall be evaluated from the point of view of the end user (manager, customer, etc.) - PoV of a manager - there’s a topical issue - goal of the data mining process is to obtain as much information as possible that is relevant to solving the problem - example - find groups of customers of a department store to offer special services to - the found groups can be interpreted as segments in the given market area - steps - form a team: data analyst, domain expert, expert on databases, … - specify the problem - obtain all data available - we should also obtain the external data describing the environment of the analyzed processes (time period of the year, advertising, political issues, weather, …) - select the methods - clustering, classification, exploratory data analysis, association rules, decision trees, genetic algorithms, Bayesian networks, neural networks - visualization methods – helpful for presentation - preprocess the data - mine the data - interpret the results - we may need to create an analytical report - make the results easy to understand - the output can also mean to carry out a reasonable action - tasks - classification and prediction - goal: predict a continuous or discrete value based on some attributes - interpretation may be challenging - prediction: weather forecast, stock prices, … - we should be able to cover the entire domain (all the data may be useful for a reasonable prediction) - description - goal: find a dominant structure or relationships - we may ignore some of the information; the extracted knowledge does not need to be that precise (but it should be easily understandable) - looking for “nuggets” - goal: find some interesting knowledge (does not have to fully cover the given concept) - real tasks (examples) - segmentation and classification of bank clients - causes of failures in telecommunication networks - causes of change of service provider - prediction of power consumption - analysis of the patient database in a hospital - Florence Nightingale - Ignaz Semmelweis - market basket analysis ## Methodologies - goal, popular methodologies - goal: provide the users with a unified framework; guide data mining applications regardless of industry - it’s necessary to have high-quality data; the steps are usually iterative - popular methodologies: 5A, SEMMA, CRISPR-DM - SEMMA - sample – select data for modeling - may include sampling, imputation (adding other useful information, e.g. adding seasons of the year to the data about the sales), partitioning (train-test-validation split) - explore – visual exploration and dimensionality reduction - modify – prepare the objects, values, and variables for data modeling; transform the data - model – apply data mining techniques (decision trees, regression models, NNs, …) - create models providing relevant outcome - assess – evaluate the results of modeling (assess their reliability and usefulness) - so that the manager can understand - CRISP-DM - cross-industry standard process for data mining; a robust general-purpose model - software-independent - 6 phases (their order is not strict) - business understanding - determine our business objective - assess our present situation, what resources and data we have - risk assessment - setting KPIs - data understanding - collect and describe initial data; explore and visualize it - verify the quality of the data - data preparation - cleaning, integration (merging), aggregation, … - make the dataset ready for further analysis and modeling - modeling - select the modeling technique - generate test design - build the model - assess the model - evaluation - evaluate the results – from the point of view of the manager - review the process - determine the next steps - deployment - results should be presented in a simple and comprehensible way - plan the deployment (what steps should be done) - final report, review the project - to preserve knowledge - ciritique - little support for project management - what to do if there are problems? - → IBM released Analytics Solutions Unified Method for Data Mining (ASUM) - ASUM - phases 1. analyze – define the needs (desired features, performance, usability, …), obtain agreement 2. design – define solution components, identify resources, clarify requirements 3. configure & build – including testing and validation 4. deploy – create a plan to run and maintain the solution 5. operate & optimize – preserve the health - \+ project management – there are processes that help to monitor the progress and maintain the project - benefits - minimized risk - scalable and enterprise-ready - comprehensive - product-specific implementation roadmaps - users can replicate previous software implementations ## Data Analysis - types of attributes - interval-scaled attributes – their values are real number following a linear scale - ratio-scaled attributes – follow an exponential scale → need to be log-transformed ($\log x_i$) before treating them like interval-scaled attributes - a nominal attribute can be transformed to a series of binary attributes (one-hot encoding) - ordinal attributes are sometimes treated as interval-scaled attributes - data standardization of interval-scaled attributes - decimal scaling – divide the data by the smallest power of 10 to get them to the interval $[-1,1]$ (while maintaining the mutual relations) - range standardization - to $[0,1]$ … $\frac{x-\mathrm{min}}{\mathrm{max}-\mathrm{min}}$ - to $[-1,1]$ … $2\cdot \frac{x-\mathrm{min}}{\mathrm{max}-\mathrm{min}}-1$ - Z-score standardization … $\frac{x-\mu}{\sigma}$ - $\mu$ … mean - $\sigma$ … corrected sample standard deviation - $\sigma=\sqrt{\frac1{n-1}\sum_{i=1}^n(x_i-\mu)^2}$ - or we can use the *corrected* sample standard deviation - standardization according to the mean absolute difference - $\frac{x-\mu}{s}$ - where $s=\frac1n\sum_{i=1}^n|x_i-\mu|$ - range, quartiles, outliers - range = difference between the highest and the lowest value in the set - quartiles – values $Q_1,Q_2,Q_3$ - 25% of the observations below $Q_1$ (lower quartile) - 50% below $Q_2$ (median) - 75% below $Q_3$ (upper quartile) - interquartile range $IQR=Q_3-Q_1$ - formulas (for discrete data) - $Q_1=x_{\lceil n/4 \rceil}$ - $Q_3=x_{\lceil 3n/4 \rceil}$ - but if $x_{n/4}$ or $x_{3n/4}$ exists ($n$ is divisible by 4), we need to average this value with the next one $($$x_{n/4+1}$ or $x_{3n/4+1}$) - outlier is any value $\notin[Q_1-1.5\cdot IQR,\ Q_3+1.5\cdot IQR]$ - box plot - box … between $Q_1$ and $Q_3$ with a mark (line) representing the median - two other lines … boundaries for outliers (or the extrema that are not outliers) - other marks outside … individual outliers - contingency table - relationship between two categorical quantities (may be binary) - cell … how many observations have this combination of attributes - there are also column and row totals - we get an expected frequency of combination if we multiply the column total with the row total and divide it by the total number of observations - intuition: we take the number of elements in the column and multiply it by the (observed) probability that any sample belongs to the given row (has this value of the attribute) - $\chi^2$-test - $\chi^2=\sum_k\sum_\ell\frac{(a_{k\ell}-e_{k\ell})^2}{e_{k\ell}}$ - where $a_{k\ell}$ is the value of the cell (number of observation with this combination of attributes) - $e_{k\ell}$ is the expected frequency as described above: $e_{k\ell}=\frac{r_k\cdot s_\ell}n$ - $\chi^2$ distribution has $(R-1)(S-1)$ degrees of freedom (the last row and column are somehow determined by the rest of the table) - we choose a level of significance $\alpha$ (usually 0.05) = probability of rejecting a true null hypothesis - we reject $H_0$ if our $\chi^2$-statistic is greater than the value of the $\chi^2$ distribution (for the given DoF and $\alpha$) - it can be used only for large-enough frequencies – when $\forall k,\ell:e_{k\ell}\geq 5$ - if the frequencies are too low and the contingency table contains only 4 fields, we can use Fisher’s test - Fisher’s test - for a contingency table with only 4 fields (two binary attributes) - one-sided or two-sided version - we get the $p$-value directly: what is the probability of this or more extreme table if we assume fixed column and row totals? - for the two-sided version, we need to compute the expected frequencies $e_{k\ell}$ - probability of this table: $$p=\frac{r_1!\cdot r_2!\cdot s_1!\cdot s_2!}{n!\cdot a!\cdot b!\cdot c!\cdot d!}$$ - $a,b,c,d$ … values of the cells - $r_i,s_j$ … row and column totals respectively - probability of some more extreme table $T_i$ – we consider values $a-i,\ b+i,\ c+i,\ d-i$ instead - the probabilities of the tables need to be summed up to get the $p$-value - null hypothesis: independence of the attributes - we reject for $p$-value $\leq \alpha$ ## Regression Analysis - motivation - it may be costly to obtain output values - input values are known earlier than the output (we want to estimate the output) - controlling input variables may lead to the desired behavior of the output - we want to find a (causal) relationship between the input and the output - linear regression - what are the parameters of the linear relationship between two numerical quantities? - $y=\beta x+\alpha \textcolor{gray}{+ \varepsilon}$ - least squares method - find parameters $\alpha,\beta$ - we minimize $SSE=\sum_i(y_i-f(x_i))^2$ (sum of squared errors) - set partial derivatives of SSE equal to zero, we get two equations: - $n\alpha+\beta\sum_i x_i=\sum_i y_i$ - or $\alpha=\bar y-\beta\bar x$ - $\alpha\sum_i x_i+\beta\sum_i x_i^2=\sum_i x_iy_i$ - so $(\bar y-\beta\bar x)\sum_i x_i+\beta\sum_i x_i^2=\sum x_iy_i$ - we get $\beta=\frac{\sum_i (x_i-\bar x)(y_i-\bar y)}{\sum_i(x_i-\bar x)^2}$ - and $\alpha=\bar y-\beta\bar x$ - correlation analysis - is there a linear relationship between two numerical quantities? - correlation coefficient $\rho(x,y)=\frac{\text{cov}(x,y)}{\sigma_x\sigma_y}\in[-1,1]$ - corrected sample covariance $\text{cov}(x,y)=\frac{1}{n-1}\sum_i(x_i-\bar x)(y_i-\bar y)$ - corrected sample standard deviation $\sigma_x=\sqrt{\frac1{n-1}\sum_i(x_i-\bar x)^2}$ - similar for $\sigma_y$ - multi-dimensional regression - linear → least-squares method in matrix form - $y=X\beta$ - $X$ … matrix of input samples - before the $x$ values of the sample, there is $1$ (so that we don’t need to keep the constant $\alpha$) - so $\beta=(X^TX)^{-1}(X^Ty)$ - may be computationally expensive → approximate solutions - non-linear – example: logistic regression - if the dependent variable is categorical - we use log (conditional) odds to model the conditional probability - $s_i=\ln\frac{P}{1-P}$ - so $P=\frac 1{1+e^{-s_i}}$ (sigmoid) - where $s_i=\alpha+\beta x_i=\alpha+\sum_j\beta_j x_{ij}$ - we estimate the parameters using *maximum likelihood* - $L=\prod_i P(y_i\mid x_i)=\prod_i p_i^{y_i}(1-p_i)^{1-y_i}$ - $p_i=\sigma(s_i)=\frac1{1+e^{-\alpha-\beta x_i}}$ - note that $1-p_i=1-\sigma(s_i)=\sigma(-s_i)$ - we minimize negative log-likelihood (= cross-entropy loss) - $E=-\sum_i(y_i\log p_i+(1-y_i)\log(1-p_i))$ - gradient descent is then applied like this - $\beta_j\leftarrow\beta_j-\eta\sum_i(\sigma(\alpha+\beta x_i)-y_i)x_{ij}$ - $\alpha\leftarrow\alpha-\eta\sum_i(\sigma(\alpha+\beta x_i)-y_i)$ - $\eta$ … learning rate - note that $\sigma'(s_i)=\sigma(s_i)(1-\sigma(s_i))$ - discriminant analysis - classification into classes - we consider a discriminant function $f_t$ for every class $c_t$ - conditional (a posteriori) probability of classifying $x$ to the class - for two classes, we get $f(x)=f_1(x)-f_2(x)=P(x|c_1)P(c_1)-P(x|c_2)P(c_2)$ - (there should be a normalization constant $\alpha$ maybe) - for normal distributions, we only need to estimate the means and the covariance matrices to get the discriminant functions - the decision boundary is located where the distribution curves intersect ## Cluster Analysis - cluster analysis, assumptions - we want to divide the observed samples into groups of mutually similar ones - assumption: we can measure the distance between samples (or between clusters) - distance between two samples - Minkovski metrics $L_z(x_1,x_2)=\sqrt[z]{\sum_j (x_{1j}-x_{2j})^z}$ - Manhattan distance … $L_1$ - Euclidean distance … $L_2$ - Chebyshev distance … $L_\infty$ - Hamming distance – number of positions at which $x_1$ and $x_2$ differ - Mahalanobis distance – useful if the quantities have different variance - uses the covariance matrix - $d(x_1,x_2)=(x_1-x_2)^TS^{-1}(x_1-x_2)$ - distance between two clusters - method of the nearest neighbor - method of the farthest neighbor - method of average distance (we compute the mean distance of two arbitrary samples – one from each cluster) - centroid method (distance between the centers of the clusters) - centroid - compute the average over all the features - but such point may not be in the data at all - k-means clustering - Lloyd - divide the points randomly into $k$ clusters - repeat until convergence - compute the centroids - assign each point to its closest centroid - alternative initializations - pronounce the first $k$ points to be centroids - k-means++ … sample $k$ points (to be used as centroids) with probability proportional to their squared distance to the closest existing centroid - LocalSearch++ … start with k-means++, then sample some more points (in the same way) one by one and replace some of the existing centroids if they cover the distribution (points) better - k-medians - just use Manhattan distance instead of Euclidean - → centroid = median (independently along each dimension) - the representative may not exist in the dataset - median is less sensitive to the presence of outliers than the mean - hierarchical clustering - bottom-up approach - we get a dendrogram - a clustering corresponds to some “cut” in the dendrogram - algorithm - initialization: each point is its own cluster - repeat: merge the two closest clusters - learning vector quantization (LVQ) - for *supervised* clustering - we consider several weight vectors (representatives) - each weight vector corresponds to a single class - first, we need to initialize the weight vectors - then, for every sample, we update the closest weight vector - if the class of the vector corresponds to the class of the sample, we move the vector closer to the sample: $w\leftarrow w+\mu(x-w)$ - otherwise, we move it further from the sample: $w\leftarrow w-\mu(x-w)$ - $\mu$ … learning rate (usually some constant divided by the number of the epoch) - no guarantee of convergence - k-medoids - uses a similarity measure instead of averaging - we try to find $k$ best representatives $Y_1,\dots,Y_k$ (from the dataset) - so the medoids are real data points - objective function: $O=\sum_{i=1}^n(\min_j\mathrm{Dist}(X_i,Y_j))$ - more robust to outliers than k-means - sometimes, we cannot easily compute average (or it does not make sense) - possible strategy - randomly select $r$ pairs $(X_i,Y_j)$ for a possible exchange - the pair with the best improvement of the objective function is exchanged - requires time proportional to $rn$ - grid-based methods - e.g. MAFIA - which cells of the grid are densely covered by the data? - merge neighboring *dense* hyper-cubes (cells) to get clusters - there is a threshold which tells us if the hyper-cube is dense - density-based algorithms - core point: there are at least $\tau$ points closer than $\varepsilon$ - border point: there is at least one core point closer than $\varepsilon$ - noise point: otherwise - DBSCAN - connectivity graph - nodes = core points - nodes are connected if the core points are closer than $\varepsilon$ - clusters = connected components - border points still belong to the corresponding clusters - noise points are reported as outliers - another algorithm: DENCLUE - scalable approaches – for lots of data - k-medoids are expensive to compute - CLARA - based on *partitioning around medoids* (PAM) – a variant of k-medoids - all possible $k(n-k)$ pairs of medoids and non-medoids are tested for an exchange - the pair with the best improvement of the objective function is exchanged - $O(kn^2d)$ time per iteration for a $d$-dimensional dataset - we apply the algorithm to a smaller sample set of size $fn$ where $f\in(0,1)$ - the remaining data points are assigned to the medoids found in the smaller sample - we repeat this multiple times and use the best found clustering - $O(kf^2n^2d+k(n-k))$ time - the main problem occurs when the preselected samples do not include a good choice of medoids - CLARANS - k-medoids, but uses the entire dataset - finds several locally optimal solutions and uses the best one - finding a locally optimal solution - randomly initialize $k$ medoids - repeat: take a random pair of a medoid and a non-medoid, swap them if this improves the objective function - if a certain number of unsuccessful exchanges is reached, we consider the solution to be locally optimal - CURE (clustering using representatives) – agglomerative hierarchical algorithm - sample $s$ points, divide them into $p$ equal partitions - cluster each partition independently using hierarchical merging to $k'$ clusters - perform hierarchical clustering over the $k'p$ clusters across all partitions to get the desired number of clusters ($k$) - assign each point to its closest cluster ## Decision Trees - decision tree, (dis)advantages - each inner node is labeled by an attribute - each edge is labeled by a predicate applicable to the attribute of the parent node - each leaf has a class label - there’s a stopping criterion – e.g. all the training data in the node belong to the same class - advantages: easy to implement and interpret, efficient, extract simple rules, applicable to large datasets - disadvantages - difficult to process continuous data (decision trees divide the features space into rectangular regions) - difficult to process for missing data - overfitting (solved by pruning) - mutual correlations among the attributes are not considered - problem if the decision boundary looks like $y=x$ (we would need to check a value of one attribute relatively to another attribute) - top down induction of decision trees (TDIDT) - top-down method - algorithm (divide and conquer) - choose one attribute as a root of the subtree - divide the data according to the values of this attribute and add a node for each subset - repeat the steps recursively for any nodes that do contain data from at least two different classes - how to choose the attribute - we use Shannon entropy $H=-\sum_c p_c\log_2p_c$ - $p_c$ … probability of occurrence of class $c$ (relative frequency of class $c$ in the data) - by definition, $0\log_2 0=0$ - note that $\log_2 x=\frac{\log_{10} x}{\log_{10} 2}=\frac{\log_{10} x}{0.301}$ - we calculate $H$ for every possible value of the attribute - then we apply weighted average – weights correspond to the frequencies of the values of the attribute in the data - we choose the attribute with the minimum weighted average entropy - se we choose the attribute like this: $\mathrm{argmin}_A \sum_v P(A=v) H(v)$ - where $H(v)=-\sum_c p_c\log_2 p_c$ - $p_c=P(C=c\mid A=v)$ - obviously, we always consider probability $P$ estimated for the subset of the data that belongs to the current node - it is useful to perform early stopping (by the principle of Occam’s razor) - time complexity: $O(np\log p)$ - $n$ … number of attributes - $p$ … number of training data - ID3 algorithm - boolean output (only two classes) - greedy top-down method - divides the nodes until there is a single class or there are no attributes (or data) left - uses information gain = entropy reduction - $\mathrm{gain}(A)=H-\sum_v P(A=v)H(v)$ - where $H$ is the entropy of the node, $H(v)$ is the entropy of the child node containing samples s.t. $A=v$ - C4.5, C5.0 - modifications of the ID3 algorithm - C4.5 - missing attribute values are ignored during tree construction; they are estimated from the other data during prediction (probably just by using mode?) - continuous data – categorized according to the values in the training set - pruning - train/validation split - subtree is replaced by a leaf (with majority vote) if the new tree does not perform worse on the validation set than the original one - in a similar fashion, a whole subtree can be raised (the most frequently used one) - rule post-pruning - rules can be extracted from the tree - then, the rules can be generalized (pruned) and ordered by their expected accuracy (then used in this order instead of the original tree) - quite radical pruning - the rules are transparent and easy to understand - estimates confidence interval for accuracy of rules and trees - for a given rule, we can evaluate training accuracy $\hat p=$ correctly classified samples ÷ all samples processed by the rule - we assume binomial distribution - for the confidence interval computed at the 95% confidence level, the lower accuracy bound for new data will correspond to $$\hat p-1.96\cdot\underbrace{\sqrt{\frac{\hat p(1-\hat p)}{N}}}_{\text{std dev}}$$ - where $N$ is the total number of samples - uses information gain ratio as criterion for data division - $\mathrm{argmax}_A\,\frac{\mathrm{gain}(A)}{-\sum_v P(A=v)\log_2 P(A=v)}$ - in the denominator, we have *SplitInformation*, which is the same formula as entropy if we consider the attribute values to be “classes” - this penalizes excessive branching - C5.0 - commercial version of C4.5 for large databases - improved generation of rules - higher accuracy is achieved using *boosting* - we gradually construct an ensemble of several classifiers (trees) - every new tree tries to improve the accuracy of the ensemble - the (previously) incorrectly classified data points are assigned larger weights → the tree knows they have high priority - majority vote is used - classification and regression trees (algorithm CART) - generates *binary* trees - chooses a value of an attribute it can use to split data into the two subtrees (usually it's a threshold) - uses entropy or Gini index to choose the best decision attribute - measure of “goodness” of the data division - $\phi(v|t)=2P_LP_R\sum_c |P(c|t_L)-P(c|t_R)|$ - $P_L$ probability that a sample belongs to the left subtree - $P(c|t_L)$ probability that a sample in the left subtree belongs to class $c$ - we want the children to be both balanced and pure - we use the value $v^*=\mathrm{argmax}_v\ \phi(v|t)$ - characteristic properties of CART - order of the attributes corresponds to their impact during classification - missing data is ignored - stopping criterion: there's no way to divide the node in order to improve the classification accuracy - high accuracy achieved on the training set does not have to reflect the accuracy on the test data - algorithm CHAID - chi-square automatic interaction detection - $\chi^2$ criterion for branching - values of the categorical attributes are gradually grouped together (by $\chi^2$ “similarity”) → there remain only two groups - bagging - bootstrapped aggregating - reduces the variance of the classifier - if we consider an ensemble of $k$ i.i.d. classifiers with variance $\sigma^2$ each, the ensemble has variance $\sigma^2/k$ - decision trees are an ideal choice for bagging – they have low bias and high variance (if they are sufficiently deep) - bagging does not reduce bias (it may even worsen accuracy if bias is the main problem) - to get classifiers that are i.i.d. (independent and identically distributed) to some extent, we perform *bootstrapping* - the data points are sampled uniformly from the original data with replacement - this way, the new dataset contains roughly $1-(1-1/n)^n\approx 1-1/e\approx 63.2\%$ distinct data points - random forests - *we use bagging* - problem: pairwise correlation of models constructed using bagging - if our $k$ classifiers, each with variance $\sigma^2$, have a positive pairwise correlation $\rho$, then the variance of the averaged prediction will be $\rho\sigma^2+\frac{(1-\rho)\sigma^2}k$ - solution: we limit the number of features that can be used for splitting - at each node, only a random subset of features is available - this speeds up training due to fewer features to search over at each stage and there is no need to prune the trees - reduced variance does not negatively affect the bias - two parameters - number of features to be sampled - number of trees to be built - boosting - classifiers are built one by one based on a weighted dataset - misclassified data points tend to have larger weights - each new tree depends on the previous ones - it is trained on a dataset with different weights (it focuses on the misclassified data points) - in the end, we get a weighted combination of the classifiers - the ensemble has a lower overall bias - disadvantage: outliers may hurt the overall performance of the model - random forests vs. boosting - random forests: fast (can run in parallel, use a small set of features at each stage) - boosting: sequential, can be expensive, but outperforms random forests