the actual “data mining” – we find patterns in the data
interpretation – found knowledge shall be evaluated from the point of view of the end user (manager, customer, etc.)
PoV of a manager
there’s a topical issue
goal of the data mining process is to obtain as much information as possible that is relevant to solving the problem
example
find groups of customers of a department store to offer special services to
the found groups can be interpreted as segments in the given market area
steps
form a team: data analyst, domain expert, expert on databases, …
specify the problem
obtain all data available
we should also obtain the external data describing the environment of the analyzed processes (time period of the year, advertising, political issues, weather, …)
select the methods
clustering, classification, exploratory data analysis, association rules, decision trees, genetic algorithms, Bayesian networks, neural networks
visualization methods – helpful for presentation
preprocess the data
mine the data
interpret the results
we may need to create an analytical report
make the results easy to understand
the output can also mean to carry out a reasonable action
tasks
classification and prediction
goal: predict a continuous or discrete value based on some attributes
interpretation may be challenging
prediction: weather forecast, stock prices, …
we should be able to cover the entire domain (all the data may be useful for a reasonable prediction)
description
goal: find a dominant structure or relationships
we may ignore some of the information; the extracted knowledge does not need to be that precise (but it should be easily understandable)
looking for “nuggets”
goal: find some interesting knowledge (does not have to fully cover the given concept)
real tasks (examples)
segmentation and classification of bank clients
causes of failures in telecommunication networks
causes of change of service provider
prediction of power consumption
analysis of the patient database in a hospital
Florence Nightingale
Ignaz Semmelweis
market basket analysis
Methodologies
goal, popular methodologies
goal: provide the users with a unified framework; guide data mining applications regardless of industry
it’s necessary to have high-quality data; the steps are usually iterative
popular methodologies: 5A, SEMMA, CRISPR-DM
SEMMA
sample – select data for modeling
may include sampling, imputation (adding other useful information, e.g. adding seasons of the year to the data about the sales), partitioning (train-test-validation split)
explore – visual exploration and dimensionality reduction
modify – prepare the objects, values, and variables for data modeling; transform the data
model – apply data mining techniques (decision trees, regression models, NNs, …)
create models providing relevant outcome
assess – evaluate the results of modeling (assess their reliability and usefulness)
so that the manager can understand
CRISP-DM
cross-industry standard process for data mining; a robust general-purpose model
software-independent
6 phases (their order is not strict)
business understanding
determine our business objective
assess our present situation, what resources and data we have
risk assessment
setting KPIs
data understanding
collect and describe initial data; explore and visualize it
verify the quality of the data
data preparation
cleaning, integration (merging), aggregation, …
make the dataset ready for further analysis and modeling
modeling
select the modeling technique
generate test design
build the model
assess the model
evaluation
evaluate the results – from the point of view of the manager
review the process
determine the next steps
deployment
results should be presented in a simple and comprehensible way
plan the deployment (what steps should be done)
final report, review the project
to preserve knowledge
ciritique
little support for project management
what to do if there are problems?
→ IBM released Analytics Solutions Unified Method for Data Mining (ASUM)
configure & build – including testing and validation
deploy – create a plan to run and maintain the solution
operate & optimize – preserve the health
+ project management – there are processes that help to monitor the progress and maintain the project
benefits
minimized risk
scalable and enterprise-ready
comprehensive
product-specific implementation roadmaps
users can replicate previous software implementations
Data Analysis
types of attributes
interval-scaled attributes – their values are real number following a linear scale
ratio-scaled attributes – follow an exponential scale → need to be log-transformed (logxi) before treating them like interval-scaled attributes
a nominal attribute can be transformed to a series of binary attributes (one-hot encoding)
ordinal attributes are sometimes treated as interval-scaled attributes
data standardization of interval-scaled attributes
decimal scaling – divide the data by the smallest power of 10 to get them to the interval [−1,1] (while maintaining the mutual relations)
range standardization
to [0,1] … max−minx−min
to [−1,1] … 2⋅max−minx−min−1
Z-score standardization … σx−μ
μ … mean
σ … corrected sample standard deviation
σ=n−11∑i=1n(xi−μ)2
or we can use the corrected sample standard deviation
standardization according to the mean absolute difference
sx−μ
where s=n1∑i=1n∣xi−μ∣
range, quartiles, outliers
range = difference between the highest and the lowest value in the set
quartiles – values Q1,Q2,Q3
25% of the observations below Q1 (lower quartile)
50% below Q2 (median)
75% below Q3 (upper quartile)
interquartile range IQR=Q3−Q1
formulas (for discrete data)
Q1=x⌈n/4⌉
Q3=x⌈3n/4⌉
but if xn/4 or x3n/4 exists (n is divisible by 4), we need to average this value with the next one (xn/4+1 or x3n/4+1)
outlier is any value ∈/[Q1−1.5⋅IQR,Q3+1.5⋅IQR]
box plot
box … between Q1 and Q3 with a mark (line) representing the median
two other lines … boundaries for outliers (or the extrema that are not outliers)
other marks outside … individual outliers
contingency table
relationship between two categorical quantities (may be binary)
cell … how many observations have this combination of attributes
there are also column and row totals
we get an expected frequency of combination if we multiply the column total with the row total and divide it by the total number of observations
intuition: we take the number of elements in the column and multiply it by the (observed) probability that any sample belongs to the given row (has this value of the attribute)
χ2-test
χ2=∑k∑ℓekℓ(akℓ−ekℓ)2
where akℓ is the value of the cell (number of observation with this combination of attributes)
ekℓ is the expected frequency as described above: ekℓ=nrk⋅sℓ
χ2 distribution has (R−1)(S−1) degrees of freedom (the last row and column are somehow determined by the rest of the table)
we choose a level of significance α (usually 0.05) = probability of rejecting a true null hypothesis
we reject H0 if our χ2-statistic is greater than the value of the χ2 distribution (for the given DoF and α)
it can be used only for large-enough frequencies – when ∀k,ℓ:ekℓ≥5
if the frequencies are too low and the contingency table contains only 4 fields, we can use Fisher’s test
Fisher’s test
for a contingency table with only 4 fields (two binary attributes)
one-sided or two-sided version
we get the p-value directly: what is the probability of this or more extreme table if we assume fixed column and row totals?
for the two-sided version, we need to compute the expected frequencies ekℓ
probability of this table: p=n!⋅a!⋅b!⋅c!⋅d!r1!⋅r2!⋅s1!⋅s2!
a,b,c,d … values of the cells
ri,sj … row and column totals respectively
probability of some more extreme table Ti – we consider values a−i,b+i,c+i,d−i instead
the probabilities of the tables need to be summed up to get the p-value
null hypothesis: independence of the attributes
we reject for p-value ≤α
Regression Analysis
motivation
it may be costly to obtain output values
input values are known earlier than the output (we want to estimate the output)
controlling input variables may lead to the desired behavior of the output
we want to find a (causal) relationship between the input and the output
linear regression
what are the parameters of the linear relationship between two numerical quantities?
y=βx+α+ε
least squares method
find parameters α,β
we minimize SSE=∑i(yi−f(xi))2 (sum of squared errors)
set partial derivatives of SSE equal to zero, we get two equations:
nα+β∑ixi=∑iyi
or α=yˉ−βxˉ
α∑ixi+β∑ixi2=∑ixiyi
so (yˉ−βxˉ)∑ixi+β∑ixi2=∑xiyi
we get β=∑i(xi−xˉ)2∑i(xi−xˉ)(yi−yˉ)
and α=yˉ−βxˉ
correlation analysis
is there a linear relationship between two numerical quantities?
corrected sample standard deviation σx=n−11∑i(xi−xˉ)2
similar for σy
multi-dimensional regression
linear → least-squares method in matrix form
y=Xβ
X … matrix of input samples
before the x values of the sample, there is 1 (so that we don’t need to keep the constant α)
so β=(XTX)−1(XTy)
may be computationally expensive → approximate solutions
non-linear – example: logistic regression
if the dependent variable is categorical
we use log (conditional) odds to model the conditional probability
si=ln1−PP
so P=1+e−si1 (sigmoid)
where si=α+βxi=α+∑jβjxij
we estimate the parameters using maximum likelihood
L=∏iP(yi∣xi)=∏ipiyi(1−pi)1−yi
pi=σ(si)=1+e−α−βxi1
note that 1−pi=1−σ(si)=σ(−si)
we minimize negative log-likelihood (= cross-entropy loss)
E=−∑i(yilogpi+(1−yi)log(1−pi))
gradient descent is then applied like this
βj←βj−η∑i(σ(α+βxi)−yi)xij
α←α−η∑i(σ(α+βxi)−yi)
η … learning rate
note that σ′(si)=σ(si)(1−σ(si))
discriminant analysis
classification into classes
we consider a discriminant function ft for every class ct
conditional (a posteriori) probability of classifying x to the class
for two classes, we get f(x)=f1(x)−f2(x)=P(x∣c1)P(c1)−P(x∣c2)P(c2)
(there should be a normalization constant α maybe)
for normal distributions, we only need to estimate the means and the covariance matrices to get the discriminant functions
the decision boundary is located where the distribution curves intersect
Cluster Analysis
cluster analysis, assumptions
we want to divide the observed samples into groups of mutually similar ones
assumption: we can measure the distance between samples (or between clusters)
distance between two samples
Minkovski metrics Lz(x1,x2)=z∑j(x1j−x2j)z
Manhattan distance … L1
Euclidean distance … L2
Chebyshev distance … L∞
Hamming distance – number of positions at which x1 and x2 differ
Mahalanobis distance – useful if the quantities have different variance
uses the covariance matrix
d(x1,x2)=(x1−x2)TS−1(x1−x2)
distance between two clusters
method of the nearest neighbor
method of the farthest neighbor
method of average distance (we compute the mean distance of two arbitrary samples – one from each cluster)
centroid method (distance between the centers of the clusters)
centroid
compute the average over all the features
but such point may not be in the data at all
k-means clustering
Lloyd
divide the points randomly into k clusters
repeat until convergence
compute the centroids
assign each point to its closest centroid
alternative initializations
pronounce the first k points to be centroids
k-means++ … sample k points (to be used as centroids) with probability proportional to their squared distance to the closest existing centroid
LocalSearch++ … start with k-means++, then sample some more points (in the same way) one by one and replace some of the existing centroids if they cover the distribution (points) better
k-medians
just use Manhattan distance instead of Euclidean
→ centroid = median (independently along each dimension)
the representative may not exist in the dataset
median is less sensitive to the presence of outliers than the mean
hierarchical clustering
bottom-up approach
we get a dendrogram
a clustering corresponds to some “cut” in the dendrogram
algorithm
initialization: each point is its own cluster
repeat: merge the two closest clusters
learning vector quantization (LVQ)
for supervised clustering
we consider several weight vectors (representatives)
each weight vector corresponds to a single class
first, we need to initialize the weight vectors
then, for every sample, we update the closest weight vector
if the class of the vector corresponds to the class of the sample, we move the vector closer to the sample: w←w+μ(x−w)
otherwise, we move it further from the sample: w←w−μ(x−w)
μ … learning rate (usually some constant divided by the number of the epoch)
no guarantee of convergence
k-medoids
uses a similarity measure instead of averaging
we try to find k best representatives Y1,…,Yk (from the dataset)
so the medoids are real data points
objective function: O=∑i=1n(minjDist(Xi,Yj))
more robust to outliers than k-means
sometimes, we cannot easily compute average (or it does not make sense)
possible strategy
randomly select r pairs (Xi,Yj) for a possible exchange
the pair with the best improvement of the objective function is exchanged
requires time proportional to rn
grid-based methods
e.g. MAFIA
which cells of the grid are densely covered by the data?
merge neighboring dense hyper-cubes (cells) to get clusters
there is a threshold which tells us if the hyper-cube is dense
density-based algorithms
core point: there are at least τ points closer than ε
border point: there is at least one core point closer than ε
noise point: otherwise
DBSCAN
connectivity graph
nodes = core points
nodes are connected if the core points are closer than ε
clusters = connected components
border points still belong to the corresponding clusters
noise points are reported as outliers
another algorithm: DENCLUE
scalable approaches – for lots of data
k-medoids are expensive to compute
CLARA
based on partitioning around medoids (PAM) – a variant of k-medoids
all possible k(n−k) pairs of medoids and non-medoids are tested for an exchange
the pair with the best improvement of the objective function is exchanged
O(kn2d) time per iteration for a d-dimensional dataset
we apply the algorithm to a smaller sample set of size fn where f∈(0,1)
the remaining data points are assigned to the medoids found in the smaller sample
we repeat this multiple times and use the best found clustering
O(kf2n2d+k(n−k)) time
the main problem occurs when the preselected samples do not include a good choice of medoids
CLARANS
k-medoids, but uses the entire dataset
finds several locally optimal solutions and uses the best one
finding a locally optimal solution
randomly initialize k medoids
repeat: take a random pair of a medoid and a non-medoid, swap them if this improves the objective function
if a certain number of unsuccessful exchanges is reached, we consider the solution to be locally optimal
CURE (clustering using representatives) – agglomerative hierarchical algorithm
sample s points, divide them into p equal partitions
cluster each partition independently using hierarchical merging to k′ clusters
perform hierarchical clustering over the k′p clusters across all partitions to get the desired number of clusters (k)
assign each point to its closest cluster
Decision Trees
decision tree, (dis)advantages
each inner node is labeled by an attribute
each edge is labeled by a predicate applicable to the attribute of the parent node
each leaf has a class label
there’s a stopping criterion – e.g. all the training data in the node belong to the same class
advantages: easy to implement and interpret, efficient, extract simple rules, applicable to large datasets
disadvantages
difficult to process continuous data (decision trees divide the features space into rectangular regions)
difficult to process for missing data
overfitting (solved by pruning)
mutual correlations among the attributes are not considered
problem if the decision boundary looks like y=x (we would need to check a value of one attribute relatively to another attribute)
top down induction of decision trees (TDIDT)
top-down method
algorithm (divide and conquer)
choose one attribute as a root of the subtree
divide the data according to the values of this attribute and add a node for each subset
repeat the steps recursively for any nodes that do contain data from at least two different classes
how to choose the attribute
we use Shannon entropy H=−∑cpclog2pc
pc … probability of occurrence of class c (relative frequency of class c in the data)
by definition, 0log20=0
note that log2x=log102log10x=0.301log10x
we calculate H for every possible value of the attribute
then we apply weighted average – weights correspond to the frequencies of the values of the attribute in the data
we choose the attribute with the minimum weighted average entropy
se we choose the attribute like this: argminA∑vP(A=v)H(v)
where H(v)=−∑cpclog2pc
pc=P(C=c∣A=v)
obviously, we always consider probability P estimated for the subset of the data that belongs to the current node
it is useful to perform early stopping (by the principle of Occam’s razor)
time complexity: O(nplogp)
n … number of attributes
p … number of training data
ID3 algorithm
boolean output (only two classes)
greedy top-down method
divides the nodes until there is a single class or there are no attributes (or data) left
uses information gain = entropy reduction
gain(A)=H−∑vP(A=v)H(v)
where H is the entropy of the node, H(v) is the entropy of the child node containing samples s.t. A=v
C4.5, C5.0
modifications of the ID3 algorithm
C4.5
missing attribute values are ignored during tree construction; they are estimated from the other data during prediction (probably just by using mode?)
continuous data – categorized according to the values in the training set
pruning
train/validation split
subtree is replaced by a leaf (with majority vote) if the new tree does not perform worse on the validation set than the original one
in a similar fashion, a whole subtree can be raised (the most frequently used one)
rule post-pruning
rules can be extracted from the tree
then, the rules can be generalized (pruned) and ordered by their expected accuracy (then used in this order instead of the original tree)
quite radical pruning
the rules are transparent and easy to understand
estimates confidence interval for accuracy of rules and trees
for a given rule, we can evaluate training accuracy p^= correctly classified samples ÷ all samples processed by the rule
we assume binomial distribution
for the confidence interval computed at the 95% confidence level, the lower accuracy bound for new data will correspond to p^−1.96⋅std devNp^(1−p^)
where N is the total number of samples
uses information gain ratio as criterion for data division
argmaxA−∑vP(A=v)log2P(A=v)gain(A)
in the denominator, we have SplitInformation, which is the same formula as entropy if we consider the attribute values to be “classes”
this penalizes excessive branching
C5.0
commercial version of C4.5 for large databases
improved generation of rules
higher accuracy is achieved using boosting
we gradually construct an ensemble of several classifiers (trees)
every new tree tries to improve the accuracy of the ensemble
the (previously) incorrectly classified data points are assigned larger weights → the tree knows they have high priority
majority vote is used
classification and regression trees (algorithm CART)
generates binary trees
chooses a value of an attribute it can use to split data into the two subtrees (usually it's a threshold)
uses entropy or Gini index to choose the best decision attribute
measure of “goodness” of the data division
ϕ(v∣t)=2PLPR∑c∣P(c∣tL)−P(c∣tR)∣
PL probability that a sample belongs to the left subtree
P(c∣tL) probability that a sample in the left subtree belongs to class c
we want the children to be both balanced and pure
we use the value v∗=argmaxvϕ(v∣t)
characteristic properties of CART
order of the attributes corresponds to their impact during classification
missing data is ignored
stopping criterion: there's no way to divide the node in order to improve the classification accuracy
high accuracy achieved on the training set does not have to reflect the accuracy on the test data
algorithm CHAID
chi-square automatic interaction detection
χ2 criterion for branching
values of the categorical attributes are gradually grouped together (by χ2 “similarity”) → there remain only two groups
bagging
bootstrapped aggregating
reduces the variance of the classifier
if we consider an ensemble of k i.i.d. classifiers with variance σ2 each, the ensemble has variance σ2/k
decision trees are an ideal choice for bagging – they have low bias and high variance (if they are sufficiently deep)
bagging does not reduce bias (it may even worsen accuracy if bias is the main problem)
to get classifiers that are i.i.d. (independent and identically distributed) to some extent, we perform bootstrapping
the data points are sampled uniformly from the original data with replacement
this way, the new dataset contains roughly 1−(1−1/n)n≈1−1/e≈63.2% distinct data points
random forests
we use bagging
problem: pairwise correlation of models constructed using bagging
if our k classifiers, each with variance σ2, have a positive pairwise correlation ρ, then the variance of the averaged prediction will be ρσ2+k(1−ρ)σ2
solution: we limit the number of features that can be used for splitting
at each node, only a random subset of features is available
this speeds up training due to fewer features to search over at each stage and there is no need to prune the trees
reduced variance does not negatively affect the bias
two parameters
number of features to be sampled
number of trees to be built
boosting
classifiers are built one by one based on a weighted dataset
misclassified data points tend to have larger weights
each new tree depends on the previous ones
it is trained on a dataset with different weights (it focuses on the misclassified data points)
in the end, we get a weighted combination of the classifiers
the ensemble has a lower overall bias
disadvantage: outliers may hurt the overall performance of the model
random forests vs. boosting
random forests: fast (can run in parallel, use a small set of features at each stage)
boosting: sequential, can be expensive, but outperforms random forests