# Lecture

- online in-class test: 20.–24. 4.
- organization
	- practicals (40 \%)
		- homework (traditional or trial alternative): 10 points
		- quizzes: 5×5 = 25 points
			- in Moodle throughout the semester
		- project (traditional or trial alternative): 15 points
			- usually, students propose their own topics
			- with a presentation
		- we can even get some bonus point for cooperation during practicals
	- exam (60 \%)
		- online in-class test (during lab practical) 15 \%
		- final exam (written and oral part) 45 \%
- traditional version: no AI, individual
	- traditional project – can be in teams
- trial alternative (for project+homework)
	- can be solved individually or in small teams involving two or (three) students
	- AI tools allowed
		- if used, test at least two and compare their benefits/drawbacks
	- involves two presentation
		- traditional one
		- software-oriented one – instead of the homework!
	- agreement with the publication of the project is necessary

## Introduction

- data mining from databases
	- non-trivial process of gaining implicit, previously not known, but potentially useful information from the data
	- originated in the 90s (there was not enough data before)
	- knowledge discovery in databases (KDD)
	- data mining (DM) – business intelligence (BI) and big data
- foundations
	- artificial intelligence, machine learning methods
	- database systems (to store large data sets), information retrieval
	- statistics – modeling and analysis of dependencies found in the data
	- \+ how to use the results for decision-making
- data mining is an interactive and iterative process
	- data preparation
		- we build one table containing all the relevant data
		- selection
		- preprocessing
		- transformation
	- the actual “data mining” – we find *patterns* in the data
	- interpretation – found knowledge shall be evaluated from the point of view of the end user (manager, customer, etc.)
- PoV of a manager
	- there's a topical issue
	- goal of the data mining process is to obtain as much information as possible that is relevant to solving the problem
	- example
		- find groups of customers of a department store to offer special services to
		- the found groups can be interpreted as segments in the given market area
	- steps
		- form a team: data analyst, domain expert, expert on databases, …
		- specify the problem
		- obtain all data available
			- we should also obtain the external data describing the environment of the analyzed processes (time period of the year, advertising, political issues, weather, …)
		- select the methods
			- clustering, classification, exploratory data analysis, association rules, decision trees, genetic algorithms, Bayesian networks, neural networks
			- visualization methods – helpful for presentation
		- preprocess the data
		- mine the data
		- interpret the results
			- we may need to create an analytical report
			- make the results easy to understand
			- the output can also mean to carry out a reasonable action
- tasks
	- classification and prediction
		- goal: predict a continuous or discrete value based on some attributes
		- interpretation may be challenging
		- prediction: weather forecast, stock prices, …
		- we should be able to cover the entire domain (all the data may be useful for a reasonable prediction)
	- description
		- goal: find a dominant structure or relationships
		- we may ignore some of the information; the extracted knowledge does not need to be that precise (but it should be easily understandable)
	- looking for “nuggets”
		- goal: find some interesting knowledge (does not have to fully cover the given concept)
- real tasks
	- segmentation and classification of bank clients
	- causes of failures in telecommunication networks
	- causes of change of service provider
	- prediction of power consumption
	- analysis of the patient database in a hospital
		- Florence Nightingale
		- Ignaz Semmelweis
	- market basket analysis

## Methodologies

- goal: provide the users with a unified framework; guide data mining applications regardless of industry
- SEMMA
	- sample – select data for modeling
		- may include sampling, imputation (adding other useful information, e.g. adding seasons of the year to the data about the sales), partitioning (train-test-validation split)
	- explore – visual exploration and dimensionality reduction
	- modify – prepare the objects, values, and variables for data modeling; transform the data
	- model – apply data mining techniques (decision trees, regression models, NNs, …)
		- create models providing relevant outcome
	- asses – evaluate the results of modeling (assess their reliability and usefulness)
- CRISP-DM – cross-industry standard process for data mining; a robust general-purpose model
	- business understanding
		- determine our business objective
		- assess our present situation, what data we have
		- risk assessment
		- setting KPIs
	- data understanding
		- collect and describe initial data; explore and visualize it
		- verify the quality of the data
	- data preparation
		- cleaning, integration (merging), aggregation, …
	- modeling
	- evaluation