Lecture

online in-class test: 20.–24. 4.
organization
- practicals (40 %)
  - homework (traditional or trial alternative): 10 points
  - quizzes: 5×5 = 25 points
    - in Moodle throughout the semester
  - project (traditional or trial alternative): 15 points
    - usually, students propose their own topics
    - with a presentation
  - we can even get some bonus point for cooperation during practicals
- exam (60 %)
  - online in-class test (during lab practical) 15 %
  - final exam (written and oral part) 45 %
traditional version: no AI, individual
- traditional project – can be in teams
trial alternative (for project+homework)
- can be solved individually or in small teams involving two or (three) students
- AI tools allowed
  - if used, test at least two and compare their benefits/drawbacks
- involves two presentation
  - traditional one
  - software-oriented one – instead of the homework!
- agreement with the publication of the project is necessary

Introduction

data mining from databases
- non-trivial process of gaining implicit, previously not known, but potentially useful information from the data
- originated in the 90s (there was not enough data before)
- knowledge discovery in databases (KDD)
- data mining (DM) – business intelligence (BI) and big data
foundations
- artificial intelligence, machine learning methods
- database systems (to store large data sets), information retrieval
- statistics – modeling and analysis of dependencies found in the data
- + how to use the results for decision-making
data mining is an interactive and iterative process
- data preparation
  - we build one table containing all the relevant data
  - selection
  - preprocessing
  - transformation
- the actual “data mining” – we find patterns in the data
- interpretation – found knowledge shall be evaluated from the point of view of the end user (manager, customer, etc.)
PoV of a manager
- there's a topical issue
- goal of the data mining process is to obtain as much information as possible that is relevant to solving the problem
- example
  - find groups of customers of a department store to offer special services to
  - the found groups can be interpreted as segments in the given market area
- steps
  - form a team: data analyst, domain expert, expert on databases, …
  - specify the problem
  - obtain all data available
    - we should also obtain the external data describing the environment of the analyzed processes (time period of the year, advertising, political issues, weather, …)
  - select the methods
    - clustering, classification, exploratory data analysis, association rules, decision trees, genetic algorithms, Bayesian networks, neural networks
    - visualization methods – helpful for presentation
  - preprocess the data
  - mine the data
  - interpret the results
    - we may need to create an analytical report
    - make the results easy to understand
    - the output can also mean to carry out a reasonable action
tasks
- classification and prediction
  - goal: predict a continuous or discrete value based on some attributes
  - interpretation may be challenging
  - prediction: weather forecast, stock prices, …
  - we should be able to cover the entire domain (all the data may be useful for a reasonable prediction)
- description
  - goal: find a dominant structure or relationships
  - we may ignore some of the information; the extracted knowledge does not need to be that precise (but it should be easily understandable)
- looking for “nuggets”
  - goal: find some interesting knowledge (does not have to fully cover the given concept)
real tasks
- segmentation and classification of bank clients
- causes of failures in telecommunication networks
- causes of change of service provider
- prediction of power consumption
- analysis of the patient database in a hospital
  - Florence Nightingale
  - Ignaz Semmelweis
- market basket analysis

Methodologies

goal: provide the users with a unified framework; guide data mining applications regardless of industry
SEMMA
- sample – select data for modeling
  - may include sampling, imputation (adding other useful information, e.g. adding seasons of the year to the data about the sales), partitioning (train-test-validation split)
- explore – visual exploration and dimensionality reduction
- modify – prepare the objects, values, and variables for data modeling; transform the data
- model – apply data mining techniques (decision trees, regression models, NNs, …)
  - create models providing relevant outcome
- asses – evaluate the results of modeling (assess their reliability and usefulness)
CRISP-DM – cross-industry standard process for data mining; a robust general-purpose model
- business understanding
  - determine our business objective
  - assess our present situation, what data we have
  - risk assessment
  - setting KPIs
- data understanding
  - collect and describe initial data; explore and visualize it
  - verify the quality of the data
- data preparation
  - cleaning, integration (merging), aggregation, …
- modeling
- evaluation