this dir | view | cards | source | edit | dark
top
Lecture
- online in-class test: 20.–24. 4.
- organization
- practicals (40 %)
- homework (traditional or trial alternative): 10 points
- quizzes: 5×5 = 25 points
- in Moodle throughout the semester
- project (traditional or trial alternative): 15 points
- usually, students propose their own topics
- with a presentation
- we can even get some bonus point for cooperation during practicals
- exam (60 %)
- online in-class test (during lab practical) 15 %
- final exam (written and oral part) 45 %
- traditional version: no AI, individual
- traditional project – can be in teams
- trial alternative (for project+homework)
- can be solved individually or in small teams involving two or (three) students
- AI tools allowed
- if used, test at least two and compare their benefits/drawbacks
- involves two presentation
- traditional one
- software-oriented one – instead of the homework!
- agreement with the publication of the project is necessary
Introduction
- data mining from databases
- non-trivial process of gaining implicit, previously not known, but potentially useful information from the data
- originated in the 90s (there was not enough data before)
- knowledge discovery in databases (KDD)
- data mining (DM) – business intelligence (BI) and big data
- foundations
- artificial intelligence, machine learning methods
- database systems (to store large data sets), information retrieval
- statistics – modeling and analysis of dependencies found in the data
- + how to use the results for decision-making
- data mining is an interactive and iterative process
- data preparation
- we build one table containing all the relevant data
- selection
- preprocessing
- transformation
- the actual “data mining” – we find patterns in the data
- interpretation – found knowledge shall be evaluated from the point of view of the end user (manager, customer, etc.)
- PoV of a manager
- there's a topical issue
- goal of the data mining process is to obtain as much information as possible that is relevant to solving the problem
- example
- find groups of customers of a department store to offer special services to
- the found groups can be interpreted as segments in the given market area
- steps
- form a team: data analyst, domain expert, expert on databases, …
- specify the problem
- obtain all data available
- we should also obtain the external data describing the environment of the analyzed processes (time period of the year, advertising, political issues, weather, …)
- select the methods
- clustering, classification, exploratory data analysis, association rules, decision trees, genetic algorithms, Bayesian networks, neural networks
- visualization methods – helpful for presentation
- preprocess the data
- mine the data
- interpret the results
- we may need to create an analytical report
- make the results easy to understand
- the output can also mean to carry out a reasonable action
- tasks
- classification and prediction
- goal: predict a continuous or discrete value based on some attributes
- interpretation may be challenging
- prediction: weather forecast, stock prices, …
- we should be able to cover the entire domain (all the data may be useful for a reasonable prediction)
- description
- goal: find a dominant structure or relationships
- we may ignore some of the information; the extracted knowledge does not need to be that precise (but it should be easily understandable)
- looking for “nuggets”
- goal: find some interesting knowledge (does not have to fully cover the given concept)
- real tasks
- segmentation and classification of bank clients
- causes of failures in telecommunication networks
- causes of change of service provider
- prediction of power consumption
- analysis of the patient database in a hospital
- Florence Nightingale
- Ignaz Semmelweis
- market basket analysis
Methodologies
- goal: provide the users with a unified framework; guide data mining applications regardless of industry
- SEMMA
- sample – select data for modeling
- may include sampling, imputation (adding other useful information, e.g. adding seasons of the year to the data about the sales), partitioning (train-test-validation split)
- explore – visual exploration and dimensionality reduction
- modify – prepare the objects, values, and variables for data modeling; transform the data
- model – apply data mining techniques (decision trees, regression models, NNs, …)
- create models providing relevant outcome
- asses – evaluate the results of modeling (assess their reliability and usefulness)
- CRISP-DM – cross-industry standard process for data mining; a robust general-purpose model
- business understanding
- determine our business objective
- assess our present situation, what data we have
- risk assessment
- setting KPIs
- data understanding
- collect and describe initial data; explore and visualize it
- verify the quality of the data
- data preparation
- cleaning, integration (merging), aggregation, …
- modeling
- evaluation