# Lecture - online in-class test: 20.–24. 4. - organization - practicals (40 \%) - homework (traditional or trial alternative): 10 points - quizzes: 5×5 = 25 points - in Moodle throughout the semester - project (traditional or trial alternative): 15 points - usually, students propose their own topics - with a presentation - we can even get some bonus point for cooperation during practicals - exam (60 \%) - online in-class test (during lab practical) 15 \% - final exam (written and oral part) 45 \% - traditional version: no AI, individual - traditional project – can be in teams - trial alternative (for project+homework) - can be solved individually or in small teams involving two or (three) students - AI tools allowed - if used, test at least two and compare their benefits/drawbacks - involves two presentation - traditional one - software-oriented one – instead of the homework! - agreement with the publication of the project is necessary ## Introduction - data mining from databases - non-trivial process of gaining implicit, previously not known, but potentially useful information from the data - originated in the 90s (there was not enough data before) - knowledge discovery in databases (KDD) - data mining (DM) – business intelligence (BI) and big data - foundations - artificial intelligence, machine learning methods - database systems (to store large data sets), information retrieval - statistics – modeling and analysis of dependencies found in the data - \+ how to use the results for decision-making - data mining is an interactive and iterative process - data preparation - we build one table containing all the relevant data - selection - preprocessing - transformation - the actual “data mining” – we find *patterns* in the data - interpretation – found knowledge shall be evaluated from the point of view of the end user (manager, customer, etc.) - PoV of a manager - there's a topical issue - goal of the data mining process is to obtain as much information as possible that is relevant to solving the problem - example - find groups of customers of a department store to offer special services to - the found groups can be interpreted as segments in the given market area - steps - form a team: data analyst, domain expert, expert on databases, … - specify the problem - obtain all data available - we should also obtain the external data describing the environment of the analyzed processes (time period of the year, advertising, political issues, weather, …) - select the methods - clustering, classification, exploratory data analysis, association rules, decision trees, genetic algorithms, Bayesian networks, neural networks - visualization methods – helpful for presentation - preprocess the data - mine the data - interpret the results - we may need to create an analytical report - make the results easy to understand - the output can also mean to carry out a reasonable action - tasks - classification and prediction - goal: predict a continuous or discrete value based on some attributes - interpretation may be challenging - prediction: weather forecast, stock prices, … - we should be able to cover the entire domain (all the data may be useful for a reasonable prediction) - description - goal: find a dominant structure or relationships - we may ignore some of the information; the extracted knowledge does not need to be that precise (but it should be easily understandable) - looking for “nuggets” - goal: find some interesting knowledge (does not have to fully cover the given concept) - real tasks - segmentation and classification of bank clients - causes of failures in telecommunication networks - causes of change of service provider - prediction of power consumption - analysis of the patient database in a hospital - Florence Nightingale - Ignaz Semmelweis - market basket analysis ## Methodologies - goal: provide the users with a unified framework; guide data mining applications regardless of industry - SEMMA - sample – select data for modeling - may include sampling, imputation (adding other useful information, e.g. adding seasons of the year to the data about the sales), partitioning (train-test-validation split) - explore – visual exploration and dimensionality reduction - modify – prepare the objects, values, and variables for data modeling; transform the data - model – apply data mining techniques (decision trees, regression models, NNs, …) - create models providing relevant outcome - asses – evaluate the results of modeling (assess their reliability and usefulness) - CRISP-DM – cross-industry standard process for data mining; a robust general-purpose model - business understanding - determine our business objective - assess our present situation, what data we have - risk assessment - setting KPIs - data understanding - collect and describe initial data; explore and visualize it - verify the quality of the data - data preparation - cleaning, integration (merging), aggregation, … - modeling - evaluation