intelligent data analysis and data...

39
Lluis Belanche + Alfredo Vellido Intelligent Data Analysis and Data Mining a.k.a. Data Mining II

Upload: others

Post on 19-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

Lluis Belanche + Alfredo Vellido

Intelligent Data Analysis and Data Mininga.k.a. Data Mining II

Page 2: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

Office 319, Omega, BCNEET, office 107, TR‐2, Terrassa

[email protected], gtalk: avellido

Tels.: 934137796, 937398090

www.lsi.upc.edu/~avellido/teaching/data_mining.html

…/~belanche/docencia/aiddm/aiddm.html

Page 3: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

Contents of the course disclaimer:(but who knows)

1. Introduction to DM and its methodologies2. Visual DM: Exploratory DM through visualization3. Pattern recognition 14. Pattern recognition 25. Feature extraction6. Feature selection7. Error estimation8. Linear classifiers, kernels and SVMs9. Probability in Data Mining10. Nonlinear Dimensionality Reduction (NLDR)11. Applications of NLDR: biomed & beyond12. DM Case studies

IDADM

Page 4: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

IDADM2012/2013. Alfredo Vellido

An Introduction to Mining (1)

Page 5: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What is DATA MINING? (1)

“Data Mining is the process of discovering actionable and meaningful patterns, profiles, and trends by sifting through your data using pattern recognition technologies (…) is a hot new technology about one of the oldest processes of human endeavour: pattern recognition (…) It is an iterative process of extracting knowledge from business transactions (…) DM is the automatic discovery of usable knowledge from your stored data.”

Jesús Mena: Data Mining your Website(Digital Press, 1999, available @ books.google)

IDADM

Page 6: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What is DATA MINING? (2)“Data Mining, by its simplest definition, automates the detection of relevant patterns in a database (…) For many years, statisticians have manually “mined” databases (…) DM uses well‐established statistical and machine learning techniques to build models that predict customer behaviour. Today, technology automates the mining process, integrates it with commercial data warehouses, and presents it in a relevant way for business users (…) the leading DM products address the broader business and technical issues, such as their integration into complex IT environments.”

Berson, Smith, & Thearling: Building Data Mining Applications for CRM (McGraw‐Hill, 2000)

IDADM

Page 7: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What is DATA MINING? (3)WIKIPEDIA 2005 DIXIT: “Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" (1) and "The science of extracting useful information from large data sets or databases" (2). Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts.”

(1) W. Frawley and G. Piatetsky‐Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine, 1992, 213‐228.

(2) D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, 2001.

en.wikipedia.org/wiki/Data_mining

IDADM

Page 8: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What is DATA MINING? (4)WIKIPEDIA’06 DIXIT: “Data mining (DM), also called Knowledge‐Discovery in Databases (KDD) or Knowledge‐Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns such as association rules. It is a fairly recent topic in computer science but applies many older computational techniques from statistics, information retrieval, machine learning and pattern recognition.

IDADM

Page 9: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What is DATA MINING? (5)In 1996, in the proceedings of the 1st International Conference on KDD, Fayyad gave one of the best‐known definitions of Knowledge Discovery from Data:

“The non‐trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.”

KDD quickly gathered strength as an interdisciplinary research field where a combination of advanced techniques from Statistics, Artificial Intelligence, Information Systems, and Visualization are used to tackle knowledge acquisition from large data bases. The term Knowledge Discovery from Data appeared in 1989 referring to the:

“[...] overall process of finding and interpreting patterns from data, typically interactive and iterative, involving repeated application of specific data mining methods or algorithms and the interpretation of the patterns generated by these algorithms.”

IDADM

Page 10: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What is DATA MINING? (6)

WIKIPEDIA’08 DIXIT: “Data mining is the process of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but is increasingly being used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods. It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases." Data mining in relation to enterprise resource planning is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making.”

IDADM

Page 11: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What is DATA MINING? (7)WIKIPEDIA’10 gave up:

IDADM

BOTTOM LINE: The concept of DM, even if somehow well‐established, is still quite fluid

Page 12: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What to expect from a DM conference… (good and bad examples, starting with a rather bad one)

15‐17 September’04: Wessex Institute of Technology(W.I.T.), Málaga, Spain

IDADM

Page 13: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

Data Mining 2004: Main Topics

Sessions 1 & 2: Text Mining

Session 3: Web Mining

Session 4: Clustering Techniques

Session 5: Data Preparation Techniques

Session 6 & 7: Applications in Business, Industry and Government

Session 8: Customer Relationship Management (CRM)

Session 9 & 10: Applications in Science and Engineering

IDADM

Page 14: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

Data Mining 2007: Main Topics

Session 1: Categorisation Methods

Session 2: Data Preparation

Session 3: Enterprise Information Systems

Session 4: Clustering Techniques

Session 5: National Security

Session 6: Data and Text Mining

Session 7: Mining Environmental and Geospatial Data

Session 8: Applications in Business, Industry and Government

IDADM

Page 15: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

Data Mining 2008: Late years …

IDADM

Page 16: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

Data Mining 2009: Late years …Investigative Data Mining For Security And Criminal DetectionJesús MenaButterworth‐Heinemann 2003

IDADM

Page 17: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

A different (good) conference, a different take …IEEE CIDM 2012, Brussels

2012 IEEE Symposium on Computational Intelligence and Data Mining• Data mining foundations

• Novel data mining algorithms in traditional areas (such as classification, regression, clustering, probabilistic modeling, and association analysis)

• Algorithms for new, structured, data types, such as arising in chemistry, biology, environment, and other scientific domains

• Developing a unifying theory of data mining• Mining sequences and sequential data• Mining spatial and temporal datasets• Mining textual and unstructured datasets• High performance implementations of data mining algorithms

IDADM

Page 18: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

A different conference, a different take …IEEE CIDM 2012, Brussels

2012 IEEE Symposium on Computational Intelligence and Data Mining• Mining in targeted application contexts

• Mining high speed data streams• Mining sensor data• Distributed data mining and mining multi‐agent data• Mining in networked settings: web, social and computer networks, and 

online communities• Data mining in electronic commerce, such as recommendation, sponsored 

web search, advertising, and marketing tasks

IDADM

Page 19: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

A different conference, a different take …IEEE CIDM 2012, Brussels

2012 IEEE Symposium on Computational Intelligence and Data Mining• Methodological aspects and the KDD process

• Data pre‐processing, data reduction, feature selection, and feature transformation

• Quality assessment, interestingness analysis, and post‐processing• Statistical foundations for robust and scalable data mining• Handling imbalanced data• Automating the mining process and other process related issues• Dealing with cost sensitive data and loss models• Human‐machine interaction and visual data mining• Security, privacy, and data integrity

IDADM

Page 20: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

A different conference, a different take …IEEE CIDM 2012, Brussels

2012 IEEE Symposium on Computational Intelligence and Data Mining• Integrated KDD applications and systems

• Bioinformatics, computational chemistry, geoinformatics, and other science & engineering disciplines

• Computational finance, online trading, and analysis of markets• Intrusion detection, fraud prevention, and surveillance• Healthcare, epidemic modeling, and clinical research• Customer relationship management• Telecommunications, network and systems management

IDADM

Page 21: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

But let’s talk money ...Where is the money in DM?

Page 22: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments
Page 23: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments
Page 24: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments
Page 25: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments
Page 26: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments
Page 27: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments
Page 28: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments
Page 29: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments
Page 30: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments
Page 31: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

www.darpa.mil

Page 32: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What’s DATA MINING?: A procedural viewpoint

IDADM

Page 33: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What’s DATA MINING?: A historicist viewpoint

ESTADÍSTICASTATISTICS

ARTIFICIALINTELLIGENCE

EXPERT SYSTEMSMACHINE LEARNING

DB MANAGEMENT

DM

PATT RECOG

KDD

IDADM

Page 34: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

What’s DATA MINING?: A historicist viewpoint

ESTADÍSTICASTATISTICS

ARTIFICIALINTELLIGENCE

OTHERS…

KDD

MACHINE LEARNING

Bio-plausible Models

Algor. Devel.

Probabilistic Models

ADVANCED PROBABILISTIC

MODELS

IDADM

Page 35: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

DATA MINING as a methodology

Page 36: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

CRISP: a DM methodologyCRoss‐Industry Standard Process for Data Mining: neutral methodology from the point of view of industry, tool and application (free & non‐proprietary)Pete Chapman, Randy Kerber (NCR); Julian Clinton, Thomas Khabaza, Colin Shearer (SPSS), Thomas Reinartz, Rüdiger Wirth (DaimlerChrysler)CRISP‐DM was conceived in 1996DaimlerChrysler: leaders in industrial application, SPSS: leaders in product development (Clementine, 1994), NCR: owners of large (huge!) databases (Teradata)Financed by the EU. Version 1.0 released officially in 1999

IDADM

Page 37: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

CRISP: Hierarchic structure of the methodology

IDADM

Page 38: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

CRISP: Description of phasesProblem/Business understanding: study of targets and requirements form thebusiness/problem viewpoint. Defining it as a DM problem.Data understanding: data recolection; getting to know the data, trying to detectboth quality problems and interesting features.Data preparation: Preparing the data set to be modelled, starting from raw data. This is an iterative and exploratory process. Selection of files, tables, variables, record samples… plus data cleaning.Modelling: Data analysis using modelling techniques of a sort that are suitable forthe problem at hand. Includes fiddling with the models, tuning their parameters, etc.Evaluation: All previous steps must be evaluated as whole (as a unitary process), and we must decide whether deliverables so far meet the DM challenge. Implementation: All the knowledge aquired to this point must be organized and presented to the “client” in a usable form. We must define, together with this client, a protocol to reliably deploy the DM findings.

IDADM

Page 39: Intelligent Data Analysis and Data Miningavellido/teaching/12-13/Intro1_IDADM_12-09-12_web.pdf“Data Mining, by its simplest definition, automates the detection of ... environments

CRISP: The virtuous loop of methodology phases

IDADM