prof. carolina ruiz department of computer science worcester polytechnic institute introduction to...
TRANSCRIPT
![Page 1: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/1.jpg)
Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute
INTRODUCTION TO
KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING
![Page 2: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/2.jpg)
“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996]
• Raw Data Data Mining
• Patterns
• Analytical Patterns (rules, decision trees)
• Statistical Patterns (data distribution)
• Visual Patterns
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.
WHAT IS DATA MINING?OR MORE GENERALLY, KNOWLEDGE DISCOVERY IN DATABASES (KDD)
![Page 3: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/3.jpg)
NEED FOR DATA MINING
• Data are being gathered and stored extremely fast
• Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data
![Page 4: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/4.jpg)
0102030405060708090
1stQtr
2ndQtr
3rdQtr
4thQtr
East
West
North
DATA ANALYSIS (KDD)PROCESS
data sources
data analysisdata mining• analytical
statistical• visual
models
model/patterns deployment• prediction
• decision supportnew data
data management
• databases• data warehouses
“good” model
model/patternevaluation• quantitative• qualitative
data “pre”-processing
• noisy/missing data • dim. reduction
cleandata
data
![Page 5: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/5.jpg)
• Machine Learning (AI)• Contributes (semi-)automatic
induction of empirical laws from observations & experimentation
• Statistics• Contributes language, framework,
and techniques
• Pattern Recognition• Contributes pattern extraction and
pattern matching techniques
• Databases• Contributes efficient data storage,
data cleansing, and data access techniques
• Data Visualization• Contributes visual data displays and
data exploration
• High Performance Comp.• Contributes techniques to efficiently
handling complexity
• Application Domain• Contributes domain knowledge
KDD IS INTERDISCIPLINARYTECHNIQUES COME FROM MULTIPLE FIELDS
![Page 6: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/6.jpg)
• Confirmatory (verification)• Given a hypothesis, verify its validity
against the data
• Exploratory (discovery)• Prescriptive patterns
• Patterns for predicting behavior of newly encountered entities
• Descriptive patterns
• Patterns for presenting the behavior of observed entities in a human-understandable format
DATA MINING MODES
![Page 7: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/7.jpg)
WHAT DO YOU WANT TO LEARN FROM YOUR DATA?KDD APPROACHES
Data
classification
regression
clustering
summarization
dependency/assoc. analysis
change/deviation detection
0102030405060708090
1stQtr
2ndQtr
3rdQtr
4thQtr
East
West
North
IF a & b & c THEN d & kIF k & a THEN e
b lue
B
b lue
C
o ra nge
D
A
IF A & B THEN IF A & D THEN
A B
C D
0.5
0.750.3
A, B -> C 80%C, D -> A 22%
![Page 8: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/8.jpg)
COMMERCIAL DATA MINING SYSTEMSMatlab
Oracle data mining
and lots more ….
![Page 9: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/9.jpg)
WEKAFrank et al., University of Waikato, New Zealand
ACADEMIC DATA MINING SYSTEMS
RapidMinerKlinkenberg et al., Univ. of Dortmund, Germany
R Programming Language Ross Ihaka and Robert Gentleman, Univ. of Auckland,
New Zealand
and many more ….
![Page 10: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/10.jpg)
DATA MINING RESOURCES – JOURNALS
• Data Mining and Knowledge Discovery JournalNewsletters:
• ACM SIGKDD Explorations Newsletter Related Journals:
• TKDE: IEEE Transactions in Knowledge and Data Engineering• TODS: ACM Transaction on Database Systems• JACM: Journal of ACM• Data and Knowledge Engineering• JIIS: Intl. Journal of Intelligent Information Systems
![Page 11: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/11.jpg)
DATA MINING RESOURCES – CONFERENCES• KDD: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining
• ICDM: IEEE International Conference on Data Mining,
• SIAM International Conference on Data Mining
• PKDD: European Conference on Principles and Practice of Knowledge Discovery in Databases
• PAKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining
• DaWak: Intl. Conference on Data Warehousing and Knowledge Discovery
Related Conferences:
• ICML: Intl. Conf. On Machine Learning
• IDEAL: Intl. Conf. On Intelligent Data Engineering and Automated Learning
• IJCAI: International Joint Conference on Artificial Intelligence
• AAAI: American Association for Artificial Intelligence Conference
• SIGMOD/PODS: ACM Intl. Conference on Data Management
• ICDE: International Conference on Data Engineering
• VLDB: International Conference on Very Large Data Bases
![Page 12: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/12.jpg)
DATA MINING RESOURCES – BOOKS, DATASETS, …
See resources webpage at:
• http://web.cs.wpi.edu/~ruiz/KDDRG/resources.html
![Page 13: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/13.jpg)
SUMMARY
• KDD is the “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”
• The KDD process includes data collection and pre-processing, data mining, and evaluation and validation of those patterns
• Data mining is the discovery and extraction of patterns from data, not the extraction of data
• Important challenges in data mining: privacy, security, scalability, real-time, and handling non-conventional data
![Page 14: Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING](https://reader038.vdocuments.net/reader038/viewer/2022110304/551bf444550346b24f8b45ac/html5/thumbnails/14.jpg)
KDDRG: KNOWLEDGE DISCOVERY AND DATA MINING RESEARCH GROUP
• KDDRG Meetings
• WHEN? Fridays at 1 pm
• WHERE? Beckett Conference Room in Fuller Labs
• To receive announcements of the talks, please subscribe to the KDDRG mailing list
• I’ll send you an email with instructions on how to do so