cs570 introduction to data mining - emory universitylxiong/cs570_s11/share/slides/01_intro.pdf ·...
TRANSCRIPT
CS570 Introduction to Data Mining
Department of Mathematics and Computer Science
Li Xiong
Today
� Meeting everybody in class
� Course topics
� Course logistics
1/18/2011 Data Mining: Concepts and Techniques 2
Meet your Instructor
� http://www.mathcs.emory.edu/~lxiong� Have taught the class twice� Research areas: data privacy and security, information managementinformation management
� Committed to help you learn� If you have any questions, concerns, suggestions
� Come to my office hours (Tu and Th 3-4pm, WSC E412)
� Email me ([email protected])
1/18/2011 Data Mining: Concepts and Techniques 3
Meet your co-instructor and TA
� Co-instructor: Sharath R Cholleti, PhD� Co-teach some lectures
� TA: Liyue Fan, PhD student� Grading assignments and projectsOffice hours: TBA � Office hours: TBA
1/18/2011 Data Mining: Concepts and Techniques 4
Meet everyone in class
� Your name� Your research area� Why are you here and what you hope to get out of the class
� Something interesting to share with the classSomething interesting to share with the class
1/18/2011 Data Mining: Concepts and Techniques 5
Today
� Meeting everybody in class
� Course topics
� Course logistics
1/18/2011 Data Mining: Concepts and Techniques 6
What the class is about
� Classical data mining algorithms and techniques� Use and implementation
� Data mining on non-traditional data� Stream data mining, graph data mining
� Data mining in new domainsPrivacy preserving data mining, distributed data mining� Privacy preserving data mining, distributed data mining
� Data warehousing� Multi-dimensional view of a database
1/18/2011 Data Mining: Concepts and Techniques 7
Evolution of Sciences
� Before 1600, empirical science
� Knowledge must be based on observable phenomena
� Natural science vs. social sciences
� 1600-1950s, theoretical science
� Motivate experiments and generalize our understanding (e.g. theoretical physics)
� 1950s-now, computational science
� Traditionally meant simulation (e.g. computational physics)
1/18/2011 Data Mining: Concepts and Techniques 8
� Evolving to include information management
� 1960-now, data science
� Flood of data from new scientific instruments and simulations
� Ability to economically store and manage petabytes of data online
� Accessibility of the data through the Internet and computing Grid
� Scientific information management poses Computer Science challenges: acquisition,
organization, query, analysis and visualization of the data
Jim Gray and Alex Szalay, The World Wide Telescope, Comm. ACM, 45(11): 50-54, Nov.
2002
Evolution of Data and Information Science
� 1960s:
� Data collection, database creation, network DBMS
� 1970s:
� Relational data model, relational DBMS implementation
� 1980s:
� RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
1/18/2011 Data Mining: Concepts and Techniques 9
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
� Application-oriented DBMS (spatial, scientific, engineering, etc.)
� 1990s:
� Data mining, data warehousing, multimedia databases, and Web
databases
� 2000s
� Stream data management and mining
� Data mining and its applications
� Web technology (XML, data integration) and global information systems
� Social networks
A Brief History of Data Mining Society
� 1989 IJCAI Workshop on Knowledge Discovery in Databases
� Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
� 1991-1994 Workshops on Knowledge Discovery in Databases
� Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
January 24, 2011 Data Mining: Concepts and Techniques 10
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
� 1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
� Journal of Data Mining and Knowledge Discovery (1997)
� ACM SIGKDD conferences since 1998 and SIGKDD Explorations
� More conferences on data mining
� PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
� ACM Transactions on KDD since 2007
1/18/2011 10Data Mining: Concepts and Techniques
What Is Data Mining?
� Data mining (knowledge discovery from data)
� Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
� Data mining really means knowledge mining
� We are drowning in data, but starving for knowledge!
� Alternative names
1/18/2011 Data Mining: Concepts and Techniques 11
� Alternative names
� Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, information
harvesting, business intelligence, etc.
� Watch out: Is everything “data mining”?
� Simple search and query processing
� Small scale data analysis
� Data dredging, data fishing, data snooping
Knowledge Discovery (KDD) Process
Task-relevant Data
Data Mining
Pattern Evaluation
1/18/2011 Data Mining: Concepts and Techniques 12
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection and
transformation
Data Mining: Confluence of Multiple Disciplines
Statistics
MachineLearning
Artificial Intelligence
1/18/2011 Data Mining: Concepts and Techniques 13
Data Mining
Database Technology
OtherDisciplines
Visualization
Motivating challenges
� Scalability: massive datasets� Search strategies, novel data structures, disk-based algorithms, parallel algorithms
� High dimensionality: biological data, temporal or spatial data
� Heterogeneous data: semi-structured text, DNA � Heterogeneous data: semi-structured text, DNA data with sequential and 3D structure, networks
� Data distribution� Distributed data mining algorithms
� Data privacy� Privacy preserving data mining algorithms
1/18/2011 Data Mining: Concepts and Techniques 14
Multi-Dimensional View of Data Mining
� Data View: Data to be mined
� Relational, data warehouse, transactional, stream, object-
oriented/relational, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
� Knowledge View: Knowledge to be mined
� Association, classification, clustering, trend/deviation, outlier analysis,
etc.
1/18/2011 Data Mining: Concepts and Techniques 15
etc.
� Multiple/integrated functions and mining at multiple levels
� Method View: Techniques utilized
� Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
� Application View: Applications adapted
� Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
Data Mining: On What Kinds of Data?
� Database-oriented data sets and applications
� Relational database
� Data warehouse
� Transactional database
� Advanced data sets and applications
� Data streams and sensor data
1/18/2011 Data Mining: Concepts and Techniques 16
� Data streams and sensor data
� Time-series data, temporal data, sequence data (incl. bio-sequences)
� Graphs, social networks and multi-linked data
� Heterogeneous databases and legacy databases
� Spatial data and spatiotemporal data
� Multimedia database
� Text databases
� The World-Wide Web
Data Mining Functionalities
� Predictive: predict the value of a particular attribute based on the values of other attributes� Classification� Regression
Descriptive: derive patterns that
1/18/2011 Data Mining: Concepts and Techniques 17
� Descriptive: derive patterns that summarize the underlying relationships in data� Cluster analysis� Association analysis
Classification and prediction
� Classification: construct models (functions) that describe
and distinguish classes for future prediction
� Prediction/regression: predict unknown or missing
numerical values
� Derived models can be represented as rules, mathematical
1/18/2011 Data Mining: Concepts and Techniques 18
formulas, etc.
� Topics
� Classification: Decision tree, Bayesian classification,
Neural networks, Support vector machines, kNN
� Regression: linear and non-linear regression
� Ensemble methods
Classification example
1/18/2011 19Data Mining: Concepts and Techniques
Frequent pattern mining and association analysis
� Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
� Frequent sequential pattern
� Frequent structured pattern
� Applications
� Basket data analysis — Beer and diapers
Web log (click stream) analysis
Data Mining: Concepts and Techniques 20
� Web log (click stream) analysis
� DNA sequence analysis
� Challenge: efficient algorithms to handle exponential size of
the search space
� Topics
� Algorithms: Apriori, Frequent pattern growth, Vertical format
� Closed and maximal patterns
� Association rules mining
1/18/2011 20Data Mining: Concepts and Techniques
Cluster and outlier analysis
� Cluster analysis
� Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns
� Unsupervised learning (vs. supervised learning)
� Maximizing intra-class similarity & minimizing interclass similarity
� Outlier analysis
� Outlier: Data object that does not comply with the general behavior
1/18/2011 Data Mining: Concepts and Techniques 21
� Outlier: Data object that does not comply with the general behavior of the data
� Noise or exception? Useful in fraud detection, rare events analysis
� E.g. Extreme large purchase
Clustering Analysis
� Topics� Partitioning based clustering: k-means� Hierarchical clustering: classical, BIRCH� Density based clustering: DBSCAN� Model-based clustering: EMCluster evaluation� Cluster evaluation
� Outlier analysis
1/18/2011 22Data Mining: Concepts and Techniques
Course topics
� Data preprocessing
� Classical data mining (machine learning) algorithms
� Frequent itemsets mining
� Classification
1/18/2011 Data Mining: Concepts and Techniques 23
� Cluster analysis
� Emerging applications
� Graph mining and social network analysis
� Privacy preserving data mining
� biomedical data mining
� Data warehousing and data cube computation
Graph Pattern Mining
� Frequent subgraph mining
� Finding frequent subgraphs within a single graph
� Finding frequent (sub)graphs in a set of graphs
� Applications of graph pattern mining
Biochemical structures, XML structures or Web
January 24, 2011 Mining and Searching Graphs in Graph Databases 24
� Biochemical structures, XML structures or Web
communities
� Building blocks for graph classification, clustering,
compression, comparison, and correlation analysis
1/18/2011 24Data Mining: Concepts and Techniques
Example: Frequent Subgraph Mining in Chemical Compounds
GRAPH DATASET
S
OH
O
O
N
O
N
HO
ONN
January 24, 2011 Mining and Searching Graphs in Graph Databases 25
FREQUENT PATTERNS
(MIN SUPPORT IS 2)
O
NHO
O
(A) (B) (C)
ONN
O
N
(1) (2)
1/18/2011 25Data Mining: Concepts and Techniques
Social Network Analysis
� Actor level: centrality, prestige, etc.
� Dyadic level: distance and reachability, etc.
� Triadic level: balance and transitivityand transitivity
� Subset level: cliques, cohesive subgroups, components
� Network level: connectedness, diameter, centralization, density, etc.
January 24, 2011 Social network analysis: methods and applications 261/18/2011 26Data Mining: Concepts and Techniques
Actor Centrality Example
� Degree Centrality� Betweenness
Centrality� Closeness
Centrality� Eigenvector
centrality
January 24, 2011 OrgNet.com 27
� Eigenvector centrality
1/18/2011 27Data Mining: Concepts and Techniques
Link Analysis on WWW
� Ranking algorithms� PageRank� HITS
1/18/2011 28Data Mining: Concepts and Techniques
Influence and Diffusion
1/18/2011 29CDC: Spread of Airborne DiseaseData Mining: Concepts and Techniques
Data mining is all good, but what about privacy?
� Privacy is not only a concern but a phenomenon� AOL data release� Netflix challenge
� Topics: algorithms that allow data mining while preserving individual information
Perturbation� Perturbation� Generalization
� Challenge: tradeoff between privacy, accuracy, and efficiency
1/18/2011 Data Mining: Concepts and Techniques 30
A Face is exposed for AOL searcher No. 4417749
� 20 million Web search queries by AOL (650k~ users)
� User 4417749� “numb fingers”, � “60 single men”� “dog that urinates on everything”� “landscapers in Lilburn, Ga”� Several people names with last name Arnold� Several people names with last name Arnold� “homes sold in shadow lake subdivision gwinnett
county georgia”
Thelma Arnold, a 62-year-old
widow who lives in Lilburn,
Ga., frequently researches
her friends’ medical ailments
and loves her dogs
1/18/2011 31Data Mining: Concepts and Techniques
Course topics
� Data preprocessing
� Classical data mining (machine learning) algorithms
� Frequent itemsets mining
� Classification
1/18/2011 Data Mining: Concepts and Techniques 32
� Cluster analysis
� Emerging applications
� Graph mining and social network analysis
� Privacy preserving data mining
� biomedical data mining
� Data warehousing and data cube computation
What is a Data Warehouse?
� “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
� Key aspects
A decision support database that is maintained separately from
1/18/2011 Data Mining: Concepts and Techniques 33
� A decision support database that is maintained separately from
the organization’s operational database
� Support OLAP (vs. OLTP)
� Data warehousing:
� The process of constructing and using data warehouses
Multi Dimensional View: From Tables to Data Cubes
� A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
� A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
1/18/2011 Data Mining: Concepts and Techniques 34
Item
Time
Cube: A Lattice of Cuboids
all
time item location supplier
0-D(apex) cuboid
1-D cuboids
� An n-D base cube is called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube.
1/18/2011 Data Mining: Concepts and Techniques 35
time,item
time,item,location
time, item, location, supplier
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Course topics
� Data preprocessing
� Classical data mining (machine learning) algorithms
� Frequent itemsets mining
� Classification
January 24, 2011 Data Mining: Concepts and Techniques 36
� Cluster analysis
� Emerging applications
� Graph mining and social network analysis
� Privacy preserving data mining
� biomedical data mining
� Data warehousing and data cube computation
1/18/2011 36Data Mining: Concepts and Techniques
Top-10 Most Popular DM Algorithms:18 Identified Candidates (I)
� Classification
� #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., 1993.
� #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984.
� #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
#4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid
1/18/2011 Data Mining: Concepts and Techniques 37
� #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398.
� Statistical Learning
� #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag.
� #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis
� #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.
� #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00.
The 18 Identified Candidates (II)
� Link Mining
� #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998.
� #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. SODA, 1998.
� Clustering
#11. K-Means: MacQueen, J. B., Some methods for classification
1/18/2011 Data Mining: Concepts and Techniques 38
� #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967.
� #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96.
� Bagging and Boosting
� #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.
The 18 Identified Candidates (III)
� Sequential Patterns
� #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996.
� #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01.
1/18/2011 Data Mining: Concepts and Techniques 39
Prefix-Projected Pattern Growth. In ICDE '01.
� Integrated Mining
� #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98.
� Rough Sets
� #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992
� Graph Mining
� #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02.
Top-10 Algorithm Selected at ICDM’06
� #1: C4.5 Decision Tree - Classification (61 votes)
� #2: K-Means - Clustering (60 votes)
� #3: SVM – Classification (58 votes)
� #4: Apriori - Frequent Itemsets (52 votes)
#5: EM – Clustering (48 votes)
1/18/2011 Data Mining: Concepts and Techniques 40
� #5: EM – Clustering (48 votes)
� #6: PageRank – Link mining (46 votes)
� #7: AdaBoost – Boosting (45 votes)
� #7: kNN – Classification (45 votes)
� #7: Naive Bayes – Classification (45 votes)
� #10: CART – Classification (34 votes)
Today
� Meeting everybody in class
� Course topics
� Course logistics
1/18/2011 Data Mining: Concepts and Techniques 41
Textbook
� Data mining: concepts and techniques. J. Han and M. Kamber. Second edition
1/18/2011 Data Mining: Concepts and Techniques 42
Other Reference Books
� S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
� R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
� T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
� U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
� U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
1/18/2011 Data Mining: Concepts and Techniques 43
Discovery, Morgan Kaufmann, 2001
� D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
� T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, Springer-Verlag, 2001
� B. Liu, Web Data Mining, Springer 2006.
� T. M. Mitchell, Machine Learning, McGraw Hill, 1997
� G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
� P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
� S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
� I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
Conferences and Journals on Data Mining
� KDD Conferences
� ACM SIGKDD Conf. on Knowledge Discovery in Databases and Data Mining (KDD)
� SIAM Data Mining Conf. (SDM)
� (IEEE) Conf. on Data Mining (ICDM)
� Other related conferences
� ACM SIGMOD
� VLDB
� (IEEE) ICDE
� WWW, SIGIR
ICML, CVPR, NIPS
January 24, 2011 Data Mining: Concepts and Techniques 44
(ICDM)
� Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD)
� Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
� ACM Conference on Applied Computing(SAC), Data Mining track
� ICML, CVPR, NIPS
� Journals
� Data Mining and Knowledge
Discovery (DAMI or DMKD)
� IEEE Trans. On Knowledge
and Data Eng. (TKDE)
� KDD Explorations
� ACM Trans. on KDD
1/18/2011 44Data Mining: Concepts and Techniques
Resources (Google, DBLP, conferences)
� Data mining and KDD (SIGKDD: CDROM)� Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
� Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
� Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)� Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
� Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
� AI & Machine Learning� Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
1/18/2011 Data Mining: Concepts and Techniques 45
� Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
� Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.
� Web and IR� Conferences: SIGIR, WWW, CIKM, etc.
� Journals: WWW: Internet and Web Information Systems,
� Statistics� Conferences: Joint Stat. Meeting, etc.
� Journals: Annals of statistics, etc.
� Visualization� Conference proceedings: CHI, ACM-SIGGraph, etc.
� Journals: IEEE Trans. visualization and computer graphics, etc.
Workload
� 2-3 programming assignments (individual)� Implementation of classical algorithms and competition!
� 2-3 written/reading assignments� 1 paper presentation� 1 open-ended course project (team of up to 2 students) with project presentation
� 1 open-ended course project (team of up to 2 students) with project presentation� Application and evaluation of existing algorithms to interesting data
� Design of new algorithms to solve new problems� Survey of a class of algorithms
� 1 midterm� No final exam
1/18/2011 46Data Mining: Concepts and Techniques
Grading
� Assignments/presentations 40� Final project 30� Midterm 30
1/18/2011 47Data Mining: Concepts and Techniques
Late Policy
� Late assignment will be accepted within 3 days of the due date and penalized 10% per day
� 1 late assignment allowance, can be used to turn in a single late assignment within 3 days of the due date without penalty. days of the due date without penalty.
1/18/2011 48Data Mining: Concepts and Techniques
Today
� Meeting everybody in class
� Course topics
� Course logistics
1/18/2011 Data Mining: Concepts and Techniques 49
� Next lecture: data preprocessing