slides are based on negnevitsky, pearson education, 2005 1 lecture 14 data mining and knowledge...
Post on 19-Dec-2015
220 views
TRANSCRIPT
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 1
Lecture 14Lecture 14
Data mining and knowledge Data mining and knowledge discoverydiscovery Introduction, or what is data mining?Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision treesDecision trees Case study: Profiling people with high Case study: Profiling people with high
blood pressureblood pressure SummarySummary
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 2
What is data mining?What is data mining? DataData is what we collect and store, and is what we collect and store, and
knowledgeknowledge is what helps us to make informed is what helps us to make informed decisions. decisions.
The extraction of knowledge from data is called The extraction of knowledge from data is called data miningdata mining. .
Data mining can also be defined as the Data mining can also be defined as the exploration and analysis of exploration and analysis of largelarge quantities of quantities of data in order to discover meaningful patterns and data in order to discover meaningful patterns and rules. rules.
The ultimate goal of data mining is to discover The ultimate goal of data mining is to discover knowledge.knowledge.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 3
Why data miningWhy data mining The Explosive Growth of Data: from terabytes to petabytesThe Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availabilityData collection and data availability
» Automated data collection tools, database systems, Web, computerized Automated data collection tools, database systems, Web, computerized
societysociety
– Major sources of abundant dataMajor sources of abundant data
» Business: Web, e-commerce, transactions, stocks, … Business: Web, e-commerce, transactions, stocks, …
» Science: Remote sensing, bioinformatics, scientific simulation, … Science: Remote sensing, bioinformatics, scientific simulation, …
» Society and everyone: news, digital cameras, YouTube Society and everyone: news, digital cameras, YouTube
knowledge!knowledge!
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 4
Why Not Traditional Data Analysis?Why Not Traditional Data Analysis?
Tremendous amount of dataTremendous amount of data
– Algorithms must be highly scalable to handle Algorithms must be highly scalable to handle such as tera-bytes of datasuch as tera-bytes of data
High-dimensionality of data High-dimensionality of data
– Micro-array may have tens of thousands of Micro-array may have tens of thousands of dimensionsdimensions
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 5
High complexity of dataHigh complexity of data
– Data streams and sensor dataData streams and sensor data
– Time-series data, temporal data, sequence data Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked Structure data, graphs, social networks and multi-linked datadata
– Heterogeneous databases and legacy databasesHeterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web dataSpatial, spatiotemporal, multimedia, text and Web data
– Software programs, scientific simulationsSoftware programs, scientific simulations New and sophisticated applicationsNew and sophisticated applications
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 6
Knowledge Discovery (KDD) ProcessKnowledge Discovery (KDD) Process
– Data mining—core of Data mining—core of knowledge discovery knowledge discovery processprocess
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 7
KDD Process: Several Key StepsKDD Process: Several Key Steps
Learning the application domainLearning the application domain
– relevant prior knowledge and goals of applicationrelevant prior knowledge and goals of application Creating a target data set: data selectionCreating a target data set: data selection Data cleaningData cleaning and preprocessing: (may take 60% of and preprocessing: (may take 60% of
effort!)effort!) Data reduction and transformationData reduction and transformation
– Find useful features, dimensionality/variable Find useful features, dimensionality/variable reduction, invariant representationreduction, invariant representation
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 8
KDD Process: Several Key StepsKDD Process: Several Key Steps
Choosing functions of data mining Choosing functions of data mining
– summarization, classification, regression, association, summarization, classification, regression, association, clusteringclustering
Choosing the mining algorithm(s)Choosing the mining algorithm(s) Data miningData mining: search for patterns of interest: search for patterns of interest Pattern evaluation and knowledge presentationPattern evaluation and knowledge presentation
– visualization, transformation, removing redundant visualization, transformation, removing redundant patterns, etc.patterns, etc.
Use of discovered knowledgeUse of discovered knowledge
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 9
Data Mining: Confluence of Multiple Data Mining: Confluence of Multiple DisciplinesDisciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
AlgorithmOther
Disciplines
Visualization
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 10
Architecture: Typical Data Mining Architecture: Typical Data Mining SystemSystem
data cleaning, integration, and selection
Database or Data Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowledge-Base
Database Data Warehouse
World-WideWeb
Other InfoRepositories
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 11
Data Mining Functionalities(1)Data Mining Functionalities(1)
Frequent patterns, association, correlation vs. causalityFrequent patterns, association, correlation vs. causality
– Diaper Diaper Beer [0.5%, 75%] (Correlation or causality?) Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction Classification and prediction
– Construct models (functions) that describe and Construct models (functions) that describe and
distinguish classes or concepts for future predictiondistinguish classes or concepts for future prediction
» E.g., classify countries based on (climate), or classify cars E.g., classify countries based on (climate), or classify cars
based on (gas mileage)based on (gas mileage)
– Predict some unknown or missing numerical valuesPredict some unknown or missing numerical values
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 12
Data Mining Functionalities(2)Data Mining Functionalities(2)
Cluster analysisCluster analysis
– Class label is unknown: Group data to form new Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patternsclasses, e.g., cluster houses to find distribution patterns
– Maximizing intra-class similarity & minimizing Maximizing intra-class similarity & minimizing interclass similarityinterclass similarity
Outlier analysisOutlier analysis
– Outlier: Data object that does not comply with the Outlier: Data object that does not comply with the general behavior of the datageneral behavior of the data
– Noise or exception? Useful in fraud detection, rare Noise or exception? Useful in fraud detection, rare events analysisevents analysis
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 13
Data Mining Functionalities(3)Data Mining Functionalities(3)
Trend and evolution analysisTrend and evolution analysis– Trend and deviation: e.g., regression analysisTrend and deviation: e.g., regression analysis– Sequential pattern mining: e.g., digital camera Sequential pattern mining: e.g., digital camera
large SD memory large SD memory– Periodicity analysisPeriodicity analysis– Similarity-based analysisSimilarity-based analysis
Other pattern-directed or statistical analysesOther pattern-directed or statistical analyses
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 14
Top-10 Most Popular DM Algorithms:Top-10 Most Popular DM Algorithms:18 Identified Candidates (I)18 Identified Candidates (I)
ClassificationClassification– #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine
Learning. Morgan Kaufmann., 1993.Learning. Morgan Kaufmann., 1993.– #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth, 1984.Classification and Regression Trees. Wadsworth, 1984.– #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani,
R. 1996. Discriminant Adaptive Nearest Neighbor R. 1996. Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)Classification. TPAMI. 18(6)
– #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398.So Stupid After All? Internat. Statist. Rev. 69, 385-398.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 15
(II)(II) Statistical LearningStatistical Learning
– #5. SVM: Vapnik, V. N. 1995. The Nature of #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag.Statistical Learning Theory. Springer-Verlag.
– #6. EM: McLachlan, G. and Peel, D. (2000). Finite #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Mixture Models. J. Wiley, New York. Association AnalysisAnalysis
– #7. Apriori: Rakesh Agrawal and Ramakrishnan #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.Rules. In VLDB '94.
– #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate Mining frequent patterns without candidate generation. In SIGMOD '00.generation. In SIGMOD '00.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 16
(III)(III)
Link MiningLink Mining– #9. PageRank: Brin, S. and Page, L. 1998. The anatomy #9. PageRank: Brin, S. and Page, L. 1998. The anatomy
of a large-scale hypertextual Web search engine. In of a large-scale hypertextual Web search engine. In WWW-7, 1998.WWW-7, 1998.
– #10. HITS: Kleinberg, J. M. 1998. Authoritative sources #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. SODA, 1998.in a hyperlinked environment. SODA, 1998.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 17
(IV)(IV) ClusteringClustering
– #11. K-Means: MacQueen, J. B., Some methods for #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967.and Probability, 1967.
– #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96.for very large databases. In SIGMOD '96.
Bagging and BoostingBagging and Boosting– #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A
decision-theoretic generalization of on-line learning decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.1 (Aug. 1997), 119-139.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 18
(V)(V)
Sequential PatternsSequential Patterns
– #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Proceedings of the 5th International Conference on Extending Database Technology, 1996.Database Technology, 1996.
– #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01.'01.
Integrated MiningIntegrated Mining– #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating
classification and association rule mining. KDD-98.classification and association rule mining. KDD-98.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 19
(VI)(VI)
Rough SetsRough Sets– #17. Finding reduct: Zdzislaw Pawlak, Rough #17. Finding reduct: Zdzislaw Pawlak, Rough
Sets: Theoretical Aspects of Reasoning about Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, Data, Kluwer Academic Publishers, Norwell, MA, 1992MA, 1992
Graph MiningGraph Mining– #18. gSpan: Yan, X. and Han, J. 2002. gSpan: #18. gSpan: Yan, X. and Han, J. 2002. gSpan:
Graph-Based Substructure Pattern Mining. In Graph-Based Substructure Pattern Mining. In ICDM '02.ICDM '02.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 20
Top-10 Algorithm Finally Selected at ICDM’06Top-10 Algorithm Finally Selected at ICDM’06 #1: C4.5 (61 votes)#1: C4.5 (61 votes) #2: K-Means (60 votes)#2: K-Means (60 votes) #3: SVM (58 votes)#3: SVM (58 votes) #4: Apriori (52 votes)#4: Apriori (52 votes) #5: EM (48 votes)#5: EM (48 votes) #6: PageRank (46 votes)#6: PageRank (46 votes) #7: AdaBoost (45 votes)#7: AdaBoost (45 votes) #7: kNN (45 votes)#7: kNN (45 votes) #7: Naive Bayes (45 votes)#7: Naive Bayes (45 votes) #10: CART (34 votes)#10: CART (34 votes)
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 21
Conferences and Journals on Data MiningConferences and Journals on Data Mining
KDD ConferencesKDD Conferences
– ACM SIGKDD Int. Conf. on Knowledge Discovery in ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (Databases and Data Mining (KDDKDD))
– SIAM Data Mining Conf. (SIAM Data Mining Conf. (SDMSDM))
– (IEEE) Int. Conf. on Data Mining ((IEEE) Int. Conf. on Data Mining (ICDMICDM))
– Conf. on Principles and practices of Knowledge Conf. on Principles and practices of Knowledge Discovery and Data Mining (Discovery and Data Mining (PKDDPKDD))
– Pacific-Asia Conf. on Knowledge Discovery and Data Pacific-Asia Conf. on Knowledge Discovery and Data Mining (Mining (PAKDDPAKDD))
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 22
Other related conferencesOther related conferences– ACM SIGMODACM SIGMOD– VLDBVLDB– (IEEE) ICDE(IEEE) ICDE– WWW, SIGIRWWW, SIGIR– ICML, CVPR, NIPSICML, CVPR, NIPS
Journals Journals – Data Mining and Knowledge Discovery (DAMI or Data Mining and Knowledge Discovery (DAMI or
DMKD)DMKD)– IEEE Trans. On Knowledge and Data Eng. (TKDE)IEEE Trans. On Knowledge and Data Eng. (TKDE)– KDD ExplorationsKDD Explorations– ACM Trans. on KDDACM Trans. on KDD
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 23
Why Not Traditional Data Analysis?(1)Why Not Traditional Data Analysis?(1)
Tremendous amount of dataTremendous amount of data
– Algorithms must be highly scalable to handle Algorithms must be highly scalable to handle such as tera-bytes of datasuch as tera-bytes of data
High-dimensionality of data High-dimensionality of data
– Micro-array may have tens of thousands of Micro-array may have tens of thousands of dimensionsdimensions
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 24
(2)(2)
High complexity of dataHigh complexity of data
– Data streams and sensor dataData streams and sensor data
– Time-series data, temporal data, sequence data Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked Structure data, graphs, social networks and multi-linked datadata
– Heterogeneous databases and legacy databasesHeterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web dataSpatial, spatiotemporal, multimedia, text and Web data
– Software programs, scientific simulationsSoftware programs, scientific simulations New and sophisticated applicationsNew and sophisticated applications
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 25
Data warehouseData warehouse Modern organisations must respond quickly to Modern organisations must respond quickly to
any change in the market. This requires rapid any change in the market. This requires rapid access to current data normally stored in access to current data normally stored in operational databases. operational databases.
However, an organisation must also determine However, an organisation must also determine which trends are relevant. This task is which trends are relevant. This task is accomplished with access to historical data that accomplished with access to historical data that are stored in large databases called are stored in large databases called data data warehouseswarehouses..
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 26
The main characteristic of a data warehouse is its The main characteristic of a data warehouse is its capacitycapacity. A data warehouse is really big – it . A data warehouse is really big – it includes millions, even billions, of data records. includes millions, even billions, of data records.
The data stored in a data warehouse isThe data stored in a data warehouse is time dependenttime dependent – linked together by the – linked together by the
times of recording – and times of recording – and integratedintegrated – all relevant information from the – all relevant information from the
operational databases is combined and operational databases is combined and structured in the warehouse.structured in the warehouse.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 27
Query toolsQuery tools A data warehouse is designed to support decision-A data warehouse is designed to support decision-
making in the organisation. The information making in the organisation. The information needed can be obtained with needed can be obtained with query toolsquery tools. .
Query tools are Query tools are assumption-basedassumption-based – a user must – a user must ask the ask the rightright questions. questions.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 28
How is data mining applied in practice?How is data mining applied in practice? Many companies use data mining today, but Many companies use data mining today, but
refuse to talk about it. refuse to talk about it. In direct marketing, data mining is used for In direct marketing, data mining is used for
targeting people who are most likely to buy targeting people who are most likely to buy certain products and services. certain products and services.
In trend analysis, it is used to determine trends In trend analysis, it is used to determine trends in the marketplace, for example, to model the in the marketplace, for example, to model the stock market. In fraud detection, data mining is stock market. In fraud detection, data mining is used to identify insurance claims, cellular phone used to identify insurance claims, cellular phone calls and credit card purchases that are most calls and credit card purchases that are most likely to be fraudulent.likely to be fraudulent.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 29
Motivation: Finding latent relationships in dataMotivation: Finding latent relationships in data
– What products were often purchased together?— What products were often purchased together?—
Beer and diapers?!Beer and diapers?!
– What are the subsequent purchases after buying a What are the subsequent purchases after buying a
PC?PC?
– What kinds of DNA are sensitive to this new drug?What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?Can we automatically classify web documents?
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 30
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 31
ApplicationsApplications– Market basket data analysis (shelf space Market basket data analysis (shelf space
planning/increasing sales/promotion)planning/increasing sales/promotion)
– cross-marketingcross-marketing
– catalog designcatalog design
– sale campaign analysissale campaign analysis
– Web log (click stream) analysisWeb log (click stream) analysis
– DNA sequence analysisDNA sequence analysis
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 32
Data mining toolsData mining tools
Data mining is based on intelligent technologies Data mining is based on intelligent technologies already discussed in this book. It often applies already discussed in this book. It often applies such tools as neural networks and neuro-fuzzy such tools as neural networks and neuro-fuzzy systems. systems.
However, the most popular tool used for data However, the most popular tool used for data mining is a mining is a decision treedecision tree..
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 33
Decision treesDecision trees
A decision tree can be defined as a map of the A decision tree can be defined as a map of the reasoning process. It describes a data set by a reasoning process. It describes a data set by a tree-like structure. tree-like structure.
Decision trees are particularly good at solving Decision trees are particularly good at solving classification problems.classification problems.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 34
ID3ID3
(tall, blond, blue) w(tall, blond, blue) w (short, silver, blue) w(short, silver, blue) w (short, black, blue) w(short, black, blue) w (tall, blond, brown) w(tall, blond, brown) w (tall, silver, blue) w(tall, silver, blue) w (short, blond, blue) w(short, blond, blue) w (short, black, brown) e(short, black, brown) e (tall, silver, black) e(tall, silver, black) e (short, black, brown) e(short, black, brown) e (tall, black, brown) e(tall, black, brown) e (tall, black, black) e(tall, black, black) e (short, blond, black) e(short, blond, black) e
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 35
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 36
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 37
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 38
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 39
A decision tree consists of A decision tree consists of nodesnodes, , branchesbranches and and leavesleaves. .
The top node is called the The top node is called the root noderoot node. The tree . The tree always starts from the root node and grows always starts from the root node and grows down by splitting the data at each level into new down by splitting the data at each level into new nodes. The root node contains the entire data nodes. The root node contains the entire data set (all data records), and child nodes hold set (all data records), and child nodes hold respective subsets of that set. respective subsets of that set.
All nodes are connected by All nodes are connected by branchesbranches. . Nodes that are at the end of branches are called Nodes that are at the end of branches are called
terminal nodesterminal nodes, or , or leavesleaves..
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 40
How does a decision tree select splits?How does a decision tree select splits? A split in a decision tree corresponds to the A split in a decision tree corresponds to the
predictor with the maximum separating power. predictor with the maximum separating power. The best split does the best job in creating The best split does the best job in creating nodes where a single class dominates.nodes where a single class dominates.
One of the best known methods of calculating One of the best known methods of calculating the predictor’s power to separate data is based the predictor’s power to separate data is based on the on the Gini coefficient of inequalityGini coefficient of inequality..
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 41
Major Issues in Data Mining(1)Major Issues in Data Mining(1) Mining methodology Mining methodology
– Mining different kinds of knowledge from diverse data types, Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Webe.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalabilityPerformance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problemPattern evaluation: the interestingness problem
– Incorporation of background knowledgeIncorporation of background knowledge
– Handling noise and incomplete dataHandling noise and incomplete data
– Parallel, distributed and incremental mining methodsParallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: Integration of the discovered knowledge with existing one: knowledge fusionknowledge fusion
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 42
(2)(2)
User interactionUser interaction– Data mining query languages and ad-hoc miningData mining query languages and ad-hoc mining
– Expression and visualization of data mining resultsExpression and visualization of data mining results
– Interactive mining ofInteractive mining of knowledge at multiple levels of knowledge at multiple levels of abstractionabstraction
Applications and social impactsApplications and social impacts– Domain-specific data mining & invisible data miningDomain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacyProtection of data security, integrity, and privacy
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 43
Summary(1) Summary(1)
Data mining: Discovering interesting patterns from large Data mining: Discovering interesting patterns from large
amounts of dataamounts of data
A natural evolution of database technology, in great A natural evolution of database technology, in great
demand, with wide applicationsdemand, with wide applications
A KDD process includes data cleaning, data integration, A KDD process includes data cleaning, data integration,
data selection, transformation, data mining, pattern data selection, transformation, data mining, pattern
evaluation, and knowledge presentationevaluation, and knowledge presentation
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 44
(2)(2) Mining can be performed in a variety of Mining can be performed in a variety of
information repositoriesinformation repositories
Data mining functionalities: characterization, Data mining functionalities: characterization,
discrimination, association, classification, discrimination, association, classification,
clustering, outlier and trend analysis, etc.clustering, outlier and trend analysis, etc.
Data mining systems and architecturesData mining systems and architectures
Major issues in data miningMajor issues in data mining
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 45
Thank youThank you
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 46
An example of a decision treeAn example of a decision tree
Homeownership
112responded:not responded: 888
Household
Total: 1000
responded:not responded:
No
Total:
103554657
9responded:not responded: 334
Yes
Total: 343
Household Income
responded:not responded:
No
Total:
3208211
responded:not responded:
Yes
Total:
86188274
Savings Accounts
89responded:not responded: 396
$20,701
Total: 485
14responded:not responded: 158
$20,700
Total: 172
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 47
The Gini coefficientThe Gini coefficient
The Gini coefficient is a measure of how well the The Gini coefficient is a measure of how well the predictor separates the classes contained in the predictor separates the classes contained in the parent node.parent node.
GiniGini, an Italian economist, introduced a rough , an Italian economist, introduced a rough measure of the amount of inequality in the measure of the amount of inequality in the income distribution in a country.income distribution in a country.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 48
Computation of the Gini coefficientComputation of the Gini coefficient
% Population
% I
nco
me
0 20 40 60 80 1000
20
40
60
80
100
The Gini coefficient is calculated as the area between The Gini coefficient is calculated as the area between the curve and the diagonal divided by the area below the the curve and the diagonal divided by the area below the diagonal. For a perfectly equal wealth distribution, the diagonal. For a perfectly equal wealth distribution, the Gini coefficient is equal to zero.Gini coefficient is equal to zero.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 49
Selecting an optimal decision tree:Selecting an optimal decision tree:((aa) Splits selected by Gini) Splits selected by Gini
Predictor 1
Class A:Class B:Total:
10050
150
Predictor 4
Class A:Class B:Total:
371249
Class A:Class B:Total:
213
Class A:Class B:Total:
233
26
Class A:Class B:Total:
110
11
Class A:Class B:Total:
189
Predictor 5 Predictor 6
Class A:Class B:Total:
254
29
Class A:Class B:Total:
128
20
Class A:Class B:Total:
6338
101
Class A:Class B:Total:
03636
Class A:Class B:Total:
415
Predictor 3
Class A:Class B:Total:
43741
Predictor 2
Class A:Class B:Total:
591
60
yes
yesyes
no
no no
yes no yes no yes no
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 50
Selecting an optimal decision tree:Selecting an optimal decision tree:((bb) Splits selected by guesswork) Splits selected by guesswork
Predictor 5
Class A:Class B:Total:
10050
150
Predictor 3
Class A:Class B:Total:
8136
117
Class A:Class B:Total:
371451
Class A:Class B:Total:
97
16
Class A:Class B:Total:
239
32
Class A:Class B:Total:
126
18
Predictor 1 Predictor 6
Class A:Class B:Total:
462167
Class A:Class B:Total:
351550
Class A:Class B:Total:
191433
Class A:Class B:Total:
89
17
Class A:Class B:Total:
295
34
Predictor 4
Class A:Class B:Total:
176
23
Predictor 2
Class A:Class B:Total:
28
10
yes no
yes no yes no
yes no
yes no
yes no
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 51
Gain chart of Gain chart of Class AClass A
0 20 40 60 80 1000
20
40
60
80
100%
Cla
ss
% Population
The Gini splits
Manual split selection
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 52
Can we extract rules from a decision tree?Can we extract rules from a decision tree?
The pass from the root node to the bottom leaf The pass from the root node to the bottom leaf reveals a decision rule.reveals a decision rule.
For example, a rule associated with the right For example, a rule associated with the right bottom leaf in the figure that represents Gini splits bottom leaf in the figure that represents Gini splits can be represented as follows:can be represented as follows:
ifif (Predictor 1 = (Predictor 1 = nono))andand (Predictor 4 = (Predictor 4 = nono))andand (Predictor 6 = (Predictor 6 = nono))thenthen class = class = Class AClass A
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 53
A typical task for decision trees is to determine A typical task for decision trees is to determine conditions that may lead to certain outcomes.conditions that may lead to certain outcomes.
Blood pressure can be categorised as optimal, Blood pressure can be categorised as optimal, normal or high. Optimal pressure is below normal or high. Optimal pressure is below 120/80, normal is between 120/80 and 130/85, 120/80, normal is between 120/80 and 130/85, and a hypertension is diagnosed when blood and a hypertension is diagnosed when blood pressure is over 140/90.pressure is over 140/90.
Case studyCase study: : Profiling people with high blood pressureProfiling people with high blood pressure
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 54
A data set for a hypertension studyA data set for a hypertension studyCommunity Health Survey: Hypertension Study (California, U.S.A.)
Gender MaleFemale
Age 18 – 34 years35 – 50 years51 – 64 years65 or more years
Race CaucasianAfrican AmericanHispanicAsian or Pacific Islander
MarriedSeparatedDivorcedWidowedNever Married
Marital Status
Household Income Less than $20,700$20,701 $45,000$45,001 $75,000$75,001 and over
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 55
A data set for a hypertension study (continued)A data set for a hypertension study (continued) Community Health Survey: Hypertension Study (California, U.S.A.)
Weight
Height
cm
kg
170 9 3
Alcohol Consumption
Abstain from alcohol Occasional (a few drinks per month) Regular (one or two drinks per day) Heavy (three or more drinks per day)
Smoking
Nonsmoker 1 – 10 cigarettes per day 11 – 20 cigarettes per day More than one pack per day
Caffeine Intake
Abstain from coffee One or two cups per day Three or more cups per day
Salt Intake
Low-salt diet Moderate-salt diet High-salt diet
Physical Activities
None One or two times per week Three or more times per week
Blood Pressure
0
Optimal Normal High
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 56
Data cleaningData cleaning
Decision trees are as good as the data they Decision trees are as good as the data they represent. Unlike neural networks and fuzzy represent. Unlike neural networks and fuzzy systems, decision trees do not tolerate noisy and systems, decision trees do not tolerate noisy and polluted data. Therefore, the data must be polluted data. Therefore, the data must be cleaned before we can start data mining.cleaned before we can start data mining.
We might find that such fields as We might find that such fields as Alcohol Alcohol ConsumptionConsumption or or SmokingSmoking have been left blank have been left blank or contain incorrect informationor contain incorrect information..
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 57
Data enrichingData enriching
From such variables as From such variables as weightweight and and heightheight we we can easily derive a new variable, can easily derive a new variable, obesityobesity. This . This variable is calculated with a body-mass index variable is calculated with a body-mass index (BMI), that is, the weight in kilograms divided (BMI), that is, the weight in kilograms divided by the square of the height in metres. Men with by the square of the height in metres. Men with BMIs of 27.8 or higher and women with BMIs BMIs of 27.8 or higher and women with BMIs of 27.3 or higher are classified as obese.of 27.3 or higher are classified as obese.
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 58
A data set for a hypertension study (continued)A data set for a hypertension study (continued)
Community Health Survey: Hypertension Study (California, U.S.A.)
Obesity ObeseNot Obese
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 59
Growing a decision treeGrowing a decision tree
Age
Blood Pressureoptimal:
Total:
normal:high:
319 (32%)
1000
528 (53%)153 (15%)
18 – 34 yearsoptimal:
Total:
normal:high:
88 (56%)
157
64 (41%)5 (3%)
35 – 50 yearsoptimal:
Total:
normal:high:
208 (35%)
596
340 (57%)48 (8%)
51 – 64 yearsoptimal:
Total:
normal:high:
21 (12%)
173
90 (52%)62 (36%)
65 or more yearsoptimal:
Total:
normal:high:
2 (3%)
74
34 (46%)38 (51%)
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 60
Growing a decision tree (continued)Growing a decision tree (continued)
51 – 64 yearsoptimal:
Total:
normal:high:
21 (12%)
173
90 (52%)62 (36%)
Obesity
Obeseoptimal:
Total:
normal:high:
3 (3%)
107
53 (49%)51 (48%)
Not Obeseoptimal:
Total:
normal:high:
18 (27%)
66
37 (56%)11 (17%)
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 61
Growing a decision tree (continued)Growing a decision tree (continued)
Obeseoptimal:
Total:
normal:high:
3 (3%)
107
53 (49%)51 (48%)
Race
Caucasianoptimal:
Total:
normal:high:
2 (5%)
43
24 (55%)17 (40%)
African Americanoptimal:
Total:
normal:high:
0 (0%)
37
13 (35%)24 (65%)
Hispanicoptimal:
Total:
normal:high:
0 (0%)
19
11 (58%)8 (42%)
Asianoptimal:
Total:
normal:high:
1 (12%)
8
5 (63%)2 (25%)
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 62
Solution space of the hypertension studySolution space of the hypertension study
The solution space is first divided into four The solution space is first divided into four rectangles by rectangles by ageage, then age group 51-64 is , then age group 51-64 is further divided into those who are overweight further divided into those who are overweight and those who are not. And finally, the group and those who are not. And finally, the group of obese people is divided by of obese people is divided by racerace. .
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 63
157
596
74
8
19
43
3766
Solution space of the hypertension studySolution space of the hypertension study
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 64
Hypertension study: forcing a splitHypertension study: forcing a split
Age
Blood Pressureoptimal:
Total:
normal:high:
319 (32%)
1000
528 (53%)153 (15%)
18 – 34 yearsoptimal:
Total:
normal:high:
88 (56%)
157
64 (41%)5 (3%)
35 – 50 yearsoptimal:
Total:
normal:high:
208 (35%)
596
340 (57%)48 (8%)
51 – 64 yearsoptimal:
Total:
normal:high:
21 (12%)
173
90 (52%)62 (36%)
65 or more yearsoptimal:
Total:
normal:high:
2 (3%)
74
34 (46%)38 (51%)
GenderGender
Maleoptimal:
Total:
normal:high:
111 (36%)
307
168 (55%)28 (9%)
Femaleoptimal:
Total:
normal:high:
97 (34%)
289
172 (59%)20 (7%)
Maleoptimal:
Total:
normal:high:
11 (13%)
86
48 (56%)27 (31%)
Femaleoptimal:
Total:
normal:high:
10 (12%)
87
42 (48%)35 (40%)
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 65
The main advantage of the decision-tree The main advantage of the decision-tree approach to data mining is it visualises the approach to data mining is it visualises the solution; it is easy to follow any path through the solution; it is easy to follow any path through the tree. tree.
Relationships discovered by a decision tree can Relationships discovered by a decision tree can be expressed as a set of rules, which can then be be expressed as a set of rules, which can then be used in developing an expert system.used in developing an expert system.
Advantages of decision treesAdvantages of decision trees
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 66
Continuous data, such as age or income, have to Continuous data, such as age or income, have to be grouped into ranges, which can unwittingly be grouped into ranges, which can unwittingly hide important patterns. hide important patterns.
Handling of missing and inconsistent data – Handling of missing and inconsistent data – decision trees can produce reliable outcomes decision trees can produce reliable outcomes only when they deal with “clean” data.only when they deal with “clean” data.
Inability to examine more than one variable at a Inability to examine more than one variable at a time. This confines trees to only the problems time. This confines trees to only the problems that can be solved by dividing the solution space that can be solved by dividing the solution space into several successive rectangles.into several successive rectangles.
Drawbacks of decision treesDrawbacks of decision trees
Slides are based on Slides are based on Negnevitsky, Pearson Education, 200Negnevitsky, Pearson Education, 20055 67
In spite of all these limitations, decision In spite of all these limitations, decision trees have become the most successful trees have become the most successful technology used for data mining. technology used for data mining.
An ability to produce clear sets of rules An ability to produce clear sets of rules make decision trees particularly attractive make decision trees particularly attractive to business professionals.to business professionals.