mining knowledge in data explosion age

58
1 Mining Knowledge in Data Explosion Age (在資料爆炸時代中挖掘知識) 廖宜恩 中興大學資訊科學與工程系

Upload: tommy96

Post on 13-Jan-2015

2.192 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

  • 1. Mining Knowledge in Data Explosion Age 1

2. Outline Some News Reports Why Data Mining What is Data Mining Knowledge Discovery Process Data Mining Functionalities Data Mining Process Data Mining Tools Trends in Data Mining Some Research Results on Data Mining Conclusions2 3. Some News Reports Time's Person of the Year for 2006 12 IT skills that employers can't say no to F.B.I. Data Mining Reached Beyond Initial Targets MIT names its top 10 emerging technologies for 2008 Effect of US Recession on Data Mining Demand (July 2008) 3 4. Why Data Mining Data Explosion Problem Data in the world doubles every 20 months! NASAs Earth Orbiting System: forty-sixmegabytes of data per second 4,000,000,000,000 bytes a day4 TeraByte/day 20200GB Hard Disk FBI fingerprints image library: 200,000,000,000,000 bytes200 TB In-line image analysis for particle detection: 1megabyte in one second4 5. Why Data Mining? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce purchases at department/grocery stores Bank/Credit Cardtransactions Competitive Pressure is Strong Provide better, customized services foran edge (e.g. in Customer RelationshipManagement)5 6. Why Data Mining? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating geneexpression data scientific simulationsgenerating terabytes of data Traditional techniques infeasible for raw data 6 7. Mining Large Data Sets - Motivation There is often information hidden in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all We are drowning in data, but starving for knowledge! 4,000,0003,500,0003,000,000The Data Gap 2,500,0002,000,0001,500,000 Total new disk (TB) since 1995 1,000,000500,000 Number of0analysts1995 19961997 19981999 7 8. What is Data Mining? Data Mining (Knowledge Discovery in Databases, KDD) : Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules 8 9. Knowledge Discovery Process Data mining: the corePattern Evaluation of knowledge discovery process.Data MiningTask-relevant Data Data Warehouse Selection Data CleaningData Integration 9 Databases 10. Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Statistics/ Machine Learning/ Enormity of dataAIPattern Recognition Curse of highData Mining dimensionality Database Heterogeneous, systems distributed nature of data10 11. Data Mining Functionalities 1. Concept description: Characterization anddiscrimination 2. Classification 3. Association rule mining 4. Clustering 5. Sequence analysis 6. Anomaly detection11 12. Concept description: Characterization anddiscrimination Concept description: Characterization: provides a concise summarization of the given collection of data Example: Describe general characteristics ofgraduate students in the NCHU database Discrimination: provides descriptions comparing two or more collections of data Example: Compare graduate and undergraduatestudents of NCHU using discriminant rule 12 13. Classification Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is theclass. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually,the given data set is divided into training and test sets, withtraining set used to build the model and test set used to validate it. 13 14. Decision Tree Classification Task Tid Attrib1Attrib2 Attrib3 Class1 Yes Large125KNo2 NoMedium 100KNo3 NoSmall70K No4 Yes Medium 120KNo5 NoLarge95K Yes6 NoMedium 60K No7 Yes Large220KNoLearn8 NoSmall85K Yes Model9 NoMedium 75K No10NoSmall90K Yes 10 ApplyTid Attrib1Attrib2 Attrib3 Class Model11NoSmall55K ? Decision12Yes Medium 80K ? Tree13Yes Large110K?14NoSmall95K ?15NoLarge67K ? 1014 15. Example of a Decision TreeSplitting AttributesTid Refund MaritalTaxable Status Income Cheat 1YesSingle125K No2No Married 100K No Refund Yes No3No Single70KNo4YesMarried 120K NoNOMarSt5No Divorced 95K Yes Single, Divorced Married6No Married 60KNo7YesDivorced 220KNo TaxIncNO8No Single85KYes < 80K> 80K9No Married 75KNo NOYES10 No Single90KYes 10 Training DataModel: Decision Tree 15 16. Apply Model to Test DataTest DataStart from the root of tree.Refund MaritalTaxable Status Income Cheat NoMarried 80K?Refund 10 Yes NoNOMarStSingle, Divorced Married TaxInc NO< 80K> 80K NO YES 16 17. Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc 17 18. Association rule mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transactionMarket-Basket transactionsExample of Association Rules TIDItems {Diaper} {Beer}, 1Bread, Milk{Milk, Bread} {Eggs,Coke}, 2Bread, Diaper, Beer, Eggs{Beer, Bread} {Milk}, 3Milk, Diaper, Beer, Coke 4Bread, Milk, Diaper, Beer 5Bread, Milk, Diaper, Coke18 19. Association Rule Discovery: Application 1 Marketing and Sales Promotion: Let the rule discovered be{Beer, } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Beer in the antecedent => Can be used to see which products would be affected if the store discontinues selling beer. Beer in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Beer to promote sale of Potato chips! 19 20. Clustering Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another.20 21. Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances Intercluster distancesare minimized are maximized21 22. Clustering: Applications Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.22 23. Clustering of Microarray Data 23 24. Sequence analysis Sequence SequenceElementEvent Database (Transaction)(Item) Customer Purchase history of a A set of items bought by Books, diarygiven customera customer at time t products, CDs, etc Web Data Browsing activity of aA collection of filesHome page, indexparticular Web visitorviewed by a Web visitorpage, contact info,after a single mouse etcclick Event data History of events Events triggered by aTypes of alarmsgenerated by a givensensor at time t generated by sensorssensor Genome DNA sequence of a An element of the DNABases A,T,G,C sequencesparticular speciessequence Element Event(Transaction) E1E1 E3 (Item)E2E2 E2E3 E4 Sequence24 25. 25 Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 26. How does the human genome stack up?Organism Genome Size (Bases) Estimated Genes Human (Homo sapiens) 3 billion 25,000Laboratory mouse (M. musculus) 2.6 billion 30,000Mustard weed (A. thaliana) 100 million 25,000Roundworm (C. elegans) 97 million19,000Fruit fly (D. melanogaster)137 million 13,000Yeast (S. cerevisiae)12.1 million6,000 Bacterium (E. coli)4.6 million 3,200 Human immunodeficiency virus (HIV) 9700926 27. Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGaAgAAgAAAGGttGGG..|..|||.|..|||cAAtAAAAcGGcGGG 27 28. Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 28 29. Social Network Analysis (Link Mining) Six Degrees of Separation Link: relationship among data objects Link-Based Object Ranking (LBR): Exploit the link structure of a graph to order or prioritize the set of objects within the graph Web information analysis such as PageRank and Hits are typical LBR approaches 29 30. Complex Network A complex network is a network (graph) that has certain non-trivial topological features that do not occur in simple networks. Such non-trivial features include: a heavy-tail in the degree distribution; a high clustering coefficient; assortativity (a correlation between two nodes) or disassortativity among vertices; and evidence of a hierarchical structure. 30 31. Web Mining Web Usage Mining Web Structure Mining Web Content Mining Google has a precious asset: Database of Intensions31 32. Graph Mining Find frequent subgraph in a given graph database Graphs are ubiquitous Web databases, XML databases Cheminformatics (chemical compound) Bioinformactics (protein structure, pathway) Workflow analysis Social network analysis32 33. Example (Chemistry-informatics) Graph Dataset (A)(B) (C) Frequent Patterns (min support is 2) (1) (2)33 34. Data Mining Process Define the problem Build data mining database Explore data Prepare data for modeling Build model Evaluate model Deploy model 34 35. Examples of data mining in science &engineering Data mining in Biomedical Engineering Robotic Arm Control Using Data Mining Techniques35 36. Data Mining Process: 1. Define the problem Control a robotic arm by means of EMG signals from biceps and triceps muscles. Electromyography (EMG,) is a medical technique for evaluating and recording physiologic properties of muscles at rest and while contracting. Muscle BicepsTriceps Contraction Supination H H PronationL L Flexion H L ExtensionSupination Pronation Flexion Extension L H 36 37. Data Mining Process: 2. Build a datamining database The dataset includes 80 records.There are two input variables; bicepssignal and triceps signal. One output variable, with four possiblevalues; supination, pronation, flexion andextension. 37 38. Data Mining Process: 3. Explore data Scatter Plot Triceps Record#FlexionExtension Supination Pronation 38 39. Data Mining Process: 3. Explore data(cont.)Scatter Plot BicepsRecord# FlexionExtension Supination Pronation39 40. Data Mining Process: 4. Prepare datafor modelingBuild a dataset with the ARFF format:@relation EMG@attribute Triceps real @attribute Biceps real @attribute Move {Flexion,Extension,Pronation,Supination}@data 13,31,Flexion 14,30,Flexion 10,31,Flexion 13,29,Flexion 40 41. Data Mining Process: 5. Build Model Classification OneRDecision TreeNave BayesianK-Nearest NeighborsNeural NetworksLinear Discriminant AnalysisSupport Vector Machines 41 42. Data Mining Process: 5. Decision Tree 1. Find the attribute that best classifies the training data. 2. Use this attribute as the root of the decision tree. 3. Repeat the process for each subtree. Triceps37TricepsBiceps14 17 42 Flexion Pronation Extension Supination 43. Data Mining Process: 6. Evaluate Models Simple validation : training set and test set n-fold cross-validation Leave-one-out10 -fold cross-validation OneR 76% Decision Tree90% Nave Bayesian 98% 1-Nearest Neighbors100% Neural Networks100% 43 44. Data Mining Process: 7. Deploy Model The neural network model was successfully implemented inside the robotic arm. 44 45. Data Mining Tools Commercial tools: SAS Enterprise Miner , IBM Intelligent Miner, SPSS Clementine Open source tools: WEKA: http://www.cs.waikato.ac.nz/ml/weka RapidMiner: http://rapid-i.com/index.php?lang=en Poll: Data mining/analytic tools you used in 2006 Good portals for data mining: KDnuggets45 46. Trends in Data Mining Application exploration development of application-specific data mining system Invisible data mining (mining as built-in function) Scalable data mining methods Constraint-based mining: use of constraints to guidedata mining systems in their search for interestingpatterns Integration of data mining with database systems, data warehouse systems, and Web database systems 46 47. Trends in Data Mining Web mining Social network analysis Recommender systems: US$1 Million prize for 10% improvement on Cinematch movie recommender system Netflix If You Liked This, Youre Sure to Love That (New York Times, Nov. 21, 2008) 47 48. Trends in Data Mining Spam filters: Cost of Spam: How much does spam cost you? Google will calculate http://www.google.com/a/help/intl/en/security/r oi_calculator.html Privacy protection and information security in data mining Bioinformatics48 49. Some Research Results on DM Localization system for WLAN Rogue Access Point Detection System Based on Packet Analysis Library Recommender System Based on Personal Ontology Model49 50. Localization system for WLAN Enhancing the Accuracy of WLAN-based Location Determination Systems Using Predicted Orientation Information (Information Sciences, Vol. 178, No. 4, Feb. 15, 2008, pp. 10491068.) We proposed Accumulated Orientation Strength (AOS) algorithm based on Bayesian classifier to predict the orientation of a mobile user for improving the accuracy of localization system.50 51. Rogue Access Point Detection System A paper entitled "Detecting Rogue Access Points Using Client-side Bottleneck Bandwidth Analysis" has been accepted for publication in Computers & Security. 51 52. Rogue Access Point Detection System Big challenge in managing APs in university campus: NCHU is a class B network with more than 50 departmental networks52 53. Rogue Access Point Detection System:Intruders from the Air53 54. Rogue Access Point Detection System Proposed a novel approach for detecting rogue access points by estimating client-side bottleneck bandwidth based on ACK packet pair technique. The system is implemented and tested in the Computer and Information Network Center at NCHU. Experimental results show that the accuracy is higher than 90%.54 55. Library Recommender System Based on Personal Ontology Model (PORE) A paper entitled "PORE: A Personal Ontology Recommender System for Digital Library" has been accepted for publication in The Electronic Library. Proposed personal ontology model for recommending books to library patrons based on keywords extracted from the books borrowed by the user55 56. Library Recommender System Based on Personal Ontology Model (PORE) Collaborative filtering techniques are also incorporated into the PORE system PORE system is in service at NCHU Library 56 57. Conclusions We are drowning in data, but starving for knowledge! Data mining is the key to knowledge discovery. Applications of data mining techniques can be found in almost every research area of computer science and engineering. Even in a recession, data mining services are still in strong demand. 57 58. References 1. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,Introduction to Data Mining, Addison-Wesley, 2006. 2. Jiawei Han and Micheline Kamber, Data Mining:Concepts and Techniques, 2nd Ed., Morgan Kaufmann,2005. 3. Jones, Neil and Pevzner, Pavel, An Introduction toBioinformatics Algorithms, MIT Press, 2004. 4. http://www.chem-eng.utoronto.ca/~datamining/ 5. Duncan Watts6Six Degrees2004 6. Mark BuchananNexus2003 7. http://www.kdnuggets.com/ 58