computational intelligence methods for information understanding and information management...

Download Computational intelligence methods for information understanding and information management Włodzisław Duch Department of Informatics Nicolaus Copernicus

Post on 19-Dec-2015

213 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Computational intelligence methods for information understanding and information management Wodzisaw Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland & School of Computer Engineering, Nanyang Technological University, Singapore IMS2005, Kunming, China
  • Slide 2
  • PlanPlan What is this about ? How to discover knowledge in data; how to create comprehensible models of data; how to evaluate new data; how to understand what computational intelligence (CI) methods really do. 1. AI, CI & Data Mining 2. Forms of useful knowledge 3. Integration of different methods in GhostMiner 4. Exploration & Visualization 5. Rule-based data analysis 6. Neurofuzzy models 7. Neural models, understanding what they do 8. Similarity-based models, prototype rules 9. Case studies 10. From data to expert system
  • Slide 3
  • AI, CI & DM Artificial Intelligence: symbolic models of knowledge. Higher-level cognition: reasoning, problem solving, planning, heuristic search for solutions. Machine learning, inductive, rule-based methods. Technology: expert systems. Computational Intelligence, Soft Computing: methods inspired by many sources: biology evolutionary, immune, neural computing statistics, patter recognition probability Bayesian networks logic fuzzy, rough Perception, object recognition. Data Mining, Knowledge Discovery in Databases. discovery of interesting rules, knowledge => info understanding. building predictive data models => part of info management.
  • Slide 4
  • Forms of useful knowledge AI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever. But... knowledge accessible to humans is in: symbols and rules; similarity to prototypes, structures, known cases; images, visual representations. What type of explanation is satisfactory? Interesting question for cognitive scientists but... in different fields answers are different!
  • Slide 5
  • Forms of knowledge 3 types of explanation presented here: logic-based: symbols and rules; exemplar-based: prototypes and similarity of structures; visualization-based: maps, diagrams, relations... Humans remember examples of each category and refer to such examples as similarity-based, case based or nearest-neighbors methods do. Humans create prototypes out of many examples as Gaussian classifiers, RBF networks, or neurofuzzy systems modeling probability densities do. Logical rules are the highest form of summarization of simple forms of knowledge; Bayesian networks present complex relationships.
  • Slide 6
  • GhostMiner Philosophy There is no free lunch provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, SVM, committees. Provide tools for visualization of data. Support the process of knowledge discovery/model building and evaluating, organizing it into projects. GhostMiner tools for data mining & knowledge discovery, from our lab + Fujitsu: http://www.fqspl.com.pl/ghostminer/ Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GhostMiner Developer & GhostMiner Analyzer (ver. 3.0 & newer)
  • Slide 7
  • Wine data example alcohol content ash content magnesium content flavanoids content proanthocyanins phenols content OD280/D315 of diluted wines Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars. Task: recognize the source of wine sample. 13 quantities measured, all features are continuous: malic acid content alkalinity of ash total phenols content nonanthocyanins phenols content color intensity hue proline. Wine sample => 13 numerical quantities => feature space rep. Complex structures: no feature space, only Similarity(A,B) known.
  • Slide 8
  • Exploration and visualization General info about the data
  • Slide 9
  • Exploration: data Inspect the data
  • Slide 10
  • Exploration: data statistics Distribution of feature values Proline has very large values, most methods will benefit from data standardization before further processing.
  • Slide 11
  • Exploration: data standardized Standardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std] Other options: normalize to [-1,+1], or normalize rejecting p% of extreme values.
  • Slide 12
  • Exploration: 1D histograms Distribution of feature values in classes Some features are more useful than the others.
  • Slide 13
  • Exploration: 1D/3D histograms Distribution of feature values in classes, 3D
  • Slide 14
  • Exploration: 2D projections Projections on selected 2D
  • Slide 15
  • Visualize data Hard to imagine relations in more than 3D. Linear methods: PCA, FDA, PP... use input combinations. SOM mappings: popular for visualization, but rather inaccurate, there is no measure of distortions. Measure of topographical distortions: map all X i points from R n to x i points in R m, m < n, and ask: how well are R ij = D(X i, X j ) distances reproduced by distances r ij = d(x i,x j ) ? Use m = 2 for visualization, use higher m for dimensionality reduction.
  • Slide 16
  • Visualize data: MDS Multidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (1969) and myself (1994) Minimize measure of topographical distortions moving the x coordinates.
  • Slide 17
  • Visualize data: Wine 3 clusters are clearly distinguished, 2D is fine. The green outlier can be identified easily.
  • Slide 18
  • Decision trees Simplest things first: use decision tree to find logical rules. Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms.
  • Slide 19
  • Decision borders Univariate trees: test the value of a single attribute x < a. Multivariate trees: test on combinations of attributes. Result: feature space is divided into large hyperrectangular areas with decision borders perpendicular to axes.
  • Slide 20
  • Splitting criteria Most popular: information gain, used in C4.5 and other trees. CART trees use Gini index of node purity: Which attribute is better? Which should be at the top of the tree? Look at entropy reduction, or information gain index.
  • Slide 21
  • Non-Bayesian selection Bayesian MAP selection: choose max a posteriori P(C|X) Problem: for binary features non-optimal decisions are taken! But estimation of P(C,A) for non-binary features is not reliable. A=0 A=1 P(C,A 1 ) 0.0100 0.4900 P(C 0 )=0.5 0.0900 0.4100 P(C 1 )=0.5 P(C,A 2 ) 0.0300 0.4700 0.1300 0.3700 P(C|X)=P(C,X)/P(X) MAP is here equivalent to a majority classifier (MC): given A=x, choose max C P(C,A=x) MC(A 1 )=0.58, S + =0.98, S - =0.18, AUC=0.58, MI= 0.058 MC(A 2 )=0.60, S + =0.94, S - =0.26, AUC=0.60, MI= 0.057 MC(A 1 ) MI(A 2 ) !
  • Slide 22
  • SSV decision tree Separability Split Value tree: based on the separability criterion. SSV criterion: separate as many pairs of vectors from different classes as possible; minimize the number of separated from the same class. Define subsets of data D using a binary test f(X,s) to split the data into left and right subset D = LS RS.
  • Slide 23
  • SSV complex tree Trees may always learn to achieve 100% accuracy. Very few vectors are left in the leaves splits are not reliable and will overfit the data!
  • Slide 24
  • SSV simplest tree Pruning finds the nodes that should be removed to increase generalization accuracy on unseen data. Trees with 7 nodes left: 15 errors/178 vectors.
  • Slide 25
  • SSV logical rules Trees may be converted to logical rules. Simplest tree leads to 4 logical rules: 1. if proline > 719 and flavanoids > 2.3 then class 1 2. if proline 2.115 then class 2 3. if proline > 719 and flavanoids < 2.3 then class 3 4. if proline < 719 and OD280 < 2.115 then class 3 How accurate are such rules? Not 15/178 errors, or 91.5% accuracy! Run 10-fold CV and average the results. 8510%? Run 10X and average 8510%2%? Run again...
  • Slide 26
  • SSV optimal trees/rules Optimal: estimate how well rules will generalize. Use stratified crossvalidation for training; use beam search for better results. 1. if OD280/D315 > 2.505 and proline > 726.5 then class 1 2. if OD280/D315 0.875 and malic-acid < 2.82 then class 2 3. if OD280/D315 > 2.505 and proline < 726.5 then class 2 4. if OD280/D315 0.875 and malic-acid > 2.82 then class 3 5. if OD280/D315 < 2.505 and hue < 0.875 then class 3 Note 6/178 errors, or 91.5% accuracy! Run 10-fold CV: results are 8510%? Run 10X!
  • Slide 27
  • Logical rules Crisp logic rules: for continuous x use linguistic variables (predicate functions). s k (x) True [X k x X' k ], for example: small(x) = True{x|x < 1} medium(x) = True{x|x [1,2]} large(x) = True{x|x > 2} Linguistic variables are used in crisp (prepositional, Boolean) logic rules: IF small-height(X) AND has-hat(X) AND has-beard(X) THEN (X is a Brownie) ELSE IF... ELSE...
  • Slide 28
  • Crisp logic decisions Crisp logic is based on rectangular membership functions: True/False values jump from 0 to 1. Step functions are used for par

Recommended

View more >