towards comprehensive foundations of computational intelligence włodzisław duch department of...

Download Towards comprehensive foundations of Computational Intelligence Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland

Post on 21-Dec-2015




1 download

Embed Size (px)


  • Slide 1
  • Towards comprehensive foundations of Computational Intelligence Wodzisaw Duch Department of Informatics, Nicolaus Copernicus University, Toru, Poland School of Computer Engineering, Nanyang Technological University, Singapore Google: Duch ICONIP HK 11/2006
  • Slide 2
  • PlanPlan What is Computational intelligence (CI) ? What can we learn? Why solid foundations are needed. Similarity based framework. Transformations and heterogeneous systems. Meta-learning. Beyond pattern recognition. Scaling up intelligent systems to human level competence?
  • Slide 3
  • What is Computational Intelligence? The Field of Interest of the Society shall be the theory, design, application, and development of biologically and linguistically motivated computational paradigms emphasizing neural networks, connectionist systems, genetic algorithms, evolutionary programming, fuzzy systems, and hybrid intelligent systems in which these paradigms are contained. Artificial Intelligence (AI) was established in 1956! AI Magazine 2005, Alan Mackworth: In AI's youth, we worked hard to establish our paradigm by vigorously attacking and excluding apparent pretenders to the throne of intelligence, pretenders such as pattern recognition, behaviorism, neural networks, and even probability theory. Now that we are established, such ideological purity is no longer a concern. We are more catholic, focusing on problems, not on hammers. Given that we do have a comprehensive toolbox, issues of architecture and integration emerge as central.
  • Slide 4
  • CI definition Computational Intelligence. An International Journal (1984) + 10 other journals with Computational Intelligence, D. Poole, A. Mackworth & R. Goebel, Computational Intelligence - A Logical Approach. (OUP 1998), GOFAI book, logic and reasoning. CI should: be problem-oriented, not method oriented; cover all that CI community is doing now, and is likely to do in future; include AI they also think they are CI... CI: science of solving (effectively) non-algorithmizable problems. Problem-oriented definition, firmly anchored in computer science. AI: focused problems requiring higher-level cognition, the rest of CI is more focused on problems related to perception and control.
  • Slide 5
  • The future of computational intelligence...
  • Slide 6
  • What can we learn? Good part of CI is about learning. What can we learn? Neural networks are universal approximators and evolutionary algorithms solve global optimization problems so everything can be learned? Not quite... Duda, Hart & Stork, Ch. 9, No Free Lunch + Ugly Duckling Theorems: Uniformly averaged over all target functions the expected error for all learning algorithms is the same. Averaged over all target functions no learning algorithm yields generalization error that is superior to any other. There is no problem-independent or best set of features. Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems.
  • Slide 7
  • Data mining packages No free lunch => provide different type of tools for knowledge discovery: decision tree, neural, neurofuzzy, similarity-based, SVM, committees, tools for visualization of data. Support the process of knowledge discovery/model building and evaluating, organizing it into projects. Many other interesting DM packages of this sort exists: Weka, Yale, Orange, Knime... 168 packages on list! We are building Intemi, completely new tools. Surprise! Almost nothing can be learned using such tools! GhostMiner, data mining tools from our lab + Fujitsu: Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GM Developer & Analyzer
  • Slide 8
  • What DM packages do? Hundreds of components... transforming, visualizing... Yale 3.3: type # components Data preprocessing 74 Experiment operations 35 Learning methods 114 Metaoptimization schemes 17 Postprocessing 5 Performance validation 14 Visualization, presentation, plugin extensions... Visual knowledge flow to link components, or script languages (XML) to define complex experiments.
  • Slide 9
  • What NN components really do? Vector mappings from the input space to hidden space(s) and to the output space + adapt parameters to improve cost functions. Hidden-Output mapping done by MLPs: T = {X i } training data, N-dimensional. H = {h j ( T )} X image in the hidden space, j =1.. N H -dim.... more transformations in hidden layers Y = {y k ( H )} X image in the output space, k =1.. N C -dim. ANN goal: data image H in the last hidden space should be linearly separable; internal representations will determine network generalization. But we never look at them!
  • Slide 10
  • Why solid foundations are needed Hundreds of components... thousands of combinations... Our treasure box is full! We can publish forever! But what would we really like to have? Press the button and wait for the truth! Computer power is with us, meta-learning should find all interesting data models = sequences of transformations/procedures. Many considerations: optimal cost solutions, various costs of using feature subsets; models that are simple & easy to understand; various representation of knowledge: crisp, fuzzy or prototype rules, visualization, confidence in predictions...
  • Slide 11
  • Slide 12
  • Principles: information compression Neural information processing in perception and cognition: information compression, or algorithmic complexity. In computing: minimum length (message, description) encoding. Wolff (2006): cognition and computation as compression by multiple alignment, unification and search. Analysis and production of natural language, fuzzy pattern recognition, probabilistic reasoning and unsupervised inductive learning. So far only models for sequential data and 1D alignment. Information compression: encoding new information in terms of old has been used to define the measure of syntactic and semantic information (Duch, Jankowski 1994); based on the size of the minimal graph representing a given data structure or knowledge-base specification, thus it goes beyond alignment.
  • Slide 13
  • Graphs of consistent concepts Learn new concepts in terms of old; using large semantic network and add new concepts linking them to known. Disambiguate concepts by spreading activation and selecting those that are consistent with already active subnetworks.
  • Slide 14
  • Similarity-based framework (Dis)similarity: more general than feature-based description, no need for vector spaces (structured objects), more general than fuzzy approach (F-rules are reduced to P-rules), includes nearest neighbor algorithms, MLPs, RBFs, separable function networks, SVMs, kernel methods and many others. Similarity-Based Methods (SBMs) are organized in a framework: p(C i |X;M) posterior classification probability or y(X;M) approximators, models M are parameterized in increasingly sophisticated way. A systematic search (greedy, beam, evolutionary) in the space of all SBM models is used to select optimal combination of parameters and procedures, opening different types of optimization channels, trying to discover appropriate bias for a given problem. Results: several candidate models, very limited version gives best results in 7 out of 12 Stalog problems.
  • Slide 15
  • SBM framework Pre-processing: from objects (cases) O to features X or directly to (diss)similarities D(O,O). Calculation of similarity between features d(x i,y i ) and objects D(X,Y). Reference (or prototype) vector R selection/creation/optimization. Weighted influence of references vectors G(D(R i,X)), i=1..k. Functions/procedures to estimate p(C|X;M) or y(X;M). Cost functions E[D T ;M] and model selection/validation procedures. Optimization procedures for the whole model M a. Search control procedures to create more complex models M a+1. Creation of ensembles of (local, competent) models. M={X(O), d(.,. ), D(.,. ), k, G(D), {R}, {p i (R)}, E[. ], K(. ), S(.,. )}, where: S(C i,C j ) is a matrix evaluating similarity of the classes; a vector of observed probabilities p i (X) instead of hard labels. The kNN model p(Ci|X;kNN) = p(C i |X;k,D(. ),{D T }); the RBF model: p(Ci|X;RBF) = p(Ci|X;D(. ),G(D),{R}), etc.
  • Slide 16
  • Meta-learning in SBM scheme Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations. k-NN 67.5/76.6% +d(x,y); Canberra 89.9/90.7 % + s i =(0,0,1,0,1,1); 71.6/64.4 % +selection, 67.5/76.6 % +k opt; 67.5/76.6 % +d(x,y) + s i =(1,0,1,0.6,0.9,1); Canberra 74.6/72.9 % +d(x,y) + sel. or opt k; Canberra 89.9/90.7 % k-NN 67.5/76.6% +d(x,y); Canberra 89.9/90.7 % + s i =(0,0,1,0,1,1); 71.6/64.4 % +selection, 67.5/76.6 % +k opt; 67.5/76.6 % +d(x,y) + s i =(1,0,1,0.6,0.9,1); Canberra 74.6/72.9 % +d(x,y) + selection; Canberra 89.9/90.7 %
  • Slide 17
  • Transformation-based framework Extend SBM adding fine granulation of methods and relations between them to enable meta-learning by search in the model space. For example, first transformation (layer) after pre-processing: PCA networks, with each node computing principal component. LDA networks, each node computes LDA direction (including FDA). ICA networks, nodes computing independent components. KL, or Kullback-Leibler networks with orthogonal or non-orthogonal components; max. of mutual information is a special case 2


View more >