studies in classification, data analysis, and knowledge ...978-3-642-80098-6/1.pdf · studies in...

13
Studies in Classification, Data Analysis, and Knowledge Organization Managing Editors H. H. Bock. Aachen O. Opitz. Augsburg M. Schader. Mannheim Springer Berlin Heidelberg New York Barcelona Budapest Hong Kong London Milan Paris Santa Clara Singapore Tokyo Editorial Board W. H. E. Day, St. John's E. Diday. Paris A. Ferligoj. Ljubljana W. Gaul. Karlsruhe J. C. Gower. Harpenden D.J. Hand. Milton Keynes P. Ihm. Marburg J. Meulmann. Leiden S. Nishisato. Toronto F. J. Radermacher. U1m R. Wille. Darmstadt

Upload: trancong

Post on 07-Feb-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

Studies in Classification, Data Analysis, and Knowledge Organization

Managing Editors H. H. Bock. Aachen O. Opitz. Augsburg M. Schader. Mannheim

Springer Berlin Heidelberg New York Barcelona Budapest Hong Kong London Milan Paris Santa Clara Singapore Tokyo

Editorial Board W. H. E. Day, St. John's E. Diday. Paris A. Ferligoj. Ljubljana W. Gaul. Karlsruhe J. C. Gower. Harpenden D.J. Hand. Milton Keynes P. Ihm. Marburg J. Meulmann. Leiden S. Nishisato. Toronto F. J. Radermacher. U1m R. Wille. Darmstadt

Page 2: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

Titles in the Series

H.-H. Bock and P. Ihm (Eds.) Classification, Data Analysis, and Knowledge Organization

M. Schader (Ed.) Analyzing and Modeling Data and Knowledge

O. Opitz, B. Lausen, and R. Klar (Eds.) Information and Classification

H.-H. Bock, W. Lenski, and M.M. Richter (Eds.) Information Systems and Data Analysis

E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.) New Approaches in Classification and Data Analysis

W. Gaul and D. Pfeifer (Eds.) From Data to Knowledge

Page 3: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

Hans-Hermann Bock· Wolfgang Polasek {Eds.}

Data Analysis and Information Systems Statistical and Conceptual Approaches

Proceedings of the 19th Annual Conference of the Gesellschaft fUr Klassifikation e. V. University of Basel, March 8-10, 1995

With 127 Figures

, Springer

Page 4: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

Prof. Dr. Hans-Hermann Bock Institut fUr Statistik und Wirtschaftsmathematik Rheinisch-Westfalische Technische Hochschule Aachen (RWTH) Wiillnerstr. 3 D-52056 Aachen, Germany [email protected]

Prof. Dr. Wolfgang Polasek Institut fUr Statistik und Okonometrie Universitat Basel Holbeinstr. 12 CH-4051 Basel, Switzerland [email protected]

Data analysis and information systems: statistical and conceptual approaches; University of Basel. March 8 - 10. 1995 / Hans-Hermann Bock; Wolfgang Polasek (ed.). - Berlin; Heidelberg; New York; Barcelona; Budapest; Hong Kong; London; Milan; Paris; Santa Clara; Singapore; Tokyo: Springer. 1996

(Proeeedings of the ... annual conference of the Gesellschaft fiir KlassiCikation e.V. ; 19) (Studies in classification. data analysis and knowledge organization) ISBN-13: 978-3-540-60774-8

NE: Bock. Hans Hennann [Hrsg.)

ISBN-13: 978-3-540-60774-8 e-ISBN-13: 978-3-642-80098-6 001: 10.1007/978-3-642-80098-6

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, re­citation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law.

@ Springer-Verlag Berlin· Heidelberg 1996

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Product liability: The publishers cannot guarantee the accuracy of any information about the application of operative techniques and medications contained in this book. In every individual case the user must check such information by consulting the relevant literature.

SPIN 10517148 21/3135 - 5 4 3 2 1 0 - Printed on acid-free paper

Page 5: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

Preface

This volume presents 45 articles dealing with theoretical aspects, methodo­logical advances and practical applications in domains relating to classifica­tion and clustering, statistical and computational data analysis, conceptual or terminological approaches for information systems, and knowledge struc­tures for databases. These articles were selected from about 140 papers presented at the 19th Annual Conference of the Gesellschaft fur Klassifika­tion, the German Classification Society. The conference was hosted by W. Polasek at the Institute of Statistics and Econometry of the University of Basel (Switzerland) March 8-10, 19951 .

The papers are grouped as follows, where the number in parentheses is the number of papers in the chapter.

1. Classification and clustering (8) 2. Uncertainty and fuzziness (5) 3. Methods of data analysis and applications (7) 4. Statistical models and methods (4) 5. Bayesian learning (5) 6. Conceptual classification, knowledge ordering and

information systems (12) 7. Linguistics and dialectometry (4).

These chapters are interrelated in many respects. The reader may recogni­ze, for example, the analogies and distinctions existing among classification principles developed in such different domains as statistics and information sciences, the benefit to be gained by the comparison of conceptual and ma­thematical approaches for structuring data and knowledge, and, finally, the wealth of practical applications described in many of the papers.

For convenience of the reader, the content of this volume is briefly reviewed.

1. Classification and clustering:

P. G. Bryant applies the 'minimum description length' criterion for selec­ting the number of components in a normal mixture model with unequal covariance matrices in the classes, sometimes resulting in simpler models than the classical model selection approaches. W. Gaul and M. Schader consider two-mode data, e.g. matrices which describe an association (fri­endship, fluctuation) of a row unit (e.g., a cognac) to a column unit (e.g., an advertisement), and propose various clustering criteria and algorithms (exchange or penalty methods) for these data, including incomplete tables and overlapping classes. P. Hansen and B. Jaumard present a masterly survey of cluster analysis, statistical and combinatorial optimization crite-

1 Conference papers related to Internet problems were published in a separate volume in German: H.Chr. Hobohm and H.J. Watjen (Hrsg.): Wissen in elektronischen Netzwerken. Beitrage zur Strukturierung und zum Retrieval von Information im Internet. Bibliotheks­und Informationssystem (81S) der Universitat Oldenburg, Oldenburg, 1995.

Page 6: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

VI

ria, and a wealth of numerical optimization algorithms used or developed in this framework. Ch. Heitz investigates the classification of an observed time series into one of two classes; unlike traditional approaches he uses a joint nonparametric time-frequency representation of signals, combined with a Euclidean distance classifier. Among the four papers of this volume dealing with spatial analysis, the article of M. Hussain and K. Fuchs considers spa­tial clustering of sites or regions described by correlated data vectors; they use a dissimilarity measure that takes into account differing autocorrelation structures in different spatial directions. J. Krauth investigates combinato­rial tests for temporal or spatial (disease) clustering and derives bounds for the p-values of certain test statistics so that they can be used for medium or large size samples. The validation of a hierarchical single linkage classifica­tion underlies the paper of B. Van Cutsem and B. Ycart, who consider the exact and asymptotic probability distribution of several characteristics of a 'totally random', unstructured dendrogram together with a useful Markov chain interpretation. Finally, K.-D. Wernecke shows how classification trees obtained by the CART algorithm can be validated and summarized using resampling methods.

2. Uncertainty and fuzziness

This section deals with fuzzy classifications and methods for expressing or handling uncertain knowledge. T. Augustin presents a constructive way for defining and computing generalized interval-probabilities that includes the classical belief function approach of Dempster and Shafer. The unification of various uncertainty theories is attempted in the paper by E. Umkehrer and K. Schill who develop a general formalism for representing and hand­ling uncertain knowledge. S. Pohlmann summarizes her general method for deriving bounds for the simultaneous occurrence of several events, given marginal probability intervals for these events. In the framework of fuz­zy clustering, Ch. Back and M. Hussain propose a measure for comparing two fuzzy partitions which avoids the deficiences of the direct fuzzification of traditional indices, e.g., of the well-known Rand index. Finally, in an information-oriented setting, P. Mutschke presents a fuzzy retrieval model (AKCESS) for analyzing the relevance of scientific agents in a bibliographic database.

3. Methods of data analysis and applications

D. Baier and W. Gaul consider the analysis of paired comparisons data by probabilistic ideal point and vector models where, traditionally, a prio­ri clusterings have been used. They propose a simultaneous approach for clustering and estimation and show by an example that the latter approach may outperform various versions of sequential methods. H.H. Bock, W.H.E. Day and F.R. Morris investigate consensus rules for molecular sequences that are useful, e.g., for protein or DNA sequencing, and point to numerous open problems to be investigated more thoroughly. C. Mennicken and I. Balderjahn describe and analyze the individual process of perceiving and evaluating ecological risks from a manager's point of view and discuss the

Page 7: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

VII

results obtained from a correspondence analysis of their data matrix. The paper of S. Ohl is a nice illustration of various descriptive, graphical and multivariate data analysis methods: his data were obtained in the develop­ment of the Mercedes-Benz S-Class when car buyers were asked for their preferences in extras and options, and he reveals from these data various de­pendence and clustering structures. U. Streit gives an account of computer­based geographic information systems and discusses various aspects of the geometrical, topological and thematic modelling of spatial objects. He calls for a closer cooperation of statisticians, geo-scientists and information scien­tists in order to improve on these systems. M. Theus presents an analysis of spatio-temporal data using interactive statistical graphics. He describes the software package REGARD which uses exploratory and graphical techni­ques. As an alternative, e.g., to multidimensional scaling, U. Wille presents methods by which an ordinal objectsxdata matrix may be represented by linearly separated regions in the n-dimensional real vector space. She the­reby generalizes Scott's axiomatization of represent ability.

4. Statistical models and methods

G. Arminger and D. Enache survey artificial neural networks considered as statistical nonlinear regression models with parameters (weights) estima­ted (adapted) by quasi-maximum likelihood methods and nonlinear least­squares. Various distance or evaluation measures and several numerical algo­rithms are presented. G. Tutz proposes non-parametric smoothing methods for categorical response data, in particular discrete kernel regression and lo­cal likelihood approaches. Those approaches are investigated in detail for multi-categorical regression and the estimation of discrete hazard functions. C. Weihs and W. Seewald present the statistical expert system STAVEX which allows the interactive computer-based design of experiments in indu­stry. Various optimization criteria as well as special features such as mix­ture factors (with 'Cox axes') and composite responses, are included. R.L. Wolpert proposes the 'conditional frequentist test' for testing simple hypo­theses. This test reports, on the one hand, posterior probabilities pleasing the Bayesians and is, on the other hand, optimal from a classical frequentist perspective.

5. Bayesian learning

M.J. Bayarri and B. Font consider the sampling from finite populations by using random routes and present Bayesian methods for the estimation of the unknown population average. K. Ickstadt, S. Jin and W. Polasek describe Gibbs and Metropolis sampling in bilinear time series models. Since the full conditional distributions cannot be found in a closed form, they are appro­ximated by two versions of the Metroplis algorithm. In the framework of time series analysis, J.S. Pai and N. Ravishanker present four closed form expressions for the exact likelihood function of Gaussian fractionally inte­grated ARMA (or ARFIMA(p, d, q)) processes. ARFIMA models express long memory features and the likelihood presentations are investigated for classical and Bayesian estimation. L.I. Pettit considers the analysis of li-

Page 8: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

VIII

fe time data where, in contrast to classical approaches, information on the degradation of the surviving items is incorporated. Bayesian estimation of the parameters of the degradation process and the prediction of future items are discussed. I<.D.S. Young surveys Bayesian classification methods using predictive diagnostic measures. She compares techniques based on Bayes factors to those using conditional predictive ordinates.

6. Conceptual classification, knowledge ordering and information systems

This chapter is devoted to various topics from information science and per­tains to classification systems, concept theory, terminology, and knowledge­based systems. J. Ingenerf considers conceptual models of a domain where the categories of objects are structured in taxonomies (such as vocabularies in medicine). He confronts the two most popular representation formalisms dealing with terminological knowledge, i.e. Description Logic and Concep­tal Graphs. R. J(lar and A. ZaifJ give an account of conceptual classificati­ons and nomenclatures in medicine (e.g., ICD and SNOMED), discuss their properties and the pros and cons of their practical usage, and give recom­mendations for future developments. In this framework, S. Schulz sketches a conceptual model for pathological states based on a controlled vocabula­ry and on clusters of suitably defined object-attribute-value triples. Since formal and informal knowledge connected with terminologies is important in many information systems M.M. Richter, G. Schmidt and M. Schneider discuss the relation between terminologies and knowledge acquisition in the framework of their environment MIKADO-KIT. H. Schimpe, M. Staudt, B. Kauert and A. Sperber propose a technique for improving the quality of knowledge bases in expert systems with an application to judging credit agreements in bank service. In order to manage the dialogue with the user in a knowledge-based system, R. Kid and M. Schader designed and imple­mented an interpreter for a corresponding dialog-controlled rule system with a view to the marketing data analysis system WIMDAS. F. Wartenberg and W. Gaul show how this latter system can be enhanced by integrating a mo­del base that supports the user when selecting optimization or data analysis techniques.

The two papers by W. Lenski and E. Wette-Roch are devoted to the Lo­gic Information System LIS. While the latter author presents a conceptual model for structuring and representing concepts of mathematical logic and for developing a structured thesaurus, the former describes a new method for processing external (bibliographical) source information in LIS by com­bining information retrieval techniques and case-based reasoning. In the framework of formal concept analysis, G. Stumme shows how attribute ex­ploration supports the acquisition of knowledge described by various types of implications for attributes. T.P. Reinartz and M. ZickwolfJ discuss the use of hierarchical conceptual clustering and of formal concept analysis for acquiring human expert knowledge, exemplified by the case of manufactu­ring rotational parts in mechanical engineering. Finally, A. Rusch and R. Wille show that the theory of knowledge spaces can be effectively connected

Page 9: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

with algebraic methods in formal concept analysis.

7. Linguistics and dialectometry

IX

This last section deals with classifications, data analysis and learning sy­stems in linguistics. P. Badryand S. Naumann contrast the symbolic and sub-symbolic approaches in computational linguistics, a question which is motivated by recent successful applications of neural nets in this domain. P. Barg presents a method for the automatic acquisition of linguistic know­ledge from unstructured data, based on the lexical knowledge representation language DATR. E. Leopold investigates the probability distribution of dif­ferent meanings of the same linguistic expression and derives, under some axiomatic assumptions, a generalization of the well-known Zipf distribution. Finally, C. Schiltz considers the classification of German dialects and dis­cusses the benefits and disadvantages of taxometric methods formerly used for Romance and English dialects.

The organizers of the conference are very much indepted indebted to sever­al institutions that supported the conference: In particular, the invitation of outstanding plenary and survey speakers would not have been possible without the generous financial support provided by the Schweizer National­fonds, the Forderverein des Wirtschaftswissenschaftlichen Zentrums (WWZ) of the University of Basel and private enterprises such as Ciba-Geigy (Basel) and Daimler-Benz (Stuttgart). The University of Basel has provided, apart from technical support, the framework for the conference. Many people were active in the organization of the conference including the members of the Scientific Program Committee and the staff of the Institut fiir Statistik und Okonometrie. We appreciate very much their committment. Special menti­on must go to Dipl.-Stat. Reinhard Vonthein who managed a considerable part of the preparation and organization of the conference, and to Rosemarie Westphal, secretary of the congress office.

The editors of this volume gratefully acknowledge the cooperation of about 80 referees inside and outside of Germany who checked the submitted ma­nuscripts and provided detailed reports, critiques and suggestions. Finally, Dipl.-Math. Erhard Cramer (Aachen) dedicated tremendous effort and con­siderable time to the preparation, reformatting and the substantial impro­vement of the revised manuscripts. The editors are extremely grateful for his excellent work.

Aachen/Basel, October 1995 Hans-Hermann Bock Wolfgang Polasek

Page 10: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

Table of Contents

Preface ............................................................... V

Table of Contents .................................................... XI

Section 1: Classification and Clustering

MDL for Mixtures of Normal Distributions Peter G. Bryant ... ............................................... 3

A New Algorithm for Two-Mode Clustering Wolfgang Gaul, Martin Schader ................................. 15

Computational Methods in Clustering from a Mathematical Programming Viewpoint Pierre Hansen, Brigitte Jaumard ................................ 24

Classification of Time Series with Optimized Time-Frequency Representations

Christoph Heitz ................................................. 41

Cluster Analysis Using Spatial Autocorrelation Mushtaq Hussain, }(Zemens l?uchs ................................ 52

Bounds for p-values of Combinatorial Tests for Clustering in Epidemiology Joachim }(rauth ................................................. 64

Probability Distributions on Indexed Dendrograms and Related Problems of Classifiability Bernard Van Cutsem, Bernard Ycart ............................ 73

On the Validation of Classification Trees }(laus-Dieter Wernecke, }(urt Possinger, Gerd }(alb .............. 88

Section 2: Uncertainty and Fuzziness

Modeling Weak Information with Generalized Basic Probability Assignments

Thomas Augustin .............................................. 101

Validity Measures for Fuzzy Partitions Christian Back, Mushtaq Hussain ............................... 114

Page 11: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

XII

Uncertainty and Actor-Oriented Information Retrieval in Jl-AKCESS. An Approach Based on Fuzzy Set Theory Peter Mutschke ................................................ 126

The" Combination Problem" for Probability Intervals: Necessary Assumptions Sigrid Pohlmann ............................................... 139

A Classification System for Uncertainty Theories: How to Select an Appropriate Formalism? Elisabeth Umkehrer, Kerstin Schill . ............................. 150

Section 3: Methods of Data Analysis and Applications

Analyzing Paired Comparisons Data Using Probabilistic Ideal Point Models and Probabilistic Vector Models Daniel Baier, Wolfgang Gaul ................................... 163

Consensus Rules for Molecular Sequences: Open Problems Hans-Hermann Bock, William H.E. Day, Fred R. McMorris ..... 175

Latent Dimensions of Managers' Risk Perception: An Application of Correspondence Analysis Claudia Mennicken, Ingo Balderjahn ........................... 186

Analysis of the Behaviour of New Car Buyers Stefan Ohl ..................................................... 197

Statistical Analysis of Spatial Data in Geographic Information Systems Ulrich Streit ................................................... 208

Analysis of Spatio-Temporal Data Using Interactive Statistical Graphics Martin Theus .................................................. 217

Representation of Finite Ordinal Data in Real Vector Spaces Uta Wille ...................................................... 228

Section 4: Statistical Models and Methods

Statistical Models and Artificial Neural Networks Gerhard Arminger, Daniel Enache .............................. 243

Page 12: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

Smoothing for Categorical Data: Discrete Kernel Regression and Local Likelihood Approaches

XIII

Gerhard Tutz .................................................. 261

Computer-based Design of Experiments in Industry Claus Weihs, Wolfgang Seewald ................................ 272

Testing Simple Hypotheses Robert L. Wolpert .............................................. 289

Section 5: Bayesian Learning

Bayesian Hierarchical Models for Random Routes in Finite Populations Maria J. Bayarri, Begona Font ................................. 301

Metropolis Sampling in Bilinear Time Series Models Katja Ickstadt, Song Jin, Wolfgang Polasek ..................... 313

Exact Likelihood Function Forms for an ARFIMA Process Jeffrey S. Pai, Nalini Ravishanker .............................. 323

Learning about Degradation Models and Prognostic Factors Lawrence I. Pettit .............................................. 332

Bayesian Classification Using Predictive Diagnostic Measures Karen D.S. young ............................................. 342

Section 6: Conceptual Classification, Knowledge Ordering and Information Systems

On the Relationship between Description Logics and Conceptual Graphs with some References to the Medical Domain Josef Ingenerf .................................................. 355

The Design of an Interpreter for Dialog-Controlled Rule Systems Rainer Kiel, Martin Schader ................................... 370

Conceptual Classifications and Nomenclatures in Medicine Rudiger Klar, Albrecht ZaijJ .................................... 380

A Method for the Detection of Sources in the Logic Information System LIS Wolfgang Lenski ............................................... 396

Page 13: Studies in Classification, Data Analysis, and Knowledge ...978-3-642-80098-6/1.pdf · Studies in Classification, Data Analysis, ... (Studies in classification. data analysis and knowledge

XIV

Two Conceptual Approaches to Acquire Human Expert Knowledge in a Complex Real World Domain Thomas P. Reinartz, Monika ZickwolfJ ......................... 406

Terminology and Knowledge Representation in Complex Domains Michael M. Richter, Gabriele Schmidt, Marina Schneider ....... 416

Knowledge Spaces and Formal Concept Analysis Anja Rusch, Rudolf Wille ...................................... 427

A Deductive Approach for Knowledge Base Refinement in Expert Systems and its Application to Credit Analysis Heribert Schimpe, Martin Staudt, Britt Kauert, Achim Sperber .. 437

Knowledge Representation of Pathological States Stefan Schulz .................................................. 447

Attribute Exploration with Background Implications and Exceptions Gerd Stumme .................................................. 457

Integrating an OR Model Base into a Knowledge-Based System for Marketing Data Analysis i?rank Wartenberg, Wolfgang Gaul .............................. 470

LIS/THESAURUS: A Model for Structuring and Representing Concepts of Mathematical Logic Elisabeth Wette-Roch .......................................... .481

Section 7: Linguistics and Dialectometry

Gender Classification of German Nouns: Symbolic and Sub-Symbolic Approaches Petra Badry, Sven Naumann ................................... 497

Automatic Inference of DATR Theories Petra Barg ..................................................... 506

Probability Distributions of Polysemantic Expressions Edda Leopold .................................................. 516

German Dialectometry Guillaume Schiltz .............................................. 526

Subject Index (including List of Authors) ............................ 541