mining text with the prototype-matching method

13
Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. ABSTRACT Text documents are the most common means for exchanging formal knowledge among people. Text is a rich medium that can contain a vast range of information, but text can be difficult to decipher automatically. Many organizations have vast repositories of textual data but with few means of automatically mining that text. Text mining methods seek to use an understanding of natural language text to extract information relevant to user needs. This article evaluates a new text mining methodology: prototype-matching for text clustering, developed by the authors’ research group. The methodology was applied to four applications: clustering documents based on their abstracts, analyzing financial data, distinguishing authorship, and evaluating multiple translation similarity. The results are discussed in terms of common business applica- tions and possible future research. Keywords: heuristic development; information retrieval; natural language interface INTRODUCTION It can be argued that computers are now used more for storing and retrieving data than com- puting data. Organizational computer systems are used for maintaining inventory, production, marketing, financial, sales, accounting, person- nel, customer, and other types of data. With enterprise systems, vast amounts of corporate data can be stored digitally and made available to employees when and where needed. Data mining software is often used to further glean information from corporate databases. A lot of transactional corporate data is numeric but not all of it. Indeed, it’s often stated that about 80% of corporate information is tex- tual or unstructured information (for example, see Chen, 2001, and Robb, 2004). An entire information systems specialty—knowledge management—includes collecting, storing, organizing, evaluating, and using textual data such as prevalent with consulting agencies in vast repositories of written reports. The World Wide Web provides access to planetary-wide databases of textual data Mining Text with the Prototype-Matching Method A. Durfee, Appalachian State University, USA A. Visa, Tampere University of Technology, Finland H. Vanharanta, Tampere University of Technology, Finland S. Schneberger, Appalachian State University, USA B. Back, Åbo Akademi University, Finland IGI PUBLISHING This paper appears in the publication, Information Resources Management Journal, Volume 20, Issue 3 edited by Mehdi Khosrow-Pour © 2007, IGI Global 701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.igi-pub.com ITJ3743

Upload: others

Post on 11-Feb-2022

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Text with the Prototype-Matching Method

Information Resource Management Journal, 20(3), 1�-31, July-September 2007 1�

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

ABSTRACT

Text documents are the most common means for exchanging formal knowledge among people. Text is a rich medium that can contain a vast range of information, but text can be difficult to decipher automatically. Many organizations have vast repositories of textual data but with few means of automatically mining that text. Text mining methods seek to use an understanding of natural language text to extract information relevant to user needs. This article evaluates a new text mining methodology: prototype-matching for text clustering, developed by the authors’ research group. The methodology was applied to four applications: clustering documents based on their abstracts, analyzing financial data, distinguishing authorship, and evaluating multiple translation similarity. The results are discussed in terms of common business applica-tions and possible future research.

Keywords: heuristic development; information retrieval; natural language interface

inTRODUCTiOnIt can be argued that computers are now used more for storing and retrieving data than com-puting data. Organizational computer systems are used for maintaining inventory, production, marketing, financial, sales, accounting, person-nel, customer, and other types of data. With enterprise systems, vast amounts of corporate data can be stored digitally and made available to employees when and where needed. Data mining software is often used to further glean information from corporate databases.

A lot of transactional corporate data is numeric but not all of it. Indeed, it’s often stated that about 80% of corporate information is tex-tual or unstructured information (for example, see Chen, 2001, and Robb, 2004). An entire information systems specialty—knowledge management—includes collecting, storing, organizing, evaluating, and using textual data such as prevalent with consulting agencies in vast repositories of written reports.

The World Wide Web provides access to planetary-wide databases of textual data

Mining Text with the Prototype-Matching Method

A. Durfee, Appalachian State University, USA

A. Visa, Tampere University of Technology, Finland

H. Vanharanta, Tampere University of Technology, Finland

S. Schneberger, Appalachian State University, USA

B. Back, Åbo Akademi University, Finland

IGI PUBLISHING

This paper appears in the publication, Information Resources Management Journal, Volume 20, Issue 3edited by Mehdi Khosrow-Pour © 2007, IGI Global

701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USATel: 717/533-8845; Fax 717/533-8661; URL-http://www.igi-pub.com

ITJ3743

Page 2: Mining Text with the Prototype-Matching Method

20 Information Resources Management Journal, 20(3), 1�-31, July-September 2007

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

for corporate users. Just one of hundreds of online article databases (Education Resources Information Center, or ERIC) has more than 1.2 million citations and 110,000 full text articles. Another, HighWire Press, has more than 1.3 million full text articles. Internal and external data sources offer extensive decision support for managers in dynamic, complex, and demanding business environments. But how can managers, decision makers, and knowledge workers find appropriate textual content among billions of words in internal and external document reposi-tories when it’s virtually impossible to do so manually? Seventy-five percent of managers spend more than an hour per day just sorting their e-mails, according to a Gartner Group survey (Marino, 2001).

Compounding the problem is that text, by its very nature, can have multiple meanings and interpretations. The structure of text is not only complex but also not always directly obvious. Even the author of a text might not know the extent of what might be interpreted from the text. These features of text make it a very rich medium for conveying a wide range of meanings but also very difficult to manage, analyze, and mine using computers (Nasukawa & Nagano, 2001). Therein lies the conundrum: There is too much internal and external text to mine manually, but it’s problematic for computer software to correctly interpret let alone create knowledge from text.

Text mining (TM) looks for a remedy of that problem. TM seeks to extract high-level knowledge and useful patterns from low-level textual data. Text mining tools seek to analyze and learn the meaning of implicitly structured information automatically (Dorre, Gerstl, & Seiffert, 1999). There are two broad categories of textual mining: text categorization and text clustering.

Text categorization analyzes text using pre-determined structures or words (i.e., keywords). It is a framework-driven approach, usually based on earlier analysis or expectations. Authors, readers, and librarians may introduce and use keywords, indexes, or mark-ups to outline the

main ideas, concepts and themes within a text to make textual searches easier for computers (Anderson, 1999; Chieng, 1997; Lahtinen, 2000; Salton, 1989; Weiss, White, Apte, & Damerau, 2000). However, authors and textual informa-tion users can assign different keywords to the same text, or even ascribe different meanings to the same keywords—possibly defeating the speed and accuracy of computer-based textual keyword searches. Readers need only consider their own wayward searches using keyword-based online search engines to understand the depth and breadth of the problem.

Text clustering, on the other hand, differs from keywords or pre-determined structural searches. Text clustering discovers latent group-ings of text, where the textual similarities within a group are maximized while similarities among groups are minimized. Effective text clustering uses the characteristics of textual meanings, structure, syntax, and semantics to find com-monality and group similar text. The resulted clusters-groups can then be used to efficiently search and analyze textual documents. Text clustering is a textual data-driven, not dependent on the accuracy and meaning of pre-conceived keywords or structure approach.

The key to gaining knowledge from inter-nal and external textual repositories, therefore, may be to exploit computers for processing the vast amounts of textual data with text mining software using text clustering to discover in-trinsic knowledge within documents.

This paper seeks to present a new text clustering methodology developed by the au-thors’ research group, position it among other text mining approaches, and empirically test it in four different applications. In the next sec-tion, this article examines related text mining research, approaches, and problems. Then the paper closely explains the process and research methodologies used by the new methodology, analyzes the results of testing it against four applications, and then discusses the general business applicability of it. Finally, the paper examines further text mining research op-portunities.

Page 3: Mining Text with the Prototype-Matching Method

Information Resource Management Journal, 20(3), 1�-31, July-September 2007 21

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

RELATED WORk

The nature of TextText, as a written form of spoken natural language, provides one of the most effective communication bridges among people. Text has a complicated and ambiguous multilevel structure; it is highly multidimensional with tens of thousands of dimensions (Fayyad, Piatetsky-Shapiro, & Smyth 1996). Structure exists in word formation (the morphology of language), in sentence grammar (syntax), and meaning (semantics). Moreover, the three components of text (word usage, grammatical construction, and meaning) vary considerably within every individual language.

The authors and readers of text often rep-resent the same semantics using different words (synonymy) or describe different meanings using words that have various meanings (polysemy). For example, in human-system communication, two people favor the same term with a probabil-ity of less than 20%, consequently resulting in 80-90% failure rates in communication (Furnas, Landauer, Gomez, & Dumais, 1987).

This characteristic of natural language words as the basic units of text can confuse text handling technologies such as document management systems, automatic thesauruses, and search engines. These technologies are primarily based on keywords, indexes, or a text property (such as author, subject, type, word count, printed page count, and time last written). These approaches are less effective working with natural language text because of the am-biguity, polesymy, synonymy, syntactic com-plexity, and multi-variance of interpretations. Text mining users should be able to categorize, prioritize, and compare documents not only by pre-defined keywords but also understand and utilize the meaning of any particular document without having to manually browse, read, and analyze them.

Text Mining ChallengesTo some extent, a language is merely a fixed stock of words that can be arranged in seem-

ingly endless variations. Additionally, words can interact in many ways; some words are more likely to occur near certain other words, for example, while others modify the meanings of words nearby. The product of the frequency of words and their relative importance is, there-fore (according to Zipf’s law) approximately constant (Zipf, 1972). This can allow some textual analysis based on word occurrence and placement.

A more fundamental reason for a language being more than a simple word stock is that natural language expressions have syntactic structural significance.1 This is the fundamental problem with text categorization using key-words; finding keywords does not necessarily equate to finding meaning. Different individuals may describe the same meaning with different (key)words because of word synonymy. In natural language, there is no one-to-one corre-spondence between word strings and syntactic structure or between syntactic structure and meaning.

According to Pullum and Scholz (2001), other features of natural languages that confuse automated systems are the unlimited complex-ity of natural language expressions and syntax variability. Even by omitting some words from context, a reader or listener is usually able to infer the meaning from ill-structured text. A speaker or writer who constructs textual expres-sions to deliver certain information to a listener or reader naturally allows some degree of per-sonal preference and background to determine textual structure—giving the literary world its vast range of writing styles. The same exact features that allow us to accept different struc-tural styles and ill-structured expressions only handicap automated text mining software.

Text MiningText mining (TM) methods and tools strive to search, organize, browse, and analyze text collections automatically looking for patterns (Kroeze, Matthee, & Bothma, 2003). TM can be simply defined, according to Witten, Bray, Mahoui, and Teahan (1998), as the process of analyzing text to extract information useful for

Page 4: Mining Text with the Prototype-Matching Method

22 Information Resources Management Journal, 20(3), 1�-31, July-September 2007

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

particular purposes. Hearst (1999) introduces TM as a step toward discovering or creating knowledge from a collection of documents. TM as a knowledge management technique identi-fies patterns and unexpected relationships in text previously unknown and to its users (Albrecht & Merkl, 1998). Kroeze et al. (2003), expand TM terminology and offers the parameters of non-novel, semi-novel, and novel investigation to differentiate between full-text information retrieval, standard text mining, and intelligent text mining. This terminology framework is represented in Table 1.

Understanding textual data can be sup-ported by specific mathematical operations: categorization, clustering, feature extraction, thematic indexing and information retrieval by content (Hand, Mannila, & Smyth, 2001; van Rijsbergen, 1979). Categorization assigns documents to pre-existing categories, called “topics” or “themes.” Text categorization applications include indexing text to support document retrieval and data extraction (Lewis, 1992). Clustering partitions a given collection of items into a number of previously unknown

groups with similar content. Clustering can discover unknown or previously unnoticed links in a subset of words, sentences, paragraphs, or documents. Feature extraction extracts particu-lar items from a collection of items to provide a representative sample of the overall content. Distinctive vocabulary items found in a docu-ment can be assigned to different categories by measuring the importance of those items to the document content. Thematic indexing identifies significant items in a particular col-lection. Textual indexing can identify a given document or query text by a set of weighted or un-weighted terms often referred to as index terms or keywords (most commonly or rarely used). Information retrieval locates a subset of a collection deemed to be relevant to a posed query based on a preconceived classification system. Traditionally, textual information retrieval systems are query-based, and it is as-sumed that users can describe their information needs explicitly and adequately in a parsable query. Text categorization and clustering are the most prevalent text mining methodologies. This chapter focuses on text clustering.

Non-novel investiga-tion finding/ retrieving already existing and known information

Semi-novel investiga-tion patterns/trends already exist in the data but are unknown

Novel investigation creating new knowl-edge outside of the data collection itself

Information retrieval uses exact or best matching(e.g., finds full texts or abstracts of papers)

Standard text mining uses mainly statisti-cal and computational methods(e.g., discovers lexical and syntactic features; finds beginnings of new themes; categorizes text into preexisting classes; summarizes text; and groups text into clusters with shared qualities)

Intelligent text mining uses interaction between investigator and a computerized tool(e.g., which business decision is implied; how can linguistic features of text be used to create knowledge about the outside world)

Table 1. A classification of text mining approaches, adapted from Kroeze et al. (2003)

Page 5: Mining Text with the Prototype-Matching Method

Information Resource Management Journal, 20(3), 1�-31, July-September 2007 23

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Text ClusteringThere are number of text mining clustering tech-niques based on statistical clustering (Slonim & Tishby, 2000; Zamir & Etzioni, 1998) and neural networks, often in the form of self-or-ganizing maps (SOM) (Chieng, 1997; Lagus, 2000). WebSOM, for example, applies SOM to cluster and visualize text content (Kohonen, 1997; Lagus, 2000; Toivonen, Visa, Vesanen, Back, & Vanharanta, 2001). WebSOM is based less on subjective perceptions of the authors and more on organizing or “visualizing” document content. Another statistical approach is based on fuzzy semantic typing to draw up a complete fuzzy affect lexicon of free text as introduced by Subasic and Huettner (2000). Gedeon, Sing, Koczy, and Bustos (1996) apply fuzzy impor-tance measures to retrieve significant “concepts” from documents using the entire document as a query vector. The hyperlink vector voting method of indexing and retrieving hypertext documents uses the content of hyperlinks point-ing to a document to rank its relevance to the query terms (Li, 1998).

A new methodology for text clustering—the prototype matching method (PMM)—is introduced and discussed in the next section. PMM statistically analyzes natural language text as a digital array to understand semantic meaning hidden in the text (Visa, Back, & Vanharanta, 1999). The method is free from human preconceptions about the text such as with text categorization techniques such as markups, indexes, or keywords.

RESEARCH METHODOLOgYPMM text mining methodology could be ap-plied to various real-world problems finding hidden patterns in textual information (Visa et al., 1999). The starting point was to provide a mechanism enabling computers to retrieve pieces of text semantically relevant to each other. The PMM can be thought of as a type of document matching, which matches a new document to old documents and ranking the new document by assigning a score or relevance (Weiss et al., 2000). The proposed methodol-

ogy was implemented in a prototyping software package called GILTA-3.

GILTA-3, using PMM, seeks similari-ties between the document-prototype and the closest-matching subject documents. For every prototype, there are two clusters created: one cluster of documents that are similar in some specific way, and another cluster of documents that are different from the prototype in some specific way. The method constructs a ranked list for every document-prototype and creates clusters from the first hits on the ranked list. A cluster of similar documents is formed from documents that “fire,” or appear as the closest matches at the top of a ranked list (in ascending order) of the distances between the prototype and the other documents. The cluster of docu-ments different from the prototype shares few or no patterns with the prototype. The documents in this cluster fire at the farthest distance in a ranked list to the prototype.

It should be noted that clustering is the most challenging task, since there is no pre-ex-isting set of categories created by human experts. Similarly, clustering results can be difficult to evaluate. The patterns that combine or separate documents into clusters may not be obvious to a user, and therefore evaluating the results can become tricky. Figure 1 below schematically depicts the process of comparing documents from a collection of documents.

Document Preprocessing andEncoding Automated text mining using statistical methods can be aided by preprocessing every textual line in a document. Preprocessing rounds numbers, separates punctuation marks with extra spaces, and excludes extra carriage returns, mathemati-cal signs, and dashes. Abbreviation, synonym, and compound word files are used to perform synonym and compound word filtering. Prepro-cessing does not omit words or perform word stemming, to keep as much initial information and structure in the preprocessed documents as possible since word order, their combinations, concurrences, and conjunctions can convey important insights.

Page 6: Mining Text with the Prototype-Matching Method

24 Information Resources Management Journal, 20(3), 1�-31, July-September 2007

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

After preprocessing, every document is encoded. Every word w in a document is transformed into a unique number according to the formula:

0

Li

L ii

y k c −=

= ×∑ (1)

where L is the length of a word as a string of characters; c i is the ASCII encoding value of every character within a word w, and k is a constant. Since the eight-bit ACSII character set was used, k= 256. The encoding algorithm produces a unique number for each word dis-regarding word stems, capitalization, and syn-onyms; only the exact same words have equal y values. Since punctuation marks also have unique ASCII values, they are also encoded. Resulting values of every word and punctua-tion mark from every document become word vectors. These vectors can then be statistically processed.

Document Processing PMM is based on the frequency distribution of all words compared to a training set, often comprised of the entire text collection as the initial training set. Word and sentence histo-grams, which allow the comparison of different documents to each other, may rely heavily on the

frequency distributions of words or sentences from an entire document collection. The same processing and analysis can be performed for document paragraphs if they are sufficiently lengthy to provide good distributions. Process-ing begins by examining the distribution of the coded word numbers.

Word QuantizationFrom a set of word codes from equation (1), the minimal and maximal values are identified for the entire document collection. The distribution of the codes is then examined using a Weibull distribution. The Weibull distribution is a highly adaptable distribution that can take on the char-acteristics of other types of distributions based on the value of shape parameters (ReliaSoft Corporation, 2002). The range between the minimal and maximal values is divided into Nw logarithmically equal bins, where Nw is the total number of words in the text collection. The word frequency of each bin is calculated and further normalized according to Nw.

Using a selected precision, Weibull dis-tributions are then calculated. The best-fitting Weibull distribution corresponding to the textual data is determined by examining the cumulative distribution. Every estimated Weibull distribu-tion is compared with the code distribution by calculating the cumulative distribution function (CDF) according to:

Figure 1. Comparing documents based on extracted histograms of words and sentences

Page 7: Mining Text with the Prototype-Matching Method

Information Resource Management Journal, 20(3), 1�-31, July-September 2007 25

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

( )( )( )max2.6 log /1

by y aCDF e

− × ×= − (2)

where minimum a and maximum b define the shape of the Weibull distribution. The ratio y/ymax is the actual portion of the total density mass. The only way to compare distributions is to use cumulative functions that are the same as the integrated probability distributions. Comparing estimated Weibull distributions and the cumula-tive code number distribution is performed in terms of the smallest square sum.

Repetitively, the best-fitting Weibull dis-tribution is divided into Nw bins of equal sizes, and every word is assigned to a bin. Every word is eventually represented as the number of the bin of the distribution in which it belongs.

Sentence QuantizationIn the same manner as words, every sentence is converted to a representational number. After every word in a sentence is changed to a bin

number (bni), the whole encoded sentence is considered a sampled signal (vector). Since not all sentences contain the same number of words, sentence vector lengths vary. To compensate for this, a discrete fourier transformation (DFT) converts every sentence vector from a collec-tion into an input signal. The input signal is a vector (bn1, bn2 … bnm), where m is the word’s placement number in a sentence. The output signals from DFT are the coefficients Bi (i= 0 … n). The coefficient B1 is then selected to further represent the sentence.

After every sentence has been converted into numbers, a cumulative distribution is cre-ated from the sentence data set using coefficients (B1) in the same way as on the word level. The range between the minimal and maximal parameters of the sentence code distribution is divided into Ns equally sized bins, where Ns is the length of the histogram vector. The frequency of sentences belonging to each bin is calculated. Then the bin counts are normalized

Figure 2. Example of a sentence distribution

Page 8: Mining Text with the Prototype-Matching Method

26 Information Resources Management Journal, 20(3), 1�-31, July-September 2007

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

in accordance with Ns. Finally, the best-fitting Weibull distribution corresponding to the sentence distributions is found. A graphical representation of sentence quantization from Back, Toivonen, Vanharanta, and Visa (2001) is shown in Figure 2.

individual HistogramsFinally, every document in a collection is re-processed to create individual word and sentence level histograms. After each word is quantified using word quantization, a word histogram Aw is created for each word. The histograms Aw are then normalized according to the length of the histogram vectors. Similarly, the histograms on the sentence level for every document in a collection are created and normalized

Document Matching and RankingThe individual word and sentence level histo-grams of all documents in a collection can then be compared with a histogram corresponding to a document-prototype (or sample document). This analysis is called document matching. Matching is done by first calculating simple Euclidian distances among the histograms; the documents closest to a document-prototype in terms of Euclidian distance form a document cluster. This matching is done for word and sentence histograms.

In the ranking phase, the documents with the smallest distances to a document-prototype are chosen from the top of the ranked list. The system creates a proximity table of all distances among the documents in a collection. The documents from the top of the proximity table for a given document-prototype are presented to a user within a specified recall window. The recall window is the quantity of closest-match-ing documents that a user wants to retrieve and consider for further analysis.

Empirical ValidationThe prototype-matching text mining meth-odology was validated in four applications: clustering scientific articles, analyzing financial data, authorship attribution, and translation accuracy.

Scientific Article ClusteringScientific research articles are published in the tens of thousands each year in hundreds of different academic fields. Their topics similarly range in the thousands, yet may overlap to some degree. A marketing research paper, for example, may investigate segmentation using a new information systems technology, or a biology paper may examine ant colonies in terms of economic theories. How can one best mine scientific research papers when there may be significant topical overlap? At present, most papers (including this one) include keywords for text categorization techniques, but text cluster-ing may offer much more efficient and effective text mining. Similar problems can be found in business, when many cross-related reports need to be mined for specific topics.

For this paper, the prototype-matching tool was applied to 444 scientific abstracts obtained from The Hawaii International Conference on System Science 2001 (HICSS-34). The sci-entific papers were organized by conference track chairs into nine major thematic tracks with further subdivision into 78 mini-tracks. Furthermore, the track chairs attempted to identify six themes that ran across the tracks. They outlined six cross-track themes featuring 134 papers in 26 mini-tracks.

Using GILTA-3 software based on PMM, the authors sought to justify the conference’s chosen tracks’ themes based on theme published in the papers. The full text of every abstract was encoded into an array of 2,080 text distribu-tion bins based on common word histograms. Sentence histograms of size 25 were generated for every abstract.

There were mixed results comparing what GILTA-3 found versus how conference chairs allocated papers. With a recall window set at 25, for example, 26% of the data-mining papers clustered with the papers from the data mining mini-track theme. Only 12% of papers from the e-commerce development track clustered with papers discussing e-commerce issues. For other cross-track themes, such as knowledge management, collaborative learning, workflow, and e-commerce development, the number of

Page 9: Mining Text with the Prototype-Matching Method

Information Resource Management Journal, 20(3), 1�-31, July-September 2007 27

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

papers that fired as the closest ones to the papers within a theme was less than 10%.

Qualitative financial DataTraditionally, corporate financial performance is analyzed using quantitative financial ratio data (such as stock price per earnings ratio). However, some valuable financial descriptive data can be found in the textual portions of corporate reports such as annual reports. Manu-ally reading thousands of long annual reports to get a complete financial picture of companies, however, is not practical. Can automated text mining methods be applied to glean financial information from corporate reporting text?

Using a database of 234 annual reports by 50 pulp companies from 1985-1989, the text was evaluated using PMM. The textual mining results were compared with the results of the quantitative analysis conducted for the same companies in Kloptchenko et al. (2004). The comparison highlighted some discrepan-cies between the qualitative and quantitative performance results in the reports. While the discrepancies may be partly explained by a possible tendency to overstate actual financial status in textual reports, the analysis appeared to support the notion that PMM text mining can be used to analyze qualitative, textual, and financial data.

Another effort was made to cluster textual quarterly reports from the leaders of telecom-munications sector, Ericsson, Motorola, and Nokia, from the years 2000-2003 (Back et al., 2001). It was found that annual or quarterly textual reports contain messages about company future prospects, not just past performance. This explained the dissimilarities in clustering qualitative and quantitative data by the phenom-ena that exists in qualitative and quantitative parts of every quarter/annual report obtained in Kloptchenko et al., (2004).

Both results suggest differences in the clustering of quantitative and qualitative data from the reports. Moreover, it was observed how fluctuations in quantitative financial per-formances influenced the qualitative parts of reports within some existing time lag. Moreover,

time lag varies by company; for Ericsson, it lasts about one quarter, while for Motorola it can last two quarters.

Authorship AttributionWhen digitized text can so easily be found through the Internet and copied directly into documents, can PMM be used to identify when text is significantly different from other text?

Two tests were made to check how the clustering methodology can find divergences in text written by different authors. Three texts from classical authors (William Shakespeare, Edgar Alan Poe, and George Bernard Shaw) were examined (Visa, Toivonen, Back, & Van-haranta, 2000). After the preprocessing, vector quantization, and histograms were created for the texts on word and sentence levels. Bin sizes were set to 2080 on the word level and 25 on the sentence level. Each text piece was treated one-by-one as a prototype, matching it against consolidated text from all sources.

The results of author divergence were extremely good on the word level. The closest matches occurred among the text pieces written by the same author. On the sentence level, the results were good in general, except for one mismatch from Shaw’s Mrs. Warren’s Profes-sion and Poe’s The Assignation. PMM appeared to have significant potential to recognize and distinguish author styles based on peculiarities of sentence structuring by different authors.

Translation AuthenticityIn a global business environment, textual documents are routinely translated into many languages. Since language fluency is too often a specialty, business managers often have to take it on faith that translations are accurate. Can PMM be used to evaluate different translations of the same underlying text?

To evaluate this possibility, Bible versions in Greek, Latin English, and two in Finnish from the years 1933 and 1938 were chosen as test materials. They were assumed to be very accurate translations of significant cultural and religious meanings. Word-, sentence-, and para-graph-level histograms were created using the

Page 10: Mining Text with the Prototype-Matching Method

28 Information Resources Management Journal, 20(3), 1�-31, July-September 2007

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

procedures outlined previously. All the books in the Bible were used as the prototype for the different versions of the Bible (Toivonen et al., 2001). A recall window of the closest-match-ing documents of 10 was chosen to compare identical passages in different translations. The assumption was that if books in different languages were similar, it was evidence that PMM could be used to verify translations (Visa, Toivonen, Vanharanta, & Back, 2001).

PMM found that an average of 6 books out of 10 appeared to be identical, that is, within the same bins. There were, on average, 4.52 books within the same bins in English and Finnish ver-sions based on the word map, 7.94 books based on the sentence map, and 5.56 books based on the paragraph maps. Mathematically, a random sample would have had only two similar books in a bin. The results therefore appear to support the notion of using PMM text mining to compare translation accuracy. As a side note, comparing translations appears to work best at the sentence level than the word or paragraph level.

DiscussionIn spite of the multidimensional and complex nature of natural language, statistical methods such as prototype-matching appear to be suit-able for mining text in terms of text clustering. Given the enormous range and amount of text available in digital text files within and without companies, this can present a number of poten-tially valuable automated applications.

In particular, three areas may provide the greatest return: text filtering, searches, and management, as shown in Table 2. These applications appear to be useful across a cor-porate value chain but perhaps most useful in marketing and finance where textual news and reports can be mined for indications of future trends. Moreover, the applications appear to be worthwhile working in different languages for global business operations.

The authors believe that the GILTA tool using PMM goes beyond information retrieval as an intelligent text mining tool since it un-covers semi-novel information and creates knowledge from knowledge. Furthermore, the authors believe the GILTA-3 software can be implemented as a module either into an exist-ing enterprise support system or into individual decision support systems such as financial analysis or marketing tools.

Research OpportunitiesGiven the depth and breadth of digitally avail-able text and the range of business opportuni-ties presented in Figure 2, there appear to be considerable opportunities for refining PMM and its application.

PMM formulas might be refined to be more accurate yet more flexible to natural text variations and different native languages. The constant parameters and the encoding methods might be improved. Appropriate bin sizes can be determined for different applications. Algo-

Area Filtering Searching Managing

Uses

e-mailsmail routingnews monitoringpush publishing

automated indexinggenre classificationauthorship attributionsurvey codinge-mailscorporate reportssocial network analysisbusiness intelligence

knowledge managementcorporate learningleveraging expertiselegal document retention

compliance

Table 2. Potential business applications of PMM text mining

Page 11: Mining Text with the Prototype-Matching Method

Information Resource Management Journal, 20(3), 1�-31, July-September 2007 2�

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

rithms specifically for filtering text streams can be designed and tested. Search engines may be able to use PMM techniques to improve the ef-ficacy of search results. In particular, search en-gines could use entire paragraphs or documents to search with instead of just keywords.

Testing the applicability of PMM text clustering to the uses in Figure 2 may present huge opportunities for research and refinement. And it’s possible that research into PMM uses will uncover even more applications where automated text mining can contribute to busi-ness success.

ACknOWLEDgMEnTThe financial support of TEKES (grant number 40887/97) and the Academy of Finland is grate-fully acknowledged.

REfEREnCESAlbrecht R. & Merkl D. (1998). Knowledge dis-

covery in literature data bases. Library and information services in astronomy III, ASP Conference Series, Vol. 153.

Anderson, M. (1999). A tool for building digital libraries. Journal Review, 5(2).

Back, B., Toivonen, J., Vanharanta, H., & Visa, A. (2001). Comparing numerical data and text information from annual reports using self-organizing maps. International Journal of Accounting Information Systems, 2.

Chen, (2001). Knowledge management systems: A text mining perspective. Tucson, AZ: Knowl-edge Computing Corporation.

Chieng, L. (1997). PAT-tree-based keyword extrac-tion for Chinese information retrieval. In Proceedings of Special Interest Group on Information Retrieval, SIGIR’97, ACM, Philadelphia.

Dörre, J., Gerstl, P., & Seiffert, R. (1999). Text min-ing: Finding nuggets in mountains of textual data. In Proceedings of the KDD-99, Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego.

Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P.

(1996). Knowledge discovery and data mining: Towards a unifying framework. In Proceedings of The Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR.

Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The vocabulary problem in human-system communication. Communi-cations of the ACM, 30(11), 964-971.

Gedeon, T., Sing, S., Koczy, L., & Bustos, R. (1996). Fuzzy relevance values for information retrieval and hypertext link generation. In Proceedings of the EUFIT-96, Fourth Euro-pean Congress on Intelligent Techniques and Soft Computing, Aachen, Germany.

Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Boston: The MIT Press.

Hearst, M. (1999). Untangling text data mining. In Proceedings of 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), MD.

Kloptchenko, A., Eklund, T., Back, B., Karlsson, J., Vanharanta, H, & Visa, A. (2004). Combining data and text mining techniques for analyzing financial reports. International Journal of Intelligent Systems in Accounting, Finance, and Management, 12(1), 29-41

Kohonen, T. (1997). Self-organizing maps. Springer-Verlag.

Kroeze, J., Matthee, M., & Bothma, J. (2003). Dif-ferentiating data and text mining terminology. In Proceedings of SAICSIT, 93-101.

Lagus, K. (2000). Text mining with WebSOM. Unpub-lished doctoral dissertation, Espoo, Finland.

Lahtinen, T. (2000). Automatic indexing: An ap-proach using an index term corpus and combining linguistic and statistical methods. Unpublished doctoral dissertation, University of Helsinki, Finland.

Lewis, D. (1992). Feature selection and feature extraction for text categorization. Speech and Natural Language Workshop.

Li, Y. (1998). Toward qualitative search engine. IEEE Internet Computing, July-August.

Marino, G. (2001). Workers mired in e-mail waste-

Page 12: Mining Text with the Prototype-Matching Method

30 Information Resources Management Journal, 20(3), 1�-31, July-September 2007

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

land. Retrieved from CNetNews.com.

Nasukawa T., & Nagano T., (2001). Text analysis and knowledge mining systems. IBM Systems Journal 40(4), 967-984.

Pullum, G., & Scholz, B. (2001). More than words. Nature, 413, 367.

ReliaSoft Corporation (2002). Reliability glossary. ReliaSoft Corporation.

Robb, D. (2004). Text mining tools take on unstruc-tured data. ComputerWorld, June 21.

Salton, G. (1989). Automatic text processing. Ad-dison-Wesley.

Slonim, N., & Tishby, N. (2000). Document clus-tering using word clusters via information Bootleneck method. In Proceedings of SIGID 2000, New York: ACM Press.

Subasic, P., & Huettner, A. (2000). Calculus of fuzzy semantic typing for qualitative analysis of text. In Proceedings of KDD-2000 Workshop on Text Mining, Sixth ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining, Boston.

Toivonen, J., Visa, A., Vesanen, T., Back, B., & Van-haranta, H. (2001). Validation of text clustering based on document contents. In Proceedings of MLDM’2001, International Workshop on Machine Learning and Data Mining in Pattern Recognition, Leipzig, Germany.

van Rijsbergen, C. (1979). Information retrieval (2nd ed.). London: Butterworths.

Visa, A., Back, B., & Vanharanta, H. (1999). Toward text understanding: Comparison of text docu-ments by sentence map. In Proceedings of the EUFIT’99, 7th European Congress on

Intelligent Techniques and Soft Computing, Aachen, Germany.

Visa, A., Toivonen, J., Back, B., & Vanharanta, H. (2000). Toward text understanding: Clas-sification of text documents by word map. In Proceedings of AeroScience 2000, SPIE 14th Annual International Symposium on Aerospace/Defense Sensing, Simulating and Controls, Orlando, FL.

Visa, A., Toivonen, J., Vanharanta, H., & Back, B. (2001). Prototype-matching: Finding meaning in thee books of the Bible. In Proceedings of HICSS-34, Hawaii International Conference on System Sciences, Maui, Hawaii.

Weiss, S., White, B., Apte, C., & Damerau, J. (2000). Lightweight document matching for help-desk applications. IEEE Intelligent Systems, March/April.

Witten, I., Bray Z., Mahoui, M., & Teahan, B. (1998). Text mining: A new frontier for lossless com-pression. In Proceedings of Data Compression Conference ‘98, IEEE.

Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of Conference of Information Retrieval (SIGIR’98), ACM Press.

Zipf, G. K. (1972). Human behavior and the prin-ciple of least effort: An introduction to human ecology. New York: Hafner.

EnDnOTE1 “Mohammed will come to the mountain” and

“The mountain will come to Mohammed” have, of course, completely different meanings although they use the exact same words.

Antonina Durfee (Kloptchenko) is an assiatant professor at Appalachian State Univeristy in Boone, USA. She helds a PhD degree from Abo Akademi Univeristy . Her currents research is in text mining, knowl-edge discovery, human issues in echology adoption and seeking behavior. She published in International Journal of Intelligent Systems in Accounting, Finance, and Management, International Journal of Digital Accounting Research.

Page 13: Mining Text with the Prototype-Matching Method

Information Resource Management Journal, 20(3), 1�-31, July-September 2007 31

Copyright © 2007, IGI Global Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Ari Visa is professor of digital signal processing at Tampere University of Technology in Tampere, Fin-land. His current research interests are in multimedia and multimedia systems, adaptive systems, wire-less communications, distributed computing, soft computing, computervision, knowledge mining, and knowledge retrieval. He has published in journals such as Journal of Management Information Systems, Information&Mangement, Benchmarking. An International Journal and Information Visualization.

Hannu Vanharanta is professor of industrial management and engineering at Tampere University of Tech-nology in Pori, Finland. His research interest sinclude strategic management, human resource manage-ment, knowledge management and executive support systems. He has published in journals such as Journal of Management Information Systems, Information & Mangement, and Benchmarking. An International Journal.

Scott Schnebergeris an associate professor in the Department of Computer Information Systems in the Walker College of Business at Appalachian State University, Boone, NC. He is also managing director of the Center for Applied Research on Emerging Technologies at Appalachian State University.

Barbro Back is professor of accounting information systems at Åbo Akademi University in Turku, Finland. Her research interests are in the areas of knowledge mining, neural networks, financial benchmarking and enterprise resource planning systems. She has published in journals such as Journal of Management Informa-tion Systems, Accounting, Management and Information Technologies, International Journal of Accounting Information Systems, European Journal of Operational Research, and Information and Mangement.