1 p a wlak's - research.nuance.coms query. most often queries p osed information ... synon ym...

23

Upload: vuonglien

Post on 10-Mar-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

Vocabulary Mining for Information Retrieval: RoughSets and Fuzzy SetsPadmini Srinivasan and Miguel E. RuizSchool of Library and Information ScienceThe University of IowaIowa City, Iowa 52242fpadmini-srinivasan,[email protected] H. Kraft, Jianhua Chen, and Sukhamay KunduDepartment of Computer ScienceLouisiana State UniversityBaton Rouge, LA 70803-4020fkraft,jianhua,[email protected] 7, 1998AbstractVocabulary mining in information retrieval refers to the utilization of the domain vocabularytowards improving the user's query. Most often queries posed to information retrieval systemsare not optimal for retrieval purposes. Vocabulary mining allows one to generalize, specializeor perform other kinds of vocabulary based transformations on the query in order to improveretrieval performance. This paper investigates a new framework for vocabulary mining thatderives from the combination of rough sets and fuzzy sets. The framework allows one to userough set based approximations even when the documents and queries are described usingweighted, i.e. fuzzy representations. The paper also explores the application of generalizedrough sets and the variable precision models. The problem of coordination between multiplevocabulary views is also examined. Finally, a preliminary analysis of issues that arise whenapplying the proposed vocabulary mining framework to the Uni�ed Medical Language System(a state of the art vocabulary system), is presented. The proposed framework supports thesystematic study and application of di�erent vocabulary views in information retrieval.Keywords: Vocabulary mining, generalized rough sets, fuzzy sets, multiple vocabulary views,UMLS.1

Page 2: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

In information retrieval the challenge is to retrieve relevant texts in response to user queries. In-formation retrieval technology has matured to the point that we now have reasonably sophisticatedoperational and research systems. However, increasing the e�ectiveness of retrieval algorithms re-mains an important and actively pursued research goal. Query re�nement, where the initial queryis modi�ed to yield a potentially more e�ective query, is an important part of information retrieval.This step is very critical for users whose queries are not formulated well enough for an e�ectiveretrieval run. One alternative for query re�nement, referred to here as vocabulary based queryre�nement, is to exploit knowledge within a vocabulary that is typically domain speci�c. A secondapproach utilizes the vocabulary in documents related to the query where the related documentsmay be identi�ed either through relevance or retrieval feedback.Several families of statistical information retrieval models have received signi�cant and longterm attention such as the Boolean, vector, probabilistic and fuzzy families of models. The generalapproach is to create suitable representations (Boolean, weighted, unweighted etc.) for the queryand the document and apply a suitable retrieval technique (similarity computation, probabilityof relevance etc.) that derives from the adopted model. Query re�nement in the Boolean modelmay occur by either changing the query operators or changing the terms or both. At all times theintegrity of the term{operator relationships with respect to the user's information needs must bemaintained. In the vector model, processes such as Rocchio's and Ide's feedback o�er documentbased query re�nement options [Salton, 1971]. Researchers have also investigated the derivation offuzzy thesauri [Miyamoto, 1990]. However, query re�nement in these models is an optional feature.In other words these models allow retrieval to be conducted without any query re�nement.In contrast, the rough set model o�ers a tight integration between retrieval and vocabularybased query re�nement. In fact, retrieval operates only after �rst exploring query re�nement.Characteristics of the domain vocabulary, i.e., terms and relationships, are automatically utilizedto re�ne the query representation before retrieval begins. An additional advantage is that themodel also automatically allows the natural perturbations in vocabularies to in uence documentrepresentations. In essence, rough sets o�er an approach where the domain's vocabulary(ies) canbe automatically mined prior to retrieval. Relationships linking terms such as synonymy, near syn-onymy or related terms, lexically related terms, speci�c and general terms can all be automaticallymined in order to strengthen retrieval e�ectiveness.Our research goal is to explore the application of the family of rough set models to informationretrieval. Almost 10 years ago, initial e�orts by one of the authors demonstrated some of the poten-tial of rough sets for information retrieval [Das-Gupta, 1988, Srinivasan, 1989, Srinivasan, 1991].Since then the area of rough sets has matured signi�cantly with many exciting advances reportedin the literature. We will explore further developments and their potential for information retrieval.In particular we aim to determine if current extensions to the model will strengthen our previousapplications of rough sets to retrieval.Section 1 provides a brief review of the standard rough set model and our previous applicationof the model to information retrieval. Section 2 explores the inclusion of fuzzy sets and logic inthe rough set framework. Section 3 applies the combination of fuzzy and rough sets to informationretrieval. Sections 4 and 5 describe other extensions to the rough set model and their applicationto information retrieval. Extensions explored include generalized rough set models and the variableprecision rough set model. Section 6 combines these extensions along with fuzzy notions into auni�ed and novel framework for collaboratively mining alternative vocabulary views. Section 7presents a preliminary analysis of issues that arise when applying the proposed framework to the2

Page 3: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

Uni�ed Medical Language System, a state of the art vocabulary system developed by the NationalLibrary of Medicine. The �nal section o�ers our conclusions and future plans for testing thisapproach.1 Pawlak's Rough Set ModelThe Rough Set Model (RSM) was proposed by Pawlak in the early 1980s [Pawlak, 1982]. It isan extension of standard set theory that supports approximations in decision making. It sharesideas and goals, to some extent, with other tools such as the Dempster-Shafer theory of evi-dence [Skowron & Grzymala{Busse, 1994], fuzzy set theory [Pawlak & Skowron, 1994], and dis-criminant analysis [Krusinska, Slowisnaki & Stefanowski, 1992]. As stated by others [Pawlak, et. al., 1995],one advantage of rough set theory is that it does not require preliminary information about the datasuch as probability assignments (as in Dempster{Shafer theory) or membership values (as in fuzzyset theory). However, it does require an equivalence relation operating on a universe of objects,and o�ers a pair of approximation operators to characterize di�erent subsets of the universe. Var-ious systems, for dealing with approximations in di�erent application contexts (especially in datamining), have been built using these operators [Hu & Cercone, 1995, Millan & Machuca, 1997,Nguyen, et. al., 1997, �hrn, et. al., 1997].In Pawlak's model an equivalence relation partitions a non empty universe of objects into disjointequivalence classes. Objects within an equivalence class are indistinguishable with regard to therelation. Any appropriate equivalence relation may be used for this purpose. The universe andthe equivalence relation together de�ne an approximation space. The equivalence classes and theempty set, �, are considered the elementary or atomic sets in this approximation space. Such anapproximation space may be used to describe arbitrary subsets of the universe. This is done usingtwo approximations: the lower and the upper approximations of the subset.Let R be an equivalence relation that partitions U , a non empty universe of objects to createan approximation space aprR = (U;R). Let the partition be denoted as U=R = fC1, C2, ..., Cng,where Ci is an equivalence class of R. Now, for an arbitrary subset S of U ,the lower approximation of S, aprR(S) = fx 2 CijCi � Sg andthe upper approximation of S, aprR(S) = fx 2 CijCi \ S 6= �g.These two approximations are in e�ect approximate descriptions of the subset S in the approx-imation space (U;R). The term `rough set' is described as the set of these two approxima-tions (aprR(S); aprR(S)) for reference set S. Also the accuracy of the approximation for S =japrR(S)j=japrR(S)j.The RSM may be viewed as an extension of set theory with two additional unary set{theoreticoperators: the lower and upper approximation operators [Lin & Liu, 1993]. The underlying equiv-alence relation is serial, re exive, symmetric, transitive and Euclidean in nature.A relation R is serial if for every x 2 U there exists a y 2 U such that xRy holds. In otherwords, every element in the universe has an R�related element in the same Universe. R is re exiveif for all x 2 U; xRx holds. R is symmetric if for every xRy that holds, yRx also holds. R istransitive if for every x; y; z 2 U; if xRy and yRz hold, then xRz holds. R is Euclidean if for every3

Page 4: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

x; y; z 2 U; if xRy and xRz hold, then yRz holds. These properties are not independent of eachother. For example, a re exive relation implies a serial relation and a symmetric and transitiverelation implies a Euclidean relation. When equivalence is used to de�ne the approximation space,and given any two subsets of the universe A and B, the lower approximations satisfy the followingproperties [Yao, et. al., 1994]:L1: aprR(A) =� (aprR(� A))L2: aprR(U) = UL3: aprR(A \B) = aprR(A) \ aprR(B)L4: aprR(A [B) � aprR(A) [ aprR(B)L5: A � B =) aprR(A) � aprR(B)L6: aprR(�) = �L7: aprR(A) � AL8: aprR(A) = aprR(aprR(A))L9: aprR(A) = aprR((aprR(A))Similarly the upper approximations satisfy the following properties:U1: aprR(A) =� (aprR(� A))U2: aprR(�) = �U3: aprR(A [B) = aprR(A) [ aprR(B)U4: aprR(A \B) � aprR(A) \ aprR(B)U5: A � B =) aprR(A) � aprR(B)U6: aprR(U) = UU7: A � aprR(A)U8: aprR(aprR(A)) = aprR(A)U9: aprR(aprR(A)) = aprR(A)The following two properties are also satis�ed:K: aprR(� A [B) �� aprR(A) [ aprR(B)LU: aprR(A) � aprR(A)Properties L1 and U1 indicate that these approximations are dual operators. Many researchershave applied and studied these operators quite extensively for data mining in several domains suchas health care and engineering (for example see [Hu & Cercone, 1995, Millan & Machuca, 1997,Nguyen, et. al., 1997, �hrn, et. al., 1997]). Our interest is in information retrieval, i.e., the prob-lem of identifying potentially relevant texts in response to queries.1.1 Application of Pawlak's RSM to Information RetrievalIn previous work we showed that a vocabulary of terms can be modeled for information retrievalapplications using the RSM [Das-Gupta, 1988, Srinivasan, 1989, Srinivasan, 1991]. The model wasapplied by considering the domain's vocabulary (individual words and phrases) as the universe Uof objects. R represented the equivalence relation de�ned by the term synonymy relationship and4

Page 5: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

was used to create a partition of U such that terms within a class are synonyms of each other.Documents and queries, represented by vectors were compared via their approximations in theapproximation space aprR = (U;R) as illustrated in the following example.Example 1: Let T = ft1; t2; ::; t10g represent a vocabulary partitioned by R the synonymy relationsuch that T=R = fC1, C2, C3, C4, C5g, as de�ned below.C1 = ft1; t4; t6gC2 = ft3; t7gC3 = ft5; t8; t10gC4 = ft9gC5 = ft2gLet D1 a document and Q1 a query be de�ned as subsets of T :D1 = ft2; t3; t4; t7gQ1 = ft1; t2; t3gWe can then de�ne the rough set (aprR(D1); aprR(D1)) with reference set D1aprR(D1) = ft2; t3; t7gaprR(D1) = ft1; t2; t3; t4; t6; t7gWe can also de�ne the rough set (aprR(Q1); aprR(Q1)) with reference set Q1aprR(Q1) = ft2gaprR(Q1) = ft1; t2; t3; t4; t6; t7gThe following interpretations may be given to these approximations in the context of our re-trieval application. The lower approximation identi�es properties (i.e., terms) that de�nitely de-scribe the subset (i.e., document or query) of interest. In contrast, the upper approximationidenti�es features that possibly describe the subset. These de�nite and possible features are ofcourse, determined largely by the underlying approximation space (i.e., vocabulary) and the rela-tion R (i.e., synonymy). Thus the vocabulary T partitioned by synonymy indicates that the queryQ1 is de�nitely described by t2 and is possibly described by t1; t2; t3; t4; t6; t7. Notice that the lowerapproximation automatically narrows the query/document to its core description while the upperapproximation expands the description to the extent permitted by the vocabulary space.Comparing D1 and Q1:In our previous work a number of comparison methods were designed. For example, a documentand query were considered `roughly equal' if they had identical lower and upper approximations.They were `top equal' if they had identical upper approximations.5

Page 6: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

In the present work we adopt a slightly di�erent strategy as described next. Two subsets ofU , say S1 and S2 may be compared in the approximation space (U;R) with a pair of asymmetricsimilarity measures. Using the lower approximations, the asymmetric similarity between S1 andS2, with S2 as the focus can be computed in the following way: �rst let,Bl = aprR(S2) j � j (aprR(S1) \ aprR(S2)) andBu = aprR(S2) j � j (aprR(S1) \ aprR(S2))where j � j represents the bounded di�erence, then calculate:SimilarityR(S1; S2) = 1� [card(Bl)=card(aprR(S2))] (1)This will equal 0 for no match and 1 for maximum match between S2 and S1 keeping S2 as thefocus in the comparison. If card(aprR(S2)) = 0 then SimilarityR(S1; S2) is set to equal 0.In the same way, SimilarityR(S1; S2) = 1� [card(Bu)=card(aprR(S2))] (2)This will equal 0 for no match and 1 for maximum match between S2 and S1 keeping S2 as the focusin the comparison. If card(aprR(S2)) = 0 then SimilarityR(S1; S2) is set to equal 0. Applyingthese similarity measures with the query as the central focus in example 1:SimilarityR(D1; Q1) = 1 and SimilarityR(D1; Q1) = 1.The overall retrieval strategy could be to use either similarity measure or perhaps some weightedcombination of the two. We are currently investigating mechanisms for combining these two sim-ilarity measures into one retrieval status value. In our previous work, we also showed how thevocabulary model quite naturally yields document clusters, another important application withininformation retrieval. However, there were some limitations. These include, for example, the in-ability to use weighted descriptions of documents and queries and the inability to utilize termrelationships other than synonymy. In the current research we solve the �rst problem by using acombination of rough set and fuzzy set theories. We also consider recent extensions to the RSM thatprovide further exibility, as for example, in being able to accommodate other types of vocabularyrelationships. Finally we also propose a method for combining multiple vocabulary relationships,i.e., vocabulary views.2 Combining Rough Set and Fuzzy Set NotionsThe motivation to include fuzzy information into the rough set framework is to enable users tospecify approximate descriptions of queries. Similarly greater exibility is o�ered when documentsare described using term weights. Several researchers have studied the combination of rough andfuzzy notions [Dubois & Prade, 1990, Dubois & Prade, 1992, Lin, 1992, Yao, 1997]. We base oure�orts on the approach proposed by Yao in [Yao, 1997] where he investigates combinations basedon alpha-level sets. As a preliminary step, the following expressions taken from [Yao, 1997] showhow membership functions may be used to compute the two rough set approximations in Pawlak's6

Page 7: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

standard model.Assume S to be a set of interest in a universe U and R an equivalence relation for U . If we let �Sdenote the membership function for S, then�aprR(S)(x) = inff�S(y)jy 2 U; (x; y) 2 Rg�aprR(S)(x) = supf�S(y)jy 2 U; (x; y) 2 Rg,and if we let �R denote the membership function for R, then�aprR(S)(x) = inff1� �R(x; y)jy =2 Sg�aprR(S)(x) = supf�R(x; y)jy 2 Sg,These may be combined to give:�aprR(S)(x) = inffmax[�S(y); 1 � �R(x; y)]jy 2 Ug (3)�aprR(S)(x) = supfmin[�S(y); �R(x; y)]jy 2 Ug (4)Using �S to de�ne the approximations, an element x belongs to the lower approximation ofS (with membership = 1) if all elements equivalent to x belong to S (i.e., �S(y) = 1). Using�R to de�ne the approximations, an element x belongs to the lower approximation of S (withmembership = 1) if all terms not in S are not equivalent to x, (i.e., �R(x; y) = 0). Finally, usingboth �S and �R to de�ne the approximations, an element x belongs to the lower approximation ofS (with membership = 1) if across all terms in the universe, when a term is equivalent to x (i.e.,�R(x; y) = 1) then it is present in S (i.e., �S(y) = 1). Thus we see that these functions may beused to compute the approximations in the degenerate case where �S and �R take the values of 0or 1, i.e., S and R are crisp sets. As Yao shows these functions are also relevant when S and R arefuzzy sets.Rough Fuzzy SetsA rough fuzzy set is derived from the approximation of a fuzzy set in a crisp approximation space.Let F be a fuzzy set in an approximation space aprR = (U;R) with R being an equivalence relation.The �-cut, � 2 [0; 1] of a fuzzy set is de�ned as:F� = fx 2 U j�F (x) � �gWith any given F� as reference set, a rough set (aprR(F�); aprR(F�)) may be de�ned using thestandard Pawlak model. More generally with the fuzzy set F as the reference set, a rough fuzzy set(aprR(F); aprR(F)) may be de�ned where each approximation is itself a fuzzy set. The membershipvalue of x belonging to the fuzzy set aprR(F) is the minimum of membership values of elementsin the equivalent class containing x, and the membership value of x belonging to aprR(F) is themaximum. Thus, equations 3 and 4 may be used to determine the membership in aprR(F) andaprR(F). 7

Page 8: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

Fuzzy Rough SetsA fuzzy rough set is derived from the approximation of a crisp set in a fuzzy approximation space.Consider a fuzzy approximation space apr< = (U;<) with < representing a fuzzy similarity relation.Similar to an �-cut for a fuzzy set, it is possible to apply a �-cut, with � 2 [0; 1] on the fuzzysimilarity relation < such that each of <'s �-level sets is an equivalence relation. Thus we mayapply the standard Pawlak model for each �-cut and so for a given subset S in U we can derive arough set (apr<� (S); apr<�(S)) in each of the �-level equivalence relations. More generally, we get afuzzy rough set (apr<(S); apr<(S)) in < where each approximation is a fuzzy set whose membershipvalues may also be determined by equations 3 and 4.Fuzzy Sets and Fuzzy Approximation SpacesThis is a more general model that allows the approximation of a fuzzy set in a fuzzy approx-imation space. Thus we can use �-cuts on the fuzzy set F to get crisp sets and �-cuts onthe fuzzy similarity relation < to get equivalence relations. This yields a family of rough sets(apr<�(F�); apry<�(F�)); � 2 [0; 1]; � 2 [0; 1]. The combination may be interpreted in three dif-ferent ways: as a family of rough sets, a family of rough fuzzy sets and a family of fuzzy roughsets, depending upon how these rough sets are grouped. Irrespective of interpretation, generalizedversions of equations 3 and 4, as shown below, may be used to determine memberships in the fuzzyapproximation sets. �apr�(4)(x) = inffmax[�4(y); 1� ��(x; y)]jy 2 Ug (5)�apr�(4)(x) = supfmin[�4(y); ��(x; y)]jy 2 Ug (6)The variable � can stand either for an equivalence relation or a fuzzy similarity relation. Thevariable 4 can stand for either a crisp or a fuzzy subset.3 Combining Rough and Fuzzy Set Models for Information Re-trievalYao's scheme for combining rough and fuzzy sets is important for us because it allows us to explorethe following situations:� Fuzzy documents.� Fuzzy queries.� Fuzzy similarity relations for the vocabulary spaces.Situation 1: Crisp vocabulary space and fuzzy vectors.This is an application of rough fuzzy sets where the vocabulary is partitioned using an equivalencerelation based on term synonymy. The document and/or query vector is a fuzzy vector. Asdescribed before the approximations of fuzzy sets in this space will yield fuzzy sets.8

Page 9: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

Example 2:As in the previous example, let T = ft1; t2; ::; t10g represent a vocabulary partitioned by R, thesynonymy relation, such that T=R = fC1, C2, C3, C4, C5g, as de�ned previously.Assume the fuzzy document FD2 and the fuzzy query FQ2 as de�ned below.FD2 = f0:9=t1; 0:7=t4; 0:5=t3; 0:8=t9gFQ2 = f0:5=t1; 0:2=t3; 0:3=t2; 0:5=t9gThen applying the same membership functions as in equations 5 and 6, the fuzzy lower and upperapproximations for FD2 may be derived.aprR(FD2) = f0:8=t9gaprR(FD2) = f0:9=t1; 0:9=t4; 0:9=t6; 0:5=t3; 0:5=t7; 0:8=t9gSimilarly,aprR(FQ2) = f0:3=t2; 0:5=t9gaprR(FQ2) = f0:5=t1; 0:5=t4; 0:5=t6; 0:2=t3; 0:2=t7; 0:3=t2; 0:5=t9gWe must modify equation 1 and equation 2 slightly to involve �-cuts for the fuzzy information toyield SimilarityR(S1; S2)� = 1� [card(Bl)=card(aprR(S2))]� (7)This will equal 0 for no match and 1 for maximum match between S2 and S1 keeping S2 as thefocus in the comparison. If card(aprR(S2))� = 0 then SimilarityR(S1; S2)� is set to equal 0. Inthe same way, SimilarityR(S1; S2)� = 1� [card(Bu)=card(aprR(S2))]� (8)This will equal 0 for no match and 1 for maximum match between S2 and S1 keeping S2 as thefocus in the comparison. If card(aprR(S2))� = 0 then SimilarityR(S1; S2)� is set to equal 0. Thuswe can compute:Bl = aprR(FQ2) j � j [ aprR(FD2) \ aprR(FQ2)] = f0:3=t2gTherefore by equation 7, SimilarityR(FD2; FQ2)0 = 1�1=2 = 0:5 and SimilarityR(FD2; FQ2)0:4 =1:0We also have, 9

Page 10: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

Bu = apr(FQ2) j � j [ apr(FD2) \ apr(FQ2)] = f0:3=t2gTherefore by equation 8, SimilarityR(FD2; FQ2)0 = 1�1=7 = 0:86 and, SimilarityR(FD2; FQ2)0:4 =1� 1=4 = 0:75Situation 2: Fuzzy vocabulary space and fuzzy document and/query vectors.We now introduce a fuzzy approximation space. This is analogous to Yao's fuzzy rough set model.A fuzzy approximation space is created by a fuzzy similarity relation which has the following prop-erties:re exive: �<(x; x) = 1 for x 2 Usymmetric: �<(x; y) = �<(y; x) for x; y 2 Utransitive: �<(x; z) � min[�<(x; y); �<(y; z)]Example 3:This is the most general case where the synonymy relation is fuzzy and the vectors are also fuzzy.Let:[t1]<2 = f1=t1; 0:9=t4; 0:4=t6g[t2]<2 = f1=t2; 0:8=t3; 0:9=t5g[t3]<2 = f0:8=t2; 1=t3; 0:8=t5g[t4]<2 = f0:9=t1; 1=t4; 0:4=t6g[t5]<2 = f0:9=t2; 0:8=t3; 1=t5g[t6]<2 = f0:4=t1; 0:4=t4; 1=t6g[t7]<2 = f1=t7; 0:6=t10g[t8]<2 = f1=t8g[t9]<2 = f1=t9g[t10]<2 = f0:6=t7; 1=t10gwhere [ti]<2 represents the fuzzy set of terms similar to term ti. Let:FD2 = f0:9=t1; 0:7=t4; 0:5=t3; 0:8=t9gFQ2 = f0:5=t1; 0:2=t3; 0:3=t2; 0:5=t9gThus the fuzzy approximations in this fuzzy equivalence space are:apr<2(FD2) = f0:6=t1; 0:2=t3; 0:6=t4; 0:8=t9g10

Page 11: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

apr<2(FD2) = f0:9=t1; 0:5=t2; 0:5=t3; 0:9=t4; 0:5=t5; 0:4=t6; 0:8=t9gapr<2(FQ2) = f0:1=t1; 0:1=t2; 0:2=t3; 0:5=t9gapr<2(FQ2) = f0:5=t1; 0:3=t2; 0:3=t3; 0:5=t4; 0:3=t5; 0:4=t6; 0:5=t9gSince:Bl = f0:1=t2gBu = fgSimilarity<2(FD2; FQ2)0 = 1� 1=4 = 0:75Similarity<2(FD2; FQ2)0:4 = 1� 0=1 = 1:0Similarity<2(FD2; FQ2)0 = 1� 0=7 = 1:0Similarity<2(FD2; FQ2)0:4 = 1� 0=7 = 1:0The two situations described above represent the inclusion of di�erent levels of fuzzy informa-tion into the rough set based information retrieval model. In each case we can consistently handlethe combinations of crisp and fuzzy information. The main advantage gained is in the modelingpower. Fuzzy weights o�er a reasonably straightforward means to represent the varying degrees ofassociation between information objects and vocabulary terms and also among vocabulary terms.Thus these fuzzy extensions to the rough set approach make the information retrieval model morerealistic.Further Extensions to the Model: The inclusion of fuzzy notions within the rough set frameworkdoes indeed o�er added realism when modeling vocabularies and making approximate comparisonsbetween documents and queries for information retrieval. However certain limitations make this aless than perfect solution. Many of the interesting vocabulary based relationships are not equiva-lence or fuzzy similarity relations. Relations identifying a term's `speci�c', `lexically related' and`statistically related' terms have di�erent properties. However these are important relations andwe would like to use them for information retrieval.We also wish to allow multiple vocabulary relations to collaboratively support informationretrieval. In other words, if each relation is regarded as a distinct view of the vocabulary, then it isimportant to be able to (optionally) consider multiple views while conducting information retrieval.This corresponds to many realistic situations where a search session may involve synonyms, speci�cterms, lexically related terms etc.4 Extensions to the Rough Set ModelIn recent work various extensions to Pawlak's RSM have been proposed. The following sectionsexamine some of these extensions and their potential in achieving the goals mentioned above.11

Page 12: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

4.1 Generalized Rough Set ModelSome of the very interesting extensions to Pawlak's rough sets, from the retrieval perspective,are those that substitute the equivalence relation with more general binary relations [Lin, 1989,Lingras & Yao, 1998, Yao, et. al., 1994, Zakowski, 1983]. Such substitutions are motivated by thefact that the requirement of equivalence may be too restrictive for certain applications (such as in-formation retrieval). Hence a number of alternatives have been proposed such as compatibility rela-tions [Zakowski, 1983] that are also the basis for neighbourhood systems [Lin, 1989]. This approachhas, for example, been used for data mining from incomplete databases [Lingras & Yao, 1998]. Theproperties of the underlying relation of the rough set model are signi�cant since they determine theproperties of the two approximation operators. For example, with compatibility relations that areonly re exive and symmetric, the approximation operators do not satisfy properties L9, L10, U9and U10.The recently proposed [Yao, et. al., 1994] general classi�cation of rough set models allows usto study these models based on the properties of the underlying binary relation. This classi�ca-tion considers various types of relations besides equivalence and compatibility relations by drawingparallels between rough sets and modal logic. Just as a modal logic system is an extension ofpropositional logic with two modal operators (the necessity and possibility operators), rough setsmay be viewed as an extension of standard set theory with the additional lower and upper approx-imation operators. Properties of the approximation operators in a rough set model (listed before)relate to the axioms of the modal operators in a modal logic system. Properties L1 through L5and U1 through U5 and K are satis�ed independent of the type of binary relation. The remainingproperties LU, L6 - L9, and U6 - U9, which depend upon the characteristics of the binary relation,are used to classify the various rough set models. For example with a serial binary relation propertyLU holds. LU in combination with L2 and U2 yields L6 and U6. With a transitive relation prop-erty L8 holds. In general any subset of these properties may provide a class of rough set models.Given that the properties are not independent and using results from modal logic, �fteen di�erentclasses of rough set models are constructed. Pawlak's rough set model based on equivalence is thestrongest one while the weakest rough set model does not require any of the additional propertiesto hold. The advantage o�ered with general binary relations is the more exible application of therough set approach.Irrespective of the type of binary relation underlying the rough set model, we may computelower and upper approximations using the general scheme shown below. Consider an arbitrarybinary relation R1 on U . That is, aR1b implies that b is R1{related to a. Thus R1 may be used tocreate a class of R1{related terms for a given term as for the term a below.R1(a) = fx 2 U jaR1xgGiven classes de�ned by such binary relations on the universe we can de�ne for a subset S 2 U ,aprR1(S) = fxjR1(x) � SgaprR1(S) = fxjR1(x) \ S 6= �gThe set aprR1(S) consists of all those elements whose R1{related elements are all in S. The set12

Page 13: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

aprR1(S) consists of those elements such that at least one R1{related element is in S. The pair(aprR1(S); aprR1(S)) is referred to as the generalized rough set induced by R1 with reference set S.In the case where R1 is an equivalence relation, one gets the standard rough set model. Thus wesee that subsets in the universe may be described using these two approximations even with moregeneral binary relations.4.2 The Generalized Rough Set Model and Information RetrievalIn information retrieval there are several vocabulary based relations of interest e.g., speci�c term,general term, lexically related term, and statistically related term. The `speci�c term' relation is notserial, re exive, symmetric or Euclidean but only transitive. Similar features hold for the generalterm relationship. In contrast the lexically related term relationship is also symmetric. Thus weare faced with alternative binary relations that exhibit varying properties. The generalized roughset model is therefore highly relevant and immediately yields a more exible modeling approach.We now illustrate the application of the generalized rough set model to information retrieval. Wecontinue to consider fuzzy document and query descriptions in this analysis.Situation 3: Generalized Binary Relations and Fuzzy Vectors.Example 4:Assume the R1 (speci�c term) relation as given below. The interpretation given is that t4 and t6de�ne the set of speci�c terms for t1. Notice that t10 is a speci�c term for t6, t8 and t9. Noticealso, that not all terms have speci�c terms in the vocabulary space. Finally, as expected theserelationships implicitly de�ne the hierarchical connections between terms. Since in theory, eachterm may have its own class of speci�c terms, there could be almost as many `speci�c term' classesas there are terms in the vocabulary.R1(t1) = ft4; t6; t7; t8; t10gR1(t2) = ft3; t5; t7; t9; t10gR1(t3) = ft7; t9; t10gR1(t4) = ft7; t8; t10gR1(t5) = ft9; t10gR1(t6) = ft10gR1(t8) = ft10gR1(t9) = ft10gNow given a fuzzy document and query de�ned on the same universe of terms,FD3 = f0:3=t1; 0:4=t3; 0:5=t9; 0:2=t10gFQ3 = f0:2=t3; 0:4=t4; 0:5=t7gwe can compute the lower and upper approximations for the document and query using the samemethod as depicted in equations 3 and 4.13

Page 14: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

aprR1(FD3) = f0:2=t5; 0:2=t6; 0:2=t8; 0:2=t9gaprR1(FD3) = f0:5=t1; 0:5=t2; 0:5=t3; 0:2=t4; 0:2=t6; 0:2=t8; 0:2=t9gaprR1(FQ3) = fgaprR1(FQ3) = f0:5=t1; 0:5=t2; 0:5=t3; 0:5=t4gThe following interpretations may be given to these approximations. From the point of view ofR1 (the speci�c term relationship on the vocabulary), the lower approximation identi�es t5, t6,t8 and, t9 as representing FD3 with fuzzy membership values 0:2, 0:2, 0:2 and, 0:2 respectively.Thus the speci�c term relationship provides a particular view of the vocabulary which o�ers twoalternatives for representing the document (and the query). The lower approximation identi�esweighted terms that are de�nitely recommended, and the upper approximation identi�es weightedterms that are possibly recommended by the vocabulary view. Now similarity between the queryand the document may be computed using equations 7 and 8:SimilarityR1(FD3; FQ3)0 = 0SimilarityR1(FD3; FQ3)0 = 1� 1=4 = 0:75SimilarityR1(FD3; FQ3)0:4 = 15 Variable Precision Rough SetsAnother interesting extension to the rough set model, for information retrieval purposes, is the vari-able precision rough set model [Wong & Ziarko, 1987, Ziarko, 1993]. In Pawlak's standard model,an element belongs to the lower approximation of a set S if all its related elements belong to S.For the upper approximation at least 1 of its related elements should be in S. In graded roughsets [Yao & Wong, 1992], the degree of overlap is considered. For some n the following are de�nedwith graded rough sets,aprnR(S) = fxjjR(x)j � jS \R(x)j � ngaprnR(S) = fxjjS \R(x)j � ngThus x belongs to aprnR(S) if at most n members of R(x) are not in S and it belongs toaprnR(S) if more than n members of R(x) are in S. This gives us a family of graded approximationoperators by simply varying n. Variable precision rough sets [Wong & Ziarko, 1987, Ziarko, 1993]o�er a probabilistic approach to rough sets by extending the idea of graded rough sets. In essenceit also considers the size of R(x). Thus we now have,aprR (S) = fxj jS\R(x)jjR(x)j � 1� g 14

Page 15: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

aprR (S) = fxj jS\R(x)jjR(x)j � gwhere 2 [0; 1]. Variable precision rough sets in essence smoothen the nature of approximations.This extension is important for information retrieval because of the di�erent types of vocabularyrelations one may encounter. It is possible that the appropriate value for is relation dependent.The optimal for a relation that is somewhat loosely de�ned, i.e., that yields large classes, isperhaps di�erent from a relation that is very tightly de�ned such as synonymy. Further insightswill be gained empirically.Given our information retrieval context with fuzzy document and query vectors the above equa-tions are modi�ed slightly. For a fuzzy set F and a binary relation R de�ned on the universe whichalso contains element x, we have:�aprR (F)(x) = supf�j jF� \R(x)jjR(x)j � 1� g (9)�aprR (F)(x) = supf�j jF� \R(x)jjR(x)j � g (10)where � represents the largest membership threshold value that allows the �-cut on F to satisfy thegiven condition. Thus sets two thresholds on the membership function. By setting 2 [0:0; 0:5],we can ensure that the threshold for the lower approximation is higher than the threshold for theupper approximation.5.1 Variable Precision Rough Sets and Information RetrievalExample 5:Assume the same information given in Example 4. In addition let us set = 0:3. Thus the thresh-old for the lower approximation is 0.7 and that for the upper approximation is 0.3.Then using equation 9 and equation 10:aprR1 (FD3) = f0:5=t5; 0:2=t6; 0:2=t8; 0:2=t9gaprR1 (FD3) = f0:4=t1; 0:2=t3; 0:5=t5; 0:2=t6; 0:2=t8; 0:2=t9gaprR1 (FQ3) = fgaprR1 (FQ3) = f0:4=t1; 0:2=t2; 0:5=t3; 0:5=t4gThus,Bl = fgBu = f0:2=t2; 0:5=t4g 15

Page 16: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

SimilarityR1(FD3; FQ3)0 = 0SimilarityR1(FD3; FQ3)0 = 1� 2=4 = 0:56 Multiple Vocabulary Views for Information Retrieval.The previous section considers a vocabulary with a single view, i.e., the view o�ered by the speci�cterm relation. We now consider multiple views of the same vocabulary working in concert. This isimportant because an information retrieval vocabulary system expresses di�erent kinds of relation-ships e.g., more speci�c, more general, synonymy, and statistically related. Any and all of theserelationships may be relevant for a given retrieval goal. For instance, when searching for \datamining" literature one may use both the general synonym \knowledge discovery" and the speci�csearch terms \rough sets" and \ID3". This suggests that a modeling approach that allows multipleviews of the vocabulary to collaborate in suggesting important search terms is worth investigating.To simplify the analysis we consider two distinct views on the given vocabulary as de�ned by twodistinct binary relations on the same universe of terms. The extension to more than two relationsis straightforward.Example 6:Consider the binary relations R1 as representing the \speci�c term" relationship and R2 as repre-senting the \lexical variant" relationship. Now given a fuzzy document and query de�ned on thesame universe U of terms,FD4 = f0:9=t1; 0:5=t3; 0:2=t4; 0:8=t6; 0:5=t9gFQ4 = f0:5=t1; 0:3=t2; 0:7=t3; 0:9=t9gAssume that with R1 and 1 we have,aprR1 1(FD4) = f0:2=t1; 0:2=t2; 0:5=t4gaprR1 1(FD4) = f0:5=t1; 0:5=t2; 0:5=t4gaprR1 1(FQ4) = f0:7=t1; 0:2=t2; 0:9=t4gaprR1 1(FQ4) = f0:8=t1; 0:5=t2; 0:9=t4; 0:3=t5gAlso assume that with R2 and 2 we have,aprR2 2(FD4) = f0:2=t2; 0:5=t3; 0:5=t5gaprR2 2(FD4) = f0:8=t2; 0:5=t3; 0:5=t5g16

Page 17: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

aprR2 2(FQ4) = f0:5=t3; 0:7=t5gaprR2 2(FQ4) = f0:5=t3; 0:7=t5gNotice that each vocabulary view has its own optimal value for , allowing us to �ne tune thevariable precision model for each relation. (However, it is also possible that a single common valuefor works optimally across all relations.) Each view makes its own recommendations regardingthe weighted terms that de�nitely and possibly represent the query (or document). Thus, thespeci�c term view recommends 0:2=t1; 0:2=t2; 0:5=t4 while the lexical relation view recommends:0:2=t2; 0:5=t3; 0:5=t5 as de�nitely describing the document. This analysis suggests that there aretwo alternative methods for combining the terms suggested by the di�erent views. We can be highlyselective and retain only those terms that are suggested by both views. Alternatively we may selectterms suggested by either view. Thus the fuzzy AND and the fuzzy OR may be appropriate forthe two options respectively.Assuming that both views must e�ect selection we have,aprR1 1ANDR2 2(FD4) = aprR1 1(FD4) \ aprR2 2(FD4) = f0:2=t2gaprR1 1ANDR2 2(FD4) = aprR1 1(FD4) \ aprR2(FD4) = f0:5=t2gaprR1 1ANDR2 2(FQ4) = aprR1 1(FQ4) \ aprR2(FQ4) = fgaprR1 1ANDR2 2(FQ4) = aprR1 1(FQ4) \ aprR2(FQ4) = f0:3=t5gAssuming that either view may e�ect selection we have,aprR1 1ORR2 2(FD4) = aprR1 1(FD4) [ aprR2 2(FD4) = f0:2=t1; 0:2=t2; 0:5=t3; 0:5=t4; 0:5=t5gaprR1 1ORR2 2(FD4) = aprR1 1(FD4) [ aprR2 2(FD4) = f0:5=t1; 0:8=t2; 0:5=t3; 0:5=t4; 0:5=t5gaprR1 1ORR2 2(FQ4) = aprR1 1(FQ4) [ aprR2 2(FQ4) = f0:7=t1; 0:2=t2; 0:5=t3; 0:9=t4; 0:7=t5gaprR1 1ORR2 2(FQ4) = aprR1 1(FQ4) [ aprR2 2(FQ4) = f0:8=t1; 0:5=t2; 0:5=t3; 0:9=t4; 0:7=t5gNotice how the original query representation has the term t9 which does not appear even in the(more optimistic) upper approximations of either view. This is because the approximation is gen-erated with the particular vocabulary view. Thus utilizing the two vocabulary views alone forretrieval risks loss of original terms. To solve this problem, we may treat the original representa-tions as the `original' (Ro) view of these objects. These may then be combined with the other viewsusing the AND or OR operations. Thus:aprR1 1ANDR2 2ANDRo(FD4) = fg17

Page 18: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

aprR1 1ANDR2 2ANDRo(FD4) = fgaprR1 1ANDR2 2ANDRo(FQ4) = fgaprR1 1ANDR2 2ANDRo(FQ4) = fgand,aprR1 1ORR2 2ORRo(FD4) = f0:9=t1; 0:2=t2; 0:5=t3; 0:5=t4; 0:5=t5 0:8=t6; 0:5=t9gaprR1 1ORR2 2ORRo(FD4) = f0:9=t1; 0:8=t2; 0:5=t3; 0:5=t4; 0:5=t5; 0:8=t6; 0:5=t9gaprR1 1ORR2 2ORRo(FQ4) = f0:7=t1; 0:3=t2; 0:7=t3; 0:9=t4; 0:7=t5; 0:9=t9gaprR1 1ORR2 2ORRo(FQ4) = f0:8=t1; 0:5=t2; 0:7=t3; 0:9=t4; 0:7=t5; 0:9=t9gThus we have,SimilarityR1ANDR2;ANDRo(FD4; FQ4)0 = 0SimilarityR1ANDR2;ANDRo(FD4; FQ4)0 = 0SimilarityR1ORR2;ORRo(FD4; FQ4)0 = 1� 5=6 = 0:17SimilarityR1ORR2;ORRo(FD4; FQ4)0 = 1� 4=6 = 0:33Thus multiple views of the same vocabulary o�er alternative approaches for term selection.These views may operate in concert to either yield the common denominator representation orthe union representation. However, it is evident from the simple example that as the numberof views increases, the AND operation is likely to become overly restrictive. This is in fact tobe expected since the views are really quite di�erent from each other. It does not make muchsense to expect terms to be both speci�c terms as well as lexically related to the query terms.Hence, the OR operator is more likely to be suitable given the nature of interesting relations inthe information retrieval domain. Another alternative to consider is to apply the di�erent views insequence, with each step o�ering an Ored combination of the current view and the previous step's�nal representation. Thus this modeling approach allows signi�cant exibility when combining thevarious binary relations that may be observed in a vocabulary scheme.7 Preliminary Analysis of the UMLS MetathesaurusWe now show the relevance of the vocabulary mining framework established in previous sections byexamining some of the properties of a real world vocabulary system used for information retrieval.We will see how such a vocabulary o�ers a variety of relations with di�erent properties. These rela-tions may be utilized for information retrieval either independently or in particular combinations.Our framework o�ers the ability to research retrieval e�ectiveness under di�erent conditions.18

Page 19: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

The 1998 edition of the UMLS Metathesaurus [NLM, 1998] is an integration of more than fortyhealth care related vocabularies. It contains 476,313 concepts. For each concept the Metathesauruspresents a wide variety of information such as de�nition, synonyms and parts of speech. In thissection we focus on eight di�erent types of relations that may be gleaned for the UMLS concepts asdescribed in Table 1. Column 1 identi�es each relation. Columns 3 and 4 provide information per-taining to the related terms for a single Metathesaurus concept: 1,2-Dipalmitoylphosphatidylcholinerepresented here as C1. Thus there are 9 synonyms identi�ed for the concept with three examplesgiven. ID Type Set size Sample entries forUMLS concept 1,2-DipalmitoylphosphatidylcholineR1 Synonym 9 1,2-Dipalmitoyl-GlycerophosphocholineDipalmitoylphosphatidylcholineDipalmitoyllecithinR2 Ancestor 14 Phospholipids - 5term Glycerophosphates - 6Phosphatidic Acids - 7Phosphatidylcholines - 8R3 Parent 3 PhosphatidylcholinestermR4 Sibling 1 DimyristoylphosphatidylcholinetermR5 Quali�er 31 administration & dosageterm adverse e�ectsanalogs & derivativesR6 Narrower 6 1,3-DG-2-Pterm 1,3-dipalmitoyl-glycero-2-phosphocholinecolfosceril palmitateR7 Related 1 Dipalmitoylphosphatidyl:Mass Concentration:term Point in Time:Serum:QuantitativeR8 Co{occuring 380 Acetophenonesterm AcidsAlcoholsLauratesTable 1: Eight UMLS Based RelationsThe ancestor relation is derived from the UMLS component vocabularies that are hierarchicallyorganized with more general/speci�c concept links1. The parent term relation is a subset of theancestor term relation. R5 represents the allowable qualifying terms, i.e., terms that add furtherspeci�city to the semantics of the quali�ed term. R7 represents related concepts (other than1The numbers indicate the hierarchical distance from the concept in focus. The smaller the number the greaterthe distance in the source vocabulary. 19

Page 20: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

synonymy, ancestor, child, parent, sibling). R8 represents the co{occurring concept relation, whichhas the largest number of entries (380) for our example concept. The Metathesaurus classi�es theseco{occurrences further into �ve di�erent classes such as \co-occurrence of primary or main subjectheadings in citations to the published literature". We ignore these classes in this example. However,it should be noted that each co{occurrence class may be regarded as a separate relation. Thisexample shows that the UMLS o�ers many relations besides synonymy that are potentially usefulfor information retrieval. The vocabulary mining framework de�ned allows one to systematicallystudy retrieval e�ectiveness using these relations either independently or in various combinations.These eight binary relations di�er in several respects. For instance, R1, the synonymy relation,is an equivalence relation with all its concomitant properties. R2, the ancestor relation, and R6 thenarrower relation are transitive but not serial, re exive or symmetric. R3 the parent relation andR5, qualifying terms, have none of these properties. R4, sibling and R7, related term relations aresymmetric while R8, the co{occurrence relation may be symmetric (depending upon the de�nitionof co{occurrence implemented in the UMLS). By including generalized relations, the proposedframework allows one to mine these di�erent relations either individually or in combination.There are other di�erences between the relations. For example, the co{occurrence relation tendsto yield many more related terms than R4, the sibling relation. Thus the question arises: how doesone combine the multiple vocabulary views in such a way that the combination remains somewhatneutral to signi�cant di�erences in class size across relations? Relations with ten entries in classesand relations with more than 300 entries on average should be able to collaborate, if necessary,during retrieval. The proposed vocabulary mining framework includes the variable precision roughset extension. This o�ers the parameter which allows us some control over this aspect.Another aspect to consider is that relations may di�er in their level of abstraction. R2, theancestor relation yields terms at di�erent levels of abstraction compared with R1, the synonymyrelation. Similarly, the co{occurrence relation may be subdivided into �ner grain relations. Againone must be able to control for such di�erences. The proposed framework allows one to enforcesome degree of consistency in the level of abstraction by de�ning the relation appropriately.Finally it is clear that some relations are not independent of each other? For example, theparent relation is a subset of the ancestor relation. Clearly using both is somewhat redundant. Thechoice between the more general relation and the more speci�c one is possibly context dependent.With some queries, the parent relation is likely to be more useful than the ancestor. This aspectmay be investigated empirically within the proposed framework.To conclude, the example shows that our rough and fuzzy set based vocabulary mining frame-work is motivated by real world complex vocabularies such as the UMLS. It is also evident that anumber of decisions will need to be made when applying the proposed rough set framework. Thecore issue underlying these decisions is in fact the very de�nitions for the di�erent relations/viewsthat can be derived for the given vocabulary. Once the views are de�ned, other aspects such aswhich views to select for a given query and how to combine them arise. These and other aspectsrelated to vocabulary mining will be examined empirically in future research.8 Conclusion.The exploration of domain vocabularies for information retrieval has been an active research area forseveral decades. Our contribution is a retrieval model within a vocabulary mining framework whereindividual vocabulary relations or views may be characterized and studied. The framework also20

Page 21: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

supports systematic study to identify e�ective combination methods and combination vocabularyviews for information retrieval. The framework is based on Pawlak's theory of rough sets andsome of its extensions. The ability to automatically mine domain vocabularies to adjust queryand document representations is of signi�cant value. Data mining research, including e�orts basedon rough sets, have typically focussed on discovering knowledge from highly structured databasessuch as relational databases. Our work extends data mining goals into the realm of relativelyunstructured, textual databases. Equally challenging is our goal for just{in{time discovery of therelevant vocabulary relations for a given user query.Future plans include testing this model in various domains. As shown here, the Uni�ed MedicalLanguage System (UMLS) implemented by the National Library of Medicine o�ers a rich vocabularysystem with various types of binary relations. Word-Net is another example of a rich vocabularysystem that o�ers an interesting test option for the future.References[Das-Gupta, 1988] Das-Gupta, P. \Rough Sets and Information Retrieval." Proceedings of the 11thInternational Conference of the Association for Computing Machinery Special Interest Groupon Information Retrieval (ACM SIGIR). Ed. Yves Chiaramella. Grenoble, France. 567-582,1988.[Dubois & Prade, 1992] Dubois, D. and Prade, H. Putting rough sets and fuzzy sets together.In Intelligent Decision Support: Handbook of Applications and Advances of the Rough SetsTheory, Slowinski, R. (Ed.) Kluwer Academic Publishers, Boston, 204-232, 1992.[Dubois & Prade, 1990] Dubois, D. and Prade, H. Rough Fuzzy Sets and Fuzzy Rough Sets. Inter-national Journal of General Systems, 17, pp. 191-209, 1990.[Hu & Cercone, 1995] Hu, X., Cercone, N.: Mining knowledge rules from databases: A rough setapproach. In: Proceedings of the 12th International Conference on Data Engineering, NewOrleans (1995) 96{105[Krusinska, Slowisnaki & Stefanowski, 1992] Krusinska, E., Slowinski, R., and Stefanowski, . Dis-criminant versus rough set approach to vague data analysis. Appl. Stochastic Models and DataAnal. 8 (1992), 43-56.[Lin, 1989] Lin, T. Y. Neighbourhood systems and approximation in database and knowledge basesystems. Proceedings of the Fourth International Symposium on Methodologies of IntelligentSystems, 1989.[Lin, 1992] Lin, T. Y. Topological and Fuzzy Rough Sets. In: Intelligent Decision Support: hand-book of Applications and Advances in Rough Sets Theory (Ed.) Slowinski, R. Kluwer AcademicPublishers, Boston, pp. 287-304, 1992.[Lin & Liu, 1993] Lin, T. Y., and Liu, Q. Rough Approximate Operators. Proceedings of the In-ternational Workshop on Rough Sets and Knowledge Discovery, First Edition, 255-257, 1993.[Lingras & Yao, 1998] Lingras, P.J. and Yao, Y.Y. Data mining using extensions of the rough setmodel. Journal of the American Society for Information Science, 49(5), pp. 415-422, 1998.21

Page 22: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

[Millan & Machuca, 1997] Millan, M., Machuca, F. Using the rough set theory to exploit the datamining potential in relational databases systems. In: [RSSC'97] 344{347[Miyamoto, 1990] Miyamoto, S., Fuzzy sets in information retrieval and cluster analysis, Dordrecht,The Netherlands: Kluwer Press, 1990.[NLM, 1998] National Library of Medicine. Uni�ed Medical Language System (UMLS) KnowledgeSources, 9th edition. MD:NLM, 1998.[Nguyen, et. al., 1997] Nguyen, S. Hoa, Skowron, A., Synak, P., Wr�oblewski, J.: Knowledge dis-covery in data bases: Rough set approach. In: M. Mares, R. Meisar, V. Novak, and J. Ramik(eds.), Proceedings of the Seventh International Fuzzy Systems Association World Congress(IFSA'97), June 25{29, Academia, Prague, (1997) 2 204{209[�hrn, et. al., 1997] �hrn, A., Vinterbo, S., Szyma�nski, P., Komorowski, J.: Modeling cardiacpatient set residuals using rough sets. In: Proceedings of AMIA Annual Fall Symposium(formerly SCAMC), Nashville, TN, USA, October 25{29 (1997) 203{207;[Pawlak, 1982] Pawlak, Z. Rough Sets. International Journal of Cumput. Inf. Science. 11 (1982),341-356.[Pawlak & Skowron, 1994] Pawlak, Z. and Skowron, A, Rough membership functions. In R. R.Yaeger, M. Fedrizzi, and J. Kacprzyk, Eds. Advances in the Dempster{Shafer Theory of Evi-dence. Wiley, New York, 1994, 251-271.[Pawlak, et. al., 1995] Pawlak, Z., Grzymala{Busse, J., Slowinski, R., and Ziarko, W. Rough sets.CACM, Nov, 1995, 38(11), 89-95.[Salton, 1971] Salton G, ed. The SMART Retrieval System{Experiments in Automatic DocumentProcessing. NJ:Prentice Hall,1971.[Skowron & Grzymala{Busse, 1994] Skowron, A., and Grzymala{Busse, J. W. From rough set the-ory to evidence theory. In Advances in the Dempster{Shafer Theory of Evidence. In R. R.Yaeger, M. Fedrizzi, and J. Kacprzyk, Eds. Advances in the Dempster{Shafer Theory of Evi-dence. Wiley, New York, 1994, 193-236.[Srinivasan, 1989] Srinivasan, P. \Intelligent Information Retrieval using Rough Set Approxima-tions." Information Processing and Management 25(4), 347-361, (1989)[Srinivasan, 1991] Srinivasan, P. \The Importance of Rough Approximations for Information Re-trieval" International Journal of Man-Machine Studies 34, 657-671, (1991)[Tsumoto, et. al., 1995] Tsumoto, S., Ziarko, W., Shan, N., Tanaka, H.: Knowledge discovery inclinical databases based on variable precision rough sets model. In: Proceedings of the 19thAnnual Symposium on Computer Applications in Medical Care, New Orleans, 1995, Journalof American Medical Informatics Association Supplement (1995) 270{274[Wong, Wang & Yao, 1995] Wong, S.K.M., Wang, L.S., and Yao, Y.Y. On modeling uncertaintywith interval structures. Computational Intelligence, 11, pp. 406-426, 1995.22

Page 23: 1 P a wlak's - research.nuance.coms query. Most often queries p osed information ... synon ym y, near syn-on ym y or related terms, ... Library of Medicine. The nal section o ers our

[Wong & Ziarko, 1987] Wong, S.K.M., and Ziarko, W. Comparison of the probabilistic approximateclassi�cation and the fuzzy set model. Fuzzy Sets and Systems, 21, pp. 357-362, 1987.[Yao, 1997] Yao, Y.Y. Combination of Rough and Fuzzy Sets based on �{level sets. In RoughSets and Data Mining: Analysis for Imprecise Data, T. Y. Lin and N. Cerone (Eds.), KluwerAcademic Publishers, Boston, 301-321, 1997.[Yao, et. al., 1994] Yao, Y.Y., Li, X., Lin, T. Y., and Liu, Q. Representation and classi�cationof rough set models. Soft Computing: Proceedings of the Third International Workshop onRough Sets and Soft Computing (RSSC '94), San Jose, CA. Nov. 10-12. T.Y.Lin, and A.M.Wildberger (Eds.), San Diego, CA: the Society for Computer Simulation, 44-47.[Yao & Wong, 1992] Yao, Y.Y., and Wong, S.K.M. A decision theoretic framework for approximat-ing concepts. International Journal of Man{Machine Studies, 37, pp.793-809, 1992.[Yao, Wong & Lin, 1997] Yao, Y.Y., Wong, S.K.M., and Lin, T.Y. A Review of Rough Set Models.In Rough Sets and Data Mining: Analysis for Imprecise Data, T. Y. Lin and N. Cerone (Eds.),Kluwer Academic Publishers, Boston, 47-73, 1997.[Zakowski, 1983] Zakowski, W. Approximations in the space (U,II). Demonstratio Mathematica,XVI, 761-769, 1983.[Ziarko, 1993] [Ziarko, 1993]Ziarko, W. Variable precision rough set model. Journal of Computerand System Sciences, 46, pp. 39-59, 1993.

23