automatic generation of taxonomies from the www

Presented by Jian-Shiun Tzeng 3/26/2009

Automatic Generation of Taxonomies from the WWW

David Sánchez , Antonio Moreno

Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Avda, Paїsos Catalans, 26, 43007 Tarragona, Spain

In: proceedings of the 5 th International Conference on Practical Aspects of Knowledge Management (PAKM 2004). LNAI

2

Domain Ontology Learning from the Web

• Publisher: VDM Verlag Dr. Mueller Akt.ges.&Co.KG

• Pub. Date: February 2008• ISBN-13: 9783836470698• 224pp

3

Related Paper

• Sánchez, D. and Moreno , A. “Web-scale taxonomy learning.” Proceedings of Workshop on Extending and Learning Lexical Ontologies using Machine Learning, ICML05, 2005

• Sánchez, D. and Moreno, A. “Learning non-taxonomic relationships from web documents for domain ontology construction.” Data & Knowledge Engineering, 64(3), 600-623 , 2007

• Sánchez, D. and Moreno, A. “Pattern-based automatic taxonomy learning from the Web.” AI Communications, 21(1), 27-48, 2008

4

Abstract

• A methodology to extract information from the Web to build a taxonomy of terms and Web resources for a given domain

• The system uses intensively a publicly available search engine, extracts concepts (based on its relation to the initial one and statistical data about appearance), selects and categorizes the most representative Web resources of each one and represents the result in a standard way

5

Outline

1. Introduction2. Taxonomy building methodology3. Ontology representation and evaluation4. Conclusion and future work

6

1. Introduction

• There is not a standard way of representing information in order to ease access to it

• Although several search tools have been developed (e.g. search engines like Google), when the searched topic is very general or we don’t have an exhaustive knowledge of the domain (in order to set the most appropriate and restrictive search query), the evaluation of the huge amount of potential resources obtained is quite slow and tedious

7

1. Introduction

• In this sense, a way for representing and accessing in a structured way the available resources depending on the selected domain would be very useful

• It is also important that this kind of representations can be obtained automatically due to the dynamic and changing nature of the Web

8

1. Introduction

• In this paper we present a methodology to extract information from the Web to build automatically a taxonomy of terms and Web resources for a given domain

9

1. Introduction

• During the building process, the most representative web sites for each selected concept are retrieved and categorized

• Finally, a polysemic detection algorithm is performed in order to discover different senses of the same word and group the most related concepts

• The result is a hierarchical and categorized organization of the available resources for the given domain

10

1. Introduction

• This hierarchy is obtained automatically and autonomously from the whole Web without any previous domain knowledge, and it represents the available resources at the moment

• These aspects distinguish this method from other similar ones [18] that are– applied on a representative selected corpus of documents [17]– use external sources of semantic information [16] (like WordNet [1]

or predefined ontologies)– or require the supervision of a human expert

• A prototype has been implemented to test the proposed method

11

1. Introduction

• The idea that represents the base for our proposal is the redundancy of information that characterizes the Web, allowing us to detect important concepts for a domain through a statistical analysis of their appearances

12

1. Introduction

• In addition to the advantage that a hierarchical representation of web resources provides in terms of searching for information, the obtained taxonomy is also a valuable element for building machine processable information representations like ontologies [6]

• In fact, many ontology creation methodologies like METHONTOLOGY [20] consider a taxonomy of terms for the domain as the point of departure for the ontology creation

13

1. Introduction

• However, the manual creation of these hierarchies of terms is a difficult task that requires an extended knowledge of the domain obtaining, in most situations, incomplete, inaccurate or obsolete results

• In our case, the taxonomy is created automatically and represents the state of the art for a domain (assuming that the Web contains the latest information for a certain topic)

14

1. Introduction

• The rest of the paper is organized as follows: – Section 2 describes the methodology developed to

build the taxonomy and classify web sites– Section 3 talks about the way of representing the

results and discusses on the main issues related to their evaluation

– The final section contains some conclusions and proposes some lines of future work

15

Outline


16

2. Taxonomy building methodology

• In this section, the methodology used to discover, select and organize representative concepts and websites for a given domain is described

• The algorithm is based on analysing a large number of web sites in order to find important concepts for a domain by studying the neighborhood of an initial keyword (we assume that words that are near to the specified keyword are closely related)

17


• Concretely, the immediate anterior word for a keyword is frequently categorizing it, whereas the immediate posterior one represents the domain where it is applied [19]– E.g. Humidity (anterior) Sensor Company (posterior)

• These concepts are processed in order to select the most adequate ones by performing a statistical analysis

• The selected ones for the anterior words are finally incorporated to the taxonomy

• For each one, the websites from where it was extracted are stored and categorized (using the posterior words), and the process is repeated recursively to find new terms and build a hierarchy

18


• Finally, in order to detect different meanings or domains to which the obtained classes belong, an algorithm for word sense discovering is performed

• This process is especially interesting when working with polysemous keywords (to avoid merging results from different domains)

• The algorithm creates clusters of classes depending on the websites’ domains from where they where selected

19


• The resulting taxonomy of terms eases the access to the available web resources and can be used to guide a search for information or a classification process from a document corpus [3, 4, 12, 13], or it can be the base for finding more complex relations between concepts and creating ontologies [8]

20


• This last point (resulting taxonomy) is especially important due to the necessity of ontologies for achieving interoperability and easing the access and interpretation of knowledge resources (e.g. Semantic Web [21], more information in [10])

21

2.1 Term discovery, taxonomy building and web categorization algorithm

• In more detail, the algorithm's sequence for discovering representative terms and building a taxonomy of web resources (shown on Fig. 1) has the following phases (see Table 1 and Fig. 2 and 3 in order to follow the explanation example):

23


1. It starts with a keyword that has to be representative enough for a specific domain (e.g. sensor) and a set of parameters that constrain the search and the concept selection (described below)

24


2. Then, it uses a publicly available search engine (Google) in order to obtain the most representative web sites that contain that keyword. The search constraints specified are the following– Maximum number of pages returned by the search

engine– Filter of similar sites

25

Maximum number of pages returned by the search engine

• This parameter constrains the size of the search• The bigger is the amount of documents

evaluated, the better are the results obtained, because we base the quality of the results in the redundancy of information (e.g. 1000 pages for sensor example)

Maximum number

26

Filter of similar sites

• For a general keyword (e.g. sensor), the enabling of this filter hides the web sites that belong to the same web domain, obtaining a set of results that represent a wider spectrum

• For a concrete word (e.g. neural network based sensor) with a smaller amount of results, the disabling of this filter will return the whole set of pages (even sub pages of a web domain)

Filter similar site

27


3. For each web site returned, an exhaustive analysis is performed in order to obtain useful information from each one. Concretely– Different types of non-HTML document formats are

processed (pdf, ps, doc, ppt, etc.) by obtaining the HTML version from Google's cache

– For each "Not found" or "Unable to show" page, the parser tries to obtain the web site's data from Google's cache

– Redirections are followed until finding the final site– Frame-based sites are also considered, obtaining the

complete set of texts by analysing each web subframe

28


4. The parser returns the useful text from each site (rejecting tags and visual information), and tries to find the initial keyword (e.g. sensor). For each matching, it analyses the immediate anterior word (e.g. temperature sensor). If it fulfills a set of prerequisites, they are selected as candidate concept. The posterior words (e.g. sensor network) are also considered. Concretely, the parser verifies the following: (Next Page)

29

– Words must have a minimum size (e.g. 3 characters) and must be represented with a standard ASCII character set (not Japanese, for example)

– They must be "relevant words". Prepositions, determinants, and very common words ("stop words") are rejected

– Each word is analyzed from its morphological root (e.g. optical and optic are considered as the same word and their attribute values - described below - are merged:• for example, the number of appearances of both words is

added)• A stemming algorithm for the English language is used for

this purpose

30


5. For each candidate concept selected (some examples are contained in Table 1), a statistical analysis is performed in order to select the most representative ones. Apart from to the text frequency analysis, the information obtained from the search engine (which is based in the analysis of the whole web) is also considered. Concretely, we consider the following attributes: (Next Page)

33

• Total number of appearances (e.g. minimum of 5 for the first iteration of the sensor example)– Times of words appear in retrieved pages?– this represents a measure of the concept's relevance

for the domain and allows eliminating very specific or unrelated ones (e.g. original)

• Number of different web sites that contain the concept at least one time (e.g. minimum of 3 for the first iteration of sensor)– this gives a measure of the word's generality for a

domain (e.g. wireless is quite common, but Bosch isn't)

34

• Estimated number of results returned by the search engine with the selected concept alone (e.g. maximum of 10,000,000 for sensor)– this indicates the word’s global generality and allows

avoiding widely-used ones (e.g. level) not include sensor

• Estimated number of results returned by the search engine joining the selected concept with the initial keyword (e.g. a minimum of 50 for the sensor example)– this represents a measure of association between

those two terms (e.g. "oxygen sensor" gives many results but "optimized sensor" doesn't)

35

• Ratio between the two last measures (e.g. minimum of 0.0001 for sensor)– This is a very important measure because it indicates

the relation intensity between the concept and the keyword and allows detecting relevant words (e.g. "temperature sensor" is much more relevant than "optimized sensor")

P(temperature|sensor)= P(temperature,sensor)/P(sensor)= C(temperature,sensor)/C(sensor)= 2,490,000/84,300,000 ≈ 0.03

36


6. Only concepts (a little percentage of the candidate list) whose attributes fit with a set of specified constraints (a range of values for each parameter) are selected (marked in bold in Table 1). Moreover, a relevance measure (1) of this selection is computed based on the amount of times the concept attribute values exceed the selection constraints (Next Page)

37

attributes

39

• This measure could be useful if an expert evaluates the taxonomy or if the hierarchy is bigger than expected (perhaps constraints where too loose), and a trim of the less relevant concepts is performed

40


7. For each concept extracted from a word previous to the initial one, a new keyword is constructed joining the new concept with the initial one (e.g. "position sensor"), and the algorithm is executed again from the beginningThis process is repeated recursively until a selected depth level is achieved (e.g. 4 levels for the sensor example) or no more results are found (e.g. solid-state pressure sensor has not got any subclass) Each new execution has its own search and selection parameter values because the searched keyword is more restrictive (constraints have to be relaxed in order to obtain a significant number of final results)

41


8. The obtained result is a hierarchy that is stored as an ontology with is-a relations. If a word has different derivative forms, all of them are evaluated independently (e.g. optic, optical) but identified with an equivalence relation (see some examples in Fig. 2)Moreover, each class stores the concept's attributes described previously and the set of URLs from where it was selected during the analysis of the immediate superclass (e.g. the set of URLs returned by Google when setting the keyword sensor that contains the candidate term optical sensor)

42

Part of Fig. 2

43


9. In the same way that the “previous word analysis” returns candidate concepts that could become classes, the posterior word for the initial keyword is also consideredIn this case, the selected terms will be used to describe and classify the set of URLs associated to each class

44

For example, if we find that for an specified URL associated to the class humidity sensor, this keyword set is followed by the word company (and this term has been selected as a candidate concept during the statistical analysis), the URL will be categorized with this word (that represents a domain of application) This information could be useful when the user browsers the set of URLs of a class because it can give him an idea about the context where the class is applied (see some examples in Fig. 2 (or 3?): humidity sensor company, magnetic sensor prototype, temperature sensor applications…) (Next Page)

45

Part of Fig. 3

46


10.Finally, a refinement process is performed in order to obtain a more compact taxonomy and avoid redundancy. In this process, classes and subclasses that have the same set of associated URLs are merged because we consider that they are closely related: in the search process, the two concepts have always appeared together (Next Page)

47

• For example, the hierarchy “wireless -> scale -> large” will result in “wireless -> large_scale” (discovering automatically a multiword term, [16]), because the last 2 subclasses have the same web sets. Moreover, the list of web sites for each class is processed in order to avoid redundancies: if an URL is stored in one of its subclasses, it will be deleted from the superclass set

48


• An example of the resulting taxonomy of terms obtained by this method with the set of parameters described during the algorithm description and for the sensor domain is shown in Fig. 2

• Several examples of candidate concepts for the first level of search and their attribute values are shown on Table 1

• Moreover, for the most significant classes discovered, examples of the Web resources obtained and categorized are shown in Fig. 3 (note that several types of non-HTML file types are retrieved)

49

Fig. 2

50

attributes

52

Fig. 3

53

2.2 Polysemy detection and semantic clustering

• One of the main problems when analyzing natural language resources is polysemous words

• In our case, for example, if the primary keyword has more than one sense (e.g. virus can be applied over “malicious computer programs” or “infectious biological agents”), the resulting taxonomy could contain concepts from different domains (one for each meaning) completely merged (e.g. “computer virus” and “immunodeficiency virus”)

54


• Although these concepts have been selected correctly, it could be interesting that the branches of the resulting taxonomic tree were grouped it they pertain to the same domain corresponding to a concrete sense of the immediate “father” concept

• With this representation, the user could be able to consult the hierarchy of terms that belongs to the desired sense for the initial keyword

55


• Performing this classification without any previous knowledge (for example, the list of meanings, synonyms for each sense or a thesaurus like WordNet [1]), is not a trivial process [16, 17]

• However, we can take profit from the context where each concept has been extracted, concretely, the documents (URL) that contain it

56


• We can assume that each website that talks about a given keyword is using it in a concrete sense, so all candidate concepts that are selected from the analysis of a single document pertain to the domain associated to a concrete keyword’s meaning

57


• Applying this idea over a large amount of documents we can find, as shown in Fig. 4 for the virus example, quite consistent semantic relations between the candidate concepts

Fig. 4

58


• So, if a word has N meanings (or it is used on N different domains with different senses), the resulting taxonomy of terms of this concept will be grouped in a similar number of sets, each one containing the elements that belong to a particular domain

• The classification process is performed without any previous semantic knowledge

• In more detail, the algorithm performs the following actions: (Next Page)

59


1. It starts from the taxonomy obtained from the described methodology. Concretely, all concepts are organized in an is-a ontology and each term stores the set of webs from which it has been obtained and the whole list of URLs returned by Google

60


2. For a given term of the taxonomy (for example the initial keyword: virus) and a concrete level of depth (for example the first one), a classification process is performed by joining the terms which belong to each keyword senseThis process is performed by a SAHN (Sequential Agglomerative Hierarchical Non-Overlapping) clustering algorithm that joins the more similar terms using as a similitude measure the number of coincidences between their URLs sets: (Next Page)

61

• For each term of the same depth level (e.g. anti, latest, simplex, linux, computer, influenza, online, immunodeficiency, leukaemia, nile, email) that depends directly on the selected word (virus), a set of URLs is constructed by joining the stored websites associated to the class and all the sets of their descendent classes (without repetitions)

62

• Each term is compared to each other and a similitude measure is obtained by comparing their URLs sets (2)Concretely, the measure represents the maxi-mum amount of coincidences between each set (normalized as a percentage of the total set). So, the higher it is, the more similar the terms are (because they are frequently used together in the same context)

63

• With these measures, a similitude matrix between all terms is constructed. The more similar terms (in the example, anti and computer) are selected and joined (they belong to the same keyword’s sense)

64

• The joining process is performed by creating a new class with those terms and removing them individually from the initial taxonomy. Their URL’s sets are joined but not merged (each term keeps its URL set independently)

65

• For this new class, the similitude measure to the remaining terms is computed. In this case, the measure of coincidence will be the minimum number of coincidences between the URL set of the individual term and the URL set of each term of the group that forms the new class (3)

66

• With this method we are able to detect the most subtle senses of the initial keyword (or at least different domains where it is used)Other measures like taking into consideration the maximum number of coincidences or the mean have been tested obtaining worse results (they tend to join all the classes, making difficult the differentiation of senses)

67

• The similitude matrix is updated with these values and the new most similar terms/classes are joined (building a dendogram like the one shown in Fig. 4). The process is repeated until no more elements remain unjoined or the similitude between each one is 0 (there are no coincidences between the URL sets)

68


3. The result is a set (with 2 elements for the virus example) of classes (their number has been automatically discovered) that groups the terms that belong to a specific meaning. The dendogram with the joined classes and similitude values can also be consulted by the user of the system

69

Outline


70

3. Ontology representation and evaluation

• The final hierarchy of terms is stored in a standard representation language: OWL [14] The Web Ontology Language is a semantic markup language for publishing and sharing ontologies on the World Wide Web

• It is designed for use by applications that need to process the content of information and facilitates greater machine interpretability by providing additional vocabulary along with a formal semantics [21]

• OWL is supported by many ontology visualizers and editors, like Protégé 2.1 [15], allowing the user to explore, understand, analyze or even modify the ontology easily

71


• In order to evaluate the correctness of the results, a set of formal tests have been performed

• Concretely, Protégé provides a set of ontological tests for detecting inconsistencies between classes, subclasses and properties of an OWL ontology from a logical point of view

• However, the amount of different documents evaluated for obtaining the result and, in general, the variety and quantity of resources available in the web difficults extremely any kind of automated evaluation

72


• So, the test of correctness from a semantic point of view can only be made by comparing the results with other existing semantic studies (for example, using other well known methodologies) or through an analysis performed by an expert of the domain

73


• Anyway, several points could be improved in the current method (from the example of Fig. 2) like– the detection of the presence of common

meaningless words (like based)– the detection of multiword terms (low->energy) or

equivalences between classes (possible acronyms, relations at different levels for the same class: e.g. airborne sensor and airborne digital sensor or very common sub hierarchies like optic)

74

Outline


75

4. Conclusion and future work

• Some researchers have been working on knowledge mining and ontology learning from different kinds of structured information sources (like data bases, knowledge bases or dictionaries [7])

• However, taking into consideration the amount of resources available easily on the Internet, we believe that knowledge extraction from unstructured documents like webs is an important line of research

76


• In this sense, many authors [2, 5, 8, 10, 11] are putting their effort on processing natural language texts for creating or extending structured representations like ontologies

• In this field, term taxonomies are the point of departure for many ontology creation methodologies [20]

77


• The discovery of these hierarchies of concepts based on word association has a very important precedent in [19], which proposes a way to interpret the relation between consecutive words (multiword terms)

• Several authors [11, 16, 17] have applied a similar idea for extracting hierarchies of terms from a document corpus to create or extend ontologies

78


• However, the classical approach for these methods is the analysis of a relevant corpus of documents for a domain. In some cases, a semantic repository (WordNet [1]) from which one can extract word's meanings and perform linguistic analysis or an existing representative ontology are used

79


• On the contrary, our methodology does not start from any kind of predefined knowledge of the domain, and it only uses publicly available web search engines for building relevant taxonomies from scratch

• Moreover, the fact of searching in the whole web adds some very important problems in relation to the relevant corpus analysis

80


• On the one hand, the heterogeneity of the information resources difficults the extraction of important data; in this case, we base our methodology on the high amount of information redundancy and a statistical analysis of the relevance of candidate terms for the domain

81


• Note also that most of these statistical measures are obtained directly from the web search engine, fact that speeds up greatly the analysis and gives more representative results (they are based in the whole web statistics, not in the analyzed subset) than a classical statistical approach based only in the texts

82


• On the other hand, polysemy becomes a serious problem when the retrieved resources obtained from the search engine are only based on the keyword’s presence (even some authors [16, 17] have detected this problem in the corpus analysis); in this case, we propose an automatic approach for polysemic disambiguation based on the clusterization of the selected classes according to the similarities between their information sources (web pages and web domains), without any kind of semantic knowledge

83


• The final taxonomy obtained with our method really represents the state of the art on the WWW for a given concept and the hierarchical structured and domain-categorized list of the most representative web sites for each class is a great help for finding and accessing the desired web resources

84


• To ease the definition of the search and selection parameters, an automatic pre-analysis can be performed from the initial keyword to estimate the most adequate values– For example, the number of results for a concept can

tell us a measure of its generality, which indicates the need of more restrictive or relaxed constraints

85


• Several executions from the same initial keyword in different times can give us different taxonomies. A study about the changes can tell us how a domain evolves (e.g. a new kind of sensor appears)

86


• For each class, an extended analysis of the relevant web sites could be performed to find possible attributes and values that describe important characteristics (e.g. the price of a sensor), or closely related words (like a topic signature [9])

87


• The same methodology applied to discover subtypes of the initial keyword can be useful for finding concrete instances of classes (e.g. manufacturer names of a specific sensor type), using some simple rules like the presence of capital letters [19]

88


• The described methodology is useful when working in easily categorized domains (where concatenated adjectives can be found)

• However, in other cases, a more exhaustive analysis has to be performed (like finding out the verb of a sentence or detecting the predicate or the subject). Following this way, more complex semantic relations could be found, and ontological structures could be constructed

Presented by Jian-Shiun Tzeng 3/26/2009

Web-scale taxonomy learning

David Sánchez , Antonio Moreno

Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Avda, Paїsos Catalans, 26, 43007 Tarragona, Spain

Proceedings of Workshop on Extending and Learning Lexical Ontologies using Machine Learning, ICML05, 2005

90

Abstract

• In this paper, we propose an automatic and unsupervised methodology to obtain taxonomies of terms from the Web and represent retrieved web sites into a meaningful organization for a desired domain without previous knowledge

91

Abstract

• It is based on the intensive use of web search engines to retrieve domain suitable resources from which extract knowledge, and to obtain web scale statistics from which infer knowledge relevancy

• Results can be useful for– easing the access to the web resources– or as the first step for constructing ontologies suitable

for the Semantic Web

92

Outline

1. Introduction2. Taxonomy building methodology3. Representation and evaluation of the results4. Related work, conclusions and future research

93

1. Introduction

• In this paper we present an unsupervised and automatic methodology to build taxonomies of terms and associated web resources from the Web for a given domain without any previous knowledge

94

Outline


95


• In this section, the methodology used to discover, select and organize in a taxonomy the representative concepts, instances and websites for a given domain is described

96


• The algorithm is based on analysing a considerable amount of web sites (but not the full set) to find candidates of concepts for a domain defined by an initial keyword

• The detection is performed by using linguistic patterns involving that keyword

• At this point we centre the analysis by studying the neighborhood of the initial keyword when it is presented in noun phrases

97


• In the English language, the immediate anterior word for a keyword is frequently classifying it, whereas the immediate posterior one represents the domain where it is applied (Grefenstette, 1997)– anterior + posterior (e.g. lung cancer)

• However, this analysis can be extended considering other patterns for hyponomy detection like Hearst’s (1992) ones

98


• The system relies on a search engine as the particular expert to search and access the web resources from where extract knowledge

• It constructs dynamically the appropriate search queries for obtaining the most appropriate corpus of web resources at each time

• So, the more knowledge extracted for a domain, the more specific the evaluated corpus of web documents will be

99


• Moreover, the search engine is also used to obtain web scale statistics from the number of hits associated to specially formulated queries

• From this statistics we check the relevancy of the extracted terms and evaluate the strength of the relationships between them

100


• However, one problem may arise due to the use of keyword based search engines:– the search will be centered on the initial keyword

selected as the representative for the domain of knowledge but, if several ways of expressing the same concept exist (e.g. synonyms, lexicalizations or even different morphological forms) a considerable amount of potentially interesting web resources will be omitted

101


• In order to solve that problem, we pretend to use an automatic methodology previously developed (Sanchez & Moreno, 2005) for discovering those alternative keywords from an initial one using the same initial taxonomy obtained through the method explained in this paper and the Web environment

• Incorporating this step into the present methodology and performing iterative executions of the learning algorithm, we will potentially improve the final recall of our system maintaining the unsupervised and uninformed operation

103

2.1 Concept discovery, taxonomy building and web categorization algorithm

• In more detail, the algorithm for discovering representative concepts and building a taxonomy of web resources (shown on Figure 1) has the following phases (see Figures 2 and 3 to follow an example for the Cancer domain referred during the explanation):

104

Figure 2.

105

Figure 3.

106


1. It starts with a keyword representative enough for a domain (e.g. cancer). Thanks to the generic implementation of the learning algorithm this initial word can be composed by several words if desired (e.g. breast cancer). Then, it uses a web search engine to obtain the most representative webs that contain that keyword (e.g. 100 pages for the Cancer domain)

107

• A priori, the more web sites are evaluated, the more complete will the results be– However, the best and more suitable web resources

are presented first by the search engine and, due to the amount of redundancy, once a certain percentage of the full set has been evaluated, obtaining new valid terms will be quite difficult

– So, just evaluating a reduced amount of the full set can give us representative and accurate results

108


2. For each site returned, an exhaustive analysis is performed. A web parser obtains the useful text and the system tries to find the initial keyword (e.g. cancer) within the retrieved text. Then, a syntactical analysis based on an annotation engine for the English language (OpenNLP1) is performed over the keyword’s neighborhood to detect tokens and grammatical categories

109

• The immediate anterior word (e.g. prostate cancer) is selected as a candidate concept. The posterior words (e.g. cancer treatment) are also retrieved– Concretely, the system verifies that words are nouns or

adjectives, as they are modifiers of the keyword’s meaning– Each word is analyzed from its morphological root using a

stemming algorithm– So, the different morphological forms of the same term will

be considered as the same concept and their statistics will be approximated to the most common morphological form

110


3. For each candidate, a selection algorithm is performed to decide if it will be considered as a subclass candidate of the initial one (a type of…), or as an instance candidate of the immediate superclass (an example of…), or even both at the same time if the same word is used for expressing concepts at different levels of concreteness. The method followed is described in section 2.2

111


4. Among the candidate concepts considered as subclass candidates, a statistical analysis is performed for selecting the most representative ones– In order to obtain relevant measures, moreover to

the frequency analysis performed over the reduced set of analyzed web sites, the information obtained from the search engine, which is based in the analysis of the whole Web (web-scale statistics (Etzioni et al., 2004)), is also evaluated

112

– This allows us to obtain representative measures without analyzing a large amount of documents (good scalability)

– As we want to perform the selection process automatically, a set of selection constraints will be computed at this point in function of the domain’s generality and the size of the search. Concretely:

113

1) From the analyzed set of web sites, we extract the total number of appearances of each candidate and the number of different web sites in which it has appeared. Those measures give us a local idea of which candidates are potentially the most used in the domain (e.g. breast cancer is very relevant but common cancer isn’t)• For each value, a minimum of hits will be considered to

detect which are the most representative terms. As this number depends on the size of the search, the minimum threshold will also depend on it

114

• However, the number of appearances of a specific concept (specially the most relevant ones) does not grow linearly in relation to the amount of websites because web resources are sorted by the search engine and, consequently, the most important ones are presented first

• In consequence, a setting of a logarithmic function with respect to the size of the search is a good measure (1)

115


2) From the search engine, we compute the amount of hits obtained by the concatenation of the candidate and the initial keyword, the total hits obtained by the candidate itself and a ratio between those measures (2)• The first two measures present mandatory constraints

defined a priori for avoiding very common words (e.g. world cancer) and misspelled ones (e.g. todaybreast cancer)

116

• The ratio value is especially important as it gives us a measure of the strength of the relationship between the candidate and the initial keyword against the whole Web (e.g. "cervical cancer" is a much more suitable relationship than "premier cancer")

• As that ratio is an absolute measure it is fixed with a value that allows configuring the behavior of the system. This last statistical measure have been previously employed by several authors (Turney, 2001) for computing in an efficient and scalable way the degree of relationship between terms

117


5. For each candidate that fulfils the two mandatory restrictions, a relevancy measure (3) is computed– That measure takes into consideration the amount of

times that the concept’s statistical values exceed the minimum values specified

– We consider the total number of appearances, the number of different web sites and the web scale ratio

– The two first values express a measure of the concept’s generality and the last one indicates the strength of the relation with the specific domain

118


6. At this point, the selection process is performed by selecting in the first place the most confident candidates and computing a selection threshold for the remaining ones in function of the selected candidates’ relevancy– The idea is to use the statistical values of the most

representative obtained terms to adapt the selection heuristic to the specific domain with independence of the search’s size or the domain generality

– This allows us to perform the selection process in a completely automatic, unsupervised and domain-independent way

119

– More concretely, those candidates whose full set of statistics fits with the whole set of specified constraints are selected in the first place

– Next, the lowest relevancy measure of those high confident selected concepts is set as the selection threshold

– Then, the remaining concepts that surpass the selection threshold will also be selected

120


7. For each selected concept, a new and more complex keyword is constructed joining this concept with the initial keyword (e.g. "breast cancer"), and the algorithm is executed again– So, at each iteration, a new and more specific set of

relevant documents for the concrete subdomain is retrieved

– This process is repeated recursively until a selected depth level is achieved or no more results are found

121


8. The result is a hierarchy that is stored as an ontology with is-a relations (as shown in Figure 3)– If a concept has different derivative forms (e.g.

hormone and hormonal), all of them are represented identified with an equivalence relation

– Each class and instance stores the URLs from where it was selected

– This can be valuable information for improving the access to web sites and for the Semantic Web

122

Figure 3.

123


9. The posterior word for the initial keyword is also considered– In this case, the selected concepts are used to

categorize the set of URLs associated to each class– For example, if we find that for a URL associated to

the class breast cancer, this keyword is followed by the word foundation, the URL will be categorized with this word that represents a domain of application (some examples in Figure 2)

125


10.When dealing with polysemic domains (e.g. virus), a semantic disambiguation algorithm is performed in order to group the obtained subclasses according to the specific word sense (e.g. biological agent or computer program)– The methodology performs a clusterization of those

classes using as a similarity measure the amount of coincidences between their URLs

– More details in Sanchez & Moreno (2004b)

126


11.Finally, a refinement is performed to obtain a more compact taxonomy– In this process, classes that share a high amount of

their URLs (e.g. avobe a 75 %) are merged because we consider them as closely related

– For example, the hierarchy “cancer -> cell -> germ-> refractory -> cisplatin -> nonseminomatous” will result in “cancer -> cell -> germ -> refractory -> nonseminomatous_cisplatin”, discovering a multiword (Voosen, 2001)

127

2.2 Named entity discovery

• One of the hardest problems of the knowledge acquisition process is to decide when a term has to be considered as a subclass (which can become the superclass of new classes) or as an instance (which represent the leafs of the hierarchy)

• Even for an knowledge engineer this can be a challenging issue

• A simplified approach is considering named entities as instances, as they represent individualizations of concepts

128


• Many named entity recognizers traditionally rely on lists of names examples (Mikheev, Moens & Grover, 1999)– The lists are compiled by humans, or assembled from

authoritative sources• It is also possible to build recognizers that identify names

automatically in text (Stevenson & Gaizauskas, 2000)– Such approaches usually attempt to learn general categories

such as organizations or persons rather than refined categories– Even considering fine-grained categories (Fleischman & Hovy,

2002), they use a closed, pre-specified set of categories of interest

129


• Other authors (Lamparter, Ehrig & Tempich, 2004) are using a thesaurus like WordNet for performing this detection:– if the word is not found in the dictionary, it is

assumed to be a named entity– However, sometimes, a named entity can be

expressed with normal words, so the use of a thesaurus is not enough to perform the differentiation

130


• Other approaches take into consideration the way in which named entities are presented– Concretely, language such as English distinguishes

proper names from normal words through capitalization

– This simple but effective idea, combined with a certain degree of linguistic analysis, has been applied by several authors (Cimiano & Staab, 2004), obtaining good results

131


– That approach, combined with statistical analysis of candidates for named entities, is the base in which our method has been designed• However, as most search engines do not differentiate

between lower and upper letters, we cannot obtain statistics directly (e.g. ratio between hits of a word in its lower form against its upper form) and, in consequence, some level of analysis has to be performed

132


• More in detail, during the taxonomy building process, if the candidate concept is represented with at least one capital letter, it will be marked as an instance candidate; otherwise, it will be considered as a subclass candidate

• Note that a concept can be considered of both types at the same time if it has been found represented with the two forms

133


• For each instance candidate, the first web sites returned by the search engine are processed to count the number of times that it is presented with upper and lower letters– Those which are presented in capital letters in most

cases (e.g. 85 % for high confidence) will be selected as instances (see examples in Fig 2b and Table 1

– details on how the full name of the named-entity is discovered are included in section 3)

135

Outline


136

3. Representation and evaluation of the results

• The final hierarchy is stored in a standard representation language: OWL. The Web Ontology Language is a semantic markup language for publishing and sharing ontologies on the World Wide Web (Berners-lee, Hendler & Lassila, 2001)– It is supported by many ontology visualisers and

editors

137


• Concerning the evaluation of automatically acquired knowledge, it is recognized to be an open problem (OntoWeb, 2002)

• Regarding our proposal, the evaluation is a task that, at this moment, is a matter of current research

138


• However, some evaluations have been performed manually and others are planned to be performed automatically comparing the results against other methodologies or available semantic repositories such as WordNet– However, this fact does not affect the domain-independent

(no predefined knowledge) basic premise– The idea is to be able to demonstrate the quality and

suitability of the system for obtaining results for well known domains and establish the base of trustworthiness on the results obtained for any other possible domain

139


• The evaluation of the obtained taxonomies of concepts is performed manually at this moment– Whenever it is possible, a representative human made

classification is taken as the ideal model of taxonomy to achieve (Gold Standard)

– Then, several tests for that domain are performed with different sizes of search

– For each one, we apply the standard measures recall and precision (see an example of evaluation for the Cancer domain in Figure 4)

140


• Moreover, a local recall measure considering only the concepts potentially discovered for the evaluated resources is also considered in order to check the performance of the selection procedure

• As a measure of comparison of the quality of the obtained results against similar available systems, we have evaluated precision and recall against hand-made web directory services and taxonomical search engines

141


• The performance of the candidate selection procedure is high as the number of mistakes (incorrectly selected and rejected items) is maintained around 15-20%

142


• Moreover, the local recall (number of correctly selected concepts from the full list of discovered ones) is typically around a 90-95%

143


• The growth of the number of discovered concepts (and in consequence the recall) follows a logarithmic distribution in relation to the size of the search due to the redundancy of information and the relevancy-based sorting of web sites presented by the search engine

144


• Moreover, arrived at a certain point in which a considerable amount of concepts has been discovered, precision tends to decrease due the growth of false candidates

• As a consequence, analysing a large amount of web sites does not imply to obtain better results than with a more reduced corpus

145


• Comparing the performance to Yahoo, we see that although its precision is the highest, as it has been made by humans, the number of results (recall) is quite limited

146


• In relation to the automatic search tools, Clusty and AlltheWeb, the first one tends to return more results (recall) but with low precision whereas the second one offers an inverse situation

• In both cases, the performance (recall-precision compromise) is easily surpassed by our proposal

147


• The evaluation of the correctness of the instances discovered (in our case, named entities) is a hard task, even for domain experts, because in most cases there not exist a standard classification

• However, an alternative way for checking automatically obtained terms can be the comparison with other well-known learning techniques applied over the same corpus and domain

148


• Concretely, for the named-entity case, we have used a named-entity detection package trained for the English language (again, OpenNLP) that is able to detect with quite high accuracy some word patterns like organization, person, and location names

149


• The evaluation is performed by testing if the selected instance name is also tagged by the detection tool as a named entity

• This process will give us a measure of precision of the obtained results in an automatic way, by comparing our algorithm against a well known human trained entity detection mechanism (based on maximum entropy models (Borthwick, 1999))

150


• In the future, we plan to improve this evaluation considering also recall measures and comparisons with other automatic techniques

• A collateral effect of this process is the detection of the full name associated to an instance, automatically tagged by OpenNLP if this name is composed by several words (e.g. 24th Annual San Antonio Breast Cancer Symposium), as shown in Table 1

151


• For illustrative purposes, for the Cancer, Sensor and Disease domains, we have obtained a precision of 100%, 85% and 91% respectively

• However, even if it is automatically obtained, this is a relative result because the tagging tool itself does not present a 100% precision and it is limited to predefined groups

152

Outline


153

4. Related work, conclusions and future research

• Several authors have been working on knowledge acquisition from natural language text collections and more concretely from the Web in the last years. In (Agirre et al., 2000; Alfonseca & Manandhar, 2002; Navigli & Velardi, 2004) the authors propose methods for extending an existing ontology with unknown words

154


• These approaches rely in previously obtained knowledge and semantic repositories like WordNet in order to classify concepts and perform disambiguations

155


• More closely related, in (Etzioni et al., 2004) authors also try to enrich previously obtained ontologies but using web-scale statistics obtained from web searchers to detect suitable candidates

• Other authors using this approach for knowledge acquisition related tasks are (Cimiano & Staab, 2004), that learn taxonomic relationships using Hearst's (1992) patterns, and Turney (2001), for synonym discovery

156


• In contrast to many of the previously developed methodologies, our proposal does not start from any kind of predefined knowledge, like WordNet and, in consequence, it can be applied over domains that are not typically considered in semantic repositories as, for example, Biosensors (Sanchez & Moreno, 2004a)

157


• Moreover, the intensive use of search engines for obtaining statistics save us from analysing large amounts of resources but maintaining good quality results, and the automatic operation eases the updating of the results in highly evolutionary domains

• These facts result in a scalable and suitable method for acquiring knowledge from a huge and dynamic repository as the Web

158


• The results obtained can be useful for several tasks– On the one hand, the automatic classification of the

web resources considered during the learning process into a meaningful organization, offers a final representation that can ease the way in which the user access to those resources

– On the other hand, the automatically obtained taxonomy can be used as a first step for the construction of ontologies (Fensel, 2001)

159


• Ontologies offer a formal and efficient way for representing knowledge but they are usually built entirely by hand

• If that knowledge is evolvable, an ontology maintenance process is required to keep the ontological knowledge up-to-date– This is why methodologies that contribute to the automatic

ontology construction, are necessary• Finally, the web populated knowledge structure obtained

from the Web itself, is very suitable for the purposes of the Semantic Web (Berners-lee, Hendler & Lassila 2001)

160


• The concept and named-entities discovery method can be extended using more complex linguistic patterns like Heart’s (1992) ones– This can be useful when dealing with non-easily

categorized domains where no noun-phrases are found

161


• As mentioned before, incorporating an intermediate step for learning automatically alternative forms (synonyms, lexicalizations, etc) for the initial keyword may allow us to retrieve new resources and concepts, improving the final recall– For the cancer domain considered in this paper, other

alternative forms that are easily discovered with this procedure are: cancers, carcinoma(s), tumor(s), tumour(s), neoplasm, etc

– More details in (Sanchez & Moreno, 2005). The casuistry for joining the partial results obtained from the learning process of each of those initial keywords to obtain a final complete taxonomy should be also considered in the future

162


• In order to extend the methodology, more complex relations (like non-taxonomic ones) can be discovered from an analysis of the sentences that contain the selected concepts (text nuggets)

• As previously introduced, ways for automate or at least easing the evaluation of every step of the knowledge acquisition process will be studied

automatic generation of taxonomies from the www

Documents

representative web resources

web documents

available resources

webscale taxonomy learning

representative web sites

potential resources

automatic taxonomy learning

previous domain knowledge