automatic generation of taxonomies from the www

162
ented by Jian-Shiun Tzeng 3/26/2009 Automatic Generation of Taxonomies from the WWW David Sánchez , Antonio Moreno Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Avda, Paїsos Catalans, 26, 43007 Tarragona, Spain In: proceedings of the 5 th International Conference on Practical Aspects of Knowledge Management (PAKM 2004). LNAI

Upload: adair

Post on 23-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Automatic Generation of Taxonomies from the WWW . David Sánchez , Antonio Moreno Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Avda , Pa ї sos Catalans, 26, 43007 Tarragona, Spain - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Generation of Taxonomies from the WWW

Presented by Jian-Shiun Tzeng 3/26/2009

Automatic Generation of Taxonomies from the WWW

David Sánchez , Antonio Moreno

Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Avda, Paїsos Catalans, 26, 43007 Tarragona, Spain

In: proceedings of the 5 th International Conference on Practical Aspects of Knowledge Management (PAKM 2004). LNAI

Page 2: Automatic Generation of Taxonomies from the WWW

2

Domain Ontology Learning from the Web

• Publisher: VDM Verlag Dr. Mueller Akt.ges.&Co.KG

• Pub. Date: February 2008• ISBN-13: 9783836470698• 224pp

Page 3: Automatic Generation of Taxonomies from the WWW

3

Related Paper

• Sánchez, D. and Moreno , A. “Web-scale taxonomy learning.” Proceedings of Workshop on Extending and Learning Lexical Ontologies using Machine Learning, ICML05, 2005

• Sánchez, D. and Moreno, A. “Learning non-taxonomic relationships from web documents for domain ontology construction.” Data & Knowledge Engineering, 64(3), 600-623 , 2007

• Sánchez, D. and Moreno, A. “Pattern-based automatic taxonomy learning from the Web.” AI Communications, 21(1), 27-48, 2008

Page 4: Automatic Generation of Taxonomies from the WWW

4

Abstract

• A methodology to extract information from the Web to build a taxonomy of terms and Web resources for a given domain

• The system uses intensively a publicly available search engine, extracts concepts (based on its relation to the initial one and statistical data about appearance), selects and categorizes the most representative Web resources of each one and represents the result in a standard way

Page 5: Automatic Generation of Taxonomies from the WWW

5

Outline

1. Introduction2. Taxonomy building methodology3. Ontology representation and evaluation4. Conclusion and future work

Page 6: Automatic Generation of Taxonomies from the WWW

6

1. Introduction

• There is not a standard way of representing information in order to ease access to it

• Although several search tools have been developed (e.g. search engines like Google), when the searched topic is very general or we don’t have an exhaustive knowledge of the domain (in order to set the most appropriate and restrictive search query), the evaluation of the huge amount of potential resources obtained is quite slow and tedious

Page 7: Automatic Generation of Taxonomies from the WWW

7

1. Introduction

• In this sense, a way for representing and accessing in a structured way the available resources depending on the selected domain would be very useful

• It is also important that this kind of representations can be obtained automatically due to the dynamic and changing nature of the Web

Page 8: Automatic Generation of Taxonomies from the WWW

8

1. Introduction

• In this paper we present a methodology to extract information from the Web to build automatically a taxonomy of terms and Web resources for a given domain

Page 9: Automatic Generation of Taxonomies from the WWW

9

1. Introduction

• During the building process, the most representative web sites for each selected concept are retrieved and categorized

• Finally, a polysemic detection algorithm is performed in order to discover different senses of the same word and group the most related concepts

• The result is a hierarchical and categorized organization of the available resources for the given domain

Page 10: Automatic Generation of Taxonomies from the WWW

10

1. Introduction

• This hierarchy is obtained automatically and autonomously from the whole Web without any previous domain knowledge, and it represents the available resources at the moment

• These aspects distinguish this method from other similar ones [18] that are– applied on a representative selected corpus of documents [17]– use external sources of semantic information [16] (like WordNet [1]

or predefined ontologies)– or require the supervision of a human expert

• A prototype has been implemented to test the proposed method

Page 11: Automatic Generation of Taxonomies from the WWW

11

1. Introduction

• The idea that represents the base for our proposal is the redundancy of information that characterizes the Web, allowing us to detect important concepts for a domain through a statistical analysis of their appearances

Page 12: Automatic Generation of Taxonomies from the WWW

12

1. Introduction

• In addition to the advantage that a hierarchical representation of web resources provides in terms of searching for information, the obtained taxonomy is also a valuable element for building machine processable information representations like ontologies [6]

• In fact, many ontology creation methodologies like METHONTOLOGY [20] consider a taxonomy of terms for the domain as the point of departure for the ontology creation

Page 13: Automatic Generation of Taxonomies from the WWW

13

1. Introduction

• However, the manual creation of these hierarchies of terms is a difficult task that requires an extended knowledge of the domain obtaining, in most situations, incomplete, inaccurate or obsolete results

• In our case, the taxonomy is created automatically and represents the state of the art for a domain (assuming that the Web contains the latest information for a certain topic)

Page 14: Automatic Generation of Taxonomies from the WWW

14

1. Introduction

• The rest of the paper is organized as follows: – Section 2 describes the methodology developed to

build the taxonomy and classify web sites– Section 3 talks about the way of representing the

results and discusses on the main issues related to their evaluation

– The final section contains some conclusions and proposes some lines of future work

Page 15: Automatic Generation of Taxonomies from the WWW

15

Outline

1. Introduction2. Taxonomy building methodology3. Ontology representation and evaluation4. Conclusion and future work

Page 16: Automatic Generation of Taxonomies from the WWW

16

2. Taxonomy building methodology

• In this section, the methodology used to discover, select and organize representative concepts and websites for a given domain is described

• The algorithm is based on analysing a large number of web sites in order to find important concepts for a domain by studying the neighborhood of an initial keyword (we assume that words that are near to the specified keyword are closely related)

Page 17: Automatic Generation of Taxonomies from the WWW

17

2. Taxonomy building methodology

• Concretely, the immediate anterior word for a keyword is frequently categorizing it, whereas the immediate posterior one represents the domain where it is applied [19]– E.g. Humidity (anterior) Sensor Company (posterior)

• These concepts are processed in order to select the most adequate ones by performing a statistical analysis

• The selected ones for the anterior words are finally incorporated to the taxonomy

• For each one, the websites from where it was extracted are stored and categorized (using the posterior words), and the process is repeated recursively to find new terms and build a hierarchy

Page 18: Automatic Generation of Taxonomies from the WWW

18

2. Taxonomy building methodology

• Finally, in order to detect different meanings or domains to which the obtained classes belong, an algorithm for word sense discovering is performed

• This process is especially interesting when working with polysemous keywords (to avoid merging results from different domains)

• The algorithm creates clusters of classes depending on the websites’ domains from where they where selected

Page 19: Automatic Generation of Taxonomies from the WWW

19

2. Taxonomy building methodology

• The resulting taxonomy of terms eases the access to the available web resources and can be used to guide a search for information or a classification process from a document corpus [3, 4, 12, 13], or it can be the base for finding more complex relations between concepts and creating ontologies [8]

Page 20: Automatic Generation of Taxonomies from the WWW

20

2. Taxonomy building methodology

• This last point (resulting taxonomy) is especially important due to the necessity of ontologies for achieving interoperability and easing the access and interpretation of knowledge resources (e.g. Semantic Web [21], more information in [10])

Page 21: Automatic Generation of Taxonomies from the WWW

21

2.1 Term discovery, taxonomy building and web categorization algorithm

• In more detail, the algorithm's sequence for discovering representative terms and building a taxonomy of web resources (shown on Fig. 1) has the following phases (see Table 1 and Fig. 2 and 3 in order to follow the explanation example):

Page 22: Automatic Generation of Taxonomies from the WWW

22

Page 23: Automatic Generation of Taxonomies from the WWW

23

2.1 Term discovery, taxonomy building and web categorization algorithm

1. It starts with a keyword that has to be representative enough for a specific domain (e.g. sensor) and a set of parameters that constrain the search and the concept selection (described below)

Page 24: Automatic Generation of Taxonomies from the WWW

24

2.1 Term discovery, taxonomy building and web categorization algorithm

2. Then, it uses a publicly available search engine (Google) in order to obtain the most representative web sites that contain that keyword. The search constraints specified are the following– Maximum number of pages returned by the search

engine– Filter of similar sites

Page 25: Automatic Generation of Taxonomies from the WWW

25

Maximum number of pages returned by the search engine

• This parameter constrains the size of the search• The bigger is the amount of documents

evaluated, the better are the results obtained, because we base the quality of the results in the redundancy of information (e.g. 1000 pages for sensor example)

Maximum number

Page 26: Automatic Generation of Taxonomies from the WWW

26

Filter of similar sites

• For a general keyword (e.g. sensor), the enabling of this filter hides the web sites that belong to the same web domain, obtaining a set of results that represent a wider spectrum

• For a concrete word (e.g. neural network based sensor) with a smaller amount of results, the disabling of this filter will return the whole set of pages (even sub pages of a web domain)

Filter similar site

Page 27: Automatic Generation of Taxonomies from the WWW

27

2.1 Term discovery, taxonomy building and web categorization algorithm

3. For each web site returned, an exhaustive analysis is performed in order to obtain useful information from each one. Concretely– Different types of non-HTML document formats are

processed (pdf, ps, doc, ppt, etc.) by obtaining the HTML version from Google's cache

– For each "Not found" or "Unable to show" page, the parser tries to obtain the web site's data from Google's cache

– Redirections are followed until finding the final site– Frame-based sites are also considered, obtaining the

complete set of texts by analysing each web subframe

Page 28: Automatic Generation of Taxonomies from the WWW

28

2.1 Term discovery, taxonomy building and web categorization algorithm

4. The parser returns the useful text from each site (rejecting tags and visual information), and tries to find the initial keyword (e.g. sensor). For each matching, it analyses the immediate anterior word (e.g. temperature sensor). If it fulfills a set of prerequisites, they are selected as candidate concept. The posterior words (e.g. sensor network) are also considered. Concretely, the parser verifies the following: (Next Page)

Page 29: Automatic Generation of Taxonomies from the WWW

29

– Words must have a minimum size (e.g. 3 characters) and must be represented with a standard ASCII character set (not Japanese, for example)

– They must be "relevant words". Prepositions, determinants, and very common words ("stop words") are rejected

– Each word is analyzed from its morphological root (e.g. optical and optic are considered as the same word and their attribute values - described below - are merged:• for example, the number of appearances of both words is

added)• A stemming algorithm for the English language is used for

this purpose

Page 30: Automatic Generation of Taxonomies from the WWW

30

2.1 Term discovery, taxonomy building and web categorization algorithm

5. For each candidate concept selected (some examples are contained in Table 1), a statistical analysis is performed in order to select the most representative ones. Apart from to the text frequency analysis, the information obtained from the search engine (which is based in the analysis of the whole web) is also considered. Concretely, we consider the following attributes: (Next Page)

Page 31: Automatic Generation of Taxonomies from the WWW

31

Page 32: Automatic Generation of Taxonomies from the WWW

32

Page 33: Automatic Generation of Taxonomies from the WWW

33

• Total number of appearances (e.g. minimum of 5 for the first iteration of the sensor example)– Times of words appear in retrieved pages?– this represents a measure of the concept's relevance

for the domain and allows eliminating very specific or unrelated ones (e.g. original)

• Number of different web sites that contain the concept at least one time (e.g. minimum of 3 for the first iteration of sensor)– this gives a measure of the word's generality for a

domain (e.g. wireless is quite common, but Bosch isn't)

Page 34: Automatic Generation of Taxonomies from the WWW

34

• Estimated number of results returned by the search engine with the selected concept alone (e.g. maximum of 10,000,000 for sensor)– this indicates the word’s global generality and allows

avoiding widely-used ones (e.g. level) not include sensor

• Estimated number of results returned by the search engine joining the selected concept with the initial keyword (e.g. a minimum of 50 for the sensor example)– this represents a measure of association between

those two terms (e.g. "oxygen sensor" gives many results but "optimized sensor" doesn't)

Page 35: Automatic Generation of Taxonomies from the WWW

35

• Ratio between the two last measures (e.g. minimum of 0.0001 for sensor)– This is a very important measure because it indicates

the relation intensity between the concept and the keyword and allows detecting relevant words (e.g. "temperature sensor" is much more relevant than "optimized sensor")

P(temperature|sensor)= P(temperature,sensor)/P(sensor)= C(temperature,sensor)/C(sensor)= 2,490,000/84,300,000 ≈ 0.03

Page 36: Automatic Generation of Taxonomies from the WWW

36

2.1 Term discovery, taxonomy building and web categorization algorithm

6. Only concepts (a little percentage of the candidate list) whose attributes fit with a set of specified constraints (a range of values for each parameter) are selected (marked in bold in Table 1). Moreover, a relevance measure (1) of this selection is computed based on the amount of times the concept attribute values exceed the selection constraints (Next Page)

Page 37: Automatic Generation of Taxonomies from the WWW

37

attributes

Page 38: Automatic Generation of Taxonomies from the WWW

38

Page 39: Automatic Generation of Taxonomies from the WWW

39

• This measure could be useful if an expert evaluates the taxonomy or if the hierarchy is bigger than expected (perhaps constraints where too loose), and a trim of the less relevant concepts is performed

Page 40: Automatic Generation of Taxonomies from the WWW

40

2.1 Term discovery, taxonomy building and web categorization algorithm

7. For each concept extracted from a word previous to the initial one, a new keyword is constructed joining the new concept with the initial one (e.g. "position sensor"), and the algorithm is executed again from the beginningThis process is repeated recursively until a selected depth level is achieved (e.g. 4 levels for the sensor example) or no more results are found (e.g. solid-state pressure sensor has not got any subclass) Each new execution has its own search and selection parameter values because the searched keyword is more restrictive (constraints have to be relaxed in order to obtain a significant number of final results)

Page 41: Automatic Generation of Taxonomies from the WWW

41

2.1 Term discovery, taxonomy building and web categorization algorithm

8. The obtained result is a hierarchy that is stored as an ontology with is-a relations. If a word has different derivative forms, all of them are evaluated independently (e.g. optic, optical) but identified with an equivalence relation (see some examples in Fig. 2)Moreover, each class stores the concept's attributes described previously and the set of URLs from where it was selected during the analysis of the immediate superclass (e.g. the set of URLs returned by Google when setting the keyword sensor that contains the candidate term optical sensor)

Page 42: Automatic Generation of Taxonomies from the WWW

42

Part of Fig. 2

Page 43: Automatic Generation of Taxonomies from the WWW

43

2.1 Term discovery, taxonomy building and web categorization algorithm

9. In the same way that the “previous word analysis” returns candidate concepts that could become classes, the posterior word for the initial keyword is also consideredIn this case, the selected terms will be used to describe and classify the set of URLs associated to each class

Page 44: Automatic Generation of Taxonomies from the WWW

44

For example, if we find that for an specified URL associated to the class humidity sensor, this keyword set is followed by the word company (and this term has been selected as a candidate concept during the statistical analysis), the URL will be categorized with this word (that represents a domain of application) This information could be useful when the user browsers the set of URLs of a class because it can give him an idea about the context where the class is applied (see some examples in Fig. 2 (or 3?): humidity sensor company, magnetic sensor prototype, temperature sensor applications…) (Next Page)

Page 45: Automatic Generation of Taxonomies from the WWW

45

Part of Fig. 3

Page 46: Automatic Generation of Taxonomies from the WWW

46

2.1 Term discovery, taxonomy building and web categorization algorithm

10.Finally, a refinement process is performed in order to obtain a more compact taxonomy and avoid redundancy. In this process, classes and subclasses that have the same set of associated URLs are merged because we consider that they are closely related: in the search process, the two concepts have always appeared together (Next Page)

Page 47: Automatic Generation of Taxonomies from the WWW

47

• For example, the hierarchy “wireless -> scale -> large” will result in “wireless -> large_scale” (discovering automatically a multiword term, [16]), because the last 2 subclasses have the same web sets. Moreover, the list of web sites for each class is processed in order to avoid redundancies: if an URL is stored in one of its subclasses, it will be deleted from the superclass set

Page 48: Automatic Generation of Taxonomies from the WWW

48

2.1 Term discovery, taxonomy building and web categorization algorithm

• An example of the resulting taxonomy of terms obtained by this method with the set of parameters described during the algorithm description and for the sensor domain is shown in Fig. 2

• Several examples of candidate concepts for the first level of search and their attribute values are shown on Table 1

• Moreover, for the most significant classes discovered, examples of the Web resources obtained and categorized are shown in Fig. 3 (note that several types of non-HTML file types are retrieved)

Page 49: Automatic Generation of Taxonomies from the WWW

49

Fig. 2

Page 50: Automatic Generation of Taxonomies from the WWW

50

attributes

Page 51: Automatic Generation of Taxonomies from the WWW

51

Page 52: Automatic Generation of Taxonomies from the WWW

52

Fig. 3

Page 53: Automatic Generation of Taxonomies from the WWW

53

2.2 Polysemy detection and semantic clustering

• One of the main problems when analyzing natural language resources is polysemous words

• In our case, for example, if the primary keyword has more than one sense (e.g. virus can be applied over “malicious computer programs” or “infectious biological agents”), the resulting taxonomy could contain concepts from different domains (one for each meaning) completely merged (e.g. “computer virus” and “immunodeficiency virus”)

Page 54: Automatic Generation of Taxonomies from the WWW

54

2.2 Polysemy detection and semantic clustering

• Although these concepts have been selected correctly, it could be interesting that the branches of the resulting taxonomic tree were grouped it they pertain to the same domain corresponding to a concrete sense of the immediate “father” concept

• With this representation, the user could be able to consult the hierarchy of terms that belongs to the desired sense for the initial keyword

Page 55: Automatic Generation of Taxonomies from the WWW

55

2.2 Polysemy detection and semantic clustering

• Performing this classification without any previous knowledge (for example, the list of meanings, synonyms for each sense or a thesaurus like WordNet [1]), is not a trivial process [16, 17]

• However, we can take profit from the context where each concept has been extracted, concretely, the documents (URL) that contain it

Page 56: Automatic Generation of Taxonomies from the WWW

56

2.2 Polysemy detection and semantic clustering

• We can assume that each website that talks about a given keyword is using it in a concrete sense, so all candidate concepts that are selected from the analysis of a single document pertain to the domain associated to a concrete keyword’s meaning

Page 57: Automatic Generation of Taxonomies from the WWW

57

2.2 Polysemy detection and semantic clustering

• Applying this idea over a large amount of documents we can find, as shown in Fig. 4 for the virus example, quite consistent semantic relations between the candidate concepts

Fig. 4

Page 58: Automatic Generation of Taxonomies from the WWW

58

2.2 Polysemy detection and semantic clustering

• So, if a word has N meanings (or it is used on N different domains with different senses), the resulting taxonomy of terms of this concept will be grouped in a similar number of sets, each one containing the elements that belong to a particular domain

• The classification process is performed without any previous semantic knowledge

• In more detail, the algorithm performs the following actions: (Next Page)

Page 59: Automatic Generation of Taxonomies from the WWW

59

2.2 Polysemy detection and semantic clustering

1. It starts from the taxonomy obtained from the described methodology. Concretely, all concepts are organized in an is-a ontology and each term stores the set of webs from which it has been obtained and the whole list of URLs returned by Google

Page 60: Automatic Generation of Taxonomies from the WWW

60

2.2 Polysemy detection and semantic clustering

2. For a given term of the taxonomy (for example the initial keyword: virus) and a concrete level of depth (for example the first one), a classification process is performed by joining the terms which belong to each keyword senseThis process is performed by a SAHN (Sequential Agglomerative Hierarchical Non-Overlapping) clustering algorithm that joins the more similar terms using as a similitude measure the number of coincidences between their URLs sets: (Next Page)

Page 61: Automatic Generation of Taxonomies from the WWW

61

• For each term of the same depth level (e.g. anti, latest, simplex, linux, computer, influenza, online, immunodeficiency, leukaemia, nile, email) that depends directly on the selected word (virus), a set of URLs is constructed by joining the stored websites associated to the class and all the sets of their descendent classes (without repetitions)

Page 62: Automatic Generation of Taxonomies from the WWW

62

• Each term is compared to each other and a similitude measure is obtained by comparing their URLs sets (2)Concretely, the measure represents the maxi-mum amount of coincidences between each set (normalized as a percentage of the total set). So, the higher it is, the more similar the terms are (because they are frequently used together in the same context)

Page 63: Automatic Generation of Taxonomies from the WWW

63

• With these measures, a similitude matrix between all terms is constructed. The more similar terms (in the example, anti and computer) are selected and joined (they belong to the same keyword’s sense)

Page 64: Automatic Generation of Taxonomies from the WWW

64

• The joining process is performed by creating a new class with those terms and removing them individually from the initial taxonomy. Their URL’s sets are joined but not merged (each term keeps its URL set independently)

Page 65: Automatic Generation of Taxonomies from the WWW

65

• For this new class, the similitude measure to the remaining terms is computed. In this case, the measure of coincidence will be the minimum number of coincidences between the URL set of the individual term and the URL set of each term of the group that forms the new class (3)

Page 66: Automatic Generation of Taxonomies from the WWW

66

• With this method we are able to detect the most subtle senses of the initial keyword (or at least different domains where it is used)Other measures like taking into consideration the maximum number of coincidences or the mean have been tested obtaining worse results (they tend to join all the classes, making difficult the differentiation of senses)

Page 67: Automatic Generation of Taxonomies from the WWW

67

• The similitude matrix is updated with these values and the new most similar terms/classes are joined (building a dendogram like the one shown in Fig. 4). The process is repeated until no more elements remain unjoined or the similitude between each one is 0 (there are no coincidences between the URL sets)

Page 68: Automatic Generation of Taxonomies from the WWW

68

2.2 Polysemy detection and semantic clustering

3. The result is a set (with 2 elements for the virus example) of classes (their number has been automatically discovered) that groups the terms that belong to a specific meaning. The dendogram with the joined classes and similitude values can also be consulted by the user of the system

Page 69: Automatic Generation of Taxonomies from the WWW

69

Outline

1. Introduction2. Taxonomy building methodology3. Ontology representation and evaluation4. Conclusion and future work

Page 70: Automatic Generation of Taxonomies from the WWW

70

3. Ontology representation and evaluation

• The final hierarchy of terms is stored in a standard representation language: OWL [14] The Web Ontology Language is a semantic markup language for publishing and sharing ontologies on the World Wide Web

• It is designed for use by applications that need to process the content of information and facilitates greater machine interpretability by providing additional vocabulary along with a formal semantics [21]

• OWL is supported by many ontology visualizers and editors, like Protégé 2.1 [15], allowing the user to explore, understand, analyze or even modify the ontology easily

Page 71: Automatic Generation of Taxonomies from the WWW

71

3. Ontology representation and evaluation

• In order to evaluate the correctness of the results, a set of formal tests have been performed

• Concretely, Protégé provides a set of ontological tests for detecting inconsistencies between classes, subclasses and properties of an OWL ontology from a logical point of view

• However, the amount of different documents evaluated for obtaining the result and, in general, the variety and quantity of resources available in the web difficults extremely any kind of automated evaluation

Page 72: Automatic Generation of Taxonomies from the WWW

72

3. Ontology representation and evaluation

• So, the test of correctness from a semantic point of view can only be made by comparing the results with other existing semantic studies (for example, using other well known methodologies) or through an analysis performed by an expert of the domain

Page 73: Automatic Generation of Taxonomies from the WWW

73

3. Ontology representation and evaluation

• Anyway, several points could be improved in the current method (from the example of Fig. 2) like– the detection of the presence of common

meaningless words (like based)– the detection of multiword terms (low->energy) or

equivalences between classes (possible acronyms, relations at different levels for the same class: e.g. airborne sensor and airborne digital sensor or very common sub hierarchies like optic)

Page 74: Automatic Generation of Taxonomies from the WWW

74

Outline

1. Introduction2. Taxonomy building methodology3. Ontology representation and evaluation4. Conclusion and future work

Page 75: Automatic Generation of Taxonomies from the WWW

75

4. Conclusion and future work

• Some researchers have been working on knowledge mining and ontology learning from different kinds of structured information sources (like data bases, knowledge bases or dictionaries [7])

• However, taking into consideration the amount of resources available easily on the Internet, we believe that knowledge extraction from unstructured documents like webs is an important line of research

Page 76: Automatic Generation of Taxonomies from the WWW

76

4. Conclusion and future work

• In this sense, many authors [2, 5, 8, 10, 11] are putting their effort on processing natural language texts for creating or extending structured representations like ontologies

• In this field, term taxonomies are the point of departure for many ontology creation methodologies [20]

Page 77: Automatic Generation of Taxonomies from the WWW

77

4. Conclusion and future work

• The discovery of these hierarchies of concepts based on word association has a very important precedent in [19], which proposes a way to interpret the relation between consecutive words (multiword terms)

• Several authors [11, 16, 17] have applied a similar idea for extracting hierarchies of terms from a document corpus to create or extend ontologies

Page 78: Automatic Generation of Taxonomies from the WWW

78

4. Conclusion and future work

• However, the classical approach for these methods is the analysis of a relevant corpus of documents for a domain. In some cases, a semantic repository (WordNet [1]) from which one can extract word's meanings and perform linguistic analysis or an existing representative ontology are used

Page 79: Automatic Generation of Taxonomies from the WWW

79

4. Conclusion and future work

• On the contrary, our methodology does not start from any kind of predefined knowledge of the domain, and it only uses publicly available web search engines for building relevant taxonomies from scratch

• Moreover, the fact of searching in the whole web adds some very important problems in relation to the relevant corpus analysis

Page 80: Automatic Generation of Taxonomies from the WWW

80

4. Conclusion and future work

• On the one hand, the heterogeneity of the information resources difficults the extraction of important data; in this case, we base our methodology on the high amount of information redundancy and a statistical analysis of the relevance of candidate terms for the domain

Page 81: Automatic Generation of Taxonomies from the WWW

81

4. Conclusion and future work

• Note also that most of these statistical measures are obtained directly from the web search engine, fact that speeds up greatly the analysis and gives more representative results (they are based in the whole web statistics, not in the analyzed subset) than a classical statistical approach based only in the texts

Page 82: Automatic Generation of Taxonomies from the WWW

82

4. Conclusion and future work

• On the other hand, polysemy becomes a serious problem when the retrieved resources obtained from the search engine are only based on the keyword’s presence (even some authors [16, 17] have detected this problem in the corpus analysis); in this case, we propose an automatic approach for polysemic disambiguation based on the clusterization of the selected classes according to the similarities between their information sources (web pages and web domains), without any kind of semantic knowledge

Page 83: Automatic Generation of Taxonomies from the WWW

83

4. Conclusion and future work

• The final taxonomy obtained with our method really represents the state of the art on the WWW for a given concept and the hierarchical structured and domain-categorized list of the most representative web sites for each class is a great help for finding and accessing the desired web resources

Page 84: Automatic Generation of Taxonomies from the WWW

84

4. Conclusion and future work

• To ease the definition of the search and selection parameters, an automatic pre-analysis can be performed from the initial keyword to estimate the most adequate values– For example, the number of results for a concept can

tell us a measure of its generality, which indicates the need of more restrictive or relaxed constraints

Page 85: Automatic Generation of Taxonomies from the WWW

85

4. Conclusion and future work

• Several executions from the same initial keyword in different times can give us different taxonomies. A study about the changes can tell us how a domain evolves (e.g. a new kind of sensor appears)

Page 86: Automatic Generation of Taxonomies from the WWW

86

4. Conclusion and future work

• For each class, an extended analysis of the relevant web sites could be performed to find possible attributes and values that describe important characteristics (e.g. the price of a sensor), or closely related words (like a topic signature [9])

Page 87: Automatic Generation of Taxonomies from the WWW

87

4. Conclusion and future work

• The same methodology applied to discover subtypes of the initial keyword can be useful for finding concrete instances of classes (e.g. manufacturer names of a specific sensor type), using some simple rules like the presence of capital letters [19]

Page 88: Automatic Generation of Taxonomies from the WWW

88

4. Conclusion and future work

• The described methodology is useful when working in easily categorized domains (where concatenated adjectives can be found)

• However, in other cases, a more exhaustive analysis has to be performed (like finding out the verb of a sentence or detecting the predicate or the subject). Following this way, more complex semantic relations could be found, and ontological structures could be constructed

Page 89: Automatic Generation of Taxonomies from the WWW

Presented by Jian-Shiun Tzeng 3/26/2009

Web-scale taxonomy learning

David Sánchez , Antonio Moreno

Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Avda, Paїsos Catalans, 26, 43007 Tarragona, Spain

Proceedings of Workshop on Extending and Learning Lexical Ontologies using Machine Learning, ICML05, 2005

Page 90: Automatic Generation of Taxonomies from the WWW

90

Abstract

• In this paper, we propose an automatic and unsupervised methodology to obtain taxonomies of terms from the Web and represent retrieved web sites into a meaningful organization for a desired domain without previous knowledge

Page 91: Automatic Generation of Taxonomies from the WWW

91

Abstract

• It is based on the intensive use of web search engines to retrieve domain suitable resources from which extract knowledge, and to obtain web scale statistics from which infer knowledge relevancy

• Results can be useful for– easing the access to the web resources– or as the first step for constructing ontologies suitable

for the Semantic Web

Page 92: Automatic Generation of Taxonomies from the WWW

92

Outline

1. Introduction2. Taxonomy building methodology3. Representation and evaluation of the results4. Related work, conclusions and future research

Page 93: Automatic Generation of Taxonomies from the WWW

93

1. Introduction

• In this paper we present an unsupervised and automatic methodology to build taxonomies of terms and associated web resources from the Web for a given domain without any previous knowledge

Page 94: Automatic Generation of Taxonomies from the WWW

94

Outline

1. Introduction2. Taxonomy building methodology3. Representation and evaluation of the results4. Related work, conclusions and future research

Page 95: Automatic Generation of Taxonomies from the WWW

95

2. Taxonomy building methodology

• In this section, the methodology used to discover, select and organize in a taxonomy the representative concepts, instances and websites for a given domain is described

Page 96: Automatic Generation of Taxonomies from the WWW

96

2. Taxonomy building methodology

• The algorithm is based on analysing a considerable amount of web sites (but not the full set) to find candidates of concepts for a domain defined by an initial keyword

• The detection is performed by using linguistic patterns involving that keyword

• At this point we centre the analysis by studying the neighborhood of the initial keyword when it is presented in noun phrases

Page 97: Automatic Generation of Taxonomies from the WWW

97

2. Taxonomy building methodology

• In the English language, the immediate anterior word for a keyword is frequently classifying it, whereas the immediate posterior one represents the domain where it is applied (Grefenstette, 1997)– anterior + posterior (e.g. lung cancer)

• However, this analysis can be extended considering other patterns for hyponomy detection like Hearst’s (1992) ones

Page 98: Automatic Generation of Taxonomies from the WWW

98

2. Taxonomy building methodology

• The system relies on a search engine as the particular expert to search and access the web resources from where extract knowledge

• It constructs dynamically the appropriate search queries for obtaining the most appropriate corpus of web resources at each time

• So, the more knowledge extracted for a domain, the more specific the evaluated corpus of web documents will be

Page 99: Automatic Generation of Taxonomies from the WWW

99

2. Taxonomy building methodology

• Moreover, the search engine is also used to obtain web scale statistics from the number of hits associated to specially formulated queries

• From this statistics we check the relevancy of the extracted terms and evaluate the strength of the relationships between them

Page 100: Automatic Generation of Taxonomies from the WWW

100

2. Taxonomy building methodology

• However, one problem may arise due to the use of keyword based search engines:– the search will be centered on the initial keyword

selected as the representative for the domain of knowledge but, if several ways of expressing the same concept exist (e.g. synonyms, lexicalizations or even different morphological forms) a considerable amount of potentially interesting web resources will be omitted

Page 101: Automatic Generation of Taxonomies from the WWW

101

2. Taxonomy building methodology

• In order to solve that problem, we pretend to use an automatic methodology previously developed (Sanchez & Moreno, 2005) for discovering those alternative keywords from an initial one using the same initial taxonomy obtained through the method explained in this paper and the Web environment

• Incorporating this step into the present methodology and performing iterative executions of the learning algorithm, we will potentially improve the final recall of our system maintaining the unsupervised and uninformed operation

Page 102: Automatic Generation of Taxonomies from the WWW

102

Page 103: Automatic Generation of Taxonomies from the WWW

103

2.1 Concept discovery, taxonomy building and web categorization algorithm

• In more detail, the algorithm for discovering representative concepts and building a taxonomy of web resources (shown on Figure 1) has the following phases (see Figures 2 and 3 to follow an example for the Cancer domain referred during the explanation):

Page 104: Automatic Generation of Taxonomies from the WWW

104

Figure 2.

Page 105: Automatic Generation of Taxonomies from the WWW

105

Figure 3.

Page 106: Automatic Generation of Taxonomies from the WWW

106

2.1 Concept discovery, taxonomy building and web categorization algorithm

1. It starts with a keyword representative enough for a domain (e.g. cancer). Thanks to the generic implementation of the learning algorithm this initial word can be composed by several words if desired (e.g. breast cancer). Then, it uses a web search engine to obtain the most representative webs that contain that keyword (e.g. 100 pages for the Cancer domain)

Page 107: Automatic Generation of Taxonomies from the WWW

107

• A priori, the more web sites are evaluated, the more complete will the results be– However, the best and more suitable web resources

are presented first by the search engine and, due to the amount of redundancy, once a certain percentage of the full set has been evaluated, obtaining new valid terms will be quite difficult

– So, just evaluating a reduced amount of the full set can give us representative and accurate results

Page 108: Automatic Generation of Taxonomies from the WWW

108

2.1 Concept discovery, taxonomy building and web categorization algorithm

2. For each site returned, an exhaustive analysis is performed. A web parser obtains the useful text and the system tries to find the initial keyword (e.g. cancer) within the retrieved text. Then, a syntactical analysis based on an annotation engine for the English language (OpenNLP1) is performed over the keyword’s neighborhood to detect tokens and grammatical categories

Page 109: Automatic Generation of Taxonomies from the WWW

109

• The immediate anterior word (e.g. prostate cancer) is selected as a candidate concept. The posterior words (e.g. cancer treatment) are also retrieved– Concretely, the system verifies that words are nouns or

adjectives, as they are modifiers of the keyword’s meaning– Each word is analyzed from its morphological root using a

stemming algorithm– So, the different morphological forms of the same term will

be considered as the same concept and their statistics will be approximated to the most common morphological form

Page 110: Automatic Generation of Taxonomies from the WWW

110

2.1 Concept discovery, taxonomy building and web categorization algorithm

3. For each candidate, a selection algorithm is performed to decide if it will be considered as a subclass candidate of the initial one (a type of…), or as an instance candidate of the immediate superclass (an example of…), or even both at the same time if the same word is used for expressing concepts at different levels of concreteness. The method followed is described in section 2.2

Page 111: Automatic Generation of Taxonomies from the WWW

111

2.1 Concept discovery, taxonomy building and web categorization algorithm

4. Among the candidate concepts considered as subclass candidates, a statistical analysis is performed for selecting the most representative ones– In order to obtain relevant measures, moreover to

the frequency analysis performed over the reduced set of analyzed web sites, the information obtained from the search engine, which is based in the analysis of the whole Web (web-scale statistics (Etzioni et al., 2004)), is also evaluated

Page 112: Automatic Generation of Taxonomies from the WWW

112

– This allows us to obtain representative measures without analyzing a large amount of documents (good scalability)

– As we want to perform the selection process automatically, a set of selection constraints will be computed at this point in function of the domain’s generality and the size of the search. Concretely:

Page 113: Automatic Generation of Taxonomies from the WWW

113

1) From the analyzed set of web sites, we extract the total number of appearances of each candidate and the number of different web sites in which it has appeared. Those measures give us a local idea of which candidates are potentially the most used in the domain (e.g. breast cancer is very relevant but common cancer isn’t)• For each value, a minimum of hits will be considered to

detect which are the most representative terms. As this number depends on the size of the search, the minimum threshold will also depend on it

Page 114: Automatic Generation of Taxonomies from the WWW

114

• However, the number of appearances of a specific concept (specially the most relevant ones) does not grow linearly in relation to the amount of websites because web resources are sorted by the search engine and, consequently, the most important ones are presented first

• In consequence, a setting of a logarithmic function with respect to the size of the search is a good measure (1)

Page 115: Automatic Generation of Taxonomies from the WWW

115

2.1 Concept discovery, taxonomy building and web categorization algorithm

2) From the search engine, we compute the amount of hits obtained by the concatenation of the candidate and the initial keyword, the total hits obtained by the candidate itself and a ratio between those measures (2)• The first two measures present mandatory constraints

defined a priori for avoiding very common words (e.g. world cancer) and misspelled ones (e.g. todaybreast cancer)

Page 116: Automatic Generation of Taxonomies from the WWW

116

• The ratio value is especially important as it gives us a measure of the strength of the relationship between the candidate and the initial keyword against the whole Web (e.g. "cervical cancer" is a much more suitable relationship than "premier cancer")

• As that ratio is an absolute measure it is fixed with a value that allows configuring the behavior of the system. This last statistical measure have been previously employed by several authors (Turney, 2001) for computing in an efficient and scalable way the degree of relationship between terms

Page 117: Automatic Generation of Taxonomies from the WWW

117

2.1 Concept discovery, taxonomy building and web categorization algorithm

5. For each candidate that fulfils the two mandatory restrictions, a relevancy measure (3) is computed– That measure takes into consideration the amount of

times that the concept’s statistical values exceed the minimum values specified

– We consider the total number of appearances, the number of different web sites and the web scale ratio

– The two first values express a measure of the concept’s generality and the last one indicates the strength of the relation with the specific domain

Page 118: Automatic Generation of Taxonomies from the WWW

118

2.1 Concept discovery, taxonomy building and web categorization algorithm

6. At this point, the selection process is performed by selecting in the first place the most confident candidates and computing a selection threshold for the remaining ones in function of the selected candidates’ relevancy– The idea is to use the statistical values of the most

representative obtained terms to adapt the selection heuristic to the specific domain with independence of the search’s size or the domain generality

– This allows us to perform the selection process in a completely automatic, unsupervised and domain-independent way

Page 119: Automatic Generation of Taxonomies from the WWW

119

– More concretely, those candidates whose full set of statistics fits with the whole set of specified constraints are selected in the first place

– Next, the lowest relevancy measure of those high confident selected concepts is set as the selection threshold

– Then, the remaining concepts that surpass the selection threshold will also be selected

Page 120: Automatic Generation of Taxonomies from the WWW

120

2.1 Concept discovery, taxonomy building and web categorization algorithm

7. For each selected concept, a new and more complex keyword is constructed joining this concept with the initial keyword (e.g. "breast cancer"), and the algorithm is executed again– So, at each iteration, a new and more specific set of

relevant documents for the concrete subdomain is retrieved

– This process is repeated recursively until a selected depth level is achieved or no more results are found

Page 121: Automatic Generation of Taxonomies from the WWW

121

2.1 Concept discovery, taxonomy building and web categorization algorithm

8. The result is a hierarchy that is stored as an ontology with is-a relations (as shown in Figure 3)– If a concept has different derivative forms (e.g.

hormone and hormonal), all of them are represented identified with an equivalence relation

– Each class and instance stores the URLs from where it was selected

– This can be valuable information for improving the access to web sites and for the Semantic Web

Page 122: Automatic Generation of Taxonomies from the WWW

122

Figure 3.

Page 123: Automatic Generation of Taxonomies from the WWW

123

2.1 Concept discovery, taxonomy building and web categorization algorithm

9. The posterior word for the initial keyword is also considered– In this case, the selected concepts are used to

categorize the set of URLs associated to each class– For example, if we find that for a URL associated to

the class breast cancer, this keyword is followed by the word foundation, the URL will be categorized with this word that represents a domain of application (some examples in Figure 2)

Page 124: Automatic Generation of Taxonomies from the WWW

124

Page 125: Automatic Generation of Taxonomies from the WWW

125

2.1 Concept discovery, taxonomy building and web categorization algorithm

10.When dealing with polysemic domains (e.g. virus), a semantic disambiguation algorithm is performed in order to group the obtained subclasses according to the specific word sense (e.g. biological agent or computer program)– The methodology performs a clusterization of those

classes using as a similarity measure the amount of coincidences between their URLs

– More details in Sanchez & Moreno (2004b)

Page 126: Automatic Generation of Taxonomies from the WWW

126

2.1 Concept discovery, taxonomy building and web categorization algorithm

11.Finally, a refinement is performed to obtain a more compact taxonomy– In this process, classes that share a high amount of

their URLs (e.g. avobe a 75 %) are merged because we consider them as closely related

– For example, the hierarchy “cancer -> cell -> germ-> refractory -> cisplatin -> nonseminomatous” will result in “cancer -> cell -> germ -> refractory -> nonseminomatous_cisplatin”, discovering a multiword (Voosen, 2001)

Page 127: Automatic Generation of Taxonomies from the WWW

127

2.2 Named entity discovery

• One of the hardest problems of the knowledge acquisition process is to decide when a term has to be considered as a subclass (which can become the superclass of new classes) or as an instance (which represent the leafs of the hierarchy)

• Even for an knowledge engineer this can be a challenging issue

• A simplified approach is considering named entities as instances, as they represent individualizations of concepts

Page 128: Automatic Generation of Taxonomies from the WWW

128

2.2 Named entity discovery

• Many named entity recognizers traditionally rely on lists of names examples (Mikheev, Moens & Grover, 1999)– The lists are compiled by humans, or assembled from

authoritative sources• It is also possible to build recognizers that identify names

automatically in text (Stevenson & Gaizauskas, 2000)– Such approaches usually attempt to learn general categories

such as organizations or persons rather than refined categories– Even considering fine-grained categories (Fleischman & Hovy,

2002), they use a closed, pre-specified set of categories of interest

Page 129: Automatic Generation of Taxonomies from the WWW

129

2.2 Named entity discovery

• Other authors (Lamparter, Ehrig & Tempich, 2004) are using a thesaurus like WordNet for performing this detection:– if the word is not found in the dictionary, it is

assumed to be a named entity– However, sometimes, a named entity can be

expressed with normal words, so the use of a thesaurus is not enough to perform the differentiation

Page 130: Automatic Generation of Taxonomies from the WWW

130

2.2 Named entity discovery

• Other approaches take into consideration the way in which named entities are presented– Concretely, language such as English distinguishes

proper names from normal words through capitalization

– This simple but effective idea, combined with a certain degree of linguistic analysis, has been applied by several authors (Cimiano & Staab, 2004), obtaining good results

Page 131: Automatic Generation of Taxonomies from the WWW

131

2.2 Named entity discovery

– That approach, combined with statistical analysis of candidates for named entities, is the base in which our method has been designed• However, as most search engines do not differentiate

between lower and upper letters, we cannot obtain statistics directly (e.g. ratio between hits of a word in its lower form against its upper form) and, in consequence, some level of analysis has to be performed

Page 132: Automatic Generation of Taxonomies from the WWW

132

2.2 Named entity discovery

• More in detail, during the taxonomy building process, if the candidate concept is represented with at least one capital letter, it will be marked as an instance candidate; otherwise, it will be considered as a subclass candidate

• Note that a concept can be considered of both types at the same time if it has been found represented with the two forms

Page 133: Automatic Generation of Taxonomies from the WWW

133

2.2 Named entity discovery

• For each instance candidate, the first web sites returned by the search engine are processed to count the number of times that it is presented with upper and lower letters– Those which are presented in capital letters in most

cases (e.g. 85 % for high confidence) will be selected as instances (see examples in Fig 2b and Table 1

– details on how the full name of the named-entity is discovered are included in section 3)

Page 134: Automatic Generation of Taxonomies from the WWW

134

Page 135: Automatic Generation of Taxonomies from the WWW

135

Outline

1. Introduction2. Taxonomy building methodology3. Representation and evaluation of the results4. Related work, conclusions and future research

Page 136: Automatic Generation of Taxonomies from the WWW

136

3. Representation and evaluation of the results

• The final hierarchy is stored in a standard representation language: OWL. The Web Ontology Language is a semantic markup language for publishing and sharing ontologies on the World Wide Web (Berners-lee, Hendler & Lassila, 2001)– It is supported by many ontology visualisers and

editors

Page 137: Automatic Generation of Taxonomies from the WWW

137

3. Representation and evaluation of the results

• Concerning the evaluation of automatically acquired knowledge, it is recognized to be an open problem (OntoWeb, 2002)

• Regarding our proposal, the evaluation is a task that, at this moment, is a matter of current research

Page 138: Automatic Generation of Taxonomies from the WWW

138

3. Representation and evaluation of the results

• However, some evaluations have been performed manually and others are planned to be performed automatically comparing the results against other methodologies or available semantic repositories such as WordNet– However, this fact does not affect the domain-independent

(no predefined knowledge) basic premise– The idea is to be able to demonstrate the quality and

suitability of the system for obtaining results for well known domains and establish the base of trustworthiness on the results obtained for any other possible domain

Page 139: Automatic Generation of Taxonomies from the WWW

139

3. Representation and evaluation of the results

• The evaluation of the obtained taxonomies of concepts is performed manually at this moment– Whenever it is possible, a representative human made

classification is taken as the ideal model of taxonomy to achieve (Gold Standard)

– Then, several tests for that domain are performed with different sizes of search

– For each one, we apply the standard measures recall and precision (see an example of evaluation for the Cancer domain in Figure 4)

Page 140: Automatic Generation of Taxonomies from the WWW

140

3. Representation and evaluation of the results

• Moreover, a local recall measure considering only the concepts potentially discovered for the evaluated resources is also considered in order to check the performance of the selection procedure

• As a measure of comparison of the quality of the obtained results against similar available systems, we have evaluated precision and recall against hand-made web directory services and taxonomical search engines

Page 141: Automatic Generation of Taxonomies from the WWW

141

3. Representation and evaluation of the results

• The performance of the candidate selection procedure is high as the number of mistakes (incorrectly selected and rejected items) is maintained around 15-20%

Page 142: Automatic Generation of Taxonomies from the WWW

142

3. Representation and evaluation of the results

• Moreover, the local recall (number of correctly selected concepts from the full list of discovered ones) is typically around a 90-95%

Page 143: Automatic Generation of Taxonomies from the WWW

143

3. Representation and evaluation of the results

• The growth of the number of discovered concepts (and in consequence the recall) follows a logarithmic distribution in relation to the size of the search due to the redundancy of information and the relevancy-based sorting of web sites presented by the search engine

Page 144: Automatic Generation of Taxonomies from the WWW

144

3. Representation and evaluation of the results

• Moreover, arrived at a certain point in which a considerable amount of concepts has been discovered, precision tends to decrease due the growth of false candidates

• As a consequence, analysing a large amount of web sites does not imply to obtain better results than with a more reduced corpus

Page 145: Automatic Generation of Taxonomies from the WWW

145

3. Representation and evaluation of the results

• Comparing the performance to Yahoo, we see that although its precision is the highest, as it has been made by humans, the number of results (recall) is quite limited

Page 146: Automatic Generation of Taxonomies from the WWW

146

3. Representation and evaluation of the results

• In relation to the automatic search tools, Clusty and AlltheWeb, the first one tends to return more results (recall) but with low precision whereas the second one offers an inverse situation

• In both cases, the performance (recall-precision compromise) is easily surpassed by our proposal

Page 147: Automatic Generation of Taxonomies from the WWW

147

3. Representation and evaluation of the results

• The evaluation of the correctness of the instances discovered (in our case, named entities) is a hard task, even for domain experts, because in most cases there not exist a standard classification

• However, an alternative way for checking automatically obtained terms can be the comparison with other well-known learning techniques applied over the same corpus and domain

Page 148: Automatic Generation of Taxonomies from the WWW

148

3. Representation and evaluation of the results

• Concretely, for the named-entity case, we have used a named-entity detection package trained for the English language (again, OpenNLP) that is able to detect with quite high accuracy some word patterns like organization, person, and location names

Page 149: Automatic Generation of Taxonomies from the WWW

149

3. Representation and evaluation of the results

• The evaluation is performed by testing if the selected instance name is also tagged by the detection tool as a named entity

• This process will give us a measure of precision of the obtained results in an automatic way, by comparing our algorithm against a well known human trained entity detection mechanism (based on maximum entropy models (Borthwick, 1999))

Page 150: Automatic Generation of Taxonomies from the WWW

150

3. Representation and evaluation of the results

• In the future, we plan to improve this evaluation considering also recall measures and comparisons with other automatic techniques

• A collateral effect of this process is the detection of the full name associated to an instance, automatically tagged by OpenNLP if this name is composed by several words (e.g. 24th Annual San Antonio Breast Cancer Symposium), as shown in Table 1

Page 151: Automatic Generation of Taxonomies from the WWW

151

3. Representation and evaluation of the results

• For illustrative purposes, for the Cancer, Sensor and Disease domains, we have obtained a precision of 100%, 85% and 91% respectively

• However, even if it is automatically obtained, this is a relative result because the tagging tool itself does not present a 100% precision and it is limited to predefined groups

Page 152: Automatic Generation of Taxonomies from the WWW

152

Outline

1. Introduction2. Taxonomy building methodology3. Representation and evaluation of the results4. Related work, conclusions and future research

Page 153: Automatic Generation of Taxonomies from the WWW

153

4. Related work, conclusions and future research

• Several authors have been working on knowledge acquisition from natural language text collections and more concretely from the Web in the last years. In (Agirre et al., 2000; Alfonseca & Manandhar, 2002; Navigli & Velardi, 2004) the authors propose methods for extending an existing ontology with unknown words

Page 154: Automatic Generation of Taxonomies from the WWW

154

4. Related work, conclusions and future research

• These approaches rely in previously obtained knowledge and semantic repositories like WordNet in order to classify concepts and perform disambiguations

Page 155: Automatic Generation of Taxonomies from the WWW

155

4. Related work, conclusions and future research

• More closely related, in (Etzioni et al., 2004) authors also try to enrich previously obtained ontologies but using web-scale statistics obtained from web searchers to detect suitable candidates

• Other authors using this approach for knowledge acquisition related tasks are (Cimiano & Staab, 2004), that learn taxonomic relationships using Hearst's (1992) patterns, and Turney (2001), for synonym discovery

Page 156: Automatic Generation of Taxonomies from the WWW

156

4. Related work, conclusions and future research

• In contrast to many of the previously developed methodologies, our proposal does not start from any kind of predefined knowledge, like WordNet and, in consequence, it can be applied over domains that are not typically considered in semantic repositories as, for example, Biosensors (Sanchez & Moreno, 2004a)

Page 157: Automatic Generation of Taxonomies from the WWW

157

4. Related work, conclusions and future research

• Moreover, the intensive use of search engines for obtaining statistics save us from analysing large amounts of resources but maintaining good quality results, and the automatic operation eases the updating of the results in highly evolutionary domains

• These facts result in a scalable and suitable method for acquiring knowledge from a huge and dynamic repository as the Web

Page 158: Automatic Generation of Taxonomies from the WWW

158

4. Related work, conclusions and future research

• The results obtained can be useful for several tasks– On the one hand, the automatic classification of the

web resources considered during the learning process into a meaningful organization, offers a final representation that can ease the way in which the user access to those resources

– On the other hand, the automatically obtained taxonomy can be used as a first step for the construction of ontologies (Fensel, 2001)

Page 159: Automatic Generation of Taxonomies from the WWW

159

4. Related work, conclusions and future research

• Ontologies offer a formal and efficient way for representing knowledge but they are usually built entirely by hand

• If that knowledge is evolvable, an ontology maintenance process is required to keep the ontological knowledge up-to-date– This is why methodologies that contribute to the automatic

ontology construction, are necessary• Finally, the web populated knowledge structure obtained

from the Web itself, is very suitable for the purposes of the Semantic Web (Berners-lee, Hendler & Lassila 2001)

Page 160: Automatic Generation of Taxonomies from the WWW

160

4. Related work, conclusions and future research

• The concept and named-entities discovery method can be extended using more complex linguistic patterns like Heart’s (1992) ones– This can be useful when dealing with non-easily

categorized domains where no noun-phrases are found

Page 161: Automatic Generation of Taxonomies from the WWW

161

4. Related work, conclusions and future research

• As mentioned before, incorporating an intermediate step for learning automatically alternative forms (synonyms, lexicalizations, etc) for the initial keyword may allow us to retrieve new resources and concepts, improving the final recall– For the cancer domain considered in this paper, other

alternative forms that are easily discovered with this procedure are: cancers, carcinoma(s), tumor(s), tumour(s), neoplasm, etc

– More details in (Sanchez & Moreno, 2005). The casuistry for joining the partial results obtained from the learning process of each of those initial keywords to obtain a final complete taxonomy should be also considered in the future

Page 162: Automatic Generation of Taxonomies from the WWW

162

4. Related work, conclusions and future research

• In order to extend the methodology, more complex relations (like non-taxonomic ones) can be discovered from an analysis of the sentences that contain the selected concepts (text nuggets)

• As previously introduced, ways for automate or at least easing the evaluation of every step of the knowledge acquisition process will be studied