semantic web mining - uni konstanz

Semantic Web MiningDiana Cerbu

Semantic Web Mining * December 2007

Contents

• Semantic Web

• Data mining

• Web mining

▫ Content web mining

▫ Structure web mining

▫ Usage web mining

• Semantic Web Mining


Semantic web

• "The Semantic Web is a vision: the idea of having data on the web defined and linked in a way that it can be used by machines – not just for display purposes, but for using it in various

applications.“ [Tim Berners-Lee]


Semantic Web Layer Cake


Semantic Web Apps

• search engines

▫ Hakia

▫ TrueKnowledge

▫ Powerset

▫ Spock

• Firefox extensions

▫ Gnosis

• TripIt


Gnosis


Data mining

“Data mining is the semi automatic extraction of patterns, changes, associations, anomalies, and other statistically significant structures from large data sets.”

- R. Grossman

Fig 3. Overview of the steps constituting the KDD process


Data mining tasks

ClusteringAssociation Rules

Naïve Bayes Neuronal NetworksDecision Trees


Web mining

• the process of discovering patterns and relations in the Web data

• applies data mining techniques on the web

• 3 areas can be distinguished:

▫ Web content mining

▫ Web structure mining

▫ Web usage mining


Why web mining?

• the internet has been constantly increasing in usage and popularity

▫ web pages: over 800 million (2000)

▫ html pages: ~6 TB of data

▫ every day ~1 million pages are added

▫ every month hundreds of GB worth of changes to existing pages

▫ 2006-2007: over 60 million domains have been registered(=1995-2005)


Web content mining

• is mostly a form of text mining (the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources)

• takes advantages of the semi-structured form (as opposed to databases) of html and xml pages to extract knowledge

• can be used to detect co-occurrences of terms in texts


Web structure mining

• describes the organization of the content within the website

▫ includes the organization inside a webpage, internal/ external links and the site hierarchy

• Google„s PageRank algorithm ranks a website on the basis of how many other sites link to it

• used to identity information hubs

• used to derive models in order to predict the popularity of a website


Web usage mining

• describes the use of websites, reflected in a web server‟s access log, as well as in logs for specific application

• semantics created by usage

▫ identification of people with the same interests: “People who liked/bought this book also looked at ...”

▫ online catalog: users interested in product A is also interested in product B


Usage web mining

• frequency of file in a web log reveals knowledge, such as:

▫ pages not of interest/ page of much interest

• result:

▫ reorganized site structure (not automated)


Semantic Web mining

• take a set of Web pages from a site and improve them for both human and machine users

▫ generate metadata that reflect a semantic model underlying the site

▫ identify patterns both in the pages‟ text and in their usage

▫ improve information architecture and page design


Steps

• employ mining methods on Web resources▫ generate mining structure

• employ mining methods on the resulting semantically structured Web resources▫ generate further structure

• at the end, ▫ design of the Web pages themselves (visible to

human users)▫ feed back the metadata and the underlying

ontology (visible to machine users)


Ontology

• provides the opportunity of representing arbitrary worlds

• includes a set of concepts, a hierarchy on them, and (n-ary) relations between concepts

• two types of ontologies:

▫ 1st uses a small number of relations between concepts : e.g. Yahoo!

▫ 2nd is rich with relations but have a rather limited description of concept, usually consisting of a short description: e.g. WordNet


Ontology learning


The ontology is filled


Knowledge base is mined


Association Rules

• combination of knowledge about instances like the “Wellnesshotel” and its Sea View golf course and knowledge derived from the Web pages‟

texts

▫ hotels with golf courses often have five stars

• (Confidence, support)

▫ (89%, 0.4%)


Clustering

• use web document clustering techniques to improve search engine results (i.e. the search results better reflect the term/s sought)

▫ indentify a cluster of users who visit and closely examine the pages of the “Wellnesshotel”, the “Palacehotel”, and the “Starhotel”

▫ “you might want also look at…”


Redesigning

• in order to introduce a new category “golf hotels”

▫ all hotels for which there is a “golf course” that “belongs to” the hotel become instances of the new category

• site and design page are modified

▫ by adding a new value for the search criterion “hotel facilities” in order to correspond to the newly added category


Benefits

• input:

▫ page of a site describes the “Palacehotel” in Zürich

▫ hotel subclass of accommodation

▫ Zürich is located in Switzerland

• search for “accommodation in Switzerland”

• result:

▫ “Palacehotel”


Q&A


Links

• http://www.hakia.com/

• http://www.powerset.com/

• http://www.trueknowledge.com/

• http://www.spock.com/

• http://www.tripit.com/

• https://addons.mozilla.org/en-US/firefox/addon/3999

• http://wordnet.princeton.edu/


Bibliography

• Web mining: From web to Semantic Web, Bettina Berendt, Andreas Hotho

• Towards Semantic Web Mining, Bettina Berendt, Andreas Hotho

semantic web mining - uni konstanz

Documents