semantic web mining - uni konstanz
TRANSCRIPT
Semantic Web Mining * December 2007
Contents
• Semantic Web
• Data mining
• Web mining
▫ Content web mining
▫ Structure web mining
▫ Usage web mining
• Semantic Web Mining
Semantic Web Mining * December 2007
Semantic web
• "The Semantic Web is a vision: the idea of having data on the web defined and linked in a way that it can be used by machines – not just for display purposes, but for using it in various
applications.“ [Tim Berners-Lee]
Semantic Web Mining * December 2007
Semantic Web Apps
• search engines
▫ Hakia
▫ TrueKnowledge
▫ Powerset
▫ Spock
• Firefox extensions
▫ Gnosis
• TripIt
Semantic Web Mining * December 2007
Data mining
“Data mining is the semi automatic extraction of patterns, changes, associations, anomalies, and other statistically significant structures from large data sets.”
- R. Grossman
Fig 3. Overview of the steps constituting the KDD process
Semantic Web Mining * December 2007
Data mining tasks
ClusteringAssociation Rules
Naïve Bayes Neuronal NetworksDecision Trees
Semantic Web Mining * December 2007
Web mining
• the process of discovering patterns and relations in the Web data
• applies data mining techniques on the web
• 3 areas can be distinguished:
▫ Web content mining
▫ Web structure mining
▫ Web usage mining
Semantic Web Mining * December 2007
Why web mining?
• the internet has been constantly increasing in usage and popularity
▫ web pages: over 800 million (2000)
▫ html pages: ~6 TB of data
▫ every day ~1 million pages are added
▫ every month hundreds of GB worth of changes to existing pages
▫ 2006-2007: over 60 million domains have been registered(=1995-2005)
Semantic Web Mining * December 2007
Web content mining
• is mostly a form of text mining (the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources)
• takes advantages of the semi-structured form (as opposed to databases) of html and xml pages to extract knowledge
• can be used to detect co-occurrences of terms in texts
Semantic Web Mining * December 2007
Web structure mining
• describes the organization of the content within the website
▫ includes the organization inside a webpage, internal/ external links and the site hierarchy
• Google„s PageRank algorithm ranks a website on the basis of how many other sites link to it
• used to identity information hubs
• used to derive models in order to predict the popularity of a website
Semantic Web Mining * December 2007
Web usage mining
• describes the use of websites, reflected in a web server‟s access log, as well as in logs for specific application
• semantics created by usage
▫ identification of people with the same interests: “People who liked/bought this book also looked at ...”
▫ online catalog: users interested in product A is also interested in product B
Semantic Web Mining * December 2007
Usage web mining
• frequency of file in a web log reveals knowledge, such as:
▫ pages not of interest/ page of much interest
• result:
▫ reorganized site structure (not automated)
Semantic Web Mining * December 2007
Semantic Web mining
• take a set of Web pages from a site and improve them for both human and machine users
▫ generate metadata that reflect a semantic model underlying the site
▫ identify patterns both in the pages‟ text and in their usage
▫ improve information architecture and page design
Semantic Web Mining * December 2007
Steps
• employ mining methods on Web resources▫ generate mining structure
• employ mining methods on the resulting semantically structured Web resources▫ generate further structure
• at the end, ▫ design of the Web pages themselves (visible to
human users)▫ feed back the metadata and the underlying
ontology (visible to machine users)
Semantic Web Mining * December 2007
Ontology
• provides the opportunity of representing arbitrary worlds
• includes a set of concepts, a hierarchy on them, and (n-ary) relations between concepts
• two types of ontologies:
▫ 1st uses a small number of relations between concepts : e.g. Yahoo!
▫ 2nd is rich with relations but have a rather limited description of concept, usually consisting of a short description: e.g. WordNet
Semantic Web Mining * December 2007
Association Rules
• combination of knowledge about instances like the “Wellnesshotel” and its Sea View golf course and knowledge derived from the Web pages‟
texts
▫ hotels with golf courses often have five stars
• (Confidence, support)
▫ (89%, 0.4%)
Semantic Web Mining * December 2007
Clustering
• use web document clustering techniques to improve search engine results (i.e. the search results better reflect the term/s sought)
▫ indentify a cluster of users who visit and closely examine the pages of the “Wellnesshotel”, the “Palacehotel”, and the “Starhotel”
▫ “you might want also look at…”
Semantic Web Mining * December 2007
Redesigning
• in order to introduce a new category “golf hotels”
▫ all hotels for which there is a “golf course” that “belongs to” the hotel become instances of the new category
• site and design page are modified
▫ by adding a new value for the search criterion “hotel facilities” in order to correspond to the newly added category
Semantic Web Mining * December 2007
Benefits
• input:
▫ page of a site describes the “Palacehotel” in Zürich
▫ hotel subclass of accommodation
▫ Zürich is located in Switzerland
• search for “accommodation in Switzerland”
• result:
▫ “Palacehotel”
Semantic Web Mining * December 2007
Links
• http://www.hakia.com/
• http://www.powerset.com/
• http://www.trueknowledge.com/
• http://www.spock.com/
• http://www.tripit.com/
• https://addons.mozilla.org/en-US/firefox/addon/3999
• http://wordnet.princeton.edu/