automatically annotating web pages using google rich snippets 11th dutch-belgian information...
Post on 20-Dec-2015
220 views
TRANSCRIPT
Automatically Annotating Web Pages Using Google Rich Snippets
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
February 4, 2011
Frederik [email protected]
Flavius [email protected]
Damir [email protected]
Jeroen van der [email protected]
Ferry [email protected]
Uzay [email protected]
Erasmus University RotterdamPO Box 1738, NL-3000 DRRotterdam, the Netherlands
This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM.
Introduction (1)
• Semantically annotating Web pages enhances machine interpretation
• Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages
• The vocabulary enables interesting applications
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Introduction (2)
• Automating annotation for static and 3rd party Web sites is deemed necessary
• Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (1)
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
• Four main stages:– Hotspot identification– Subjectivity analysis– Information extraction– Page annotation
• Web pages are converted to DOM trees in order to enable easy processing
Framework (2)
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
RDFa
Framework (3): Hotspots
• Reviews are characterized by large blocks of text: hotspots
• Headers, navigation elements, footers, etc., do not contain these blocks
• Text blocks have few HTML elements
• For each element in the DOM tree, we compute the text-to-content-ratio (TTCR):
, with = # textual characters, and = total # characters in
DOM
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
DOM
text
L
LTTCR textL
DOML
Framework (4): Hotspots
• Illustrative example:
• The h1 element contains 64/73 × 100% ≈ 88% text
• However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
<h1> Intel Core i7-975 Extreme And i7-950 Processors Reviewed</h1><div> <p> Page <span class="page-number">1</span> of <span class="num-pages">15</span> </p></div>
Framework (5): Subjectivity
• Hotspots are verified as reviews whenever they are subjective enough
• We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009):– Original: check if document has ≥ k sentences that contain ≥
n subjectivity words each– Modification: check if document has ≥ m percent of all
sentences that contain ≥ n subjectivity words each
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (6): IE
• Various information is extracted:– Authors:
• Named entities are detected in the vicinity of hotspots• Named Entity Recognizer (NER)
– Dates:• Many different date formats are easily parsed• Regular expressions
– Products:• Name often found in title and h1 elements• Overlapping words
– Ratings:• Many formats, e.g., images (90%), which can be numerical (80%),
descriptors (15%), or letters (5%)• We focus on numerical ratings• Regular expressions on plain text or alt text of images
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
(\w)\s(\d{1,2})(th|,)?\s(\d{2,4})
([0-9.,]+)\s?/\s?([0-9.,]+)
MM dd yyyy
4/5
Framework (7): Annotation
• Key elements are tagged using Google Rich Snippets
• A new annotated Web page is returned
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
<div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review"> <span property="v:itemreviewed"> Tango Hotel Taichung </span> <span property="v:reviewer">Sarah Lee</span> <span property="v:rating">4 stars</span> <span property="v:dtreviewed">18th December 2008</span> <p property="v:summary"> Boutique like hotel without the boutique price </p></div>
Implementation (1)
• We have implemented the ARROW framework as a Web application:– Java-based– Apache Tomcat server
• Input:– URL– Preferred output:
• Visualizer
• Annotated document
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Implementation (2)
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Evaluation
• Test set: 100 review, 100 non-review Web pages
• Sub-second performance
• Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%)
• Main problems related to detecting authors, likely caused by the use of nicknames
• Dependency on Web site structures
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Conclusions
• We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets
• Framework not bound to vocabulary
• Proof-of-concept implementation shows promising results
• Future work:– Improve heuristics– Add intelligent (semantically enabled) text parsers– Extend to other domains, e.g., recipes, videos, etc.
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Questions
http://www.arrow-project.com/
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)