[submission] final_presentation
TRANSCRIPT
![Page 1: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/1.jpg)
Higgs-Reader
Team C. Arif Jafer, Camilo Celis, Marcus Low,
![Page 2: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/2.jpg)
Contents❏ Overview❏ Problems & Requirements❏ Goals Met❏ Approach❏ Architecture❏ WebAnn - Training Set Creator Tool❏ The Korean Language Model❏ Final Reader❏ Demo
![Page 3: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/3.jpg)
Overview
![Page 4: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/4.jpg)
Overview
● A reader engine is composed of:○ A web-page text extraction algorithm, to find the
main article text○ Heuristics to find metadata, relevant images to the
main article○ User Interface to embody the reader engine
![Page 5: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/5.jpg)
Overview (boilerpipe)
![Page 6: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/6.jpg)
Overview
● Higgs-Reader (built upon DOM-Distiller)○ Boilerpipe extended with a Korean Language Model
■ Tools to train the model - Weka / C4.5 Decision Trees■ Tools to generate the training set - WebAnn■ Integration of the model back into the reader engine
○ Existing Heuristics in DOM-Distiller will be tuned to improve the performance for Korean Web pages
○ Final Reader Chrome Extension
![Page 7: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/7.jpg)
Goals Met● Extended the DOM-Distiller reader engine, with
enhanced support for Korean web pages. ● Created a new Korean Language model for text-
extraction● Tuned the existing heuristics to improve the
performance on Korean web sites● Created a Reader UI to embody the reader engine
![Page 8: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/8.jpg)
Problems Encountered● The existing reader engines, such as DOM-Distiller, had a poor support for
non-English web pages.● Korean websites did not commonly follow the website markup standards,
such as OpenGraph protocol, schema.org, etc.● Current HTML standards used by majority of websites tend to still use the
<div> or <table> tags to separate content. This eliminates the possibility of identifying the semantics of any particular section of HTML source.
● Poor performance on multi-page websites. It should be able to retrieve all or at most K number of the pages at once.
● Poor performance on detection of relevant images or other rich-content media.
![Page 9: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/9.jpg)
Requirements Met● A Korean language model was made and integrated into the Boilerpipe
algorithm.○ Tooling for creating the training set (WebAnn)
● The existing DOM-Distiller was tuned to work with Korean websites. ● Better support for web pages, with their layouts made with tables.● Better support for multi-page web pages.● Enhanced the relevant image detection heuristic● Chrome Extension Implementation (Final Reader)● Comparison mechanism for testing purposes
![Page 10: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/10.jpg)
Approach
● 4 Stages○ Web Page Annotator○ Korean Language Model for boilerpipe○ Reader Engine tuning○ Reader UI
![Page 11: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/11.jpg)
Approach / Architecture (Overall)
![Page 12: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/12.jpg)
Approach / Architecture (WebAnn)● Web Page Annotator (WebAnn)
○ Built as a Chrome extension○ Provides a simple UI to annotate
different sections of a web page with predefined labels.
■ HEADING■ FULL_CONTENT■ SUPPLEMENTARY■ COMMENTS■ RELEVANT_IMAGES
![Page 13: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/13.jpg)
WebAnn -- Training Set Creator Tool
Ordinary Webpage
![Page 14: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/14.jpg)
WebAnn -- Training Set Creator Tool
Annotator in Action
![Page 15: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/15.jpg)
Approach / Architecture (Machine Learning)
![Page 16: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/16.jpg)
Approach / Architecture (Language Model)● Korean Language Model
○ A corresponding model for each of the Models listed in Table 3.2○ Will be trained using Shallow Text features listed in Table 3.2
DensityRulesClassifier
HeuristicsFilterBase
IgnoreBlocksAfterContentFilter
IgnoreBlocksAfterContentFromEndFilter
KeepLargestFulltextBlockFilter
MinFullTextWordsFilter
NumWordsRulesClassifier
TerminatingBlocksFinder
prev_link_density
prev_text_density
prev_num_words
prev_num_words_in_anchor_text
curr_link_density
curr_text_density
curr_num_words
curr_num_words_in_anchor_text
next_link_density
next_text_density
next_num_words
next_num_words_in_anchor_text
![Page 17: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/17.jpg)
Approach / Architecture (Language Model)
● Korean Language Models○ Trained using C4.5 Decision Trees algorithm
■ Existing English language models also trained with this algorithm■ better performance on multi-category classification problems■ Good performance in supervised learning
○ Use the Weka ML toolset■ Provides a wide number of implementations for ML algorithms■ easy to compare and evaluate different models by tuning the
parameters■ Provides cross-validation features, such as k-fold cross validation
![Page 18: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/18.jpg)
Korean Language Heuristics
● Lack of <p> tags● Terminating Blocks
![Page 19: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/19.jpg)
Korean Language Model
Decision Tree based on Number of Words Decision Tree based on Density of Words
![Page 20: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/20.jpg)
Number of Words
Korean Language Model
English Model
boilerplate content
21032 621
225 647
Confusion Matrix
boilerplate content
21637 16
142 730
Correctly Classified Instances
22367 99.2986 %
Incorrectly Classified Instances
158 0.7014 %
Density of Words
English Model
boilerplate content
21105 548
220 652
Confusion Matrix
boilerplate content
21637 16
142 730
Correctly Classified Instances
22367 99.2986 %
Incorrectly Classified Instances
158 0.7014 %
![Page 21: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/21.jpg)
Approach / Architecture (Language Model)
![Page 22: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/22.jpg)
Approach / Architecture (Reader Engine)
● Reader Engine○ Based on the DOM-Distiller project○ New Language model will be integrated into Boilerpipe○ Existing Heuristics will be tuned to improve performance on Korean
web pages○ Built upon Google Web Toolkit (GWT)
■ Can use Java libraries■ Can use Java OOP features■ Compiler will produce cross-browser JS code■ Reader engine can be ported into any browser
![Page 23: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/23.jpg)
Approach / Architecture (Reader Engine)
![Page 24: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/24.jpg)
Final Reader UI (Implementation)
Final Reader Old Reader
![Page 25: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/25.jpg)
OLD READER (Live Demo)
● Small Chrome Extension using old dom-distiller code and old language model
![Page 26: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/26.jpg)
FINAL READER (Live Demo)
● Faster build cycles● Can be used to easily compare with Old
Reader extension
![Page 27: [submission] Final_Presentation](https://reader033.vdocuments.net/reader033/viewer/2022042817/55a9eb3d1a28ab23638b46ed/html5/thumbnails/27.jpg)
Thank you