![Page 1: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/1.jpg)
Detecting Blogs Independently From The Language And Content
Francisco Manuel Rangel Pardo
PhD. Anselmo Peñas Padilla
![Page 2: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/2.jpg)
Powered by Corex Soluciones Informáticas 2009
Introduction
What is a Social Media? sharing, discussion, collaboration -> Web 2.0,
What is a Blog? Opinions, experiences, information Freely comunication
What means “Detecting Blogs Independently From The Content And Language”?
![Page 3: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/3.jpg)
Powered by Corex Soluciones Informáticas 2009
Structure
Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.
![Page 4: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/4.jpg)
Powered by Corex Soluciones Informáticas 2009
Many people generating contents, many people consuming contents
Huge quantities of users and data Free, global and spontaneous information,
experiences and opinions Blog as a source of knowledge but raw data First of all, we have to identify them
Problem definition
![Page 5: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/5.jpg)
Powered by Corex Soluciones Informáticas 2009
Get a self-contained representation for Web pages that can be used in an inductive learning process to obtain good results identifying Blogs independently from the content, style, author and language
Research Objective
![Page 6: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/6.jpg)
Powered by Corex Soluciones Informáticas 2009
Structure
Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.
![Page 7: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/7.jpg)
Powered by Corex Soluciones Informáticas 2009
How are Blogs? Heterogeneus in content, themes and
styles Many different languages Technology vs. Web2.0 Concept
Research approachVisual Characteristics of the Blogs
![Page 8: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/8.jpg)
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
![Page 9: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/9.jpg)
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
![Page 10: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/10.jpg)
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
![Page 11: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/11.jpg)
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
![Page 12: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/12.jpg)
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
![Page 13: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/13.jpg)
Powered by Corex Soluciones Informáticas 2009
Machine learning / inductive learning 14 features from content and structure Frecuency of ocurrence of some entities Ratios between frecuency of ocurrence of
some entities
Research approachFeatures of the representation
![Page 14: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/14.jpg)
Powered by Corex Soluciones Informáticas 2009
“blog” in Url “blog” in document “post” in document “rss” or “atom” Comments vs. Dates Comments in link
vs. Dates Comments in link
vs. Comments
Comments vs. Headlines Comments in link vs headlines Dates vs. Headlines ¿Blogroll? Links same domain vs. Links
different domain Links different domain vs.
Total links Links blogroll vs. Links page
Research approachFeatures of the representation
![Page 15: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/15.jpg)
Powered by Corex Soluciones Informáticas 2009
Structure
Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.
![Page 16: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/16.jpg)
Powered by Corex Soluciones Informáticas 2009
DMOZ ODP 4 different languages Blog / No-Blog No-Blog: Many
different categories (arts, business, computers…)
Experimental ResultsTest Collection
![Page 17: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/17.jpg)
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsEvaluation framework
4 different classifiers Naïve Bayes BayesNet Support Vector Machines Decision Trees
Training based on accuracy
Cross validation Statistical-F
T-Student H0: all representations
have the same performance Interval for real error
![Page 18: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/18.jpg)
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsBaseline and other representations
4 different representations + majority baseline BoW: Bag of Words Google Blog Search NITLE Project CRX: Our representation
![Page 19: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/19.jpg)
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
![Page 20: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/20.jpg)
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
![Page 21: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/21.jpg)
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
![Page 22: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/22.jpg)
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
We reject the H0 -> CRX improves significantly the classification
![Page 23: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/23.jpg)
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
![Page 24: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/24.jpg)
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsDiscussion
BoW methods High dimensionality: multilingualism & themes Decrease their performance
Google Blog Search Do not distinguish between Blog or pages with subscription:
newsgroups, wikis or forums NITLE
Logic rules vs. inductive learning Personal Web page created with wordpress Blog created programmatically and hosted in an own domain Based on the current technology
![Page 25: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/25.jpg)
Powered by Corex Soluciones Informáticas 2009
Structure
Problem definition and researching goals Experimental results Conclussions and Future Work. Applications.
![Page 26: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/26.jpg)
Powered by Corex Soluciones Informáticas 2009
Conclusions
We have created a test collection: 4 different languages 2 different classes: Blog / No-Blog
We have experimented: 4 different representations 4 different methods of inductive learning
We have obtain better results F values of 0.920 Interval error lower than 2%
![Page 27: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/27.jpg)
Powered by Corex Soluciones Informáticas 2009
Conclusions
Our representation The concept of Blog vs. underlying technology Prioritizes Blogs with reviews and comments
Conclusion Identification independently from content, style,
author and language
![Page 28: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/28.jpg)
Powered by Corex Soluciones Informáticas 2009
Future work
Deal with new languages Strengthen entity extraction Include temporal analysis Include extra rules
![Page 29: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/29.jpg)
Powered by Corex Soluciones Informáticas 2009
Applications
Searching for information and opinions about: Products and services Customers and providers Competitors
![Page 30: Detecting Blogs Independently from the Language and Content MSM09](https://reader033.vdocuments.net/reader033/viewer/2022061302/5491c220ac79592a288b45fb/html5/thumbnails/30.jpg)
Powered by Corex Soluciones Informáticas 2009
Contact
Thank you
You can contact us:
Resources:
http://www.wikrplusd.com