exploring in the weblog space by detecting informative and affective articles

21
Exploring in the Weblog Exploring in the Weblog Space by Detecting Space by Detecting Informative and Affective Informative and Affective Articles Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Yu Shanghai Jiao-Tong University Shanghai Jiao-Tong University Qiang Yang Qiang Yang Hong Kong University of Science and Technology Hong Kong University of Science and Technology WWW2007 WWW2007

Upload: shafira-massey

Post on 04-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Exploring in the Weblog Space by Detecting Informative and Affective Articles. Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University Qiang Yang Hong Kong University of Science and Technology WWW2007. Introduction. Unique characteristics of blogs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Exploring in the Weblog Exploring in the Weblog Space by Detecting Space by Detecting Informative and Affective Informative and Affective ArticlesArticles

Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong YuXiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu

Shanghai Jiao-Tong UniversityShanghai Jiao-Tong University

Qiang YangQiang Yang

Hong Kong University of Science and TechnologyHong Kong University of Science and Technology

WWW2007WWW2007

Page 2: Exploring in the Weblog Space by Detecting Informative and Affective Articles

IntroductionIntroduction

Unique characteristics of blogsUnique characteristics of blogs– Mainly maintained by individual persons and Mainly maintained by individual persons and

thus the contents are generally personalthus the contents are generally personal– The link structures between blogs generally The link structures between blogs generally

form localized communitiesform localized communities

Ongoing research on blogsOngoing research on blogs– Content based analysisContent based analysis– Blog communities’ evolutionBlog communities’ evolution– Different kinds of tools to help users retrieve, Different kinds of tools to help users retrieve,

organize and analyze the blogsorganize and analyze the blogs

Page 3: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Introduction – Genres Introduction – Genres in Blog’s Contentin Blog’s Content AffectiveAffective

– The online diary by which people share their The online diary by which people share their daily life publicly, express their feelings or daily life publicly, express their feelings or thoughts or emotions through the blogsthoughts or emotions through the blogs

InformativeInformative– Topic-oriented; the topic can be related to a Topic-oriented; the topic can be related to a

hobby or the author’s profession or businesshobby or the author’s profession or business

Page 4: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Introduction – the Introduction – the Problem and the Problem and the ApproachApproach The problemThe problem

– Separating informative articles from affective Separating informative articles from affective articles in blogs.articles in blogs.

The approachThe approach– Considering the problem as binary classificationConsidering the problem as binary classification– Challenges Challenges

The definitions of the informative articles and the The definitions of the informative articles and the affective articlesaffective articles

The training corpus for both categoriesThe training corpus for both categories The machine learning algorithmThe machine learning algorithm

Page 5: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Introduction – Studies Introduction – Studies in the Weblog Spacein the Weblog Space Emotion and topic classification of blog articlesEmotion and topic classification of blog articles

– To improve the effectiveness of emotion To improve the effectiveness of emotion classification through filtering out informative classification through filtering out informative articlesarticles

Blog searchBlog search– An intent-driven blog-search engine is proposed to An intent-driven blog-search engine is proposed to

resort the search results by considering their score resort the search results by considering their score of informative values.of informative values.

Automatic detection of high-quality blogsAutomatic detection of high-quality blogs– To measure the quality of a blog by calculating the To measure the quality of a blog by calculating the

percentage of informative articlespercentage of informative articles

Page 6: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Definition of Definition of Informative and Informative and Affective ArticlesAffective Articles A survey is done among the users who usually A survey is done among the users who usually

participate in the activities in blogsparticipate in the activities in blogs

Contents of informative articles include:Contents of informative articles include:– News that is similar to the news on traditional news News that is similar to the news on traditional news

websiteswebsites– Technical descriptions, e.g. programming techniquesTechnical descriptions, e.g. programming techniques– Commonsense knowledgeCommonsense knowledge– Objective comments on the events in the worldObjective comments on the events in the world

Contents of affective articles include:Contents of affective articles include:– Diaries about personal affairsDiaries about personal affairs– Self-feelings or self-emotions descriptionsSelf-feelings or self-emotions descriptions

Page 7: Exploring in the Weblog Space by Detecting Informative and Affective Articles

AlgorithmsAlgorithms

Classification algorithmsClassification algorithms– Naïve Bayes Classifier (NB)Naïve Bayes Classifier (NB)– Support Vector Machine (SVM)Support Vector Machine (SVM)– Rocchio ClassifierRocchio Classifier

Feature selection algorithmsFeature selection algorithms– Information Gain (IG)Information Gain (IG)– χχ22 statistic (CHI) statistic (CHI)

Page 8: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Classification Classification Algorithm – Naïve Algorithm – Naïve Bayes ClassifierBayes Classifier

Laplace smoothing is applied to overcome the zero-Laplace smoothing is applied to overcome the zero-frequency problemfrequency problem

})|()(max{arg*

)(

)|()()|(

1

K

j

j cwPcPc

dP

cdPcPdcP

Page 9: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Classification Classification Algorithm – Rocchio Algorithm – Rocchio ClassifierClassifier Category profile based classifierCategory profile based classifier

jj cDdjcdjj

d

d

cDd

d

cc

||

1

||

1

where |cj| is the number of documents in the category cj and denotes document with terms weighted by TF-IDF

d

Page 10: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Feature Selection Feature Selection AlgorithmsAlgorithms Information Gain (IG)Information Gain (IG)

χχ22 statistic (CHI) statistic (CHI)

m

iii

m

iii

m

iii

tcPtcPtP

tcPtcPtP

cPcPtG

1

1

1

)|(log)|()(

)|(log)|()(

)(log)()(

),()()(

)()()()(

)(),(

21

2

22

ctcPt

DCBADBCA

CBADNct

m

iiavg

Page 11: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Experiment DataExperiment Data

5000 articles crawled from MSN space5000 articles crawled from MSN space

3,547 of them are labeled as 3,547 of them are labeled as affectiveaffective and 1,109 are and 1,109 are labeled as labeled as informativeinformative while the others are filtered while the others are filtered because of the encoding problembecause of the encoding problem

2,200 articles from Sohu.com Directory as informative 2,200 articles from Sohu.com Directory as informative articlesarticles– News, commonsense knowledge or objective News, commonsense knowledge or objective

comments about 22 different topicscomments about 22 different topics

Table 1. Statistics of Data SetTable 1. Statistics of Data Set

Page 12: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Experiment – Experiment – Comparing Comparing Classification Classification AlgorithmsAlgorithms

Table 2. Performances of three classification Table 2. Performances of three classification algorithmsalgorithms

Page 13: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Comparing Feature Comparing Feature Selection AlgorithmsSelection Algorithms

Table 3. Performances on different features setTable 3. Performances on different features set

Page 14: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Representative Representative FeaturesFeatures

Table 4. Top 20 representative features of each categoryTable 4. Top 20 representative features of each category

Page 15: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Study on Emotion and Study on Emotion and Topic ClassificationTopic Classification Assume that informative articles do not Assume that informative articles do not

express personal emotionsexpress personal emotions– Extracting affective articles can help to build a Extracting affective articles can help to build a

corpus with pure emotional articlescorpus with pure emotional articles

Figure 1. Two-step approach for topic and emotion classification

Page 16: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Experiment on Experiment on Emotion ClassificationEmotion Classification DataData

– Training: 2,494 blog articles are manually Training: 2,494 blog articles are manually labeled into two emotion tendencies, labeled into two emotion tendencies, positivepositive and and negativenegative

– Testing: 1,303 articles from 75 blogs in MSN Testing: 1,303 articles from 75 blogs in MSN SpaceSpace

Table 5. Data set used for emotion classificationTable 5. Data set used for emotion classification

Page 17: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Experiment Result on Experiment Result on Emotion ClassificationEmotion Classification Before the binary emotion classifier, the Before the binary emotion classifier, the

information-affectiveness classification is used information-affectiveness classification is used (I-Approach) or not (II-Approach)(I-Approach) or not (II-Approach)

Table 6. Comparison results for two emotion classification Table 6. Comparison results for two emotion classification approachesapproaches

Page 18: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Study on Intent-driven Study on Intent-driven Weblog Search EngineWeblog Search Engine Blog search is at the state of Web search Blog search is at the state of Web search

currentlycurrently

Intent-driven search Intent-driven search (re-rank)(re-rank)

SSmixedmixed = = λλ .. SSifif + + (1(1 -- ||λλ|)|) .. SSoriginorigin

where where SSifif is a confidence value between -1 (strong is a confidence value between -1 (strong

affective intent) and 1 (strong informative intent), and affective intent) and 1 (strong informative intent), and SSoriginorigin is the original relevance score is the original relevance score

Page 19: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Analysis for the Analysis for the Distribution of Two Distribution of Two Genres of ArticlesGenres of Articles

Figure 2. Distribution of informative articles and Figure 2. Distribution of informative articles and affective articles on 99,059 blog articlesaffective articles on 99,059 blog articles

Page 20: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Detecting High-quality Detecting High-quality BlogsBlogs

Figure 3. Distribution of blogs with different levels of Figure 3. Distribution of blogs with different levels of quality on 6,319 blogsquality on 6,319 blogs

Page 21: Exploring in the Weblog Space by Detecting Informative and Affective Articles

Conclusion and Future Conclusion and Future WorkWork The task of separating informative and The task of separating informative and

affective articles is addressed and considered affective articles is addressed and considered as a binary classification task.as a binary classification task.

The applications of above information-The applications of above information-affectiveness classification are studied, affectiveness classification are studied, including emotion classification, intent-driven including emotion classification, intent-driven blog search and high-quality blogs detection.blog search and high-quality blogs detection.

Future work: 1) building a much large data Future work: 1) building a much large data set by using semi-supervised learning set by using semi-supervised learning techniques 2) applying the existing approach techniques 2) applying the existing approach on the data in other languageson the data in other languages