exploring in the weblog space by detecting informative and affective articles
DESCRIPTION
Exploring in the Weblog Space by Detecting Informative and Affective Articles. Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University Qiang Yang Hong Kong University of Science and Technology WWW2007. Introduction. Unique characteristics of blogs - PowerPoint PPT PresentationTRANSCRIPT
Exploring in the Weblog Exploring in the Weblog Space by Detecting Space by Detecting Informative and Affective Informative and Affective ArticlesArticles
Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong YuXiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu
Shanghai Jiao-Tong UniversityShanghai Jiao-Tong University
Qiang YangQiang Yang
Hong Kong University of Science and TechnologyHong Kong University of Science and Technology
WWW2007WWW2007
IntroductionIntroduction
Unique characteristics of blogsUnique characteristics of blogs– Mainly maintained by individual persons and Mainly maintained by individual persons and
thus the contents are generally personalthus the contents are generally personal– The link structures between blogs generally The link structures between blogs generally
form localized communitiesform localized communities
Ongoing research on blogsOngoing research on blogs– Content based analysisContent based analysis– Blog communities’ evolutionBlog communities’ evolution– Different kinds of tools to help users retrieve, Different kinds of tools to help users retrieve,
organize and analyze the blogsorganize and analyze the blogs
Introduction – Genres Introduction – Genres in Blog’s Contentin Blog’s Content AffectiveAffective
– The online diary by which people share their The online diary by which people share their daily life publicly, express their feelings or daily life publicly, express their feelings or thoughts or emotions through the blogsthoughts or emotions through the blogs
InformativeInformative– Topic-oriented; the topic can be related to a Topic-oriented; the topic can be related to a
hobby or the author’s profession or businesshobby or the author’s profession or business
Introduction – the Introduction – the Problem and the Problem and the ApproachApproach The problemThe problem
– Separating informative articles from affective Separating informative articles from affective articles in blogs.articles in blogs.
The approachThe approach– Considering the problem as binary classificationConsidering the problem as binary classification– Challenges Challenges
The definitions of the informative articles and the The definitions of the informative articles and the affective articlesaffective articles
The training corpus for both categoriesThe training corpus for both categories The machine learning algorithmThe machine learning algorithm
Introduction – Studies Introduction – Studies in the Weblog Spacein the Weblog Space Emotion and topic classification of blog articlesEmotion and topic classification of blog articles
– To improve the effectiveness of emotion To improve the effectiveness of emotion classification through filtering out informative classification through filtering out informative articlesarticles
Blog searchBlog search– An intent-driven blog-search engine is proposed to An intent-driven blog-search engine is proposed to
resort the search results by considering their score resort the search results by considering their score of informative values.of informative values.
Automatic detection of high-quality blogsAutomatic detection of high-quality blogs– To measure the quality of a blog by calculating the To measure the quality of a blog by calculating the
percentage of informative articlespercentage of informative articles
Definition of Definition of Informative and Informative and Affective ArticlesAffective Articles A survey is done among the users who usually A survey is done among the users who usually
participate in the activities in blogsparticipate in the activities in blogs
Contents of informative articles include:Contents of informative articles include:– News that is similar to the news on traditional news News that is similar to the news on traditional news
websiteswebsites– Technical descriptions, e.g. programming techniquesTechnical descriptions, e.g. programming techniques– Commonsense knowledgeCommonsense knowledge– Objective comments on the events in the worldObjective comments on the events in the world
Contents of affective articles include:Contents of affective articles include:– Diaries about personal affairsDiaries about personal affairs– Self-feelings or self-emotions descriptionsSelf-feelings or self-emotions descriptions
AlgorithmsAlgorithms
Classification algorithmsClassification algorithms– Naïve Bayes Classifier (NB)Naïve Bayes Classifier (NB)– Support Vector Machine (SVM)Support Vector Machine (SVM)– Rocchio ClassifierRocchio Classifier
Feature selection algorithmsFeature selection algorithms– Information Gain (IG)Information Gain (IG)– χχ22 statistic (CHI) statistic (CHI)
Classification Classification Algorithm – Naïve Algorithm – Naïve Bayes ClassifierBayes Classifier
Laplace smoothing is applied to overcome the zero-Laplace smoothing is applied to overcome the zero-frequency problemfrequency problem
})|()(max{arg*
)(
)|()()|(
1
K
j
j cwPcPc
dP
cdPcPdcP
Classification Classification Algorithm – Rocchio Algorithm – Rocchio ClassifierClassifier Category profile based classifierCategory profile based classifier
jj cDdjcdjj
d
d
cDd
d
cc
||
1
||
1
where |cj| is the number of documents in the category cj and denotes document with terms weighted by TF-IDF
d
Feature Selection Feature Selection AlgorithmsAlgorithms Information Gain (IG)Information Gain (IG)
χχ22 statistic (CHI) statistic (CHI)
m
iii
m
iii
m
iii
tcPtcPtP
tcPtcPtP
cPcPtG
1
1
1
)|(log)|()(
)|(log)|()(
)(log)()(
),()()(
)()()()(
)(),(
21
2
22
ctcPt
DCBADBCA
CBADNct
m
iiavg
Experiment DataExperiment Data
5000 articles crawled from MSN space5000 articles crawled from MSN space
3,547 of them are labeled as 3,547 of them are labeled as affectiveaffective and 1,109 are and 1,109 are labeled as labeled as informativeinformative while the others are filtered while the others are filtered because of the encoding problembecause of the encoding problem
2,200 articles from Sohu.com Directory as informative 2,200 articles from Sohu.com Directory as informative articlesarticles– News, commonsense knowledge or objective News, commonsense knowledge or objective
comments about 22 different topicscomments about 22 different topics
Table 1. Statistics of Data SetTable 1. Statistics of Data Set
Experiment – Experiment – Comparing Comparing Classification Classification AlgorithmsAlgorithms
Table 2. Performances of three classification Table 2. Performances of three classification algorithmsalgorithms
Comparing Feature Comparing Feature Selection AlgorithmsSelection Algorithms
Table 3. Performances on different features setTable 3. Performances on different features set
Representative Representative FeaturesFeatures
Table 4. Top 20 representative features of each categoryTable 4. Top 20 representative features of each category
Study on Emotion and Study on Emotion and Topic ClassificationTopic Classification Assume that informative articles do not Assume that informative articles do not
express personal emotionsexpress personal emotions– Extracting affective articles can help to build a Extracting affective articles can help to build a
corpus with pure emotional articlescorpus with pure emotional articles
Figure 1. Two-step approach for topic and emotion classification
Experiment on Experiment on Emotion ClassificationEmotion Classification DataData
– Training: 2,494 blog articles are manually Training: 2,494 blog articles are manually labeled into two emotion tendencies, labeled into two emotion tendencies, positivepositive and and negativenegative
– Testing: 1,303 articles from 75 blogs in MSN Testing: 1,303 articles from 75 blogs in MSN SpaceSpace
Table 5. Data set used for emotion classificationTable 5. Data set used for emotion classification
Experiment Result on Experiment Result on Emotion ClassificationEmotion Classification Before the binary emotion classifier, the Before the binary emotion classifier, the
information-affectiveness classification is used information-affectiveness classification is used (I-Approach) or not (II-Approach)(I-Approach) or not (II-Approach)
Table 6. Comparison results for two emotion classification Table 6. Comparison results for two emotion classification approachesapproaches
Study on Intent-driven Study on Intent-driven Weblog Search EngineWeblog Search Engine Blog search is at the state of Web search Blog search is at the state of Web search
currentlycurrently
Intent-driven search Intent-driven search (re-rank)(re-rank)
SSmixedmixed = = λλ .. SSifif + + (1(1 -- ||λλ|)|) .. SSoriginorigin
where where SSifif is a confidence value between -1 (strong is a confidence value between -1 (strong
affective intent) and 1 (strong informative intent), and affective intent) and 1 (strong informative intent), and SSoriginorigin is the original relevance score is the original relevance score
Analysis for the Analysis for the Distribution of Two Distribution of Two Genres of ArticlesGenres of Articles
Figure 2. Distribution of informative articles and Figure 2. Distribution of informative articles and affective articles on 99,059 blog articlesaffective articles on 99,059 blog articles
Detecting High-quality Detecting High-quality BlogsBlogs
Figure 3. Distribution of blogs with different levels of Figure 3. Distribution of blogs with different levels of quality on 6,319 blogsquality on 6,319 blogs
Conclusion and Future Conclusion and Future WorkWork The task of separating informative and The task of separating informative and
affective articles is addressed and considered affective articles is addressed and considered as a binary classification task.as a binary classification task.
The applications of above information-The applications of above information-affectiveness classification are studied, affectiveness classification are studied, including emotion classification, intent-driven including emotion classification, intent-driven blog search and high-quality blogs detection.blog search and high-quality blogs detection.
Future work: 1) building a much large data Future work: 1) building a much large data set by using semi-supervised learning set by using semi-supervised learning techniques 2) applying the existing approach techniques 2) applying the existing approach on the data in other languageson the data in other languages