anticipating discussion activity on community forums

20
Anticipating Discussion Activity on Community Forums Matthew Rowe , Sofia Angeletou and Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The Third IEEE International Conference on Social Computing. MIT, Boston, USA. 2011

Upload: matthew-rowe

Post on 29-Nov-2014

2.344 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums

Matthew Rowe, Sofia Angeletou and Harith AlaniKnowledge Media Institute, The Open University, Milton

Keynes, United Kingdom

The Third IEEE International Conference on Social Computing. MIT, Boston, USA. 2011

Page 2: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 2

Community Content

• Online communities are now used to:– Ask questions– Post opinions and ideas– Discuss events and current issues

• Content analysis in online communities is attractive for:– Market analysis– Brand consensus and product opinion

• Social network analytics in the US is predicted to reach $1 billion by 2014 (Forrester 2009)

• Masses of data is now being published in online communities:– Facebook has more than 60 million status updates per day (Facebook

statistics 2010)

Page 3: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 3

Page 4: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 4

The Need for Analysis

• Analysts need to know which piece of content will generate the most activity– i.e. the most auspicious or influential– Helps focus the attention of human and computerised

analysts• What to track?

• Need to understand the effect features (community and content) have on attention to content

• Enable content creators to shape their content in order to maximise impact– E.g. promoters, government policy makers

RQ1: Which features are key to stimulating discussions?RQ2: How do these features influence discussion length?

Page 5: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 5

Outline

• Anticipating Discussion Activity: Approach Overview– Identifying Seed Posts– Predicting Discussion Activity

• Features• Dataset

– Community Message Board: Boards.ie• 1. Identifying Seed Posts• 2. Predicting Discussion Activity• Findings• Conclusions

Page 6: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 6

Approach Overview

• Two-stage approach to predict discussion activity in online communities:1. Identify seed posts

• i.e. Thread starters that yield a reply• Will a given post start a discussion?• What are the properties that seed posts exhibit?

– What parameters tend to trigger a discussion?

2. Predict discussion activity levels• From the identified seed posts• What is the level of discussion that a seed post will

generate?• What features correlate with heightened discussion

activity?

Page 7: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 7

Features

• For each post, model: a) the author, b) the content and c) the topical concentration of the author

• F1: User Features– In-degree, out-degree: social network properties of the author– Post count, age, post rate: participation information of the author

• F2: Content Features– Post length, referral count, time in day: surface features of the

post– Complexity: cumulative entropy of terms in the post– Readability: Gunning Fog index of the post– Informativeness: TF-IDF measure of terms within the post– Polarity: average sentiment of terms in the post

Page 8: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 8

Features (2)

• F3: Focus Features– Topic entropy: the concentration of the author across

community forums• Higher entropy indicates a wider spread of forum activity• More random distribution, less concentrated

– Topic Likelihood: the likelihood that a user posts in a specific forum given his post history

• Measures the affinity that a user has with a given forum• Lower likelihood indicates a user posting on an unfamiliar topic

Page 9: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 9

Dataset: Boards.ie

• Irish community message board that was established in 1998• Covers a wide array of topics and themes in forums

– E.g. World of Warcraft, Japanese Culture, Rugby

• We were provided with the complete dataset spanning 1998-2008 of all posts and forum information– Focussed on 2006 due to the scale of entire dataset

• No explicit social connections exist in the dataset– Social network features were built from the reply-to graph

• 6-month window prior to the post date was used to build the user and focus features

Page 10: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 10

1. Identifying Seed Posts

• Will a given post start a discussion?• What are the properties that seed posts exhibit?

• Experiment Setup:– Used all thread starter posts from Boards.ie in 2006– Training/validation/testing sets using a 70/20/10% random split– Binary classification task: Is this a seed post or not?– Measures: precision, recall, f-measure, area under ROC curve

• Performed 2 experiments:– a) Model Selection

• Tested individual feature sets (user, content, focus) and combinations– b) Feature Assessment

• Dropping 1 feature at a time, record reduction in f-measure

Page 11: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 11

1.a) Model Selection

Page 12: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 12

1.b) Feature Assessment

Page 13: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 13

1.b) Feature Assessment

Page 14: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 14

2. Predicting Discussion Activity

• What is the level of discussion that a seed post will generate?• What features correlate with heightened discussion activity?

• Experiment Setup:– Train: seed posts in 70% training split– Test: seed posts in 20% validation split– Measure: Normalised Discounted Cumulative Gain (nDCG)

• Look at varying rank positions: nDCG@k, k=1,2,5,10,20,50,100

• Performed 2 experiments– a) Model Selection

• Regression models: Linear, Isotonic, Support Vector Regression• Tested individual feature sets (user, content, focus) and combinations

– b) Feature Contributions• Assess the features in the best performing model from a)

Page 15: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 15

2.a) Model Selection

Page 16: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 16

2.a) Model Selection

Linear Isotonic Support Vector Regression

Page 17: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 17

2.b) Feature Contributions

• What features correlate with heightened discussion activity?

Page 18: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 18

Findings

RQ1:Which features are key to stimulating discussions?• Having many URLs in a post can negatively impact discussion activity

– Could associate the post with spam content• Seed posts are associated with greater forum likelihood• Lower informativeness is associated with seed posts

– i.e. seeds use language that is familiar to the community

RQ2: How do these features influence discussion length?• Lower forum entropy = heightened discussion activity• Greater complexity = heightened discussion activity

– i.e. include more diverse language in the post• Increased activity can be expected from an increase in forum

likelihood coupled with a decrease in forum entropy• Negative sentiment posts generate more activity

Page 19: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 19

Conclusions and Future Work

• The two-stage approach is able to:– Identify seed posts to a high degree of accuracy

• F-measure: 0.792– Predict discussion activity levels

• nDCG@1: 0.89 (linear regression model)• Content and focus features yield best performing model

– Average nDCG@k: 0.756

• Findings inform:– Market Analysts to track high activity posts from the outset– Content creators to shape content in order to maximise impact

• Currently applying approach over different platforms:– How can we predict activity on a given social web system?– How do social web systems differ in generate activity?

Page 20: Anticipating Discussion Activity on Community Forums

Anticipating Discussion Activity on Community Forums 20

Questions?

Web: http://people.kmi.open.ac.uk/roweEmail: [email protected]: @mattroweshow