ws3 2014 group project: x-post

Post on 01-Jul-2015

277 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

slides for the group project of team 3 (Anca Dumitrache, Fabio Benedetti, Seyi Feyisetan), Web Science summer school in Southampton, July 2014

TRANSCRIPT

X-PostCreating a Cross Posting FacilitatorFor Technology Communities.

Hacker News & StackOverflow

WS3 Group 3Anca Dumitrache, Fabio Benedetti, Seyi Feyisetan

Introduction

● Stack Overflow: questions and answers on technology● Hacker News: news for technology enthusiasts

● similar to Hacker News: Reddit, Slashdot● similar to Stack Overflow: Quora

Goals1. develop a methodology to compare online technology

communities

2. use the vocabulary of one social community (e.g. StackOverflow) to describe the other (e.g. Hacker News)

3. topic recommendation: newsworthy cross posting across communities

Topic recommendation

Pipeline

Pipeline

Approach

1. data gathering:○ sources: Hacker News + StackOverflow○ fixed timeframe: September 2013○ method: web scraping with Python, R

2. data processing:○ linking: named entity extraction with term matching using the tags

vocabulary from Stack Overflow○ cleanup: only keep posts with tech-related topics

Future development1. data processing:

○ crowdsourced disambiguation of entities2. training:

○ use a priori observations of cross posting as training data○ possible features:

i. co-occurring tagsii. frequency of tagsiii. number of points in a postiv. number of comments in a postv. time...

3. evaluation:○ crowdsourced ranking of recommendation relevance

Results

Topic overlap

Trending topics

Trending topics

Frequency overlap

Frequency overlap

zoomed in

Findings

1. small set of overlapping topics over the two social machines(but better NER could identify more links)

2. StackOverflow has a more diverse range of topics than HackerNews(although the vocabulary likely introduces bias)

3. different frequently discussed topics on both social machines(although a set of outliers does exist)

Future Work● add more data sources such as Reddit, Slashdot

● gather data over a larger timeframe

● fine tune our Named Entity Recogniser

● expand the vocabulary used to describe the communities (and publish as Linked Data)

● use crowdsourcing for tag disambiguation and output evaluation

ConclusionPreliminary studies show that: ● we can use StackOverflow tags as a vocabulary to understand online

technology communities

● we can identify a feature set to compare these communities

● there is enough gap between trending topics in the two communities to allow for the use case of a topic recommendation system

top related