extracting interest tags from twitter user biographies ying ding, jing jiang school of information...
TRANSCRIPT
Extracting Interest Tags fromTwitter User Biographies
Ying Ding, Jing Jiang
School of Information Systems
Singapore Management University
AIRS 2014, Kuching, Malaysia
Social Media and Personal Data
Dec 5, 2014 AIRS 2014 2
• Much personal information revealed in social media– Content, links, ratings
personal preferences
• All this information is useful to– Researchers: social science– Businesses: targeted
advertising
User Biographies in Twitter
Dec 5, 2014 AIRS 2014 3
• Self-introductions written in free form• Reflect users’ background and interests
User Biographies in Twitter
4
profession interestsage
Around 28% of Singapore Twitter users and 50% of US Twitter usersrevealed their personal interests in their biographies.
Dong Wei et. al. Who am I on Twitter?: A cross-country comparison.WWW’2014
Dec 5, 2014 AIRS 2014
Outline
• Background
• Our task
• Syntactic patterns of interest tags
• Build training data + gold standard
• Method
• Experiments
• Summary
5 Dec 5, 2014 AIRS 2014
Our task
• Automatically extract phrases that describe a user’s personal interests.– We call them “interest tags”– A typical information extraction problem.– Automatically build training data based on
common syntactic patterns.
6 Dec 5, 2014 AIRS 2014
Syntactic Patterns of Interest Tags
8
• Based on manual annotation of 500 user biographies.• 28.8% of user biographies contain meaningful interest tags.
Dec 5, 2014 AIRS 2014
Building Training Data
• Seed patterns:
– Play + [NP]
– [NP] + fan
– Interested in + [NP]
• Steps:
– Use seed patterns to extract noun phrases and rank them according to their frequency
– Pick the top-100 ranked noun phrases and use them as positive instances to train CRF
9 Dec 5, 2014 AIRS 2014
Features• Syntactic or dependency features are not used as the
Twitter text is noisy for parsing• Both lexical and POS tag feature are used• To avoid over-fitting: only features extracted from the
surrounding tokens for each position are used
10 Dec 5, 2014 AIRS 2014
Gold Standard
• Two annotators: graduate students
• 500 randomly sampled user biographies
• 1190 sentences– Two annotators disagree on 10 sentences– High agreement
11 Dec 5, 2014 AIRS 2014
Experiment
12
BL-700: top 700 frequent phrases, we choose 700 because it gets the highest F-score among various numbers.Seed: use seed patterns to recognize interest tags
Dec 5, 2014 AIRS 2014
Extracted Patterns
13 Dec 5, 2014 AIRS 2014
Some popular patterns are:•[Interest tag] + fan/lover/enthusiast•I love + [interest tag]•[interest tag] is/are my life
Is it difficult to predict interest tags by users’ tweets?
We also applied Tf-idf ranking, which has been used to extract
personalized user tags, to extract user interest tags.
15 Dec 5, 2014 AIRS 2014
• Interest tags extracted from user’s biographies are not necessarily reflected in a user’s post tweets.• They can work as supplementary information when profiling a user.
Summary
• We studied the problem of extracting interest tags from Twitter user biographies
• We automatically built noisy training data based on syntactic patterns
• We trained CRF classifier on the noisy training data and achieved decent performance
• Interest tags extracted from Twitter user biographies may not be reflected in user’s tweets
16 Dec 5, 2014 AIRS 2014