contextual analysis of user interests in social media sites – an exploration with micro-blogs...
TRANSCRIPT
Contextual Analysis of User Interests in Social Media Sites
– An Exploration with Micro-blogs
Nilanjan Banerjee, Dipanjan Chakraborty, Koustuv Dasgupta, Anupam Joshi, Sameer Madan, Sumit Mittal, Seema Nagar, Angshu Rai
[CIKM ’09]
Advisor: Dr. Koh Jia-LingReporter: Che-Wei, Liang
Date: 2009/10/26
1
Outline
• Introduction• Data Set• Mining Real-Time User Interests• Discovering Associations in User Interests• Pattern Discovery in Interest Clusters• Conclusion and Future Work
2
Introduction
• Challenges– Tweets tend too be stream of consciousness fragment– Lack of structure of description
• In this paper, – Report the results of analyses of tweets using
unstructured text mining technique
5
Data Set
• Collect data from Twitter– Select the most active users spanning across 10 cities– Collect tweets over four weeks
• from March 2009 to April 2009• Tweet : <user name, tweet, time of publishing the tweet>
6
Mining Real-Time User Interests
• Tweets usually have the following properties– ephemeral: • the interest in an activity changes over time
– descriptive: • the interest can be described using one or more
indicative keywords or terms
– localized: • the interest (or activity) is usually associated with
(contextual) location information
7
Mining Real-Time User Interests
• Identify tweets expressing interests by content-indicative and usage-indicative keywords– Content-indicative keywords (category words)• Express the broad class (category) of user interests, e.g.
movie, sports, etc.
– Usage-indicative keywords• Characterize the activity associated with a particular
interest• Can be either temporal or action keywords
8
Mining Real-Time User Interests
• First, explore what kind of keywords twitters use most
• Exclude pronouns, prepositions, helping verbs, question words, non-indicative words
• Stem the words using Porter-stemming algorithm
9
Mining Real-Time User Interests
• Content-indicative Keywords– Form an initial list of category keywords• Consult from Wordnet and IMDB
– Enriched seed list of keywords by• Manually inspecting thousands of tweets and including
“interest-indicative words”
– Finally, identify five seed categories from the list of category keywords• movie, music, food, sports, dance
10
Mining Real-Time User Interests
• Use term frequency-based measure – estimate the occurrences of temporal and action words
13
Mining Real-Time User Interests
• Context-based discovery of keywords– Consider non-stemmed words to enrich knowledge base of
keywords• Stemmed data incurs a loss of information of tense
– Discover similar words by • Finding matches that are contextually similar to
the seed dictionary words
14
Mining Real-Time User Interests
• POS-based discovery of action verbs– Use a POS analyser to extract action verbs– Identify the relevant action verbs that show a high
correlation with identified category words– Added to existing set of usage-indicative keywords
15
• D represents the total number of tweets • A = { tweets containing the keyword “cw” }• B = { tweets containing the keyword “aw” }
Discovering Associations in User Interests
• Goal: – Explore different latent semantic associations
between content-indicative category words and usage-indicative action/temporal words
• N-Gram Analysis• Contextual Analysis using k-means clustering• Temporal Analysis
16
N-Gram Analysis
• If an user is interested in an intention, he/she should use indicative action and/or temporal words to express interests– E.g. “I want to watch a movie tonight”
• Employ bigram-based analysis of category word– Co-occurring words can be at a
variable distance (a tolerance limit of 5 words)
17
N-Gram Analysis
• People have tendency to tweet about activities that are planned at different times of the day– E.g. “party tonight”
19
Contextual Analysis using k-means clustering
• To discover any new groups of tweets and perform a contextual analysis– Clustering is a better accepted technique to group
similar documents – Use k-means clustering– Analyze clusters to discover latent associations of
cluster tags with other words in the cluster• Tag cluster with the highest occurring words
21
Contextual Analysis using k-means clusteringSub-Cluster Analysis
• Analyzed content of clusters having content-indicative tags, temporal words, action words– Ran k-means, and gathered predominant sub-clusters
23
Temporal Analysis
• Real-time interests have a significant temporal component, if captured can lead to insights on word associations with temporal aspect of interests
24
Pattern Discovery in Interest Clusters
• A microscopic analysis of select content-indicative clusters
• Built a set of benchmark– 5000 comprising of a mix of tweets • from party, food, sports, movie clusters
– Manually tagging those that indicate a real-time interest (i.e. positive tweets)
25
Patterns in Real-time Interest Tweets
• Patterns can be of several types:1. Word occurrence-based• e.g. “gym” occurs with “go” in positive tweets
2. Grammar-based• e.g. party is preceded by a verb of the form “going for"
in positive tweets
3. Precedence-based• e.g. “tonight” succeeds “movie”
26
Patterns in Real-time Interest Tweets
27
Sports Category Food CategoryAn intention to play a sport or go and watch a game
Express a real-time intention of having a food, going to a restaurant
Patterns in Real-time Interest Tweets
28
Party Category Movie Category
Depicting user’s intention to get involved in a party
Expressing an intention to watch a movie in near future
Differentiating Intentions from Tweets -Word Affinity measure
• Affinity of a word “w” to a Set of Tweets “T”– Defined as the probability of “w” to occur in “T”– Using to compute the associations of frequently used
words in tweets
29
Real-time Interest Classification-Initial Evaluation
• An evaluation of how some traditional text classification algorithms perform in classifying tweets
• Further need to exploit several mechanisms– Word-usage based heuristics, rule-based filtering
30
Conclusion And Future Work
• Investigated and evaluated microblogs by – Using contextual information of its users to capture real-
time user interests• Revealed of enough keywords that express interests• Use statistical techniques to discover associations• Clustering reveal words indicative of user interests• Discover patterns from clusters
• There exists ample scope for research– Indentifying user context • Emotions, presence, location
31