don’t remove my stop words: identifying personality …
TRANSCRIPT
DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM QUORA ANSWERS
ASHUTOSH BAHETI, 12CS10012
RAHUL GURNANI, 12CS10039
DHRUV JAIN, 12CS30043
NISHKARSH SHASTRI, 12CS10034
SABYASACHEE BARAUH, 12CS30029
OBJECTIVE
● Identifying Personality of Quora users with respect to the big five personality traits using linguistic features based analysis of their answer
● Openness To Experience
● Conscientiousness
● Extraversion
● Agreeableness
● Neuroticism
2
RELATED WORK
Psychological meaning of words : LIWC and computerised text analysis methods - Yla R. Tausczik and James W. Pennebaker
Tausczik, Yla R., and James W. Pennebaker. "The psychological meaning of words: LIWC and computerized text analysis methods."
Mairesse, François, et al. "Using linguistic cues for the automatic recognition of personality in conversation and text."
Workshop on Computational Personality Recognition - Fabio Celli, Fabio Pianesi, David Stillwell, Michal Kosinski
3
Project Timeline4
Classifying essay data based on LIWC as feature
Identifying the linguistic features for the Big V personality traits
Extraction of textual features from the essays
Classifying based on new features and LIWC
Survey with the Quora users to get a labelled dataset
Crawling the answers of Surveyed users
Using the Quora Dump to expand LIWC
Trained the model based on labelled Quora Dataset
Calculated the accuracy of the trained model
5
Classification of Essay Data
Straightforward ML approachlabelled essays with binary values for each personality
sanitized the data present in the essays
Created the trie structure for LIWC prefix matching
Extracted the features based on LIWC word count for each category
Applied SVM to the data using WEKA
Accuracy of model found to be 53%
6
Features Identified for Extroversion
Word Variance (repetitivity)
Type/Token Ratio
Formality measure and Informality MeasureF-Measure = (noun freq + adjective freq + preposition freq + article freq -
pronoun freq - verb freq - adverb freq - interjection freq + 100)/2
I-Measure = (Wrong-typed Words freq. + Interjections freq. + Emoticon freq. ) * 100
Positivity of Text and Negativity Of Text
Rich Vocabulary, use of difficult words
Concrete and Frequent Words
Use of more social words
7
Features Identified for Openness
Preference for longer words
Words expressing tentativeness
Avoidance of 1st person singular pronouns
Present tense forms
The avoidance of past tense indicates
8
Features Identified for Conscientiousness
Avoid negations
Avoid words reflecting discrepancies (e.g., should and would)
2nd person pronouns
Filler words (in males and not in females): More useful in
speech analysis
9
Features Identified for Agreeableness
More positive emotions few negative emotions
Few articles
Negative and Positive emotion words
Leisure activity
10
Features Identified for Neuroticism
1st person singular pronouns
Noun Negative
Multiple punctuations
Fewer references to occupation
11
Extraction of Features
Python scripts using nltk to extract the features mentioned in previous five slides
Speech based features were not extracted
12
NLP Techniques based features:Discourse Parsing
Used the discourse parsing on all the essays data.
Created RST style discourse trees.
Extracted main nucleus text from the data
Extracted the relation count from the RST trees
Normalized the relation count.
Constructed the feature vector to include the discourse
relation count
13
Expansion of LIWC Word Set
Seeded LDA and Word2Vec Methods
14
Expansion of LIWC: Seeded LDA
Seeded LDA treats each document as a mixture of topics.
It treats topics as a probability distribution of words.
We can give a prior asymetric probability to a word topic pair
to seed the topic with the given word.
We have used the gensim package and the eta parameter to
implement seeded LDA, however it did not give better
results due to overfitting.
15
Expansion of LIWC: Word2Vec
Applied Word2Vec modelling on Quora Dump
Found the most similar words for each word present under the
tag
Compared the similarity with 1Billion WIki Text
Added the most similar words thus found to new LIWC
dictionary
Trained the models on new LIWC dictionary
16
Expansion of Posemo,Negemo,Funct-words
Added More Positive Words,Negative Words[1]Added more functional words[2]
1. Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
2. Leah Gilner and Franc Morales at [Sequence Publishing] (http://www.sequencepublishing.com) for listing English function words
17
User Survey18
Survey Method
Used a 10 question questionnaire - BFI 10
Contacted the Quora users having more than 30 answers
50 Users filled the survey
Calculated the personality score for all the 5 personality traits
between 1-10
19
Extraction of Data
Written the Python script to crawl all the answers of these users
Sanitized the answersPruned all the answers with less than 200 wordsLabelled the dataset thus obtained with survey results
20
Results21
Only LIWC Features on labelled Essays 22
SMOLogisti
cAdabo
ost SVM
Random
Forest
Openness
60.5348 %
59.9271 %
59.1167 %
51.5397 %
55.1053 %
Conscientiousness
55.4295 %
55.3485 %
55.3485 %
50.8104 %
53.4441 %
Extraversion
54.5786 %
54.7812 %
54.8622 %
51.7423 %
53.201 %
Agreeableness
55.1459 %
53.7682 %
56.0778 %
53.0794 %
54.4165 %
Neuroticism
55.9968 %
56.1183 %
54.3355 %
50.0405 %
52.5932 %
LIWC Features + New Extracted Features on labelled Essays23
SMOLogisti
cAdabo
ost SVM
Random
Forest
Openness
60.5348 %
60.3728 %
58.59 %
51.9854 %
57.7391 %
Conscientiousness
56.3614 %
55.0243 %
55.2269 %
51.2156 %
53.282 %
Extraversion
55.1864 %
55.5105 %
55.5105 %
51.2561 %
52.5122 %
Agreeableness
54.9028 %
53.6872 %
56.7261 %
53.0389 %
52.107 %
Neuroticism
56.9692 %
57.7391 %
54.0924 %
50.6888 %
51.7828 %
Expanded LIWC + New Extracted Features on labelled Essays24
SMOLogisti
cAdabo
ost SVM
Random
Forest
Openness
61.1831 %
61.7504 %
59.8865 %
53.0794 %
56.3209 %
Conscientiousness
55.5105 %
54.6191 %
53.8088 %
51.5802 %
51.6613 %
Extraversion
54.2139 %
54.3355 %
55.8752 %
52.0259 %
50.6078 %
Agreeableness
55.3485 %
54.2139 %
54.2139 %
51.6613 %
51.7423 %
Neuroticism
57.577 %
56.6856 %
54.8622 %
51.2561 %
51.9449 %
Expanded LIWC + New Extracted Features + Discourse Relations on labelled Essays
25
SMOLogisti
cAdabo
ost SVM
Random
Forest
Openness
61.4336 %
60.2726 %
58.9601 %
52.3473 %
57.2943 %
Conscientiousness
56.4866 %
55.679 %
53.054 %
51.5901 %
51.2367 %
Extraversion
54.1141 %
53.4578 %
55.5275 %
52.5492 %
53.054 %
Agreeableness
56.7895 %
56.9914 %
54.5684 %
53.8617 %
54.8208 %
Neuroticism
56.84 %
57.0419 %
53.8617 %
53.5083 %
53.3569 %
Only LIWC Features on Labelled Quora Dataset26
SMOLogisti
cAdabo
ost SVM
Random
Forest
Openness
74.8971 %
74.8971 %
70.3704 %
70.535 %
71.6049 %
Conscientiousness
68.9712 %
66.9136 %
68.9712 %
68.9712 %
69.7942 %
Extraversion
76.2963 %
76.7078 %
76.2963 %
76.2963 %
78.93 %
Agreeableness
67.8189 %
66.5844 %
63.4568 %
63.4568 %
66.1728 %
Neuroticism
72.9218 %
71.8519 %
72.9218 %
72.9218 %
71.7695 %
Expanded LIWC + FeaturesQuora dataset
27
SMOLogis
ticAdaboost
Adaboost (random
forest) SVM
Random
Forest
Openness
75.3909 %
73.9095 %
72.7572 %
77.284 %
71.0288 %
74.7325 %
Conscientiousness
70.1235 %
67.572 %
68.9712 %
73.6626 %
68.5597 %
71.2757 %
Extraversion
76.3786 %
77.284 %
76.2963 %
80.4115 %
77.9424 %
79.7531 %
Agreeableness
66.9959 %
67.1605 %
63.4568 %
69.3827 %
64.1975 %
66.0905 %
Neuroticism
73.0041 %
70.0412 %
72.9218 %
75.2263 %
71.9342 %
72.3457 %
Future Work
Expand LIWC by taking more unlabelled quora dataGathering richer labelled quora data by conducting paid
personality surveysEvaluate on more labelled quora dataLeveraging Discourse output to generate better discourse
featuresAdd more linguistic features by identifying patterns in quora
answers
28
Thank You29