don’t remove my stop words: identifying personality …

DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM QUORA ANSWERS

ASHUTOSH BAHETI, 12CS10012

RAHUL GURNANI, 12CS10039

DHRUV JAIN, 12CS30043

NISHKARSH SHASTRI, 12CS10034

SABYASACHEE BARAUH, 12CS30029

OBJECTIVE

● Identifying Personality of Quora users with respect to the big five personality traits using linguistic features based analysis of their answer

● Openness To Experience

● Conscientiousness

● Extraversion

● Agreeableness

● Neuroticism

2

RELATED WORK

Psychological meaning of words : LIWC and computerised text analysis methods - Yla R. Tausczik and James W. Pennebaker

Tausczik, Yla R., and James W. Pennebaker. "The psychological meaning of words: LIWC and computerized text analysis methods."

Mairesse, François, et al. "Using linguistic cues for the automatic recognition of personality in conversation and text."

Workshop on Computational Personality Recognition - Fabio Celli, Fabio Pianesi, David Stillwell, Michal Kosinski

3

Project Timeline4

Classifying essay data based on LIWC as feature

Identifying the linguistic features for the Big V personality traits

Extraction of textual features from the essays

Classifying based on new features and LIWC

Survey with the Quora users to get a labelled dataset

Crawling the answers of Surveyed users

Using the Quora Dump to expand LIWC

Trained the model based on labelled Quora Dataset

Calculated the accuracy of the trained model

5

Classification of Essay Data

Straightforward ML approachlabelled essays with binary values for each personality

sanitized the data present in the essays

Created the trie structure for LIWC prefix matching

Extracted the features based on LIWC word count for each category

Applied SVM to the data using WEKA

Accuracy of model found to be 53%

6

Features Identified for Extroversion

Word Variance (repetitivity)

Type/Token Ratio

Formality measure and Informality MeasureF-Measure = (noun freq + adjective freq + preposition freq + article freq -

pronoun freq - verb freq - adverb freq - interjection freq + 100)/2

I-Measure = (Wrong-typed Words freq. + Interjections freq. + Emoticon freq. ) * 100

Positivity of Text and Negativity Of Text

Rich Vocabulary, use of difficult words

Concrete and Frequent Words

Use of more social words

7

Features Identified for Openness

Preference for longer words

Words expressing tentativeness

Avoidance of 1st person singular pronouns

Present tense forms

The avoidance of past tense indicates

8

Features Identified for Conscientiousness

Avoid negations

Avoid words reflecting discrepancies (e.g., should and would)

2nd person pronouns

Filler words (in males and not in females): More useful in

speech analysis

9

Features Identified for Agreeableness

More positive emotions few negative emotions

Few articles

Negative and Positive emotion words

Leisure activity

10

Features Identified for Neuroticism

1st person singular pronouns

Noun Negative

Multiple punctuations

Fewer references to occupation

11

Extraction of Features

Python scripts using nltk to extract the features mentioned in previous five slides

Speech based features were not extracted

12

NLP Techniques based features:Discourse Parsing

Used the discourse parsing on all the essays data.

Created RST style discourse trees.

Extracted main nucleus text from the data

Extracted the relation count from the RST trees

Normalized the relation count.

Constructed the feature vector to include the discourse

relation count

13

Expansion of LIWC Word Set

Seeded LDA and Word2Vec Methods

14

Expansion of LIWC: Seeded LDA

Seeded LDA treats each document as a mixture of topics.

It treats topics as a probability distribution of words.

We can give a prior asymetric probability to a word topic pair

to seed the topic with the given word.

We have used the gensim package and the eta parameter to

implement seeded LDA, however it did not give better

results due to overfitting.

15

Expansion of LIWC: Word2Vec

Applied Word2Vec modelling on Quora Dump

Found the most similar words for each word present under the

tag

Compared the similarity with 1Billion WIki Text

Added the most similar words thus found to new LIWC

dictionary

Trained the models on new LIWC dictionary

16

Expansion of Posemo,Negemo,Funct-words

Added More Positive Words,Negative Words[1]Added more functional words[2]

1. Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."

2. Leah Gilner and Franc Morales at [Sequence Publishing] (http://www.sequencepublishing.com) for listing English function words

17

User Survey18

Survey Method

Used a 10 question questionnaire - BFI 10

Contacted the Quora users having more than 30 answers

50 Users filled the survey

Calculated the personality score for all the 5 personality traits

between 1-10

19

Extraction of Data

Written the Python script to crawl all the answers of these users

Sanitized the answersPruned all the answers with less than 200 wordsLabelled the dataset thus obtained with survey results

20

Results21

Only LIWC Features on labelled Essays 22

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

60.5348 %

59.9271 %

59.1167 %

51.5397 %

55.1053 %

Conscientiousness

55.4295 %

55.3485 %

55.3485 %

50.8104 %

53.4441 %

Extraversion

54.5786 %

54.7812 %

54.8622 %

51.7423 %

53.201 %

Agreeableness

55.1459 %

53.7682 %

56.0778 %

53.0794 %

54.4165 %

Neuroticism

55.9968 %

56.1183 %

54.3355 %

50.0405 %

52.5932 %

LIWC Features + New Extracted Features on labelled Essays23

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

60.5348 %

60.3728 %

58.59 %

51.9854 %

57.7391 %

Conscientiousness

56.3614 %

55.0243 %

55.2269 %

51.2156 %

53.282 %

Extraversion

55.1864 %

55.5105 %

55.5105 %

51.2561 %

52.5122 %

Agreeableness

54.9028 %

53.6872 %

56.7261 %

53.0389 %

52.107 %

Neuroticism

56.9692 %

57.7391 %

54.0924 %

50.6888 %

51.7828 %

Expanded LIWC + New Extracted Features on labelled Essays24

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

61.1831 %

61.7504 %

59.8865 %

53.0794 %

56.3209 %

Conscientiousness

55.5105 %

54.6191 %

53.8088 %

51.5802 %

51.6613 %

Extraversion

54.2139 %

54.3355 %

55.8752 %

52.0259 %

50.6078 %

Agreeableness

55.3485 %

54.2139 %

54.2139 %

51.6613 %

51.7423 %

Neuroticism

57.577 %

56.6856 %

54.8622 %

51.2561 %

51.9449 %

Expanded LIWC + New Extracted Features + Discourse Relations on labelled Essays

25

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

61.4336 %

60.2726 %

58.9601 %

52.3473 %

57.2943 %

Conscientiousness

56.4866 %

55.679 %

53.054 %

51.5901 %

51.2367 %

Extraversion

54.1141 %

53.4578 %

55.5275 %

52.5492 %

53.054 %

Agreeableness

56.7895 %

56.9914 %

54.5684 %

53.8617 %

54.8208 %

Neuroticism

56.84 %

57.0419 %

53.8617 %

53.5083 %

53.3569 %

Only LIWC Features on Labelled Quora Dataset26

SMOLogisti

cAdabo

ost SVM

Random

Forest

Openness

74.8971 %

74.8971 %

70.3704 %

70.535 %

71.6049 %

Conscientiousness

68.9712 %

66.9136 %

68.9712 %

68.9712 %

69.7942 %

Extraversion

76.2963 %

76.7078 %

76.2963 %

76.2963 %

78.93 %

Agreeableness

67.8189 %

66.5844 %

63.4568 %

63.4568 %

66.1728 %

Neuroticism

72.9218 %

71.8519 %

72.9218 %

72.9218 %

71.7695 %

Expanded LIWC + FeaturesQuora dataset

27

SMOLogis

ticAdaboost

Adaboost (random

forest) SVM

Random

Forest

Openness

75.3909 %

73.9095 %

72.7572 %

77.284 %

71.0288 %

74.7325 %

Conscientiousness

70.1235 %

67.572 %

68.9712 %

73.6626 %

68.5597 %

71.2757 %

Extraversion

76.3786 %

77.284 %

76.2963 %

80.4115 %

77.9424 %

79.7531 %

Agreeableness

66.9959 %

67.1605 %

63.4568 %

69.3827 %

64.1975 %

66.0905 %

Neuroticism

73.0041 %

70.0412 %

72.9218 %

75.2263 %

71.9342 %

72.3457 %

Future Work

Expand LIWC by taking more unlabelled quora dataGathering richer labelled quora data by conducting paid

personality surveysEvaluate on more labelled quora dataLeveraging Discourse output to generate better discourse

featuresAdd more linguistic features by identifying patterns in quora

answers

28

Thank You29

don’t remove my stop words: identifying personality …

Documents