modeling latent biographic attributes in conversational genres nikesh garera david yarowsky

Modeling Latent Biographic Attributes in Conversational Genres

Nikesh GareraDavid Yarowsky

Introduction

• Latent classification of biographic attribute such as gender, age and native language.

• Modeling and classifying biographic attributes based on lexical and discourse factors.

• A speaker’s lexical choice and discourse style may differ substantially depending on the gender/age of the speaker’s interlocutor.

Corpus Detail

• Fisher telephone conversation corpus• Standard Switchboard conversational corpus• Both are transcript and annotated.

Modeling Gender via Ngram Features

• Using unigram and bigram features with tf-idf weighting.

• Stop words were retained in the feature set.• Only Ngrams with frequency greater than 5

were retained.• SVM model. – 90.84% on Fisher corpora for gender classification– 90.22% on Switchboard corpora for gender

classification

Effect of Partner’s Gender

• Modeling of speaker gender/age based on the prior and joint modeling of the partner speaker’s gender/age

• People tend to use stronger gender-specific, age-specific or dialect-specific word/phrase usage and discourse properties when speaking with someone of a similar gender/age/dialect than when speaking with someone of a different gender/age/dialect, when they may adapt a more neutral speaking style


• Oracle Experiment– Assume we know whether the conversation is homogeneous

(same gender) or heterogeneous (different gender).– Classify both the test conversation side and the partner side,

and if the classifier is more confident about the partner side then we choose the gender of the test conversation side based on the heterogeneous/homogeneous information.• Test vs. Partner

– If Confidence(T) > Confidence(P)– If Confidence(T) < Confidence(P)

» If Hetero, Class(T) = !Class(P)» If Hemo, Class(T) = Class(P)

– Overall accuracy improves to 96.46% on Fisher corpus from 90.84%


• Replacing Oracle by a Hemo vs. Hetero Classify– Classifying the conversation as mixed or single-

gender– Low accuracy 68.35% on Fisher Corpus as male-

male and female-female conversations are grouped into one class.

– Create two different classifiers• Male-male vs. Rest• Female-female vs. Rest


• Modeling partner via conditional model and whole-conversation model– The following classifiers were trained and each of

their scores was used as a feature in a meta SVM classifier:• Male-male vs. Rest• Female-female vs. Rest• Conditional model of gender given most likely partner's

gender• Ngram model


• Conditional model of gender given most likely partner's gender– Two separate classifiers were trained for classifying the gender of a

given conversation side, one where the partner is male and other where the partner is female.

– Given a test conversation side, we first choose the most likely gender of the partner’s conversation side using the Ngram based model.

– Choose the gender of the test conversation side using the appropriate conditional model.

Effect of Partner’s Gender• Incorporating Sociolinguistic Features

– 1. % of conversation spoken:– 2. Speaker rate:– 3. %of pronoun usage:– 4. % of back-channel responses such as "(laughter)" and "(lipsmacks)".– 5. % of passive usage:– 6. % of short utterances– 7. % of modal auxiliaries, subordinate clauses.– 8. % of “mm” tokens such as "mhm", "um", "uh-huh", "uh", "hm",

"hmm",etc.– 9. Type-token ratio– 10. Mean inter-utterance time:– 11. % of “yeah” occurences.– 12. % of WH-question words.– 13. % Mean word and utterance length.

Gender Classification Results

Reference

Nikesh Garera. Multilingual Acquisition of Structured Information via Novel Relationship Extraction Models over Diverse Knowledge Sources. Ph.D. Thesis, Johns Hopkins University, Baltimore, Maryland, September 2009.

modeling latent biographic attributes in conversational genres nikesh garera david yarowsky

Documents