detecting gender by full name: experiments with the russian language

23
Problem and the Dataset Method Results References Detecting Gender by Full Name: Experiments with the Russian Language Alexander Panchenko [email protected] Digital Society Laboratory LLC, Russia April 11, 2014 Alexander Panchenko [email protected] Detecting Gender by Full Name: Experiments with the Ru

Upload: alexander-panchenko

Post on 10-May-2015

368 views

Category:

Science


5 download

DESCRIPTION

This paper describes a method that detects gender of a person by his/her full name. While some approaches were proposed for English language, little has been done so far for Russian. We fill this gap and present a large-scale experiment on a dataset of 100,000 Russian full names from Facebook. Our method is based on three types of features (word endings, character $n$-grams and dictionary of names) combined within a linear supervised model. Experiments show that the proposed simple and computationally efficient approach yields excellent results achieving accuracy up to 96\%.

TRANSCRIPT

Page 1: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Detecting Gender by Full Name:Experiments with the Russian Language

Alexander [email protected]

Digital Society Laboratory LLC, Russia

April 11, 2014

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 2: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Problem

Detect gender of a userto profile a user;user segmentation is helpful in search, advertisement, etc.

By text written by a user:Ciot et al. [2013], Koppel et al. [2002], Goswami et al. [2009],Mukherjee and Liu [2010], Peersman et al. [2011], Rao et al.[2010], Rangel and Rosso Rangel and Rosso [2013], Al Zamalet al.Al Zamal et al. [2012] and Lui et al. Liu et al. [2012].Russian language: Daniel [2012]

By full name: Burger et al. [2011]

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 3: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Online demo

http://research.digsolab.com/gender

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 4: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Dataset

100,000 full names of Facebook usersfull name – first and last name of a usera full name has a gender label: male or femalenames written in both Cyrillic and Latin alphabets“Alexander Ivanov”, “Masha Sidorova”, “Pavel Nikolenko”, etc.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 5: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Dataset

Figure : Name-surname co-occurrences: rows and columns are sorted byfrequency.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 6: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Dataset

72% of first names have typical male/female ending68% of surnames have typical male/female endinga typical male/female ending splits males from females with anerror less than 5%gender of ≥ 50% first names recognized with 8 endingsgender of ≥ 50% second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30% of names. This motivates the need for a moresophisticated statistical approach.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 7: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Dataset: character endings of Russian names

Type Ending Gender Error, % Examplefirst name na (на) female 0.27 Ekaterinafirst name iya (ия) female 0.32 Anastasiyafirst name ei (ей) male 0.16 Sergeifirst name dr (др) male 0.00 Alexandrfirst name ga (га) male 4.94 Seregafirst name an (ан) male 4.99 Ivanfirst name la (ла) female 4.23 Luidmilafirst name ii (ий) male 0.34 Yuriisecond name va (ва) female 0.28 Morozovasecond name ov (ов) male 0.21 Objedkovsecond name na (на) female 2.22 Matyushinasecond name ev (ев) male 0.44 Sergeevsecond name in (ин) male 1.94 Teterin

Table : Most discriminative and frequent two character endings ofRussian names.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 8: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Gender Detection Method

input: a string representing a name of a personoutput: gender (male or female)binary classification task

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 9: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Features

endingscharacter n-gramsdictionary of male/female names and surnames

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 10: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Word endings

males: Alexander Yaroskavski, Oleg Arbuzovfemales: Alexandra Yaroskavskaya, Nayaliya ArbuzovaBUT: “Sidorenko”, “Moroz” or “Bondar”!two most common one-character endings: “a” and “ya” (“я”)

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 11: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Character n-grams

unigrams, bigrams or trigramsk most frequent n-grams

Most frequent trigrams

a__, va_, v__, na , ova, __A, ov_, ina, kov, nov, _Al, n__,a__, o__, __V, ndr, Ale, iya , ei , lek, eks, ko_, nko, rin, _An,enk, __C, na_, “ii ”, __E, eva, __N, __M, san, ksa , __I ,ev_ , in_, and, len, __O, va_, rov, _Na

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 12: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Dictionaries of first and last names

first/last name and a probability that it belongs to the malegender:

P(c = male|w) =nwmale∑

c∈{male,female} nwc,

nwmale – number of male profiles with the first/last name w in

the dictionaryprobability that first name is of male genderP(c = male|firstname)probability that the last name is of male genderP(c = male|lastname)3,427 firs names, 11,411 last names

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 13: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Model

L2-regularized Logistic Regressionyields reasonable results for the NLP-related problemsmodel w, feature vector xclassification:

P(y = 1|x) = 11+ e−wT x

.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 14: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0,638 0,995 0,633 0,774endings 0,850 ± 0,002 0,921 ± 0,003 0,784 ± 0,004 0,847 ± 0,0023-grams 0,944 ± 0,003 0,948 ± 0,003 0,946 ± 0,003 0,947 ± 0,003dicts 0,956 ± 0,002 0,992 ± 0,001 0,925 ± 0,003 0,957 ± 0,002endings+3-grams 0,946 ± 0,003 0,950 ± 0,002 0,947 ± 0,004 0,949 ± 0,0033-grams+dicts 0,956 ± 0,003 0,960 ± 0,003 0,957 ± 0,004 0,959 ± 0,003endings+3-grams+dicts 0,957 ± 0,003 0,961 ± 0,003 0,959 ± 0,004 0,960 ± 0,002

Table : Results of the experiments on the training set of 10,000 names.Here endings – 4 Russian female endings, trigrams – 1000 most frequent3-grams, dictionary – name/surname dictionary with γ = 80% topentries. This table presents precision, recall and F-measure of the femaleclass.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 15: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Results

Figure : Learning curves of single and combined models. Accuracy wasestimated on separate sample of 10,000 names.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 16: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Results

Figure : Accuracy of the model 3-grams function of the number offeatures used k .

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 17: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Results

Figure : Accuracy of the dicts model function of the fraction ofdictionaries used γ.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 18: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Error Analysis

There are several types of errors:Inconsistent annotation, such as “Anna Kryukova (male)” or“Boris Krolchansky (female)”.Name string is neither male nor female, but rather a name of agroup, e.g. “Wikom Tools”, “Kazakh University of Humanities”or “Privat Bank”.Name string represents a foreign name, e.g. “Abdulloh IbnAbdulloh”, “Brooke Alisson”, “Ulpetay Niyetbay” or “YolaDolson”. Our model was not trained to deal with such names.Meaningless or partially anonymized names names, e.g.“Crazzy Ma”, “Un Petit Diable”, “Vv Tt”, “Vio La Tor” or “MuuMuu”. Additional information is required to derive gender ofsuch users.People with rare names or surnames, e.g. “Guldjan Reyzova”,“Yagun Zumpelich” or “Akob Saakan”. These are people withcommon names and surnames in other countries, such asGeorgia, Kazakhstan, Azerbaijan or Tajikistan. In our dataset,people from these countries are under-represented.Full names that can denote both males and females, e.g.“Jenya Chekulenko”, “Jenya Sergienko”, “Sasha Sidorenko” or“Sasha Radchenko”. Additional information is required to infergender of such people.Misclassifications of common names, e.g. “Ilya Nasorshin”,“Oleg Dubovik” or “Elena Antropova”.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 19: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Error Analysis

Train Set Errors Test Set Errorsname true class name true class

1 Lea Shraiber female Ilya Nadorshin male2 Profanum Vulgus female Rustem Saledinov male3 Anna Kryukova male Erkin Bahlamet male4 Gin Amaya male Gocha Lapachi male5 Gertrud Gallet female Muttaqiyyah Abdulvahhab female6 Dolores Laughter female Yola Dolson female7 Di Nolik male Heiran Gasanova female8 Jlija Hotieca female Hadji Murad male9 Gic Globmedic female Jenya Chekulenko female10 Ulpetay Niyetbay female Tury.Ru Domodedovskaya Metro Office male11 Olga Shoff male Elmira Nabizade female12 Phil Golosoun male Niko Liparteliani male13 Tsitsino Shurgaya female Oleg Grin’ male14 Anna Grobov female Santi Zarovneva female15 Linguini Incident female Misha Badali male16 Toma Oganesyan female Che Serega male17 Swon Swetik female Petr Kiyashko male18 Adel Simon female Sandugash Botabaeva female19 Ant Kam- male Jenya Sergienko female20 Xristi Xitrozver female Abdulloh Ibn Abdulloh female21 Anii Reznookova female Naikaita Laitvainenko male22 Aurelia Grishko male Fil Kalnitskiy male23 Alex Bu female Helen Hovel’ female24 Karen Karine female Valery Kotelnikov male25 Russian Spain female Max Od male26 Lucy Walter male Jean Kvartshelia male27 Aysah Ahmed female Adjedo Trupachuli female28 Kiti Iz female Ainur Serikova female29 Cutejilian Juka female Privat Bank female30 Azer Dunja male No Limit female

Table : Examples of train and test set errors of the modelendings+3-grams+dicts. Here Cyrillic characters were transliterated intoEnglish in the standard way.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 20: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Thank you! Questions?

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 21: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Al Zamal, F., Liu, W., and Ruths, D. (2012). Homophily and latentattribute inference: Inferring latent attributes of twitter usersfrom neighbors. In ICWSM.

Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. (2011).Discriminating gender on twitter. In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing, pages 1301–1309. Association for ComputationalLinguistics.

Ciot, M., Sonderegger, M., and Ruths, D. (2013). Gender inferenceof twitter users in non-english contexts. In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing, Seattle, Wash, pages 18–21.

Daniel, M. A. Zelenkov, Y. (2012). Russian national corpus as aplayground for sociolinguistic research. episode iv. gender andlength of the utterance. In Proceedings of Dialog-2012, pages51–62.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 22: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Goswami, S., Sarkar, S., and Rustagi, M. (2009). Stylometricanalysis of bloggers’ age and gender. In Third International AAAIConference on Weblogs and Social Media.

Koppel, M., Argamon, S., and Shimoni, A. R. (2002).Automatically categorizing written texts by author gender.Literary and Linguistic Computing, 17(4):401–412.

Liu, W., Zamal, F. A., and Ruths, D. (2012). Using social media toinfer gender composition of commuter populations. Proceedingsof the When the City Meets the Citizen Worksop.

Mukherjee, A. and Liu, B. (2010). Improving gender classificationof blog authors. In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing, pages207–217. Association for Computational Linguistics.

Peersman, C., Daelemans, W., and Van Vaerenbergh, L. (2011).Predicting age and gender in online social networks. InProceedings of the 3rd international workshop on Search andmining user-generated contents, pages 37–44. ACM.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language

Page 23: Detecting Gender by Full Name:  Experiments with the Russian Language

Problem and the Dataset Method Results References

Rangel, F. and Rosso, P. (2013). Use of language and authorprofiling: Identification of gender and age. Natural LanguageProcessing and Cognitive Science, page 177.

Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010).Classifying latent user attributes in twitter. In Proceedings of the2nd international workshop on Search and mining user-generatedcontents, pages 37–44. ACM.

Alexander Panchenko [email protected] Gender by Full Name: Experiments with the Russian Language