gender detection in blogs [information retrival and extraction]

GENDER DETECTION IN

BLOGS

Presented By (Team No. 32)

Nitish Jain (201301227)Ganesh Borle (201505587)Vamshikrishna Reddy (201202177)

Mentored By

Lokesh Walase

IRE [CSE474]

The Big Picture

ABSTRACT

🔸Through the sands of time, textual content has remained a prominent feature of internet media especially BLOGS.

🔸Thus, author profiling and attribution becomes an important and task and we try to capture one aspect of it, i.e gender.

● internet can’t take responsibility of the all the content, it should be the author itself.

● But . . .

● lot of content brings a lot of responsibility

Given a text blog , can we identify whether the writer is a male or a female ?

The Question

WHO IS THE AUTHOR?

OUR APPROACH

THE APPROACH

🔸An ensemble is applied on these models and the input document is classified as written by male or female.

● We take advantage of the linguistic features of the blog and create a feature file.

● This feature file is then trained on various classifier and a model for each of the classifier is prepared.

WORKFLOW

🔸each document contains text of about ~35 blogs in XML format.

[Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ]

The Dataset

● Koppels blog dataset

● contains about 19 thousand document

http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

PARSING

● Language used : Python● Each blog is entry stored in XML format

<Blog><date>....... </date>

<post>….

</post>...

<Blog>

● Each of the blog filename contains the name and Gender of the author

The Feature Extraction

FEATURES

For our task of Gender Identification, we take the help of the following linguistic features:🔸Character Based Features🔸Word Based Features🔸Syntactic Features🔸Structural Features🔸Function Words🔸POS Start Probability

The

Classification

THE CLASSIFICATION TASK

For the task of classification, we used several classifying algorithms and arrived at a model that uses ensemble of the following classification algorithms:🔸Random Forest Classifier🔸Neural Networks Classifier🔸Adaboost Tree Classifier🔸Gradient Boosting Classifier🔸Bagging Classifier

THE CLASSIFICATION TASK

For each of the classifier🔸We fed it with partial features to actually see the variation

of accuracies with the features.🔸We applied a 10 fold validation to measure the accuracies.

For measuring the accuracy of the ensemble we took the majority class from the classified results of the classifiers.

RANDOM FOREST CLASSIFIER

● An meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset

● By using Random Forest Classifier we were able to achieve an accuracy of 69.79%

NEURAL NETWORKS CLASSIFIER

● Consists of multiple layers of nodes with each layer fully connected to the next layer nodes and each node is a neuron with non-linear perceptron.

● Uses a supervised learning called backpropagation for training the network.

● By using Neural Networks Classifier we were able to achieve an accuracy of 69.51%

ADABOOST TREE CLASSIFIER

● An meta estimator that begins by fitting a classifier on the original dataset and then fits the next round classifiers on the same dataset

● By using Adaboost tree Classifier we were able to achieve an accuracy of 69.57%

GRADIENT BOOSTING CLASSIFIER

● Builds model in a forward stage-wise fashion.

● In each of the next stages weak classifiers are introduced to compensate the shortcomings of the existing weak learners and these shortcomings are identified by the gradients.

● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.81%

BAGGING CLASSIFIER

● A meta estimator that fits the base classifiers each on random subsets of the datasets and then aggregate their individual predictions.

● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.03%

THE ENSEMBLE

● An Ensemble takes the output of other classifier and then applies a majority voting to the outputs of the classifier to determine the output.

● By using the Ensemble model on the above discussed classifiers we were able to achieve an accuracy of 71.10%

FINAL RESULTS

THE FINAL RESULTS

● By using the ensemble, we were actually able to increase our efficiency by nearly 1% in each case irrespective of the performance of the individual classifiers.

● The maximum obtainable accuracy that was shown during the experiments was 73.19% by the Ensemble model.

73.188406 %

The maximum Accuracy Achieved

USEFUL LINKS

🔸Github - https://github.com/nitishjain2007/Gender_Identification🔸Youtube - 🔸Slideshare - 🔸Website - http://nitishjain2007.github.io/Gender_Identification/ 🔸Dropbox -

https://github.com/nitishjain2007/Gender_Identification

http://nitishjain2007.github.io/Gender_Identification/

REFERENCES

🔸http://u.cs.biu.ac.il/~koppel/papers/male female llc final.pdf 🔸http://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile

/208/537

🔸http://www.cs.columbia.edu/nlp/papers/2011/acl2011age.pdf 🔸http://www.ccse.kfupm.edu.sa/~ahmadsm/coe589 121/cheng

2011 gender identification.pdf

http://u.cs.biu.ac.il/%7Ekoppel/papers/male-female-llc-final.pdf

http://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/208/537

http://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/208/537

http://www.cs.columbia.edu/nlp/papers/2011/acl2011age.pdf

http://www.ccse.kfupm.edu.sa/%7Eahmadsm/coe589-121/cheng2011-gender-identification.pdf

http://www.ccse.kfupm.edu.sa/%7Eahmadsm/coe589-121/cheng2011-gender-identification.pdf

Thanks!Any questions?

gender detection in blogs [information retrival and extraction]

Education