author paper midterm

21
Author- Paper Identification Problem Team : Karthik Reddy Vakati Nachammai C Pooja Mishra Guided By Prof Duc Tran

Upload: pooja-mishra

Post on 28-Jun-2015

131 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Author paper midterm

Author- Paper Identification Problem

Team :

Karthik Reddy Vakati

Nachammai C

Pooja Mishra

Guided ByProf Duc Tran

Page 2: Author paper midterm

Problem Statement

• To determine the correct author from the author’s dataset for a particular paper.

• Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles

• This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author

Page 3: Author paper midterm

Type of data Data provided by KDD challenge is in csv format.

Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)

Author -( Id, Name, Affiliation) Paper-Author -( PaperId , AuthorId, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage) Train-(AuthorId , ConfirmedPaperIds, DeletedPaperIds) Test - (AuthorId , PaperIds) Validation -(AuthorId,PaperIds,Usage)

Page 4: Author paper midterm

Data Points The data points include all papers written by an

author, his affliation (University, Technical Society, Groups). Paper-Author -( PaperId , AuthorId, Name, Affiliation)

The meta data includes journals written by him and conferences attended by an author. Paper -( Id, Title, Year, ConferenceId , JournalId,

Keywords) Author -( Id, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage)

Page 5: Author paper midterm

Issues with data

Issues with data The csv files needed cleaning Few had attributes spilled over 3 rows Some rows had more attributes than the

required number of attributes Special characters caused issue

Wrote a Perl script to Clean data and format it

Page 6: Author paper midterm

Issues with data-I

Page 7: Author paper midterm

Issues with data-II

Page 8: Author paper midterm

Predictions & IntuitionsPrediction: Given a paper and an author, one should be able to identify

whether the given paper was written by the author.

Intuition: We initially identified this problem as a Clustering problem. We

chose clustering because a set of papers written by one author can be grouped together and then for a given paper and author we can identify if the paper is from author’s cluster.

The features PaperId, AuthorId, PaperTitle, AuthorName play a significant role in the prediction.

Page 9: Author paper midterm

Feature selection

We used following features from Train dataset while building the model :

ConfirmedPaperIds DeletedPaperIds

Page 10: Author paper midterm

Tools Used & Model Trained

Tools Used: Weka R Apache Mahout

Model Trained: Simple K-Means J-48 ZeroR

Page 11: Author paper midterm

K-means clustering using Weka Training the data

Page 12: Author paper midterm

Visualization of k-means clustering result

Page 13: Author paper midterm

Simple K-means clustering using R

Page 14: Author paper midterm

Error in R for Clustering

> y=read.table("Paper_fixed.csv",header=TRUE,sep=',')

> y[1:10,]

> km3 <- kmeans(x,3)

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

In addition: Warning message:

In kmeans(x, 3) : NAs introduced by coercion

Page 15: Author paper midterm

Conclusion

Why clustering does not work for this problem? Handling of mixed set of attributes is an issue in R Simple Kmeans clustering works on calculating the distance

from centroids and thus needs numeric attributes and distances. Hence clustering is not a best approach for our problem

To overcome the problem we are trying to convert the data into numeric integer values and then  numeric distance measures are applied for computing

However, this problem looks more like a classification problem - to classify whether a paper is written by an author

Page 16: Author paper midterm

Moving on to Classification algorithms..

ZeroR

Tree J-48

Naïve Bayes

Page 17: Author paper midterm

Results using Tree-J48 algorithm

Page 18: Author paper midterm

Results using ZeroR algorithm

Page 19: Author paper midterm

Visualization of ZeroR results for Precision

Page 20: Author paper midterm

Next Steps

We are working on the feature engineering - feature transformation – work on the Author name attribute and transform it into a common format for all Author names.

Once we have the feature engineering done - We will working principally on Naïve Bayes and other classification algorithms that we think will suit our problem

And fine tune the model…

Page 21: Author paper midterm

Thank you!!