author paper midterm

Author- Paper Identification Problem

Team :

Karthik Reddy Vakati

Nachammai C

Pooja Mishra

Guided ByProf Duc Tran

Problem Statement

• To determine the correct author from the author’s dataset for a particular paper.

• Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles

• This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author

Type of data Data provided by KDD challenge is in csv format.

Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)

Author -( Id, Name, Affiliation) Paper-Author -( PaperId , AuthorId, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage) Train-(AuthorId , ConfirmedPaperIds, DeletedPaperIds) Test - (AuthorId , PaperIds) Validation -(AuthorId,PaperIds,Usage)

Data Points The data points include all papers written by an

author, his affliation (University, Technical Society, Groups). Paper-Author -( PaperId , AuthorId, Name, Affiliation)

The meta data includes journals written by him and conferences attended by an author. Paper -( Id, Title, Year, ConferenceId , JournalId,

Keywords) Author -( Id, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage)

Issues with data

Issues with data The csv files needed cleaning Few had attributes spilled over 3 rows Some rows had more attributes than the

required number of attributes Special characters caused issue

Wrote a Perl script to Clean data and format it

Issues with data-I

Issues with data-II

Predictions & IntuitionsPrediction: Given a paper and an author, one should be able to identify

whether the given paper was written by the author.

Intuition: We initially identified this problem as a Clustering problem. We

chose clustering because a set of papers written by one author can be grouped together and then for a given paper and author we can identify if the paper is from author’s cluster.

The features PaperId, AuthorId, PaperTitle, AuthorName play a significant role in the prediction.

Feature selection

We used following features from Train dataset while building the model :

ConfirmedPaperIds DeletedPaperIds

Tools Used & Model Trained

Tools Used: Weka R Apache Mahout

Model Trained: Simple K-Means J-48 ZeroR

K-means clustering using Weka Training the data

Visualization of k-means clustering result

Simple K-means clustering using R

Error in R for Clustering

> y=read.table("Paper_fixed.csv",header=TRUE,sep=',')

> y[1:10,]

> km3 <- kmeans(x,3)

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

In addition: Warning message:

In kmeans(x, 3) : NAs introduced by coercion

Conclusion

Why clustering does not work for this problem? Handling of mixed set of attributes is an issue in R Simple Kmeans clustering works on calculating the distance

from centroids and thus needs numeric attributes and distances. Hence clustering is not a best approach for our problem

To overcome the problem we are trying to convert the data into numeric integer values and then numeric distance measures are applied for computing

However, this problem looks more like a classification problem - to classify whether a paper is written by an author

Moving on to Classification algorithms..

ZeroR

Tree J-48

Naïve Bayes

Results using Tree-J48 algorithm

Results using ZeroR algorithm

Visualization of ZeroR results for Precision

Next Steps

We are working on the feature engineering - feature transformation – work on the Author name attribute and transform it into a common format for all Author names.

Once we have the feature engineering done - We will working principally on Naïve Bayes and other classification algorithms that we think will suit our problem

And fine tune the model…

Thank you!!

author paper midterm

Education

affiliation paperauthor

given author

paper id

keywords author id

noisy author

correct author

author names

wrong author