author paper identification problem final presentation
DESCRIPTION
TRANSCRIPT
![Page 1: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/1.jpg)
Author- Paper Identification Problem
Guided ByProf Duc Tran
Team :
Karthik Reddy Vakati
Nachammai C
Pooja Mishra
![Page 2: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/2.jpg)
Problem Statement
To determine the correct author from the author’s dataset for a particular paper.
Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles.
Challenge is to determine which papers in an author profile were truly written by a given author
![Page 3: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/3.jpg)
Data Points
The data points include all papers written by an author, his affiliation (University, Technical Society, Groups).
Paper-Author -( PaperId , AuthorId, Name, Affiliation)
The meta data includes journals written by him and conferences attended by an author.
Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)
Author -( Id, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage)
![Page 4: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/4.jpg)
Machine Learning Task
• Building the model using train dataset and testing
• Feature Engineering
• Algorithms
• Model Tuning
• Results
• Evaluation
![Page 5: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/5.jpg)
Steps Taken to Solve Problem
Data preprocessing and cleaning Feature engineering Extracting feature values Creating final input file Choose a ML algorithm - Random Forest/Gradient Boost
Model Building the model Tuning the model Test the model on test data Evaluating the results
![Page 6: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/6.jpg)
Data preprocessing and cleaning
Issues with data Few had attributes spilled over 3 rows Some rows had more attributes than the required number of
attributes Wrote a Perl script to Clean data and format it Removed stop words using NLTK package in python Converted all text to lower case Removed special characters Removed noise from years field Assigned ID to each keyword and normalized it
![Page 7: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/7.jpg)
Issues with data-I
![Page 8: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/8.jpg)
Feature Engineering Steps
• Aggregation: combining multiple features into one.
How did we use: Elaborated train file with AuthorID, PaperID and Confirmation combined with name and affiliation from Author file and PaperAuthor file.
• Construction: Creating new features out of original ones
How did we use: minyear and maxyear into active years
• Discretization: Converting continuous features or variables to discretized or nominal features
How did we use: The year the paper is published. The max and min years the author was actively publishing papers.
![Page 9: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/9.jpg)
Author features
• distance between the author names in paper-author and author files
• matched substring ratio between the author names in paper-author and author
files
• keywords used by a particular author(less weight)
• count keywords for author
• count the no of co-authors for a given co-author
• weighted TF-IDF measure of all author keywords inside author's papers
• count different papers of author
• years during which the author wrote many papers
• number of times an author is repeated (sum for distinct ids)
• list of distinct ids assigned to the same author
![Page 10: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/10.jpg)
Paper features
• year of paper
• count authors of paper
• count duplicated papers for paper
• count duplicated authors for paper
• count keywords in paper
• how many time the exact same set of authors is repeated in
different papers (without
duplicates)
![Page 11: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/11.jpg)
Paper-Author features
• correct affiliation from the table PaperAuthor: binary feature
• which year the author publishes (for the first year papers of author this feature equals 1 for the second year papers - 2 and so on)
• count sources: number of times pair author-paper is appeared in the table PaperAuthor table and Author table the same
![Page 12: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/12.jpg)
Machine Learning Models Used
Models used earlier
• K-means clustering
• ZeroR
• Tree J-48
• Naïve Bayes
![Page 13: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/13.jpg)
Machine Learning Models Used
• RandomForest
Using Weka, Mahout and H20
• Gradient Boost
Using H20
![Page 14: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/14.jpg)
Build Random forest using weka
With the elaborated train file
![Page 15: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/15.jpg)
Build Random forest using H20
![Page 16: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/16.jpg)
Feature values extraction & importance
Count of paper for each author Maximum active year for given author Maximum active year for given author Jaccard distance between author name in author file and
paper author file Jaccard distance between affiliation in author file and paper
author file The year paper was published Normalized Keyword ids
![Page 17: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/17.jpg)
1. Put the data in HDFS. $HADOOP_HOME/bin/hadoop fs -mkdir testdata $HADOOP_HOME/bin/hadoop fs -put testdata2. Build the job files. $MAHOUT_HOME/ run: mvn clean install -DskipTests3. Generate a file descriptor for the dataset. $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core--job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
4. Run the model $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples--job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest
5. Using the Decision Forest to Classify new data. $MAHOUT_HOME/examples/target/mahout-examples--job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o predictions
Random Forest using Mahout Steps
![Page 18: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/18.jpg)
Evaluation Metrics
The mean average precision for N users at position n is the average of the average precision of each user, i.e.,
MAP@n=∑i=1Nap@ni/N
• Mean Average Precision
![Page 19: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/19.jpg)
Lessons Learned!
Since we were given train and test data supervised learning is the best fit. Hence classification algorithms work better for author paper identification problem rather than clustering.
When you want to find patterns or structures in the provided data use unsupervised learning models such as clustering.
Choosing the features is the most important thing and we can extract the feature values from the given data and use it to build the model.
Construct an initial set of features and try to build the model and test for its accuracy. This might not give you better results always. Construct features from features.
Choosing features from different data points will give better results than just choosing them from only one.
Choose initial set of weights for each feature based on its importance. This will help in model tuning.
![Page 20: Author paper identification problem final presentation](https://reader038.vdocuments.net/reader038/viewer/2022110115/54b773854a795985568b4652/html5/thumbnails/20.jpg)
Thank you!!