dcm.uhcl.edudcm.uhcl.edu/capf09g3/csci 6838 report version 2.0.docx · web viewreps – resume...
TRANSCRIPT
REPS – Resume Extraction and Processing System
Resume Extraction and Processing System
By
Abhijit Pawar
Ruchit Sontakke
Prerana Narang
Karan Parekh
Instructor: Dr. Bun Yue
Mentor : Dr. Dilhar De Silva
Date: 12/4/2009
University of Houston- Clear Lake- Fall 2009 1
REPS – Resume Extraction and Processing System
Acknowledgement
With affection and deep appreciation, we acknowledge our indebtedness to our instructor Dr.
Kwok Bun Yue for giving us an opportunity to explore our skills and innovations beyond the
prescribed syllabi of our coursework by granting us the permission to work on an external
project. We further extend our gratitude to our most effective and valuable mentor Dr. Dilhar
Desilva, who has been continuously guiding us throughout the course of our project. In addition,
we would like to express thanks to our Project Manager Mr. Stewart Bush who encouraged us
which gave us the inspiration for going ahead with the project.
University of Houston- Clear Lake- Fall 2009 2
REPS – Resume Extraction and Processing System
Abstract
AtLink Communications commercializes software through a process driven approach that is
essential for the success of any product developed. To examine the efficiency of such a process
driven approach with the help of Business Process Management(BPM) Tool AtLink intended to
develop a product for automated resume extraction which can be of great use to the the
organization in analyzing the information from resumes submitted online on job portal website
of the organization.
The main purpose of our project, Resume Extraction & Processing System [REPS] is to analyze
the resume uploaded by the user and extract the information contained in the original resume.
Our team used a Software Development Assistant (SDA) tool to perform the activities while
working on the project. The activities throughout the lifecycle of the project were performed in
four phases namely, inception, elaboration, construction and transition.
REPS was implemented as web application with Adobe Flex® as the user interface and deployed
on Apache Tomcat 6.0 server. We used a java remoting web messaging technology BlazeDS
which is adobe open source utility to integrate the application programs with the flex user
interface. We used MySql database for storing the extracted data from the resume.
The timeline, deliverables and the efforts given by the team in the implementation of this
application has been recorded in the SDA tool.
After testing, validating and verification AtLink Communications shall be able implement this
application in any web based job portal websites where the users are required to upload their
resumes online. This application eases the process to a great extent by reducing the amount of
manual input from the users to a minimum level.
University of Houston- Clear Lake- Fall 2009 3
REPS – Resume Extraction and Processing System
Table of Contents
1. Introduction and Background……………………………………………….........................5
1.1 ConvertDoc……………………………………………………………...............................6
1.2 VisualText Resume Analyzer………………………………………………………...........6
1.3 DOM Parser………………………………………………………………………………..7
1.4 BlazeDS…………………………………………………………………………………....7
1.5 Adobe Flex…………………………………………………………………………………7
2. Design and Implementation......................................................................................................8
2.1 Architecture of REPS………………………………………………………………….8
2.2 Implementation with SDA Tool……………………………………………………….9
2.2.1 Inception Phase………………….…………………………………………...9
2.2.2 Elaboration Phase…………………………………………………………..10
2.2.3 Construction Phase…………………………………………………………11
2.2.4 Transition Phase……………………………………………………………11
3. Implementation issues and Lessons Learnt…...…………………………………………...12
4. Conclusion……………………………………………………………………………………13
5. References……………………………………………………………………………………14
Appendix A: Team Information………………………………………………………………....15
Appendix B: Team Contribution………………………………………………………………...17
Appendix C: Schedule…………………………………………………………………………..18
Appendix D: Screen Shots………………………………………………………………………19
Appendix E: Database Design…………………………………………………………………..25
University of Houston- Clear Lake- Fall 2009 4
REPS – Resume Extraction and Processing System
1. Introduction and Background
AtLink Communications is a provider of Process Automation technology. By treating Processes
the same as a standard Business Process, AtLink customers automate, manage and control their
enterprise more effectively and efficiently than previously possible. [3]
There are many web based applications or web sites either requires a lot of input from their users
to fill up resume information on a form or show the recruiters a scanned document of the resume.
The company would like us to develop a web based application which automates this process of
submitting a resume by the users into their relational database thereby minimizing the user input.
The application must accept a MS word format resumes, scan them and map the relevant data to
tables in the relational database customized for this purpose.
We are supposed to use the Software Development Assistant tool in the implementation of this
application taking it through the full software development life cycle.
The development of this application had many challenges including the numerous types of
resume format available today in the market, parsing these resume segregating the different
segments and mapping those segments exactly on to the database.
When developing this application we started with examining various resume formats, trying to
find a way to read those resumes and their fields accurately. Then we worked on parsing the MS
word format resume to a human readable format and then mapping those values and fields in the
database and also show it back to the users on the Flex built user interface. The most important
part in the development of this application was the integration of the different open source
modules we used.
University of Houston- Clear Lake- Fall 2009 5
REPS – Resume Extraction and Processing System
In the final application the user has to only upload a MS word resume and the rest of the process
is done by the application. The resume is then shown to the user in a segmented cover-flow form
with each segment in a particular page, for the ease of the user. The application also allows the
user to edit the information on the resulting web resume form when and where required.
Following are the software modules used in the project:
1.1 ConvertDoc
It is an open source document conversion utility which converts a document from one format
to another. We used this utility to convert the MS word format resume submitted by the user
to text format. A detailed explanation is covered in the implementation segment.
1.2 VisualText Resume Analyzer
VisualText Resume Analyzer is an open source natural language processing and text analysis
system. We used the analyzer in our application in order to analyze and extract resume
information using natural language processing parser. The output resume in text format
obtained from the ConvertDoc utility is given as an input to the analyzer while extracting
data. The detailed use of this system is covered in the implementation part.
University of Houston- Clear Lake- Fall 2009 6
REPS – Resume Extraction and Processing System
1.3 DOM Parser
We parsed the extracted information in the xml file from the VisualText Resume Analyzer in
java using a DOM parser.
1.4 Blazeds
Our team used BlazeDS which is a java remoting and web messaging technology for
integrating the java programs with flex user interfaces.
1.5 Adobe Flex
Adobe Flex is a SDK for developing and deploying cross-platform rich internet applications
which are based on Adobe Flash platform. We implemented the whole application based on
Flex and the user interface where the user shall upload their resumes and see the resulting
resume form. A snap shot of the user interface can be seen in Appendix.
University of Houston- Clear Lake- Fall 2009 7
REPS – Resume Extraction and Processing System
2. Design and Implementation
2.1 Architecture Diagram REPS
The high level architecture diagram gives an overview of all the modules that we developed.
The following steps show the flow of the application:
a. First, the MS word resume file selected by the user is sent over to the server side
where REPS resides. This client side user interface is implemented using FLEX.
b. On the server side, in REPS, the file is converted from MS word format to text format
using Convert Doc, the open source utility.
University of Houston- Clear Lake- Fall 2009 8
a. FLEX UI FORMS
INTERNET
CLIENT
Resume (.doc or .docx format)
Resume (.doc or .docx format)
Resume (.txt file)
XML File
Extracted Data Extracted Data
Confirmed/Edited Data
Confirmed/Edited Data
DATABASE
Map confirmed/ Edited data
b. CONVERTDOC
c. VisualText Resume Analyzer
d. DOM Parser
REPS – Resume Extraction and Processing System
c. The converted resume in text format is then submitted to VisualText resume analyzer
which extracts resume information and outputs it in XML file.
d. This XML file goes as an input to the DOM parser. The parsed data from the DOM
parser is displayed on to the Flex user interface using BlazeDS.
e. The user gets an opportunity to edit and make the changes to the extracted data.
f. After making all the changes, the data is finally mapped on to the database.
2.2 Implementation with SDA tool
We used the SDA tool for the implementation of this application, The SDA tool divided the
life cycle of this software development process into four major phases:
2.2.1 Inception Phase
During this phase we were given a demo of the existing resume application, we examined
that application and also the different formats of resume readily available in the market. We
then prepared a risk document which helped us in mitigating the risks that we encountered
during the course of the project.
Abstract of the project, the common vocabulary for the users and management, use cases,
and first version of the requirement document were among the documents we prepared
during this phase. These documents helped us in a clear understanding of the resulting
application we were aiming at.
The most challenging part of this phase was selecting and acquiring tools to accurately read
the MS word resume parse it into a human readable format in order to map those resume data
accurately into the database.
University of Houston- Clear Lake- Fall 2009 9
REPS – Resume Extraction and Processing System
We came across many parsers and analyzers during our research but none were intelligent
enough to analyze the numerous varieties of resume formats available in the market and
which the users may use.
Our team with the help of our mentor found an open source analyzer, VisualText, which uses
natural language processing technique. This analyzer was the key to the success of the
project. We used the VisualText analyzer for our application by feeding different type of data
for example, cities, address types, states, zip code, university format, dates of attendance,
dates of attendance, employment history etc. The VisualText only takes a text document as
an input; the MS word resume was converted to a text document using ConvertDoc.
The output of the VisualText analyzer was in the form of XML. The next big task was to
parse this XML document.
2.2.2 Elaboration Phase
Lot of require, design
During the elaboration phase we prepared the sequence diagrams which guided us in how the
flow of the application will be. This was very useful step. We also studied the Adobe Flex in
order to implement it in making the user interface of the web based application during the
construction phase.
Our team further refined the knowledge base of the natural language processing based
VisualText analyzer for a more detailed output format. To do this, we prepared the relational
database for the application according to the requirement of the company using MySql.
University of Houston- Clear Lake- Fall 2009 10
REPS – Resume Extraction and Processing System
It was during this phase we prepared the final requirement document, which instated what the
final application should do. The requirement document was passed by the mentor and we
transitioned into the construction phase.
2.2.3 Construction Phase
This was one of the lengthiest phases during the life cycle of the project. We divided our
team into two groups, the first group worked on successfully parsing the VisualText
analyzer’s XML output in order to map the resume data in the database. To implement this,
our team used DOM parsers. The second group worked on preparing the first prototype of the
user interface for the final application. We used panels, buttons combined with action script
in order to implement the cover-flow style resume form. The resume in the UI form is
divided according to the database tables like, General information which includes Name,
address, e-mail, phone etc, the employment history which includes employer name, date of
joining , end date etc, similarly for the other segments in a resume. The user can browse
through all the sections by clicking the navigation bar.
The next task our team implemented was to map the parsed output from the DOM parser on
to the database and fill the UI resume cover-flow form with the same. This task was
completed by using Blazeds.
2.2.4 Transition Phase
This was the final phase in the development of our application. We deployed the actual
working model of the application. The technical report documentation and final project
presentation was done during this phase.
University of Houston- Clear Lake- Fall 2009 11
REPS – Resume Extraction and Processing System
3 Implementation Issues and Lessons learned
During the life cycle of our software development project we encountered many issues which
could have affected the overall output and timelines of the project, however with the SDA tool
many of such issues and risks were covered. To point out a few, we were prompted by the
process tool whenever a task was assigned to us, it also sent us a reminder email if the work
assigned was near due date etc.
Finding and generalizing tools to work for the various type of resume formats available today is
not feasible, instead a specific area, for example North America should be chosen and should be
first implemented. Other areas and resumes can be added as and when required.
The important lessons we learned were that, implementing a process and using BPM tools really
help in the overall success of the project, team work and contribution is the key to the success of
our project.
University of Houston- Clear Lake- Fall 2009 12
REPS – Resume Extraction and Processing System
4. Conclusion
The Resume Extraction and Processing System is a standalone web based application. The
system has been implemented and the overall goal (i.e. minimizing user input while uploading
resume) has been achieved. The project has taught us how a real world project is implemented
using software engineering best practices. It has also helped us in maintaining project timeline,
team spirit and immense team co-ordination. The team had a very good learning experience
throughout the project.
University of Houston- Clear Lake- Fall 2009 13
REPS – Resume Extraction and Processing System
5. References
1. http://msdn.microsoft.com/en-us/default.aspx
2. http://www.convert-files.com/SII/Convert-DOC/English/WebHelp/command_line_manual/ examples__converting_docx_word_documents/example_converting_from_docx_to_txt.htm
3. http://www.textanalysis.com/help/help.htm
4. http://linguistlist.org/sp/Software.html#66
5. http://nlp.stanford.edu/software/lex-parser.shtml#Download
6. http://www.ellogon.org/
7. http://en.wikipedia.org/wiki/Natural_language_processing
8. http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Runtime.html
9. http://www.w3schools.com/Xpath/xpath_examples.asp
10. http://www.roseindia.net/xml/dom/accessing-xml-file-java.shtml
11. http://opensource.adobe.com/wiki/display/blazeds/Release+Builds
12. http://learn.adobe.com/wiki/display/Flex/ Creating+a+BlazeDS+web+service+application+in+Flex+Builder
13. http://ieee.org/portal/site
University of Houston- Clear Lake- Fall 2009 14
REPS – Resume Extraction and Processing System
Appendix A: Team Information
Team Website: http://dcm.uhcl.edu/capf09g3/index.html
Team Members:
1. Abhijit Pawar:
2. Prerana Narang:
University of Houston- Clear Lake- Fall 2009 15
Name Abhijit Pawar
Student ID 0862273
Email [email protected], [email protected]
Phone Number 832-561-0866
Major Computer Science
Responsibilities Team Leader, Research, Programmer, Documentation
Name Prerana Narang
Student ID 0855767
Email [email protected], [email protected]
Phone Number 832-266-9175
Major Computer Science
Responsibilities Documentation, Research, Programmer
REPS – Resume Extraction and Processing System
3. Ruchit Sontakke:
4. Karan Parekh:
University of Houston- Clear Lake- Fall 2009 16
Name Ruchit Sontakke
Student ID 0858027
Email [email protected], [email protected]
Phone Number 832-316-6339
Major Computer Science
Role Testing, Research, Webmaster, Programmer
Name Karan Parekh
Email [email protected] , [email protected]
Student ID 0834607
Phone Number 281-224-5817
Major Computer Information Systems
Role Analyst, Research, Webmaster, Programmer
REPS – Resume Extraction and Processing System
Appendix B: Team Contribution
1. Conversion to Text: 50% Abhijit and 50% Prerana
2. GUI Design: 60% Karan and 40% Ruchit
3. VisualText ( Resume analyzer): 70% Abhijit and 30% Prerana
4. Flex Implementation: 30% Ruchit ,30% Karan and 40% Abhijit
5. DOM Parser: 60% Abhijit and 40% Prerana
6. Database Implementation: 70% Abhijit and 30% Karan
7. Website Maintenance: 50% Ruchit and 50% Karan
8. Minutes and Agendas: 60% Ruchit and 40% Prerana
University of Houston- Clear Lake- Fall 2009 17
REPS – Resume Extraction and Processing System
9. Technical Writing(Report): 50% Karan, 30% Ruchit,15% Prerana and 15% Abhijit
University of Houston- Clear Lake- Fall 2009 18
REPS – Resume Extraction and Processing System
University of Houston- Clear Lake- Fall 2009 19
Appendix C: Schedule
REPS – Resume Extraction and Processing System
Appendix D: Screen Shots
Upload Resume Page:
University of Houston- Clear Lake- Fall 2009 20
REPS – Resume Extraction and Processing System
View Resume Page:
Data Extracted Page:
University of Houston- Clear Lake- Fall 2009 21
REPS – Resume Extraction and Processing System
Resume Page:
University of Houston- Clear Lake- Fall 2009 22
REPS – Resume Extraction and Processing System
Snapshot of the SDA tool:
1)
The above snapshot shows the different phases of Life Cycle.
The most important phases are Inception, Elaboration, Construction and Transition.
It shows the flow of how the project proceeds through different phases. Each of these phases is
described in detail in the tool. Under each phase there are number of activities that need to be
completed in order to move to the next phase. Terms such as Active, Inactive, and Complete
define the status of phase.
University of Houston- Clear Lake- Fall 2009 23
REPS – Resume Extraction and Processing System
2)
The above snapshot shows the layout of the tool when a specific activity is chosen from a
specific project.
University of Houston- Clear Lake- Fall 2009 24
REPS – Resume Extraction and Processing System
3)
This particular snapshot shows the layout of the tool when the user enters the document control
section. Here the document can be under different status depending on the place where it is
placed. It can be under Get working category or Review Category.
University of Houston- Clear Lake- Fall 2009 25
REPS – Resume Extraction and Processing System
Appendix E: Database Design
Contact Information Table:
Education Information Table:
Employment Information Table:
University of Houston- Clear Lake- Fall 2009 26