dcm.uhcl.edudcm.uhcl.edu/capf09g3/csci 6838 report version 2.0.docx · web viewreps – resume...

35
REPS – Resume Extraction and Processing System Resume Extraction and Processing System By Abhijit Pawar Ruchit Sontakke Prerana Narang Karan Parekh Instructor: Dr. Bun Yue Mentor : Dr. Dilhar De Silva Date: 12/4/2009 University of Houston- Clear Lake- Fall 2009 1

Upload: buique

Post on 28-Mar-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Resume Extraction and Processing System

By

Abhijit Pawar

Ruchit Sontakke

Prerana Narang

Karan Parekh

Instructor: Dr. Bun Yue

Mentor : Dr. Dilhar De Silva

Date: 12/4/2009

University of Houston- Clear Lake- Fall 2009 1

Page 2: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Acknowledgement

With affection and deep appreciation, we acknowledge our indebtedness to our instructor Dr.

Kwok Bun Yue for giving us an opportunity to explore our skills and innovations beyond the

prescribed syllabi of our coursework by granting us the permission to work on an external

project. We further extend our gratitude to our most effective and valuable mentor Dr. Dilhar

Desilva, who has been continuously guiding us throughout the course of our project. In addition,

we would like to express thanks to our Project Manager Mr. Stewart Bush who encouraged us

which gave us the inspiration for going ahead with the project.

University of Houston- Clear Lake- Fall 2009 2

Page 3: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Abstract

AtLink Communications commercializes software through a process driven approach that is

essential for the success of any product developed. To examine the efficiency of such a process

driven approach with the help of Business Process Management(BPM) Tool AtLink intended to

develop a product for automated resume extraction which can be of great use to the the

organization in analyzing the information from resumes submitted online on job portal website

of the organization.

The main purpose of our project, Resume Extraction & Processing System [REPS] is to analyze

the resume uploaded by the user and extract the information contained in the original resume.

Our team used a Software Development Assistant (SDA) tool to perform the activities while

working on the project. The activities throughout the lifecycle of the project were performed in

four phases namely, inception, elaboration, construction and transition.

REPS was implemented as web application with Adobe Flex® as the user interface and deployed

on Apache Tomcat 6.0 server. We used a java remoting web messaging technology BlazeDS

which is adobe open source utility to integrate the application programs with the flex user

interface. We used MySql database for storing the extracted data from the resume.

The timeline, deliverables and the efforts given by the team in the implementation of this

application has been recorded in the SDA tool.

After testing, validating and verification AtLink Communications shall be able implement this

application in any web based job portal websites where the users are required to upload their

resumes online. This application eases the process to a great extent by reducing the amount of

manual input from the users to a minimum level.

University of Houston- Clear Lake- Fall 2009 3

Page 4: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Table of Contents

1. Introduction and Background……………………………………………….........................5

1.1 ConvertDoc……………………………………………………………...............................6

1.2 VisualText Resume Analyzer………………………………………………………...........6

1.3 DOM Parser………………………………………………………………………………..7

1.4 BlazeDS…………………………………………………………………………………....7

1.5 Adobe Flex…………………………………………………………………………………7

2. Design and Implementation......................................................................................................8

2.1 Architecture of REPS………………………………………………………………….8

2.2 Implementation with SDA Tool……………………………………………………….9

2.2.1 Inception Phase………………….…………………………………………...9

2.2.2 Elaboration Phase…………………………………………………………..10

2.2.3 Construction Phase…………………………………………………………11

2.2.4 Transition Phase……………………………………………………………11

3. Implementation issues and Lessons Learnt…...…………………………………………...12

4. Conclusion……………………………………………………………………………………13

5. References……………………………………………………………………………………14

Appendix A: Team Information………………………………………………………………....15

Appendix B: Team Contribution………………………………………………………………...17

Appendix C: Schedule…………………………………………………………………………..18

Appendix D: Screen Shots………………………………………………………………………19

Appendix E: Database Design…………………………………………………………………..25

University of Houston- Clear Lake- Fall 2009 4

Page 5: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

1. Introduction and Background

AtLink Communications is a provider of Process Automation technology. By treating Processes

the same as a standard Business Process, AtLink customers automate, manage and control their

enterprise more effectively and efficiently than previously possible. [3]

There are many web based applications or web sites either requires a lot of input from their users

to fill up resume information on a form or show the recruiters a scanned document of the resume.

The company would like us to develop a web based application which automates this process of

submitting a resume by the users into their relational database thereby minimizing the user input.

The application must accept a MS word format resumes, scan them and map the relevant data to

tables in the relational database customized for this purpose.

We are supposed to use the Software Development Assistant tool in the implementation of this

application taking it through the full software development life cycle.

The development of this application had many challenges including the numerous types of

resume format available today in the market, parsing these resume segregating the different

segments and mapping those segments exactly on to the database.

When developing this application we started with examining various resume formats, trying to

find a way to read those resumes and their fields accurately. Then we worked on parsing the MS

word format resume to a human readable format and then mapping those values and fields in the

database and also show it back to the users on the Flex built user interface. The most important

part in the development of this application was the integration of the different open source

modules we used.

University of Houston- Clear Lake- Fall 2009 5

Page 6: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

In the final application the user has to only upload a MS word resume and the rest of the process

is done by the application. The resume is then shown to the user in a segmented cover-flow form

with each segment in a particular page, for the ease of the user. The application also allows the

user to edit the information on the resulting web resume form when and where required.

Following are the software modules used in the project:

1.1 ConvertDoc

It is an open source document conversion utility which converts a document from one format

to another. We used this utility to convert the MS word format resume submitted by the user

to text format. A detailed explanation is covered in the implementation segment.

1.2 VisualText Resume Analyzer

VisualText Resume Analyzer is an open source natural language processing and text analysis

system. We used the analyzer in our application in order to analyze and extract resume

information using natural language processing parser. The output resume in text format

obtained from the ConvertDoc utility is given as an input to the analyzer while extracting

data. The detailed use of this system is covered in the implementation part.

University of Houston- Clear Lake- Fall 2009 6

Page 7: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

1.3 DOM Parser

We parsed the extracted information in the xml file from the VisualText Resume Analyzer in

java using a DOM parser.

1.4 Blazeds

Our team used BlazeDS which is a java remoting and web messaging technology for

integrating the java programs with flex user interfaces.

1.5 Adobe Flex

Adobe Flex is a SDK for developing and deploying cross-platform rich internet applications

which are based on Adobe Flash platform. We implemented the whole application based on

Flex and the user interface where the user shall upload their resumes and see the resulting

resume form. A snap shot of the user interface can be seen in Appendix.

University of Houston- Clear Lake- Fall 2009 7

Page 8: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

2. Design and Implementation

2.1 Architecture Diagram REPS

The high level architecture diagram gives an overview of all the modules that we developed.

The following steps show the flow of the application:

a. First, the MS word resume file selected by the user is sent over to the server side

where REPS resides. This client side user interface is implemented using FLEX.

b. On the server side, in REPS, the file is converted from MS word format to text format

using Convert Doc, the open source utility.

University of Houston- Clear Lake- Fall 2009 8

a. FLEX UI FORMS

INTERNET

CLIENT

Resume (.doc or .docx format)

Resume (.doc or .docx format)

Resume (.txt file)

XML File

Extracted Data Extracted Data

Confirmed/Edited Data

Confirmed/Edited Data

DATABASE

Map confirmed/ Edited data

b. CONVERTDOC

c. VisualText Resume Analyzer

d. DOM Parser

Page 9: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

c. The converted resume in text format is then submitted to VisualText resume analyzer

which extracts resume information and outputs it in XML file.

d. This XML file goes as an input to the DOM parser. The parsed data from the DOM

parser is displayed on to the Flex user interface using BlazeDS.

e. The user gets an opportunity to edit and make the changes to the extracted data.

f. After making all the changes, the data is finally mapped on to the database.

2.2 Implementation with SDA tool

We used the SDA tool for the implementation of this application, The SDA tool divided the

life cycle of this software development process into four major phases:

2.2.1 Inception Phase

During this phase we were given a demo of the existing resume application, we examined

that application and also the different formats of resume readily available in the market. We

then prepared a risk document which helped us in mitigating the risks that we encountered

during the course of the project.

Abstract of the project, the common vocabulary for the users and management, use cases,

and first version of the requirement document were among the documents we prepared

during this phase. These documents helped us in a clear understanding of the resulting

application we were aiming at.

The most challenging part of this phase was selecting and acquiring tools to accurately read

the MS word resume parse it into a human readable format in order to map those resume data

accurately into the database.

University of Houston- Clear Lake- Fall 2009 9

Page 10: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

We came across many parsers and analyzers during our research but none were intelligent

enough to analyze the numerous varieties of resume formats available in the market and

which the users may use.

Our team with the help of our mentor found an open source analyzer, VisualText, which uses

natural language processing technique. This analyzer was the key to the success of the

project. We used the VisualText analyzer for our application by feeding different type of data

for example, cities, address types, states, zip code, university format, dates of attendance,

dates of attendance, employment history etc. The VisualText only takes a text document as

an input; the MS word resume was converted to a text document using ConvertDoc.

The output of the VisualText analyzer was in the form of XML. The next big task was to

parse this XML document.

2.2.2 Elaboration Phase

Lot of require, design

During the elaboration phase we prepared the sequence diagrams which guided us in how the

flow of the application will be. This was very useful step. We also studied the Adobe Flex in

order to implement it in making the user interface of the web based application during the

construction phase.

Our team further refined the knowledge base of the natural language processing based

VisualText analyzer for a more detailed output format. To do this, we prepared the relational

database for the application according to the requirement of the company using MySql.

University of Houston- Clear Lake- Fall 2009 10

Page 11: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

It was during this phase we prepared the final requirement document, which instated what the

final application should do. The requirement document was passed by the mentor and we

transitioned into the construction phase.

2.2.3 Construction Phase

This was one of the lengthiest phases during the life cycle of the project. We divided our

team into two groups, the first group worked on successfully parsing the VisualText

analyzer’s XML output in order to map the resume data in the database. To implement this,

our team used DOM parsers. The second group worked on preparing the first prototype of the

user interface for the final application. We used panels, buttons combined with action script

in order to implement the cover-flow style resume form. The resume in the UI form is

divided according to the database tables like, General information which includes Name,

address, e-mail, phone etc, the employment history which includes employer name, date of

joining , end date etc, similarly for the other segments in a resume. The user can browse

through all the sections by clicking the navigation bar.

The next task our team implemented was to map the parsed output from the DOM parser on

to the database and fill the UI resume cover-flow form with the same. This task was

completed by using Blazeds.

2.2.4 Transition Phase

This was the final phase in the development of our application. We deployed the actual

working model of the application. The technical report documentation and final project

presentation was done during this phase.

University of Houston- Clear Lake- Fall 2009 11

Page 12: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

3 Implementation Issues and Lessons learned

During the life cycle of our software development project we encountered many issues which

could have affected the overall output and timelines of the project, however with the SDA tool

many of such issues and risks were covered. To point out a few, we were prompted by the

process tool whenever a task was assigned to us, it also sent us a reminder email if the work

assigned was near due date etc.

Finding and generalizing tools to work for the various type of resume formats available today is

not feasible, instead a specific area, for example North America should be chosen and should be

first implemented. Other areas and resumes can be added as and when required.

The important lessons we learned were that, implementing a process and using BPM tools really

help in the overall success of the project, team work and contribution is the key to the success of

our project.

University of Houston- Clear Lake- Fall 2009 12

Page 13: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

4. Conclusion

The Resume Extraction and Processing System is a standalone web based application. The

system has been implemented and the overall goal (i.e. minimizing user input while uploading

resume) has been achieved. The project has taught us how a real world project is implemented

using software engineering best practices. It has also helped us in maintaining project timeline,

team spirit and immense team co-ordination. The team had a very good learning experience

throughout the project.

University of Houston- Clear Lake- Fall 2009 13

Page 14: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

5. References

1. http://msdn.microsoft.com/en-us/default.aspx

2. http://www.convert-files.com/SII/Convert-DOC/English/WebHelp/command_line_manual/ examples__converting_docx_word_documents/example_converting_from_docx_to_txt.htm

3. http://www.textanalysis.com/help/help.htm

4. http://linguistlist.org/sp/Software.html#66

5. http://nlp.stanford.edu/software/lex-parser.shtml#Download

6. http://www.ellogon.org/

7. http://en.wikipedia.org/wiki/Natural_language_processing

8. http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Runtime.html

9. http://www.w3schools.com/Xpath/xpath_examples.asp

10. http://www.roseindia.net/xml/dom/accessing-xml-file-java.shtml

11. http://opensource.adobe.com/wiki/display/blazeds/Release+Builds

12. http://learn.adobe.com/wiki/display/Flex/ Creating+a+BlazeDS+web+service+application+in+Flex+Builder

13. http://ieee.org/portal/site

University of Houston- Clear Lake- Fall 2009 14

Page 15: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Appendix A: Team Information

Team Website: http://dcm.uhcl.edu/capf09g3/index.html

Team Members:

1. Abhijit Pawar:

2. Prerana Narang:

University of Houston- Clear Lake- Fall 2009 15

Name Abhijit Pawar

Student ID 0862273

Email [email protected], [email protected]

Phone Number 832-561-0866

Major Computer Science

Responsibilities Team Leader, Research, Programmer, Documentation

Name Prerana Narang

Student ID 0855767

Email [email protected], [email protected]

Phone Number 832-266-9175

Major Computer Science

Responsibilities Documentation, Research, Programmer

Page 16: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

3. Ruchit Sontakke:

4. Karan Parekh:

University of Houston- Clear Lake- Fall 2009 16

Name Ruchit Sontakke

Student ID 0858027

Email [email protected], [email protected]

Phone Number 832-316-6339

Major Computer Science

Role Testing, Research, Webmaster, Programmer

Name Karan Parekh

Email [email protected] , [email protected]

Student ID 0834607

Phone Number 281-224-5817

Major Computer Information Systems

Role Analyst, Research, Webmaster, Programmer

Page 17: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Appendix B: Team Contribution

1. Conversion to Text: 50% Abhijit and 50% Prerana

2. GUI Design: 60% Karan and 40% Ruchit

3. VisualText ( Resume analyzer): 70% Abhijit and 30% Prerana

4. Flex Implementation: 30% Ruchit ,30% Karan and 40% Abhijit

5. DOM Parser: 60% Abhijit and 40% Prerana

6. Database Implementation: 70% Abhijit and 30% Karan

7. Website Maintenance: 50% Ruchit and 50% Karan

8. Minutes and Agendas: 60% Ruchit and 40% Prerana

University of Houston- Clear Lake- Fall 2009 17

Page 18: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

9. Technical Writing(Report): 50% Karan, 30% Ruchit,15% Prerana and 15% Abhijit

University of Houston- Clear Lake- Fall 2009 18

Page 19: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

University of Houston- Clear Lake- Fall 2009 19

Appendix C: Schedule

Page 20: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Appendix D: Screen Shots

Upload Resume Page:

University of Houston- Clear Lake- Fall 2009 20

Page 21: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

View Resume Page:

Data Extracted Page:

University of Houston- Clear Lake- Fall 2009 21

Page 22: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Resume Page:

University of Houston- Clear Lake- Fall 2009 22

Page 23: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Snapshot of the SDA tool:

1)

The above snapshot shows the different phases of Life Cycle.

The most important phases are Inception, Elaboration, Construction and Transition.

It shows the flow of how the project proceeds through different phases. Each of these phases is

described in detail in the tool. Under each phase there are number of activities that need to be

completed in order to move to the next phase. Terms such as Active, Inactive, and Complete

define the status of phase.

University of Houston- Clear Lake- Fall 2009 23

Page 24: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

2)

The above snapshot shows the layout of the tool when a specific activity is chosen from a

specific project.

University of Houston- Clear Lake- Fall 2009 24

Page 25: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

3)

This particular snapshot shows the layout of the tool when the user enters the document control

section. Here the document can be under different status depending on the place where it is

placed. It can be under Get working category or Review Category.

University of Houston- Clear Lake- Fall 2009 25

Page 26: dcm.uhcl.edudcm.uhcl.edu/capf09g3/CSCI 6838 report version 2.0.docx · Web viewREPS – Resume Extraction and Processing System. University of Houston- Clear Lake- Fall 200915

REPS – Resume Extraction and Processing System

Appendix E: Database Design

Contact Information Table:

Education Information Table:

Employment Information Table:

University of Houston- Clear Lake- Fall 2009 26