complete black book

75
Q & A PARSER Submitted By Name of the Student Roll no PRANAY B.MHATRE 30 SUDHARMA S.PATIL 37 SWAPNIL B.PRADHAN 41 In partial fulfillment for the award of Bachelor of Engineering (Computer Engineering) Guided by Ms. Deepti Vijay Chandran Department of Computer Engineering Smt. Indira Gandhi College of Engineering Affiliated to Mumbai University

Upload: swapnil2288

Post on 17-Oct-2014

785 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Complete Black Book

Q & A PARSER

Submitted By

Name of the Student Roll no

PRANAY B.MHATRE 30SUDHARMA S.PATIL 37SWAPNIL B.PRADHAN 41

In partial fulfillment for the award of

Bachelor of Engineering(Computer Engineering)

Guided byMs. Deepti Vijay Chandran

Department of Computer EngineeringSmt. Indira Gandhi College of Engineering

Affiliated to Mumbai UniversityMumbai (M.S.)

(2010 - 2011)

Page 2: Complete Black Book

B.E.(COMS)=====

Q &

A P

AR

SE

R===== 2010 -

2011

Page 3: Complete Black Book
Page 4: Complete Black Book

Q & A PARSER

Submitted by

Name of the Student Roll no

PRANAY B.MHATRE 30

SUDHARMA S.PATIL 37

SWAPNIL B.PRADHAN 41

In partial fulfillment of

Bachelor of Engineering

(Computer Engineering)

Name of the Guide

Ms.Deepti Vijay Chandran

Department of Computer Engineering

Smt.Indira Gandhi College of Engineering

Koparkharine, Navi Mumbai.

(2010 - 2011)

Page 5: Complete Black Book

CERTIFICATE

This is to certify that, the project “Q & A PARSER” submitted by

Name of the Student Roll No.

PRANAY B.MHATRE 30

SUDHARMA S.PATIL 37

SWAPNIL B.PRADHAN 41

is a bonafide work completed under my supervision and guidance in partial fulfillment for award of Bachelor of Engineering (Computer Engineering) Degree of Mumbai University, Mumbai.

Place : Koparkharine

Date :

(Ms. Deepti Vijay Chandran)

Guide Examiner

(Prof. K.T. Patil) Dr. S.K. NarayankhedkarHead of the Department Principal

Smt. Indira Gandhi College of EngineeringKoparkharine, Navi Mumbai.

Page 6: Complete Black Book

INDEXCONTENTS PAGE NO.

1. INTRODUCTION 7

1.1 Problem Definition 8

1.2 Mathematical Morphology 8

1.3 Scope of the Project 91.4 Methodology Used 9

2. LITERATURE SURVEY 10

3. PROJECT SCHEDULE TIMELINE CHART 12

3.1 Milestones and Timeline 13

4. REQUIREMENT GATHERING 17

4.1 Hardware Requirement 184.2 Software Requirement 184.3 Feasibility Study 194.4Language Used 21

5. DESIGN 29

5.1 Software Development Model 305.2 UML Diagrams 31

Page 7: Complete Black Book

6. IMPLEMENTATION 37

6.1 Basic Algorithm 386.2 Front End:: JAVA 396.3 Functional components of the project 416.4 Screen Layout 436.5 More Results 526.6 Software Results 54

7. CONCLUSION. 55

7.1Advantages 567.2Disadvantages 567.3Future Aspects 56

REFERENCES 57

Page 8: Complete Black Book

List Of Figures:

FigNo. Diagram Pg.No

1 Spiral Model 9

2 Sequence Diagram 31

4 Use Case Diagram 32

5 Class Diagram 33

6 Activity Diagram 34

7 State chart Diagram 35

Page 9: Complete Black Book

List of Tables

Table Name Page No.

Table 1 Milestones and Timelines 13

Table 2 Input Output Table 42

List Of Graph

Graph Name Page No.

Graph1 Timelines 15

Page 10: Complete Black Book

Chapter 1

INTRODUCTIONINTRODUCTION

Page 11: Complete Black Book

INTRODUCTION

1.1 Problem Definition:

Why Natural Language Processing is a Critically Needed Technology

Anyone who has used a search engine to perform market, consulting, or financial research, can tell you the pain of spending hours looking for the answer to a seemingly simple question. Add up all the questions a researcher must ask and the hours really rack up. Just how big is the search problem? According to International Data Group the average knowledge worker makes $60,000 per year out of which $14,000 is spent on search. Knowledge workers spend 24% of their time on search. Here is a quote from Network World, "A company that employs 1,000 information workers can expect more than $5 million in annual salary costs to go down the drain because of the time wasted looking for information and not finding it, IDC research found last year." Furthermore an Accenture study found that 50% of information retrieved in search by middle managers is useless. In the document heavy financial services sector researchers are frequently forced to give up looking for answers, or cannot check the accuracy of answers with multiple sources because it would be time prohibitive. Senior risk management is comprised of a firm’s most senior executives whose job is to evaluate if you are doing your job correctly to mitigate risk at the most upper levels of the firm. Now imagine you are on the phone with your firm’s senior risk managers (your boss’s boss’s boss) and you are asked a question that you don’t know the answer to? Imagine if you could type a short question into a search box and come up with an answer in time to provide an intelligent and correct response to the question? That is the power of natural language processing, you type in a question in “natural language” and be provided with an instant result containing the answer that saves the day.

Our biggest challenge was being flexible enough to revisit our designs when it turned out that our original ideas were not as viable as we had hoped. During initial planning, we felt that we would need to apply an array of different techniques to bring the system together, including named entity recognition, part of speech tagging, reference resolution, and text classification. However, once we started building the system we were surprised to find that our test results did not always align with our ideas. For instance, using named entity recognition to generate questions did not always generate fluent or even answerable questions. In the case of our answer" program, however, the initial simple solution turned out to be quite effective all by itself.

More concretely, the area where we struggled the most was question asking. When we first began considering options, we hadn't learned much about the necessary tools. Parsing in particular wasn't something we had spent much time on in class at that point, so we didn't realize how useful it could be. Instead, we focused on using named entity recognition, an approach that didn't pan out. In addition, while we planned to ask questions of all difficulty levels, while testing we found that our medium" and \hard" questions were too often either disfluent or unanswerable given the article. This motivated us to restrict our question asking to easy"-difficulty questions.

Page 12: Complete Black Book

1.2 Q & A parser (NLP)

Natural Language Processing (NLP) is the technology that evaluates the relationships of words such as actions, entities, or events, comprised within unstructured text, meaning sentences within paragraphs found in a variety of text based documents. Question Answering Natural Language Processing Search is the Natural Language Processing technology that specifically solves the problem of finding answers to a question which can be asked by simply entering it into a search interface using natural human language, for example, “Who is Barack Obama?”

Unlike keyword search in Google or Yahoo for example, Natural Language Processing Question Answering Search specifically allows users to ask questions in their natural language and then retrieves the most relevant answers within seconds. The standard search process requires the execution of multiple keyword combinations that then force the searcher to click on links only too frequently to find no answer and then they process of searching and liking continues until the user finds something or gives up. With Natural Language Processing Search there is no extra work and no need to search multiple links, resulting in immense time savings. Entering a question is simple for the user even though the technology behind the scenes is highly complex.

Fig.1-comparision of search

Page 13: Complete Black Book

Natural Language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. In theory, natural-language processing is a very attractive method of human-computer interaction. Natural-language understanding is sometimes referred to as an AI-complete problem, because natural-language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it.

NLP has significant overlap with the field of computational linguistics, and is often considered a sub-field of artificial intelligence. Modern NLP algorithms are grounded in machine learning, especially statistical machine learning. Research into modern statistical NLP algorithms requires an understanding of a number of disparate fields, including linguistics, computer science, statistics (particularly Bayesian statistics), linear algebra and optimization theory.

Text –to-Speech Intro:

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.[1]

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.[2]

The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s.

Page 14: Complete Black Book

1.2 Objective of Project

The goal of the Question & Answer parsing (NLP) is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person.

This goal is not easy to reach. "Understanding" language means, among other things, knowing what concepts a word or phrase stands for and knowing how to link those concepts together in a meaningful way.

1.3 Scope of Project

In this project, we developed Q & A system, an open-domain question answering system which is based on parsing and read keylogg algorithm.

It does not try to understand the semantics of a question or answer but it uses statistical methods that rely on data redundancy. In addition, some linguistic transformations of the question are performed. It can handle factoid questions and definition questions. For most Types of questions, it tries to return accurate answers. If this is not possible, it returns Short text passages, which are single sentences or sentence fragments that are assumed to contain the answer.

Page 15: Complete Black Book

1.4 Methodology Used

The model for the system development life cycle that I have used is the Spiral Lifecycle Model. This model of development combines the features of the prototyping model and the waterfall model.

Fig 1. Spiral Model

Page 16: Complete Black Book

Chapter 2

LITERATURELITERATURE SURVEYSURVEY

Page 17: Complete Black Book

2. Literature Survey

Fig.2 Text based image retrieval

The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably - on the basis of the conversational content alone - between the program and a real human. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in

Page 18: Complete Black Book

machine translation was conducted until the late 1980's, when the first statistical machine translation systems were developed.

Some notably successful NLP systems developed in the 1960's were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?”

During the 70's many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.

Up to the 1980's, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing. This was due both to the steady increase in computational power resulting from Moore's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.

Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.

Page 19: Complete Black Book

Review of Literature

Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results.

Project Review:

First, the question (from the upper field) is parsed to get a Reed-Kellogg tree syntax graph. Then the graph is transformed into its direct answer form. For example Question Clause syntax node is replaced with a Clause syntax node. The resulting graph is used as a syntax-lexical pattern.Then the algorithm scans the text in the second field and tries to find utterances, most similar to the pattern. First, it compares a syntax node from the pattern with a syntax node from target text. If syntax nodes match, it compares the meanings of words on the nodes. To compare word meanings it simply compares the Lexemes. If both syntax and meanings match, algorithm goes down the syntax trees and builds the syntax fragment common for both utterances. The more syntax nodes have been matched, the higher is matching score. The best answers are shown as a result.If question has a question word, the tool assures that question word is always matched. The node in the answer graph, which matches the question word is the short answer (possibly with all underlying words in a syntax tree)

Page 20: Complete Black Book

Review of Text To Speech

A text-to-speech system (or "engine") is composed of two parts- a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations), which is then imposed on the output speech.

Fig.3 TTS synthesis

Page 21: Complete Black Book

Chapter 3

PROJECTPROJECT SHEDULE &SHEDULE & TIMELINETIMELINE

CHARTCHART

Page 22: Complete Black Book

3.1 Milestones and Timeline

Number Milestone Name Milestone Description Timeline

(in

weeks)

Remarks

1 Requirement

Specification

A requirement

specification document

should be delivered.

3 weeks Attempt should be

made to identify

additional features

which can be

incorporated at a later

point of time. Brain-

storming comprising

of all the members

should be done

2 Technology

Familiarization

Understanding of

technology. Each person

should get themselves as

expert in each of the

technology and should

arrange a half day session

to share the info and

come up with a document

for reference

5 weeks

3 System Setup Setup up a test

environment

1 week

Design A high level architecture

diagram and detailed

design of all the modules.

Also a data dictionary

2 weeks

Page 23: Complete Black Book

document should be

delivered

4 Implementation

of 1st phase

A working code for the

1st module should be

developed. This should

bring up the Server

Program and admin

functionality

6 weeks

5 Testing and

rework for 1st

phase

Testing and fixing the

bugs

1.5 weeks

6 Implementation

of 2nd phase

The working code for the

1st and 2nd phase

4 weeks

7 Testing and

rework for 2nd

phase

Testing and fixing the

bugs

1.5 weeks

8 Implementation

of 3rd phase

The working code for the

1st and 2nd phase

5 weeks

9 Testing and

rework of the

entire application

Testing and fixing the

bugs

3 weeks

10 Deployment of

the application

Deploy the application 1 week

Table 1. Milestones and Timeline

Graph 1. Timeline

Page 24: Complete Black Book

Monthsand Weeks

Tasks

AUGUST SEPTEMBER OCTOBER NOVEMBER

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1. Identify need of the project

2.Research

a. Existing software

b. What improvement we can provide

3.Requirement analysis

4. Conduct Feasibility study

5.Desgning

Page 25: Complete Black Book
Page 26: Complete Black Book

Graph.1 Timeline

Months and

WeeksTasks

JANUARY FEBRUARY MARCH APRIL

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

6.Designing of GUI

7.Coding of GUI

8. Algorithm development

9.Algorithm implementation

10. Designing and coding of Software Modules

11.Complete software testing

12.Analysis of different conditions

13.Final release of the software

Page 27: Complete Black Book

Chapter 4

REQUIREMENTREQUIREMENTANALYSISANALYSIS

Page 28: Complete Black Book

4.1 System Requirement

HARDWARE CONFIGURATION

The hardware used for the development of the project is:

PROCESSOR : PENTIUM III 866 MHz

RAM : 128 MD SD RAM

MONITOR : 15” COLOR

HARD DISK : 20 GB

FLOPPY DRIVE : 1.44 MB

CDDRIVE : LG 52X

KEYBOARD : STANDARD 102 KEYS

MOUSE : LOGITECH MOUSE

SOFTWARE CONFIGURATION

The software used for the development of the project is:

OPERATING SYSTEM : Windows XP Professional

ENVIRONMENT : Visual studio 2008

Page 29: Complete Black Book

4.2 Feasibility Study

Feasibility analysis is performed to choose the system that meets the performance requirements at least cost. The most essential tasks performed by a feasibility analysis are the identification and description of candidate systems and the selection of the best of the candidate systems. The best system means the system that meet performance requirements at the least cost. The most difficult part of a feasibility analysis is the identification of the candidate systems and evaluation of their performances and costs. The new system has no additional expenses to implement the system. It has advantages such as we can easily access file from any client in the network, accurate output for accurate input and this application is more user-friendly. We can use this application not only in this organization but also in other firms. So it is worth solving problem.

Analysts should concentrate on providing the answers to four key questions:

How much? The cost of the new system

What? The objectives of the new system

When? The delivery timescale

How? The means and procedures used to produce the new system.

4.2.1 Economical Feasibility:

Economic analysis is the most frequently used method for evaluating the effectiveness of the candidate system. More commonly known as cost/benefit analysis, the procedure is to determine the benefits and savings that are expected from a candidate system and compare them with cost. This analysis phase determines how much cost is needed to produce the proposed system. This system is economically feasible since it does not require any initial set up cost as the organization has required machines in supporting programs for the application execute itself. It does not need additional staffing requirements.

4.2.2 Technical Feasibility:

Technical feasibility analysis is performed to check whether the proposed system is technically feasible or not. Technical feasibility centers on the existing computer system (Hardware, Software, etc.,) and to what extent it can support the proposed addition. This involves financial consideration to accommodate technical enhancement. This Project is technically feasible. The input can be done through mobile, which are both interactive and user-friendly. A normal user can also operate the system.

Page 30: Complete Black Book

4.2.3 Operational Feasibility:

Operational feasibility analysis is performed to check whether the system is operationally feasible or not. Using command buttons throughout the program enhances operational feasibility. So the maintenance and modification is found to be easier. Will the system be used if it is developed and implemented? Will there be resistance from users that will undermine the possible application benefits? The feasibility study is carried out by a small group of people who are familiar with information systems techniques, understand the part of the business or organization that will be involved or affected by the project, and are skilled in the systems analysis and design process.

This project to be implemented is feasible in all respects, since it uses existing resources; there is no need to spend more money to buy an entirely new system. It enhances the performance of the network, so it is operationally feasible.

Page 31: Complete Black Book

4.3 Language Used

C# (c sharp):

1. C# is a simple, modern, object oriented language derived from C++ and Java.

2. It aims to combine the high productivity of Visual Basic and the raw power of C++.

3. It is a part of Microsoft Visual Studio7.0.

4. Visual studio supports Vb, VC++, C++, Vbscript, Jscript. All of these languages provide access to the Microsoft .NET platform.

5. .NET includes a Common Execution engine and a rich class library.

6. Microsoft's JVM equiv. is Common language run time (CLR).

7. CLR accommodates more than one language such as C#, VB.NET, Jscript, ASP.NET, C++.

8. Source code --->Intermediate Language code (IL) ---> (JIT Compiler) Native code.

9. The classes and data types are common to all of the .NET languages.

10. We may develop Console application, Windows application, and Web application using C#.

11. In C# Microsoft has taken care of C++ problems such as Memory management, pointers etc.

12.It supports garbage collection, automatic memory management and a lot.

MAIN FEATURES OF C#

Simple Modern Object Oriented Type Safe Interoperability Scalable & Updateable

SIMPLE:

Page 32: Complete Black Book

1. Pointers are missing in C#.

2. Unsafe operations such as direct memory manipulation are not allowed.

3. In C# there is no usage of "::" or "->" operators.

4. Since it`s on .NET, it inherits the features of automatic memory management and garbage collection.

5. Varying ranges of the primitive types like Integer, Floats etc.

6. Integer values of 0 and 1 are no longer accepted as Boolean values. Boolean values are pure true or false values in C# so no more errors of "="operator and "=="operator. "==" is used for comparison operation and "=" is used for assignment operation.

MODERN

1.C# has been based according to the current trend and is very powerful and simple for building interoperable, scalable, robust applications.

2. C# includes built in support to turn any component into a web service that can be invoked over the Internet from any application running on any platform.

OBJECT ORIENTED

1. C# supports Data Encapsulation, inheritance, polymorphism, interfaces.

2. (int, float, double) are not objects in java but C# has introduces structures(structs) which enable the primitive types to become objects

int i=1; string a=i.Tostring(); //conversion (or) Boxing

TYPE SAFE 1. In C# we cannot perform unsafe casts like convert double to a Boolean.

2. Value types (primitive types) are initialized to zeros and reference types (objects and classes are initialized to null by the compiler automatically.

3. Arrays are zero base indexed and are bound checked.

4. Overflow of types can be checked.

INTEROPERABILITY

Page 33: Complete Black Book

1. C# includes native support for the COM and windows based applications.

2. Allowing restricted use of native pointers.

3. Users no longer have to explicitly implement the unknown and other COM interfaces, those features are built in.

4. C# allows the users to use pointers as unsafe code blocks to manipulate your old code.

5. Components from VB NET and other managed code languages and directly be used in C#.

SCALABLE AND UPDATEABLE

1. .NET has introduced assemblies, which are self-describing by means of their manifest. Manifest establishes the assembly identity, version, culture and digital signature etc. Assemblies need not to be register anywhere.

2. To scale our application we delete the old files and updating them with new ones. No registering of dynamic linking library.

3. Updating software components is an error prone task. Revisions made to the code can affect the existing program C# support versioning in the language. Native support for interfaces and method overriding enable complex frame works to be developed and evolved over time.

Page 34: Complete Black Book

4.4 Dataflow Diagrams:

LEVEL 0:

Input Question

& Text

Voice Mode

Fig.4.4.1 DFD Level 0

USER

0.0

QUERY PROCESSING

DISPLAY

OF

QUALITY

Page 35: Complete Black Book

LEVEL 1:

Input text

& question

Syntactically Right

Answer

TOKEN

Keyword

All Possible Answer

Main Answer

Fig, 4.4.2 DFD Level 1

USER

1.0

SYNTAX ANALYSER

2.0

PART OF SPEECH TAGGING

3.0

Parse Tree Generation

Keyword Extraction

5.0

MATCHING SCORE

4.0

COMPARER

DISPLAY OF

ANSWER

Page 36: Complete Black Book

LEVEL 2:

Fig.4.4.3 DFD Level 2

Page 37: Complete Black Book

Chapter 5

DESIGNDESIGN

Page 38: Complete Black Book

5.1 SYSTEM ARCHITECTURE:

Page 39: Complete Black Book

5.2 INTERFACE DESIGN:

Q & A PARSER USING NLP

INPUT SENTENCES

SENTENCE PARSE TREE QUESTION PARSE TREE

QUESITON

ANSWER

Page 40: Complete Black Book

5.3 CLASSES USED FOR PROCESSING AN APPLICATION:

1. Lexeme class description:

Namespace: Nlp4Net.NlpLib Assembly: NlpLib.dll

public class Lexeme : IUserData, ICloneable

Lexeme is a string of characters. There are three types Lexeme.LexType of Lexemes.

Lexemes with syntax and semantic information contain Words.

There may be several syntactically different Words associated with the same Lexeme. For example the same lexeme "code" has two Words: noun and verb; it plays different syntax roles and carries different semantic in the following utterances: "We code the project. The code is complex." Which Word is used can be determined only during higher levels of processing.

Words may belong to different languages. Currently NlpLib supports only en-US language.

You can use lexical ambiguity in OCR or speech recognition when Lexeme is not clearly recognized. Instead of processing different lexemes, overload the same Lexeme with possible Words and let syntax parser to make a choice.

2. NLParser class description:

Namespace: Nlp4Net.NlpLib Assembly: NlpLib.dll

public class NLParser : IUserData

NLParser is a natural language parser that allows lexical and syntax parsing of English text.

Note: converting plain text into Lexemes and Utterances.

The easiest way is to use NLParser.Text<Lexeme>() or NLParser.Text<Utterance>() eumerators:

Syntax:

NLParser parser = new NLParser(); foreach(Lexeme lexeme in parser.Text<Lexeme>(@"c:\test.txt", Encoding.UTF8)) {     if ((Lexeme.LexType.word == lexeme.LexemeType) && !lexeme.HasWords)         Console.WriteLine(lexeme.Text); }

Page 41: Complete Black Book

Alternatively you can use Parse() method and subscribe to NLParser.OnLexeme or NLParser.OnUtterance event. When parsing is complete, call Flush() method otherwise some portion of text may remain in internal buffers. You can continue parsing after calling Flush().

Flush() may be used if you want to force an end of an utterance.

Note: NLParser supports plain text or Lexemes as input

3. SyntaxNode class description:

Namespace: Nlp4Net.NlpLib Assembly: NlpLib.dll

public class SyntaxNode : IUserData, ICloneable

SyntaxNode is used to build a Reed-Kellogg tree graph. The root of a tree graph represents a syntax diagram of an Utterance. Utterance allows mutual diagrams when syntax is ambiguous.

SyntaxNode may have associated Words. SyntaxNode has methods allowing sorting and search operations. SyntaxNode allows syntax graph transformation. For example Utterance can be

simplified by cutting less important parts. Another typical transformation is conversion between question and answer forms in natural language queries.

4. Utterance class description:

Namespace: Nlp4Net.NlpLib Assembly: NlpLib.dll

public class Utterance : IUserData, ICloneable

Utterance represents the smallest piece of exchangeable semantic information. You can analyze syntax relations between Words to convert Utterance.Syntaxes into your semantic representation.

Utterance is the result of applying Natural Language processor to a text.

Page 42: Complete Black Book

5.4 FLOWCHART:

Display AnswerAnswer in Voice

Form

Page 43: Complete Black Book

Example:

Display Answer in Voice

Form

Page 44: Complete Black Book

5.5 Uml Diagrams:

1.0 Use case Diagram:

1.1:

Page 45: Complete Black Book

1.2:

Page 46: Complete Black Book

2.0 Class Diagram:

Page 47: Complete Black Book

3.0 Sequence Diagram:

Page 48: Complete Black Book

4.0 Activity Diagram:

Page 49: Complete Black Book

Chapter 6

IMPLEMENTATIOIMPLEMENTATIONN

Page 50: Complete Black Book

6.1 ALGORITHM DEVELOPMENT:

Text file:

The longest river in the world is Nile.

The second longest river in China is Huanghe.

The Yangtze River is the longest river in Asia.

Strawberries contain no fat.

There is nearly no fat in strawberries.

Now step1: What is the second longest river in China?

Step2: Once we have given input question, the program actually checks for a valid question using Reed-Kellogg syntax function it checks whether we have a proper question.To find above utterances, we will use the code below:

NLParser parser = new NLParser(); foreach(Utterance utterance in parser.Text<Utterance>(@"c:\test.txt", Encoding.UTF8)) {      if ((null != utterance.Syntaxes) && (0 != utterance.Syntaxes.Length))           Console.WriteLine(utterance.Syntaxes[0].ToString()); }

And this is the output of the code

Page 51: Complete Black Book

For phrase this is the syntax diagram output if the phrase is “second longest river in China”

Step 3:

If no question found in syntax tree then below message is returnedif (null == question) { Console.WriteLine("Cannot parse question: " + szQuestion); return; }If there is no utterances in sentence then it will display “ no answer found”

Step4:

Syntax graphs are matched

This is syntax graph for the question “what is the second longest river in China?”

Page 52: Complete Black Book

And this is syntax graph for the input text i.e the second longest river in China is Huanghe

Step5: If syntax nodes match, then meanings of words associated with syntax nodes are compared here meanings of the words mean “Lexemes” and lexemes mean, if river is noun in questions, and if it is noun in input text then river lexeme is matched.

Step6: If both syntax and meanings are equal, and if the utterance are considered to be equal, then matching score is incremented. The more the matches of lexemes, the more the score and the more score gets the output answer.

Once it has found matches we will have the output.

Page 53: Complete Black Book

6.2 APPLICATION:Artificial Intelligence

Information retrieval & Web SearchInformation retrieval (IR) is the science of searching for documents,

for information within documents, and for metadata about documents, as well

as that of searching databases and the World Wide Web.

Information ExtractionInformation extraction (IE) is a type of information retrieval whose goal is to

automatically extract structured information, i.e. categorized and contextually

and semantically well-defined data from a certain domain, from

unstructured machine-readable documents

Question AnsweringType in keywords to Asking Questions in Natural Language.

Response from documents to extracted or generated answer

Text SummarizationProcess of distilling most important information from a source to produce an

abridged version

Machine Translationuse of computer software to translate text or speech from one natural

language to another.

Page 54: Complete Black Book

6.3 HIPO CHART:

Fig 6.3 HIPO chart

Page 55: Complete Black Book

6.4 IPO

Page 56: Complete Black Book

Fig.6.4 IPO

Chapter 7

CONCLUSIONCONCLUSION

Page 57: Complete Black Book

7.1 CONCLUSION:

Q & A systems have been extended in recent years to explore critical new scientific and practical dimensions.

For example, systems have been developed to automatically answer temporal and geospatial questions, definitional questions, biographical questions, multilingual questions, and questions from multimedia (e.g., audio, imagery, video). Additional aspects such as interactivity (often required for clarification of questions or answers), answer reuse, and knowledge representation and reasoning to support question answering have been explored. Future research may explore what kinds of questions can be asked and answered about social media, including sentiment analysis.

Some problem in word segmentation, POS tagging is needed to be performed in a more generally way in order to apply this model for wider domain of application. In our model, these problems are solved by a technical solution. It is defining a dictionary for the system and assigning POS label for words. This is only suitable in case of a specific application with some clear information about structure of data and predictable searching scenarios.

7.2 ADVANTAGES:-

1. Quick response time

2. Customized processing

3. Small memory factor

4. No database needed

Page 58: Complete Black Book

7.3 DISADVANTAGES:-

1. Cannot decode complex sentences.

2. Since it is first version it will have some bugs

3. Cannot take more than 15 sentences

7.4 FUTURE MODIFICATION:-

1) Increasing sentence capacity

2) Adding spell check features

3) Voice based browser

4) Text Summarization

5) Language Recognizer & Translator

Page 59: Complete Black Book

REFERENCES:

IEEE paper:

Natural Language Question Answering Model Applied To Document Retrieval SystemNguyen Tuan Dang, and Do Thi Thanh Tuyen

[1] Enrique Alfonseca, Marco De Boni, José-Luis Jara-Valencia, Suresh Manandhar, “A prototype Question Answering system using syntactic and semantic information for answer retrieval”, Proceedings of the 10th Text Retrieval Conference, 2002.

[2] Carlos Amaral, Dominique Laurent, “Implementation of a QA system in a real context”, Workshop TellMeMore, November 24, 2006. [3] Eric Brill, Susan Dumais, Michele Banko, “An Analysis of the AskMSR Question-Answering System”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2002.

[4] Boris Katz, Jimmy Lin, “Selectively Using Relations to Improve Precision in Question Answering”, Proceedings of the EACL 2003 Workshop on Natural Language, 2003.

[5] Boris Katz, Beth Levin, “Exploiting Lexical Regularities in Designing Natual Language Systems”, Proceedings of the 12th International Conference on Computational Linguistics.

[6] Callison-Bruch, Chris, A computer model of a grammar for English questions, Undergraduate honors thesis, Stanford University, 2000. [7] Nguyen Kim Anh, “Translating the logical queries into SQL queries in natural language query systems”, Proceedings of the ICT.rda’06 in Hanoi Capital, 2006.

[8] Nguyen Tuan Dang, Do Thi Thanh Tuyen, “E-Library Searching by Natural Language Question-Answering System”, Proceedings of the Fifth International Conference on Information Technology in Education and Training (IT@EDU2008), Pages: 71-76, Ho Chi Minh and Vung Tau, Vietnam, December 15-16 , 2008.

[9] Nguyen Tuan Dang, Do Thi Thanh Tuyen “Document Retrieval Based on Question Answering System”, accepted paper, The Second International Conference on Information and Computing Science, Manchester, UK, May 21-22, 2009.

[10] Riloff, Mann, Phillips, "Reverse-Engineering Question/Answer Collections from Ordinary Text", in Advances in Open Domain Question Answering, Springer Series: Text, Speech and Language Technology , Vol. 32, 2006.

Page 60: Complete Black Book

[11] “SOFTWARE ENGINEERING”, ‘ROGER S.PRESSMAN’ A PRACTIONER’S APPROACH, Sixth Edition, McGraw Hill International Edition.

[12] “SOFTWARE ENGINEERING”, ‘SHARI LAWRENCE PELEEGER’, Second Edition, Pearson Education.

[13] Fabio Rinaldi, James Dowdall, Kaarel Kaljurand, Michael Hess, “Exploiting Paraphrases in a Question Answering System”, Proceedings of the second international workshop on Paraphrasing, 2003.

[14] Eric Brill, Susan Dumais, Michele Banko, “An Analysis of the AskMSR Question-Answering System”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2002.