knowledge extraction from technical documents knowledge extraction from technical documents *with...

Post on 28-Mar-2015

234 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© Generative Software Technologies Corp. 1

Knowledge Extraction from Technical Documents

*With first class-support for Feature Modeling

Rehan Rauf, Michal Antkiewicz, and Krzysztof Czarnecki

Generative Software Technologies Corp. Waterloo, Canada

http://gensoftech.com

© Generative Software Technologies Corp. 2

The Idea

© Generative Software Technologies Corp. 3

Specification Documents

Spec DocHeadingtext text text text text text text- text text text text text text - text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

Section

Table

Paragraph

Physical structures

Functional Reqs

Business Rules

Use Case

Logical structures(specification elements)

© Generative Software Technologies Corp. 4

Recognize and extract specification elements

based on physical document

structure

© Generative Software Technologies Corp. 5

ET – Extraction Toolsearches for template instances

Spec Doctext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

Text Text Text

text text text

text text text

text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

Text Text Text

text Text text

text text

UC Template UC 1

UC 2

© Generative Software Technologies Corp. 9

Precondition:Documents have been authored with some

template in mind

© Generative Software Technologies Corp. 10

Application scenarios

© Generative Software Technologies Corp. 11

Import to Requirements Mgmt Tools

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

DoorsHP Quality CenterRequisite Pro…

Functional Reqs

Business Rules

Use Case

Functional Reqs

Business Rules

Use Case

ET

© Generative Software Technologies Corp. 12

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

QT

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Structured Query

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

All use cases with actor = ‘customer’

Use Case

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Functional Reqs

Use CaseUse Case

Business Rules

© Generative Software Technologies Corp. 13

Spec Doc

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Headingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Tracing

Business Rules

Use Case

Use Case

© Generative Software Technologies Corp. 14

Spec Doc

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Headingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

Template Conformance Checking

Use Case

Use Case

© Generative Software Technologies Corp. 15

Main Challenge:Logical and Physical

Variation

© Generative Software Technologies Corp. 16

Challenge – Variation

Instances of Use Case

© Generative Software Technologies Corp. 17

Challenge – Variation

Instances of Use Case Logical components Component Identifiers

© Generative Software Technologies Corp. 18

Challenge – Variation

Instances of Use Case Logical components Component Identifiers

© Generative Software Technologies Corp. 19

Variation Types

Designed Accidental

Logical

Physical

© Generative Software Technologies Corp. 20

Designed Logical Variation

Optional component

© Generative Software Technologies Corp. 21

Designed Logical Alternatives

Deeper decomposition

Different methodologies lead to logical variation

© Generative Software Technologies Corp. 22

Designed Physical Variation

Different formatting

© Generative Software Technologies Corp. 23

Accidental Variation

LogicalMissing components, e.g., actor

PhysicalSpelling mistakes, e.g., “Actar”Style inconsistency, e.g., italics instead of bold

© Generative Software Technologies Corp. 24

Solution

© Generative Software Technologies Corp. 25

ET – Extraction Tool

Docs PSE

Physical componentsSections, lists, table cells

LSE

UC Template

Logical componentsActor, flow, extensions

Accidental variationvia match threshold

Designed variation

via template

© Generative Software Technologies Corp. 26

UC Template

Metamodel

UC

Name : String Flow

Action : String

*

1 1

SectionHeading

List

Paragraph

Mapping

© Generative Software Technologies Corp. 27

Example Template

© Generative Software Technologies Corp. 28

Logical Structure

© Generative Software Technologies Corp. 29

Mapping

© Generative Software Technologies Corp. 30

Regular Expressions

© Generative Software Technologies Corp. 31

Lists

© Generative Software Technologies Corp. 32

Component Nesting

© Generative Software Technologies Corp. 33

Optional Components

© Generative Software Technologies Corp. 34

Physical Alternatives

© Generative Software Technologies Corp. 35

Templates with Tables

© Generative Software Technologies Corp. 36

Logical Alternatives

© Generative Software Technologies Corp. 37

ET – Extraction Tool

Docs PSE

Physical components

Basic: Paragraph, cell, graphic

Composite: Sections, lists, tables, …

LSE

UC Template

Logical componentsActor, flow, extensions

© Generative Software Technologies Corp. 38

Physical Structure Extraction

Docs PSE

Physical components

Basic: Paragraph, cell, graphic

Composite: Sections, lists, tables, …

LSE

UC Template

Logical componentsActor, flow, extensions

Only part dependent on

document-format

© Generative Software Technologies Corp. 39

Performance

© Generative Software Technologies Corp. 40

Can we extract logical structures from real-world documents?

© Generative Software Technologies Corp. 41

Document Set

43 documents24 from 3 companies11 from public sources6 student projects2,000 to 23,000 words

ContentUse CasesData ObjectsBusiness RulesFunctional ReqsNon-Functional Reqs…

Docs

© Generative Software Technologies Corp. 42

ET2) Verify extraction

Template Development

UC1

UC Template

UC Template

1) Write template manually

UC2

??

3) Refine template

© Generative Software Technologies Corp. 43

Results

36 logical structuresUse cases, data objects, business rules, … Template sizes from 3 to 52 LOCTotal 942 instances

Nearly all instances perfectly recognized100% recall for 33 templates; over 80% for remaining 3100% precision for 35 templates; 87% for remaining 1

Error causesSevere formatting problems, e.g., manual line breaksForgotten ids

© Generative Software Technologies Corp. 44

Other Questions

Amount & kind of template change in refinement 1% – 25% LOC affected during refinement81% changes concern optionality (add ‘?’ or component)

Amount of iterations1 instance (11 cases) to 50% of all instances (6 cases)

e.g., 10 out of 20 (2 cases); mostly simple edits, add `?’

ImplicationStart with few examples, then edit the template based on expert knowledge (e.g., add `?’)

© Generative Software Technologies Corp. 45

Related Work

Import to Req Mgmt ToolsTools prescribe document structureManual markup for fine-grained extraction

Wrapper inductionMachine generated docs (web pages)Induced Regex not human readable (no modeling language)

Natural language processingCan benefit from structure-induced semantic tags

© Generative Software Technologies Corp. 46

Future: Template by Example

UC1

UC Template

UC2

3) Refine template

1) Mark up sample document

UC Template

TE 2) Extract template

3) Verify extraction

ET

© Generative Software Technologies Corp. 47

Summary

© Generative Software Technologies Corp. 48

ET – Design

48

Functional Reqs

B. Rules

Use Case

B. Rules

Use Case

Use Case

PSE

Physical components

Spec Doc

Spec Doc

Spec Doc

UC Template

LSE

Logical components

Spec Doc

Spec Doc

Use CaseQT

Query

Functional Reqs

B. Rules

Use Case

ET

Import

Tracing

Conformance

Application scenarios Template development

Evaluation results

Nearly all instancesperfectly recognized

43 real-world documents

© Generative Software Technologies Corp. 49

Technology available athttp://gensoftech.com/IntelligentET

top related