knowledge extraction from technical documents knowledge extraction from technical documents *with...

46
Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf, Michal Antkiewicz, and Krzysztof Czarnecki Generative Software Technologies Corp. Waterloo, Canada http://gensoftech.com 1 © Generative Software Technologies Corp.

Upload: aaron-homer

Post on 28-Mar-2015

234 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 1

Knowledge Extraction from Technical Documents

*With first class-support for Feature Modeling

Rehan Rauf, Michal Antkiewicz, and Krzysztof Czarnecki

Generative Software Technologies Corp. Waterloo, Canada

http://gensoftech.com

Page 2: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 2

The Idea

Page 3: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 3

Specification Documents

Spec DocHeadingtext text text text text text text- text text text text text text - text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

Section

Table

Paragraph

Physical structures

Functional Reqs

Business Rules

Use Case

Logical structures(specification elements)

Page 4: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 4

Recognize and extract specification elements

based on physical document

structure

Page 5: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 5

ET – Extraction Toolsearches for template instances

Spec Doctext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

Text Text Text

text text text

text text text

text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

Text Text Text

text Text text

text text

UC Template UC 1

UC 2

Page 6: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 9

Precondition:Documents have been authored with some

template in mind

Page 7: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 10

Application scenarios

Page 8: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 11

Import to Requirements Mgmt Tools

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

DoorsHP Quality CenterRequisite Pro…

Functional Reqs

Business Rules

Use Case

Functional Reqs

Business Rules

Use Case

ET

Page 9: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 12

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

QT

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Structured Query

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

All use cases with actor = ‘customer’

Use Case

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Functional Reqs

Use CaseUse Case

Business Rules

Page 10: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 13

Spec Doc

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Headingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Tracing

Business Rules

Use Case

Use Case

Page 11: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 14

Spec Doc

text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text

Headingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text

Text Text Text Text Text Text

text text Text Text text text

text text text text text text

Template Conformance Checking

Use Case

Use Case

Page 12: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 15

Main Challenge:Logical and Physical

Variation

Page 13: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 16

Challenge – Variation

Instances of Use Case

Page 14: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 17

Challenge – Variation

Instances of Use Case Logical components Component Identifiers

Page 15: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 18

Challenge – Variation

Instances of Use Case Logical components Component Identifiers

Page 16: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 19

Variation Types

Designed Accidental

Logical

Physical

Page 17: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 20

Designed Logical Variation

Optional component

Page 18: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 21

Designed Logical Alternatives

Deeper decomposition

Different methodologies lead to logical variation

Page 19: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 22

Designed Physical Variation

Different formatting

Page 20: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 23

Accidental Variation

LogicalMissing components, e.g., actor

PhysicalSpelling mistakes, e.g., “Actar”Style inconsistency, e.g., italics instead of bold

Page 21: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 24

Solution

Page 22: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 25

ET – Extraction Tool

Docs PSE

Physical componentsSections, lists, table cells

LSE

UC Template

Logical componentsActor, flow, extensions

Accidental variationvia match threshold

Designed variation

via template

Page 23: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 26

UC Template

Metamodel

UC

Name : String Flow

Action : String

*

1 1

SectionHeading

List

Paragraph

Mapping

Page 24: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 27

Example Template

Page 25: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 28

Logical Structure

Page 26: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 29

Mapping

Page 27: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 30

Regular Expressions

Page 28: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 31

Lists

Page 29: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 32

Component Nesting

Page 30: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 33

Optional Components

Page 31: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 34

Physical Alternatives

Page 32: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 35

Templates with Tables

Page 33: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 36

Logical Alternatives

Page 34: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 37

ET – Extraction Tool

Docs PSE

Physical components

Basic: Paragraph, cell, graphic

Composite: Sections, lists, tables, …

LSE

UC Template

Logical componentsActor, flow, extensions

Page 35: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 38

Physical Structure Extraction

Docs PSE

Physical components

Basic: Paragraph, cell, graphic

Composite: Sections, lists, tables, …

LSE

UC Template

Logical componentsActor, flow, extensions

Only part dependent on

document-format

Page 36: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 39

Performance

Page 37: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 40

Can we extract logical structures from real-world documents?

Page 38: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 41

Document Set

43 documents24 from 3 companies11 from public sources6 student projects2,000 to 23,000 words

ContentUse CasesData ObjectsBusiness RulesFunctional ReqsNon-Functional Reqs…

Docs

Page 39: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 42

ET2) Verify extraction

Template Development

UC1

UC Template

UC Template

1) Write template manually

UC2

??

3) Refine template

Page 40: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 43

Results

36 logical structuresUse cases, data objects, business rules, … Template sizes from 3 to 52 LOCTotal 942 instances

Nearly all instances perfectly recognized100% recall for 33 templates; over 80% for remaining 3100% precision for 35 templates; 87% for remaining 1

Error causesSevere formatting problems, e.g., manual line breaksForgotten ids

Page 41: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 44

Other Questions

Amount & kind of template change in refinement 1% – 25% LOC affected during refinement81% changes concern optionality (add ‘?’ or component)

Amount of iterations1 instance (11 cases) to 50% of all instances (6 cases)

e.g., 10 out of 20 (2 cases); mostly simple edits, add `?’

ImplicationStart with few examples, then edit the template based on expert knowledge (e.g., add `?’)

Page 42: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 45

Related Work

Import to Req Mgmt ToolsTools prescribe document structureManual markup for fine-grained extraction

Wrapper inductionMachine generated docs (web pages)Induced Regex not human readable (no modeling language)

Natural language processingCan benefit from structure-induced semantic tags

Page 43: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 46

Future: Template by Example

UC1

UC Template

UC2

3) Refine template

1) Mark up sample document

UC Template

TE 2) Extract template

3) Verify extraction

ET

Page 44: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 47

Summary

Page 45: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 48

ET – Design

48

Functional Reqs

B. Rules

Use Case

B. Rules

Use Case

Use Case

PSE

Physical components

Spec Doc

Spec Doc

Spec Doc

UC Template

LSE

Logical components

Spec Doc

Spec Doc

Use CaseQT

Query

Functional Reqs

B. Rules

Use Case

ET

Import

Tracing

Conformance

Application scenarios Template development

Evaluation results

Nearly all instancesperfectly recognized

43 real-world documents

Page 46: Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

© Generative Software Technologies Corp. 49

Technology available athttp://gensoftech.com/IntelligentET