pattern markup-language

31
Pattern Markup-Language Pattern Markup-Language A tool for simplifying data A tool for simplifying data extraction extraction from semi-structured sources from semi-structured sources Jonathan Baker, Hilton Campbell Jonathan Baker, Hilton Campbell , , Jordan Crabtree, David W. Embley Jordan Crabtree, David W. Embley

Upload: nelly

Post on 11-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Pattern Markup-Language. A tool for simplifying data extraction from semi-structured sources Jonathan Baker, Hilton Campbell, Jordan Crabtree, David W. Embley. Many Sites with Genealogical Data. Structural Patterns. Programmer Defined Regular Expressions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pattern Markup-Language

Pattern Markup-LanguagePattern Markup-LanguageA tool for simplifying data extractionA tool for simplifying data extraction

from semi-structured sourcesfrom semi-structured sources

Jonathan Baker, Hilton CampbellJonathan Baker, Hilton Campbell,,

Jordan Crabtree, David W. EmbleyJordan Crabtree, David W. Embley

Page 2: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 22

Many Sites with Genealogical Many Sites with Genealogical DataData

Page 3: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 33

Page 4: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 44

Page 5: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 55

Structural PatternsStructural Patterns

Page 6: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 66

Page 7: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 77

Page 8: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 88

Page 9: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 99

Page 10: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1010

Regular Expression A

Programmer DefinedProgrammer DefinedRegular ExpressionsRegular Expressions

Page 11: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1111

Regular Expression B

Programmer DefinedProgrammer DefinedRegular ExpressionsRegular Expressions

Page 12: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1212

Regular Expression C

Programmer DefinedProgrammer DefinedRegular ExpressionsRegular Expressions

Page 13: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1313

Given NameBirth DateDeath Date Aliases

Which Relationships Which Relationships FoundFound??

Page 14: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1414

Person

Birth Death Names

Date Date Given Aliases

Simple Schema Simple Schema Represents RelationshipsRepresents Relationships

Page 15: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1515

Combine Schema andCombine Schema andRegular ExpressionsRegular Expressions

Person

Birth Death Names

Date Date Given Aliases

Regular Expression A Regular Expression B Regular Expression DRegular Expression C

Tree Represented by XML = Tree Represented by XML = PatMLPatML

Page 16: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1616

Person

Birth Death Names

Date Date Given Aliases

Regular Expression A Regular Expression B Regular Expression C Regular Expression D

Page 17: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1717

Person

Birth Death Names

Date Date Given Aliases

Regular Expression A Regular Expression B Regular Expression C Regular Expression D

Page 18: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1818

Person

Birth Death Names

Date Date Given Aliases

Regular Expression A Regular Expression B Regular Expression C Regular Expression D

Page 19: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 1919

Person

Birth Death Names

Date Date Given Aliases

Regular Expression A Regular Expression B Regular Expression C Regular Expression D

Page 20: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2020

Person

Birth Death Names

Date Date Given Aliases

Regular Expression A Regular Expression B Regular Expression C Regular Expression D

Schema GeneratorEstablishes relationships

PatML Generation Tools

Page 21: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2121

Person

Birth Death Names

Date Date Given Aliases

Regular Expression A Regular Expression B Regular Expression C Regular Expression D

PatML EditorHelps write the regular expressions and establish which facts they match

PatML Generation Tools

Page 22: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2222

Page 23: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2323

Using PatML EditorUsing PatML Editor

Get your schema fileGet your schema file Browse for sample pageBrowse for sample page Add nodesAdd nodes Add expressionsAdd expressions See the highlights in sourceSee the highlights in source AdjustAdjust

Page 24: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2424

PatML EditorPatML EditorInterfaceInterface

Browser with rendered

sample page

Text area with sample

page source

Tree representingPatML structure

Page 25: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2525

Page 26: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2626

Fast and VersatileFast and Versatile

Regular sites can be integrated Regular sites can be integrated in hoursin hours

Adaptable to any type of Adaptable to any type of informationinformation

Page 27: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2727

Implementation to DateImplementation to Date

Genesis uses PatML files to search a Genesis uses PatML files to search a variety of sitesvariety of sites Searches TNG, Retrospect-GDS, Family Searches TNG, Retrospect-GDS, Family

Search, GedCom and Kansas GunslingersSearch, GedCom and Kansas Gunslingers Standardizes information for a common Standardizes information for a common

datamodeldatamodel Simultaneously searches other sites (in Simultaneously searches other sites (in

different formats) for people with similar different formats) for people with similar informationinformation

Page 28: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2828

ResultsResults

Page 29: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 2929

Produced PatML that correctly extracts Produced PatML that correctly extracts data from TNG, RGDS, GedCom Sites, data from TNG, RGDS, GedCom Sites, and Kansas Gunslingersand Kansas Gunslingers

User Interface allows for improved User Interface allows for improved debugging environmentdebugging environment

~1/10 coding time with PatML ~1/10 coding time with PatML generation tools compared to similarly generation tools compared to similarly functioning hand coded parsersfunctioning hand coded parsers

ResultsResults

Page 30: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 3030

LimitationsLimitations Sites must be recognizable with Sites must be recognizable with regular expressionsregular expressions

Even regular sites have page to Even regular sites have page to page HTML variationspage HTML variations

Programmer error with regular Programmer error with regular expressionsexpressions

Regular expression operations can be Regular expression operations can be slowslow

Page 31: Pattern Markup-Language

Pattern Markup LanguagePattern Markup Language 3131

Future workFuture work

Automatic regular expression Automatic regular expression generationgeneration

Parsing links to extract data on Parsing links to extract data on connected pagesconnected pages

Use in other applications and fieldsUse in other applications and fields XPath approachesXPath approaches