1 querying the web for genealogical information troy walker spring research conference 2003 research...
Post on 21-Dec-2015
216 views
TRANSCRIPT
![Page 1: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/1.jpg)
11
Querying the Web for Querying the Web for Genealogical InformationGenealogical Information
Troy WalkerTroy Walker
Spring Research Conference Spring Research Conference 20032003
Research funded by NSFResearch funded by NSF
![Page 2: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/2.jpg)
22
Genealogical Information on Genealogical Information on the Webthe Web
Hundreds of thousands of sitesHundreds of thousands of sites Some professional (Ancestry.com, Some professional (Ancestry.com,
Familysearch.org)Familysearch.org) Mostly hobbyist (Cyndislist.com)Mostly hobbyist (Cyndislist.com)
Search enginesSearch engines ““Walker genealogy” on Google: 199,000 resultsWalker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through1 page/minute = 5 months to go through
Why not enlist the help of a computer?Why not enlist the help of a computer?
![Page 3: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/3.jpg)
33
ProblemsProblems
No standard way of presenting dataNo standard way of presenting data Text formatted with HTML tagsText formatted with HTML tags TablesTables Forms to access informationForms to access information
Each site has its own idea of what Each site has its own idea of what genealogical information is—differing genealogical information is—differing schemasschemas
![Page 4: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/4.jpg)
44
Proposed solutionProposed solution
Based on Ontos and other work done at Based on Ontos and other work done at the BYU Data Extraction Groupthe BYU Data Extraction Group
Able to extract from:Able to extract from: Semi-structured or unstructured textSemi-structured or unstructured text TablesTables FormsForms
Scalable and robust to changes in pagesScalable and robust to changes in pages Built for genealogy but easily adaptable to Built for genealogy but easily adaptable to
other domainsother domains
![Page 5: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/5.jpg)
55
TextText
![Page 6: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/6.jpg)
66
TablesTables
![Page 7: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/7.jpg)
77
FormsForms
![Page 8: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/8.jpg)
88
FormsForms
![Page 9: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/9.jpg)
99
System OverviewSystem Overview
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
To be implementedTo be improvedTo be integrated
![Page 10: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/10.jpg)
1010
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
User QueryUser Query
Form generated from ontologyForm generated from ontology Query by exampleQuery by example
![Page 11: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/11.jpg)
1111
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
URL DatabaseURL Databaseand Document Retrieverand Document Retriever
Contains Genealogy URLsContains Genealogy URLs Search each URL—too much timeSearch each URL—too much time Filter likely URLsFilter likely URLs
URLURL FilterFilter
http://www.ancestry.com/http://www.ancestry.com/search/main.htm?lfl=advsearch/main.htm?lfl=adv
http://http://userdb.rootsweb.com/userdb.rootsweb.com/deaths/cgi-bin/deaths.cgideaths/cgi-bin/deaths.cgi
Death Date > Death Date > 18801880
http://www.camcomp.com/http://www.camcomp.com/users/jwalker/johngene/users/jwalker/johngene/johngenes.htmjohngenes.htm
Name: Bates, Name: Bates, Boyle, Damon, Boyle, Damon, Eliot, … Walker, Eliot, … Walker, WoodsworthWoodsworth
http://www.rootsweb.com/http://www.rootsweb.com/~gaupson/cedarcem.htm~gaupson/cedarcem.htm
Burial Location:Burial Location:
Thomaston, GAThomaston, GA
http://www.cs.utk.edu/http://www.cs.utk.edu/~dwalker/genealogy/LISTS/~dwalker/genealogy/LISTS/Adams.htmlAdams.html
Name: AdamsName: Adams
http://www.cs.utk.edu/http://www.cs.utk.edu/~dwalker/genealogy/LISTS/~dwalker/genealogy/LISTS/Walker.html Walker.html
Name: WalkerName: Walker
http://www.cs.utk.edu/http://www.cs.utk.edu/~dwalker/genealogy/LISTS/~dwalker/genealogy/LISTS/Warley.htmlWarley.html
Name: WarleyName: Warley
http://http://homepages.rootsweb.com/homepages.rootsweb.com/~gemmell/walkdesc.htm~gemmell/walkdesc.htm
Name: WalkerName: Walker
http://http://www.smartnouveau.com/www.smartnouveau.com/jbplace/Kemp/f0000425.htmljbplace/Kemp/f0000425.html
Name: Anderson, Name: Anderson, Burt, Summers, Burt, Summers, WalkerWalker
![Page 12: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/12.jpg)
1212
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
Method SelectorMethod Selector
Analyze pageAnalyze page Select appropriate methodSelect appropriate method
![Page 13: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/13.jpg)
1313
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
Preprocessing EnginesPreprocessing Engines
TextText Improved record-separationImproved record-separation Ability to handle single-record pagesAbility to handle single-record pages
TableTable FormsForms
![Page 14: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/14.jpg)
1414
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
Extraction EngineExtraction Engine
OntosOntos Cache schema matchesCache schema matches
![Page 15: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/15.jpg)
1515
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
DocumentRetriever
FormEngine
TableEngine
Unstructured orSemi-Structured
Text Engine
URLDatabase
UserQuery
ResultFilter
DocumentStructure
Recognizer
DataExtraction
Engine
MappingInformation
Result FilterResult Filter
Filters objects Filters objects relevant to queryrelevant to query
Presents to userPresents to user
PersoPersonn
NameName GendeGenderr
11 Ezra Erastus WalkerEzra Erastus Walker MM
PersoPersonn
EventEvent DateDate LocationLocation
11 BirthBirth 27 Sep 27 Sep 1885 1885
Taylor, Apache, Taylor, Apache, AZAZ
11 DeathDeath 19 Sep 19 Sep 19521952
![Page 16: 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d595503460f94a39742/html5/thumbnails/16.jpg)
1616
ConclusionConclusion
Integrates, builds on previous DEG workIntegrates, builds on previous DEG work Extracts from:Extracts from:
Semi-structured or unstructured textSemi-structured or unstructured text TablesTables FormsForms
Scalable—only searches probable pagesScalable—only searches probable pages Robust to changes in pagesRobust to changes in pages Ontology based—easily adapted to other Ontology based—easily adapted to other
domainsdomains