knowledge discovery and dissemination (kdd) program

KBB: A Knowledge-Bundle Builder for Research Studies

Knowledge Discovery and Dissemination (KDD) ProgramIARPA-BAA-09-10Question Period: 22 Dec 09 2 Feb 10Proposal Due Date: 16 Feb 10IssuesInformation Extraction / Annotation / Wrapper GenerationWide Variety of Data Sets(Possibly) Large Data Sets(Possibly) Numerous Data SetsAlignment of Data SetsSchema Mapping & Schema IntegrationData Cleaning and IntegrationAdvanced Analytic Algorithms / Query / ReasoningPerformance

Unifying Solution ThemeKnowledge Bundles (KBs) ~ discovered/extracted/annotated knowledge organized for dissemination/query/analysisEither actual or virtual, or, a combinationQueries, reasoning, algorithmic analysis, data miningQueries & reasoning should always immediately work based on library of extraction ontologies, ontology snippets, and instance recognizersPay as you go: greater organization, more extraction, improved analysis based on just doing the KDD workKnowledge-Bundle Builder (KBB)Knowledge begets knowledge (KBs as extraction ontologies)Fully automatic KBB toolsSemi-automatic KBB tools

Many ApplicationsBusiness planning and decision makingScientific research studiesPurchase of large-ticket itemsGenealogy and family historyWeb of KnowledgeInterconnected KBs superimposed over a web of pagesYahoos Web of Concepts initiative [Kumar et al., PODS09]And Intelligence Gathering and AnalysisNot just bio-research studies. Ill draw from some of these as I further explain KBs and KBBs. (Switching applications because 1. Cui is not here to explain the medical biology & 2. its not implemented, but most of what I will present is implemented, although not fully integrated and not fully working as well as it should for the system to be commercialized.)4Prior Research (outline for next part of presentation)Formalization of IdeasQuery Processing and ReasoningAskOntos / SerFRGenWoKExtraction and Annotation (Semi- & Un-structured Sources)OntoESFOCIH, TANGO, TISPNERReverse Engineering (Structured Sources)RDB, XML, OWLNested TablesSemantic IntegrationMultifaceted mappings (including mappings based on OntoES)Direct and indirect mappingsSemantic enrichment for integration (e.g., MOGO)Explain formal framework and semi-automatic creation in the rest of the presentation. Tie into ACM-L.5KB FormalizationKBa 7-tuple: (O, R, C, I, D, A, L)O: Object setsone-place predicatesR: Relationship setsn-place predicatesC: Constraintsclosed formulasI: Interpretationspredicate calc. models for (O, R, C)D: Deductive inference rulesopen formulasA: Annotationslinks from KB to source documentsL: Linguistic groundingsdata framesto enable:high-precision document filteringautomatic annotationfree-form query processing

KB: (O, R, C, )7

KB: (O, R, C, , L)

8

KB: (O, R, C, I, , A, L)

9

KB: (O, R, C, I, D, A, L)

Age(x) :- ObituaryDate(y), BirthDate(z), AgeCalculator(x, y, z)Another reasoning possibility to point out: Thursday, which does not have a specific date attached to it can be reasoned about to realize that it must be March 12, 1998.

10KB Query

KB Query

KB ReasoningScreenshots from CWs thesisFree-form Query Processing with Annotated Results

We are working toward KBs and KBBs 14KBB:(Semi)-Automatically Building KBsOntologyEditor (manual; gives full control)FOCIH (semi-automatic)TANGO (semi-automatic)TISP (fully automatic)NER (Named-Entity Recognition research)

Ontology Editor

FOCIH: Form-based Ontology Creation and Information Harvesting

fleckveltergonsity (ld/gg)hepth(gd)burlam1.2120falder2.3230multon2.5400

repeat:understand tablegenerate mini-ontologymatch with growing ontologyadjust & mergeuntil ontology developedTANGO:Table ANalysis for Generating Ontologies

GrowingOntologyTISP: Table Interpretation by Sibling Pages

SameTISP: Table Interpretation by Sibling Pages

DifferentSameNER: Named-Entity Recognition

Automated extraction is critical. OpenDMAP for biology21Reverse Engineering from Structured SourcesTransformation from source (?) to target (O, R, C, I, )Information PreservingConstraint PreservingStructured sourcePredicates and constraints formalized in some wayExamples: RDB, XML, OWL, Nested FormsRDB Reverse EngineeringTheorem. Let S be a relational database with its schema restricted as follows:(1) the only declared constraints are single-attribute primary key constraints andsingle-attribute foreign-key constraints, (2) every relation schema has a primarykey, (3) all foreign keys reference only primary keys and have the same nameas the primary key they reference, (4) except for attributes referencing foreignkeys, all attribute names are unique throughout the entire database schema, (5)all relation schemas are in 3NF. Let T be an OSM-O model instance. A transformationfrom S to T exists that preserves information and constraints. C-XML: Conceptual XML

XML Schema C- XMLIn general, reverse engineering from any structured schema.24OWL OSMYihongs Converter CodeNested Table Reverse Engineering via TISP

Theorem. Let S be a nested table with a single label path to each data item,and let T be an OSM-O model instance. A transformation from S to T existsthat preserves information and constraints. Semantic IntegrationSchema MappingDirect & IndirectUse of extraction ontologiesSemantic Enhancement for IntegrationSemantics of many sources abstracted awayAlignment with global community knowledgeWordNetData-frame libraryMulti-faceted Schema MappingCentral Idea: Exploit All Data & MetadataMatching Possibilities (Facets)Attribute NamesData-Value CharacteristicsExpected Data Values (use of extraction ontologies)Data-Dictionary InformationStructural Properties

ExampleSource Schema SCar YearhasMakehasModelhasCostStylehashasYearhasFeaturehasCosthasCar MileagehasPhonehasModelhasTarget Schema TMakehasMileshasYearModelMakeYearMakeModelCar Car MileageMilesIndividual Facet MatchingAttribute NamesData-Value CharacteristicsExpected Data ValuesAttribute NamesTarget and Source Attributes T : A S : BWordNetC4.5 Decision Tree: feature selection, trained on schemas in DB booksf0: same wordf1: synonymf2: sum of distances to a common hypernym rootf3: number of different common hypernym rootsf4: sum of the number of senses of A and BWordNet Rule

The number of different common hypernym roots of A and BThe sum of distances of A and B to a common hypernymThe sum of the number of senses of A and BConfidence Measures

Data-Value CharacteristicsC4.5 Decision Tree FeaturesNumeric data(Mean, variation, standard deviation, ) Alphanumeric data(String length, numeric ratio, space ratio)Confidence Measures

Expected Data ValuesTarget Schema T and Source Schema SRegular expression recognizer for attribute A in TData instances for attribute B in SHit Ratio = N'/N for (A, B) matchN' : number of B data instances recognized by the regular expressions of AN: number of B data instances

Confidence Measures

Combined Measures

Threshold: 0.510000000000000100000000010000000000010000010000000Final Confidence Measures

000Direct & Indirect Schema MappingsSourceCar YearCostStyleYearFeatureCostPhoneTargetCar MilesMileageModelMakeMake&ModelColorBody TypeMapping GenerationDirect Matches as described earlier:Attribute Names based on WordNetValue Characteristics based on value lengths, averages, Expected Values based on regular-expression recognizersIndirect Matches:1-n, n-1, or n-m based on direct matchesStructure EvaluationUnionSelectionDecompositionComposition

Union and SelectionCar SourceYearCostStyleYearFeatureCostPhoneTargetCar MilesMileageModelMakeMake&ModelColorBody TypeDecomposition and CompositionCar SourceYearCostStyleYearFeatureCostPhoneTargetCar MilesMileageModelMakeMake&ModelColorBody TypeSemantic Enrichment (e.g., MOGO)

fleckveltergonsity (ld/gg)hepth(gd)burlam1.2120falder2.3230multon2.5400

TANGO repeatedly turns raw tables into conceptual mini-ontologies and integrates them into a growing ontology.GrowingOntologyMOGO (Mini-Ontology GeneratOr)generates mini-ontologies frominterpreted tables.Sample Input Region and State InformationLocationPopulation (2000)LatitudeLongitudeNortheast2,122,869 Delaware817,37645-90 Maine1,305,49344-93Northwest9,690,665 Oregon3,559,54745-120 Washington6,131,11843-120

Sample Output- Explain what is mini about the ontology Walk user through the mini-ontology to explain what it tells us about the concepts/relationships45Concept/Value RecognitionLexical CluesLabels as data valuesData value assignmentData Frame CluesLabels as data valuesData value assignmentDefaultRecognize concepts and values by syntax and layout

46Concept/Value RecognitionLexical CluesLabels as data valuesData value assignmentData Frame CluesLabels as data valuesData value assignmentDefaultRecognize concepts and values by syntax and layout

Concepts and Value Assignments NortheastNorthwestDelawareMaineOregonWashington LocationRegionState47Concept/Value RecognitionLexical CluesLabels as data valuesData value assignmentData Frame CluesLabels as data valuesData value assignmentDefaultRecognize concepts and values by syntax and layoutPopulationLatitudeLongitude2,122,869817,3761,305,4939,690,6653,559,5476,131,11845444543-90-93-120-120

Year20022003

Concepts and Value Assignments NortheastNorthwestDelawareMaineOregonWashington LocationRegionState48

Relationship DiscoveryDimension Tree MappingsLexical CluesGeneralization/SpecializationAggregationData FramesOntology Fragment Merge

200049Relationship DiscoveryDimension Tree MappingsLexical CluesGeneralization/SpecializationAggregationData FramesOntology Fragment Merge

50Constraint DiscoveryGeneralization/SpecializationComputed ValuesFunctional RelationshipsOptional Participation

Region and State InformationLocationPopulation (2000)LatitudeLongitudeNortheast2,122,869 Delaware817,37645-90 Maine1,305,49344-93Northwest9,690,665 Oregon3,559,54745-120 Washington6,131,11843-120

- Explain how functional dependencies are found better values in original table are functionally determined51Ontology Workbench: Prototype Development Tool

We are working toward KBs and KBBs 52Case Study: Knowledge Bundlesfor Bio-ResearchProblem: locate, gather, organize dataSolution: semi-automatically create KBs with KBBsKBsConceptualized data + reasoning and provenance linksLinguistically grounded & thus extraction ontologiesKBBsKB Builder tool setActively learns to build KBsWhats my take-home message? KBs and KBBs can play a significant role in assisting researchers locate, gather, and organize information for research studies. As an example, this may well be the essence of what ACM-L means: conceptual modeling (CM) is the foundation for KBs and active learning (A-L) is the foundation for KBBs. Emphasize here and later. (What must listeners understand to be able to take the message home?)

53Research Study: Objective and TaskObjective: Study the association of:TP53 polymorphism andLung cancerTask: locate, gather, organize data from:Single Nucleotide Polymorphism databaseMedical journal articlesMedical-record databaseDoesnt matter whether it is the essence of ACM-L or not may be better to pitch it as one possible way to achieve Active Conceptual Modeling for Learning (ACM-L).54Gather SNP Information from the NCBI dbSNP RepositorySNP: Single Nucleotide PolymorphismNCBI: National Center for Biotechnology Information

Explain how FOCIH works. Also, how it will work when we add a filtering mechanism (e.g., minor allele frequency > 1%)55Search PubMed LiteraturePubMed: Search-engine access to life sciences and biomedical scientific journal articles

Works by linguistically grounding an extraction ontology. (e.g., people in the bioInformatics community may know about OpenDMAP)56Reverse-Engineer Human Subject Information from INDIVOINDIVO: personally controlled health record system

Reverse-Engineer Human Subject Information from INDIVOINDIVO: personally controlled health record systemAdd Annotated Images

Radiology Report(John Doe, July 19, 12:14 pm)Query and Analyze Data in Knowledge Bundle (KB)

Research to AccomplishBuild Unified PrototypeIntegrate projectsEnhance/Add KBB toolsCreate Knowledge RepositoryData-frame recognizersOntology snippetsExtraction ontologies (both developed & developing)Develop user interfaceAllow for virtual KBsAdd/Develop analysis tools & data mining toolsResolve performance issuesDecidability & tractability of basic algorithmsArchitecture for web-scale system

Issue Resolution (Summary)Wide variety of data setsGeneral references the Web? CIA World Factbook? ... (OntoES, FOCIH, TISP)Free-running text news, technical journals (WePS, [Embley09], Ancestry.com)Geospatial data ([Embley89b])Entity databases (RelDB[Embley97], XML[Al-Kamha07,Al-Kamha08], IMS=heirarchical[Mok06,Mok10], Network=graph=OSM, OWL[Ding-converter])Reports (Filled-in forms and semi-structured data [Tao09,Liddle99],TANGO)And more (Attensity?)Large and numerous data sets (extension to large and additional types; performance)Alignment of data models (TANGO)Schema mapping ([Xu03,Xu06], )Data integration ([Biskup03])Semantic enrichment (MOGO)Advanced analytic algorithms (Giraud-Carrier: knowledge-based semantic distance, record linkage, and hybrid social networks; best-effort, quick answers [Zitzelberger thesis])Performance ([Al-Muhammed07b], IS/Liddle, Attensity?)

Vision: KBs & KBBs for Knowledge Discovery and DisseminationCustom harvesting of information into KBsKB creation via a KBBSemi-automatic: shifts harvesting burden to machineSynergistic: works without intrusive overheadActively learns as it goes & improves with experienceResolve challenging research issuesKB/KBB prototypeSemantic integrationAnalysis & data mining toolsPerformance issues (including virtual KBs, large & diverse source repositories, quick construction & immediate usage)www.deg.byu.eduLocation

Northeast

Northwest

Washington

Maine

Oregon

Delaware

[Dimension2]

Longitude

Latitude

Population

2,122,869

-120

817,376

Title: Region and State Information

2000

Location

Northeast

Northwest

Washington

Maine

Oregon

Delaware

[Dimension2]

Longitude

Latitude

Population

2,122,869

-120

817,376


2000

Location

Northeast

Northwest

Washington

Maine

Oregon

Delaware

[Dimension2]

Longitude

Latitude

Population

2,122,869

-120

817,376


2000

Location

Northeast

Northwest

Washington

Maine

Oregon

Delaware

[Dimension2]

Longitude

Latitude

Population

2,122,869

-120

817,376


2000

Location

Northeast

Northwest

Washington

Maine

Oregon

Delaware

[Dimension2]

Longitude

Latitude

Population

2,122,869

-120

817,376


2000

Location

Northeast

Northwest

Washington

Maine

Oregon

Delaware

[Dimension2]

Longitude

Latitude

Population

2,122,869

-120

817,376


2000

knowledge discovery and dissemination (kdd) program

Documents

knowledge kbs

data miningqueries reasoning

semiautomatic creation

bioresearch studies

prior research outline

zanother reasoning possibility

algorithmic analysis

improved analysis