achieving semantic interoperability – architectures and methods denise a. d. bedford senior...

63
Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Upload: damian-casey

Post on 27-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Achieving Semantic Interoperability – Architectures and Methods

Denise A. D. Bedford

Senior Information Officer

World Bank

Page 2: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Semantic Interoperability (SI)

• Semantic interoperability means different things to different people primarily because the context is always different

• Semantics – – Resolved at the understanding and reasoning level– Word level, Concept level, Language level,

Grammatical level, Domain Vocabulary level, Representation level

• Interoperability – Resolved at the architecture level – Different sources using different semantics

Page 3: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

What Does SI Look Like?

• Answer to this question is always, “It depends…”

• Achieving semantic interoperability means that the semantic and the interoperability challenges are resolved at the system level – not at the user level

• Practical examples – Cross application discovery– Cross language discovery– Recommender engines– Workflow management – Scenario inferencing

• Let’s look at a high level model of the enterprise search model and find the SI points

Page 4: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

TRIMArchives

TransformationRules/Maps

PeopleSoft

SIRSISystem

InfoShopMetadata

SAPFinancialSystem

Web ContentMgmt.

Metadata

Metadata RepositoryOf Bank Standard Metadata(Oracle Tables & Indexes)

World Bank Catalog/Enterprise Search

(Oracle Intermedia)

World Bank Catalog/Enterprise Search

(Oracle Intermedia)Site Specific

Searching

Site Specific Searching

PublicationsCatalog

PublicationsCatalog

RecommenderEngines

RecommenderEngines

Personal Profiles

Personal Profiles

Portal Content Syndication

Portal Content Syndication

MetadataExtract

MetadataExtract

MetadataExtract

MetadataExtract

MetadataExtract

MetadataExtract

Browse &NavigationStructures

Browse &NavigationStructures

Concept Extraction, Categorization & Summarization Technologies

MetadataExtract

IRIS Oracle

Vision of Semantic Interoperability

Reference Tables (CDS+)

Topics, CountriesDocument Types

(Oracle data classes)

Data Governance

Bodies

Data Governance

Bodies

Page 5: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Basic Assumptions and Constraints

• There are many layers of semantic challenges between the user experience and architecture

• Ideally, semantic interoperability is grounded in your enterprise architecture – regardless of the level of sophistication of your enterprise architecture

• Semantic interoperability is a question of degree - some of the layers are interoperable at the enterprise level and others may be at a local level

• Some layers may be universal – beyond the enterprise – and others are by definition limited to the enterprise

Page 6: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Managing Interoperability Challenges

• Option 1: Integrate, map and reconcile at a superficial level– Reference mappings– Continuous monitoring – always after the fact– Consultation and reconciliation and fixing– SI solution is always a partial solution

• Option 2: Provide the capability to generate semantically interoperable solutions early in the development stages– Use the technologies to model what people would do if they had

unlimited time and resources– Develop consistent profiles which distributed throughout an

enterprise, but managed centrally– Govern and manage the profiles, not the ‘mess’

Page 7: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Combining Options

• Option 1 is feeding the beast – you never get ahead and it consumes resources you could use for other products and services

• My experience is that we have to use both options– Mapping and managing the legacy data unless you can recon– Trying to push a programmatic solution for new content– At least trying to stop the reconciliation at a given point in time

• I’d like to talk first about the idea behind the architecture and second, about the actual semantic methods

Page 8: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Teragram Tools

• Teragram is a company located in Boston and Paris which offers COTS natural language processing (NLP) technologies

• Teragram’s Natural Language Processing technologies include:– Rules Based Concept Extraction (also called classifier)– Grammar Based Concept Extraction– Categorization– Summarization – Clustering– Language detection

• Semantic engines are available in 30+ languages

Page 9: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Teragram Use

• Operationalized in the System– IRIS – Retrospective Processing– ImageBank – daily processing of incoming documents– Structured service descriptions – terse text

• Self-Service Model – WBI Library of Learning– Africa Region Operations Toolkit– External Affairs – eLibrary– External Affairs – Media Monitoring– External Affairs – Disease Control Priorities Website– ICSID -- Document Management – PICs MARC Record attributes– Web Archives metadata

Page 10: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Structured & Unstructured Data

• Range of formats processed– Anything in electronic format – MS Office, html, xml, pdf, …

• Range of types of text processed– 17M pdf documents– Very short structured service descriptions

• Different writing styles– Formal publications, internal informal emails, web pages, data reports

• Depending on what you are trying to do with the data – may or may not have to adjust the profile and your strategy

• Most important consideration, though, is the nature of the writing style – informal requires some adjustments

Page 11: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Business Drivers

• In order to get ahead of the problem, we decided to:

• ‘Institutionalize’ the Teragram profiles so that outputs are consistently generated across applications and content

• Have a single installation of the technologies to ensure consistent management and efficient maintenance

• Allow different systems to call and consume the outputs from the technologies while using the same profiles

• Avoid tight integration of the Teragram technologies with any existing system

Page 12: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Teragram Components & Configuration

Enterprise Profile

Development & Maintenance

IQ Teragram Team

TK240 Client

MasterData Stores

AuthorityLists

Taxonomies

ControlledVocabularies

Concept Profile

CategorizationProfile

Concept List for Clustering

SummarizationRules File

Training Sets

TestingSets

LanguageGrammars

Concept Engine

CategorizationEngine

SummarizationEngine

ClusteringEngine

XML formatted output

Page 13: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

ImageBank Integration

Content Capture

ISP Integration

Enterprise Profile

Development &

Maintenance

XML Wrapped Metadata

Dedicated Server – Teragram Semantic

Engine – Concept Extraction, Categorization, Clustering, Rule Based Engine, Language Detection

APIs & Integration

APIs & Integration

Content Capture

XML Wrapped Metadata

Factiva Metadata Database

IRIS Integration

APIs & Integration

EnterpriseMetadata Capture Strategy

TK240 Client

XML Output

e-CDS Reference Sources

APIs & Technical Integration

Content OwnersContent Owners

Business Analyst

IDU Indexers SITRC Librarians

IRIS FunctionalTeam

Enterprise Metadata Capture – Functional Reference Model

Page 14: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Information Architecture Best Practices

• Build profiles at the attribute level so that everyone can use the same profile and there is only one profile to maintain

• Each calling system, though, can specify the attributes that they want to use in their processing– ImageBank can specify Topics and Keywords– WBI can specify Topics, Keywords, Country, Regions– Media Monitoring can specify Topics, Organization Names,

People Names– eLibrary can specify Author, Title, Publisher, Publication

Date, Topics, Library of Congress Class No.

• Each of these users is calling the same Topic profile even though their overall profiles are different

Page 15: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Enterprise Profile

Development & Maintenance

Enterprise Metadata Profile

Concept Extraction TechnologyCountryOrganization NamePeople NameSeries Name/Collection TitleAuthor/CreatorTitlePublisher Standard Statistical VariableVersion/Edition

Categorization TechnologyTopic CategorizationBusiness Function CategorizationRegion CategorizationSector CategorizationTheme Categorization

Rule-Based CaptureProject IDTrust Fund #Loan #Credit #Series #Publication DateLanguage

Summarization

e-CDS Reference Sources forCountry, Region, Topics

Business Function, Keywords,Project ID, People, Organization

Data GovernanceProcess for

Topics, Business Function,Country, Region, Keywords,

People, Organizations, Project ID

Teragram Team

TK240 Client ISP IRIS ImageBankFactiva

JOLISE-Journals

Enterprise Profile Creation and Maintenance

UCM ServiceRequests

Update & Change Requests

Page 16: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Now For the Semantics…

Page 17: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Context

• I will use today a simple application to illustrate the problems and the solutions

• Context is programmatic capture of high quality, consistent, persistent, rich metadata to support parametric enterprise search

• Parametric enterprise search looks simple but there are a lot of underlying semantic problems

• Implementation has expanded beyond core metadata at this point in time and continues to grow but that’s another discussion – also expanding into other languages

Page 18: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Cross Application Information Discovery

Page 19: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Cross Application Information Discovery

Page 20: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

World Bank Core Metadata

Agent Country Authorized By

Record Identifier

Title Region Rights Management

Disposal Status

Date Abstract/ Summary

Access Rights

Disposal Review Date

Format Keywords Location Management History

Publisher Topic Use History Retention Schedule/Mandate

Language Business Function

Disclosure Review Date Preservation History

Version Disclosure Status Aggregation Level

Series Name Series #

Relation

Content Type

Identification/ Distinction

Use Management Compliant Document Management

Human CreationProgrammatic Capture

Inherit from System Context

Extrapolate from Business Rules

Search & Browse

Page 21: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Semantic Methods

• Each of these parameters presents a different kind of semantic challenge

• Need to find the right semantic solution to fit the semantic problem

• Semantic methods should always mirror how a human approaches, deconstructs and solves the semantic challenge

• Purely statistical approaches to solving semantic problems are only appropriate where a human being would take a statistical approach

• Mistake we have made in the profession is to assume that statistical methods can solve semantic problems – they cannot

Page 22: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

NLP Technologies – Two Approaches

• Over the past 50 years, there have been two competing strategies in NLP - statistical vs. semantic

• In the mid-1990’s at the AAAI Stanford Spring Workshops it was agreed by the active practitioners that the statistical NLP approach had hit a rubber ceiling – there were no further productivity gains to be made from this approach

• About that time, the semantic approach showed practical gains – we have been combining the two approaches since the late 1990’s

• Most of the tools on the market today are statistical NLP, but some have a more robust underlying semantic engine

Page 23: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Problem with Statistical NLP• We experimented with several of these tools in the early 2000s

– including Autonomy, Semio, Northern Lights Clustering – but there were problems

– the statistical associations you generate are entirely dependent upon the frequency at which they occur in the training set

– Without a semantic base you cannot distinguish types of entities, attributes, concepts or relationships

– If the training set is not representative of your universe, your relationships will not be representative and you cannot generalize from the results

– If the universe crosses domains, then the data that have the greatest commonality (least meaning) have the greatest association value

Page 24: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Semantic NLP

• For years, people thought the semantic could not be achieved so they relied on statistical methods

• The reason they thought it would never be practical is that it took a long time to build the foundation – understanding human language is not a trivial exercise

• Building a semantic foundation involves:– developing grammatical and morphological rules –

language by language– Using parsers and Part of Speech (POS) taggers to

semantically decompose text into semantic elements– Building dictionaries or corpa for individual languages as

fuel for the semantic foundation to run on– Making it all work fast enough and in a resource efficient

way to make it economically practical

Page 25: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Example of Semantic Analysis

Page 26: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Problem with Statistical Tools• There are problems with the way the statistical tools are packed in tools…

– Resource intense to run – to cluster 100 documents may take several hours and give you suboptimal results

– Results are dynamic not persistent - you can’t do anything else with the results but look at them and point back to the documents

– They only live in the index that was built to support the cluster and generally are not consumable by any other tools

– Outputs are not persistently associated with the content

– We wanted to generate persistent metadata which could then be manipulated by other tools

Page 27: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Implementing Teragram

• The package consists of a developers client (TK240) and multiple servers to support the technologies

• Client is the tool we use to build the profiles/rules – server interprets the rules

• Recall the earlier model of enterprise profiles

• Each attribute is supported by its own profile – there is a profile for countries, one for regions, one for topics, one for people names, and so on

• We keep a ‘table’ of the profiles that any application uses – call the profiles at run time

• Language profiles are separate – English, French, Spanish, …

Page 28: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Implementing Teragram• The first step is not applying the tool to content, but analyzing the

semantic challenge

• Understand how a person resolves the semantic problem - then devise a machine solution that resembles the human solution

• The solution involves selecting a tool from the Teragram set, building the rules, testing and refining the rules, then rolling out as QA for end user review

• End user feedback and signoff is important – helps build confidence and improves the quality of the result

• Depending on the complexity of the problem and whether the rules require a reference source, putting the solution together might take a week to two months

Page 29: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Examples of Solutions

• There are different kinds of semantic tools – you have to find the one that suits your semantic problem

• Let’s look at some solution examples:– Rules Based Concept Extraction– Grammar Based Concept Extraction– Categorization– Summarization – Clustering– Language detection

• As I talk about each solution, I’ll describe what we tried that didn’t work, as well as what did work in the end

Page 30: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Rule Based Concept Extraction• What is it?

– Rule based concept or entity extraction is a simple pattern recognition technique which looks for and extracts named entities

– Entities can be anything – but you have to have a comprehensive list of the names of the entities you’re looking for

• How does it work?– It is a simple pattern matching program which compares the list of

entity names to what it finds in content– Regular expressions are used to match sets of strings that follow a

pattern but contain some variation– List of entity names can be built from scratch or using existing

sources – we try to use existing sources– A rule-based concept extractor would be fueled by a list such as

Working Paper Series Names, edition or version statement, Publisher’s names, etc.

– Generally, concept extraction works on a “match” or “no match” approach – it matches or it doesn’t

– Your list of entity names has to be pretty good

Page 31: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Rule Based Concept Extraction

• How do we build it?1. Create a comprehensive list of the names of the entities – most of the

time these already exist, and there may be multiple copies 2. Review the list, study the patterns in the names, and prune the list3. Apply regular expressions to simplify the patterns in the names4. Build a Concept Profile 5. Run the concept profile against a test set of documents (not a training

set because we build this from an authoritative list not through ‘discovery’)

6. Review the results and refine the profile

• State of Industry – The industry is very advanced – this type of work has been under

development and deployed for at least three decades now. It is a bit more reliable than grammatical extraction, but it takes more time to build.

Page 32: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Rules Based Concept Extraction Examples

• Loan #• Credit #• Report #• Trust Fund #• ISBN, ISSN• Organization Name

(companies, NGOs, IGOs, governmental organizations, etc.)

• Address• Phone Numbers

• Social Security Numbers• Library of Congress Class

Number• Document Object Identifier• URLs• ICSID Tribunal Number• Edition or version statement• Series Name• Publisher Name

Let’s look at the Teragram TK240 profiles for Organization Names, Edition Statements, and ISBN

Page 33: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Replace this slide with the ISBN screen – with the rules displayedConcept based rules

engine allows us to define patterns to

capture other kinds of data

ISBN Concept Extraction Profile – Regular Expressions (RegEx)

Use of concept extraction, regular expressions, and

the rules engine to capture ISBNs.

Regular expressions match sets of strings by pattern, so we don’t need to list every exact ISBN we’re

looking for.

Page 34: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Classifier concept

extraction allows us to look for exact string

matches

List of entities matches exact

strings. This requires an exhaustive list–

but gives us extensive control. (It would be difficult to

distinguish by pattern between IGOs and other

NGOs.)

Page 35: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Another list of entities matches

exact strings. In this case, though, we’re making this into an ‘authority control

list’– We’re matching multiple strings to the one approved

output. (In this case, the AACR2-approved edition statement.)

Page 36: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Grammatical Concept Extractions

• What is it?– A simple pattern matching algorithm which matches your specifications to the

underlying grammatical entities– For example, you could define a grammar that describes a proper noun for

people’s names or for sentence fragments that look like titles

• How does it work?– This is also a pattern matching program but it uses computational linguistics

knowledge of a language in order to identify the entities to extract – if you don’t have an underlying semantic engine, you can’t do this type of extraction

– There is no authoritative list in this case – instead it uses parsers, part-of-speech tagging and grammatical code

– The semantic engine’s dictionary determines how well the extraction works – if you don’t have a good dictionary you won’t get good results

– There needs to be a distinct semantic engine for each language you’re working with

Page 37: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Grammatical Concept Extractions

• How do we build it?– Model the type of grammatical entity we want to extract and use the

grammar definitions to build a profile– Test the profile on a set of test content to see how it behaves – Refine the grammars– Deploy the profile

• State of Industry – It has taken decades to get the grammars for languages well defined – There are not too many of these tools available on the market today but

we are pushing to have more open source– Teragram now has grammars and semantic engines for 30 different

languages commercially available– IFC has been working with ClearForest

• Let’s look at some examples of grammatical profiles – People’s Names, Noun Phrases, Verb Phrases, Book Titles

Page 38: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

TK240 Grammars for People Names

Grammar concept extraction allows us to define concepts based on semantic language

patterns.

Page 39: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Grammatical Concept Extraction

<?xml version="1.0" encoding="UTF-8"?>

<Proper_Noun_Concept>

<Source><Source_Type>file</Source_Type>

<Source_Name>W:/Concept Extraction/Media Monitoring Negative Training Set/ 001B950F2EE8D0B4452570B4003FF816.txt</Source_Name>

</Source><Profile_Name>PEOPLE_ORG</Profile_Name>

<keywords>Abdul Salam Syed, Aruna Roy, Arundhati Roy, Arvind Kesarival, Bharat Dogra, Kwazulu Natal, Madhu Bhaduri, </keywords><keyword_count>7</keyword_count>

</Proper_Noun_Concept>

Proper Noun Profile for People Names uses grammars to find and extract the names of people referenced in the document.

Page 40: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Grammatical Concept Extraction –People Names Client testing mode

Page 41: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Rule-Based Categorization

• What is it?– Categorization is the process of grouping things based on

characteristics– Categorization technologies classify documents into groups or

collections of resources– An object is assigned to a category or schema class because it

is ‘like’ the other resources in some way– Categories form part of a hierarchical structure when applied to such

subjects as a taxonomy

• How does it work?– Automated categorization is an ‘inferencing’ task- meaning that we

have to tell the tools what makes up a category and then how to decide whether something fits that category or not

– We have to teach it to think like a human being – • When I see -- access to phone lines, analog cellular systems,

answer bid rate, answer seizure rate – I know this should be categorized as ‘telecommunications’

• We use domain vocabularies to create the category descriptions

Page 42: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Rule Based Categorization• How do we build it?

1. Build the hierarchy of categoriesa) Manually if you have a scheme in place and maintained by peopleb) Programmatically if you need to discover what the scheme should

be2. Build a training set of content category by category – from all

kinds of content3. Describe each category in terms of its ‘ontology’ – in our case

this means the concepts that describe it (generally between 1,000 and 10,000 concepts)

4. Filter the list to discover groups of concepts5. The richer the definition, the better the categorization engine

works6. Test each category profile on the training set7. Test the category profile on a larger set that is outside the

domain8. Insert the categirt profile into the profile for the larger hierarchy

Page 43: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Rule Based Categorization

• State of the Industry– Only a handful of rule-based categorizers are on the market

today– Most of the existing technologies are dynamic clustering tools– However, the market will probably grow in this area as the

demand grows

Page 44: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Categorization Examples• Let’s look at some working examples by going to the Teragram

TK240 profiles

– Topics– Countries– Regions– Sector – Theme – Disease Profiles

• Other categorization profiles we’re also working on…

– Business processes (characteristics of business processes)– Sentiment ratings (positive media statements, negative media

statements, etc.)– Document types (by characteristics found in the documents)– Security classification (by characteristics found in the documents)

Page 45: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Topic Hierarchy From Relationships across data classes

Build the rules at the lowest level of categorization

Page 46: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Subtopics

Domain concepts or controlled vocabulary

Page 47: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Topics Categorization Client Test

Page 48: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Automatically Generated XML Metadata

Page 49: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Automatically Generated Metadata

Page 50: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Automatically Generated XML Metadata for Business Function attribute

• Office memorandum on requesting CD’s clearance of the Board Package for NEPAL: Economic Reforms Technical Assistance (ERTA)

Page 51: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Clustering• What is it?

– The use of statistical and data mining techniques to partition data into sets. Generally the partitioning is based on statistical co-occurrence of words, and their proximity to or distance from each other

• How does it work?

– Those words that have frequent occurrences close to one another are assigned to the same cluster

– Clusters can be defined at the set or the concept level – usually the latter

– Can work with a raw training set of text to discover and associate concepts or to suggest ‘buckets’ of concepts

– Some few tools can work with refined list of concepts to be clustered against a text corpus

– Please note the difference between clustering words in content and clustering domain concepts – major distinction

Page 52: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Clustering vs. Categorization

• Clustering Categorization

Page 53: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Feeder Clustering

• How do we build it?1. Define the list of concepts2. Create the training set 3. Load the concepts into the clustering engine4. Generate the concept clusters

• State of Industry – Most of the commercial tools that call themselves

‘categorizers’ are actually clustering engines– Generally, doesn’t work at a high domain level for large

sets of text– They can provide insights into concepts in a domain

when used on a small set of documents– All the engines are resource intense, though, and the

outputs are transitory – clusters live only in the cluster index

– If you change the text set, the cluster changes

Page 54: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Clustering Concepts

This is from the clustering output for 12.15.00 - Wildlife Resources.

‘Clusters’ of concepts between line breaks are terms from the Wildlife Resources controlled vocabulary found co-occurring in the same training document. This highlights often subtle relationships.

Page 55: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Clustering Words in Content Clusters of words

based on occurrences in

the content

Page 56: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Summarization

• What is it?– Rule-driven pattern matching and sentence extraction programs – Important to distinguish summarization technologies from some

information extraction technologies - many on the market extract ‘fragments’ of sentences – what Google does when it presents a search result to you

– Will generate document surrogates, poiint of view summaries, HTML metatag Description, and ‘gist’ or ‘synopsis’ for search indexing

– Results are sufficient for ‘gisting’ for html metatags, as surrogates for full text document indexing, or as summaries to display in search results to give the user a sense of the content

• How does it work?– Uses rules and conditions for selecting sentences– Enables us to define how many sentences to select– Allows us to tell us the concepts to use to select sentences– Allows us to determine where in the sentence the concepts might

occur– Allows us to exclude sentences from being selected– We can write multiple sets of rules for different kinds of content

Page 57: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Summarization• How do we build it?

1. Analyze the content to be summarized to understand the type of speech and writing used – IRIS is different from Publications is different from News stories

2. Identify the key concepts that should trigger a sentence extraction3. Identify where in the sentence these concepts are likely to occur4. Identify the concepts that should be avoided5. Convert concepts and conditions to a rule format6. Load the rule file onto the summarization server7. Test the rules against test set of content and refine until ‘done’8. Launch the summarization engine and call the rule file

• State of Industry – Most tools are either readers or extractors. Reader method uses clustering &

weighting to promote sentence fragments. Extractor method uses internal format representation, word & sentence weighting

– What has been missing from the Extractors in most commercial products is the capability to specify the concepts and the rules. Teragram is the only product we found to support this.

Page 58: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Summarization Rules

CodeWhere would appear in the

sentence It is likely to be included Syntax

5 anywhere in the sentence It is likely not to be included copyright/2004,5

9 anywhere in the sentence Definitely not included for/example,9

7 anywhere in the sentence Definitely to be included got/the/top/grade,7

10 anywhere in the sentence It is likely to be included pull/off/that/coup,10

2anywhere in the sentence,

followed by the second It is likely to be included evidence,2:collected

1 beginning of the sentence It is likely to be included we/report,1

6 beginning of the sentence Definitely to be included reporting/on,6

8 beginning of the sentence Definitely not included copyright/reserved,8

3

beginning of the sentence; only if the preceding sentence

qualifies It is likely to be included however,3

4

beginning of the sentence; only if the preceding sentence

qualifies Definitely to be included the/former,4

Page 59: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Automatically Generated Gist

• PID Bosnia-Herzegovina Private Sector Credit Project• Rules

– agreed/to,10

– with/the/objective,10

– objective,2:project

– proposed,2:project

– assist/in,10

• Gist

Page 60: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Impacts & Outcomes

• Productivity Improvements– Can now assign deep metadata to all kinds of content – Remove the human review aspect from the metadata capture– Reduce unit times where human review is still used

• Information Quality impacts– The metadata created is consistent– All metadata carries the information architecture with it– Apply quality metrics at the metadata level to eliminate need

to build ‘fuzzy search architectures’ – these rarely scale or improve in performance

– Use the technologies to identify and fix problems with our data

Page 61: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Lessons Learned

• All semantic interoperability challenges are practical which means that there is a context in which they are used

• Don’t try to solve semantic challenges that don’t pertain to your environment – thing long term about use

• Analyze the context to determine the highest value semantic challenges

• Leverage what others have done, but don’t adopt their SI solutions as a black box solution – won’t work unless you have identical contexts

• Start by modeling the context – you might begin with a logical reference model or an ontology

Page 62: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Additional Applications

• 60 years of content which is not characterized in terms of its business process – retrospectively categorize to provide an important perspective

• People and Institutions Referenced

• Media Monitoring – generating metadata for news stories from around the work for statistical analysis purposes – how is the Bank perceived in Brazil, in Kenya, in India

• Capturing important numbers – bid #, project ID, Trust Fund # - where staff don’t input it or make errors in transcription

• Language detection for content

Page 63: Achieving Semantic Interoperability – Architectures and Methods Denise A. D. Bedford Senior Information Officer World Bank

Thank You!

Questions & Discussions