popescu_denisa

Enterprise Information Integration and Semantic Technologies at the World Bank

Denisa PopescuEnterprise Architecture

World Bank Group

• Bank’s Information Challenges

• SAS Teragram Technologies Overview

• How Teragram Works in the Bank

• Key Outcomes & Lessons Learned

Presentation Outline

Bank’s Information Challenges

World Bank Group

• World Bank Group is an international development organization providing loans, grants and knowledge and advisory services to developing countries for a wide array of purposes that include education, health, public administration, infrastructure, financial and private sector development, agriculture and environmental and natural resource management.

• Office of the Enterprise Architecture is part of the Central IT Department and is responsible for the Enterprise Architecture Framework, Enterprise Information and Technology Standards and Policies, and Shared Enterprise Information Platforms and Tools.

• Numerous repositories that contain large amounts of information

• Most of our information is unstructured (pdfs,.doc, .txt, .ppt, .html)

• Rely on staff to “file” information and add metadata

• Lack of authoritative reference sources

As a result,

• Uneven capture of information and metadata across the Bank’s institutional repositories

• Similar information resides in multiple repositories

• Multiple representations for same type of “information”

• Staff can’t find related information

Bank’s Architectural & Information Challenges

Data Entities, Attributes, Relationships, Data definition: varchar(x),number, character, primary key, foreign key, etc, M

etad

ata

AuthorTitle

Project IDCountry

Topic….

AuthorTitleProject IDCountryBusiness Function…

Information in Bank’s Environment

Operational/Transaction Data

Operations HumanResources

FinancialMgmt

Loans

010101011101101110010011

010101011101101110010011

010101011101101110010011

010101011101101110010011

010101011101101110010011

Etc.…

Structured Data

Attributes describing unstructured content

Content

Web PagesEmail Records MultimediaBooks

Unstructured Information

Documents

Title Date

SpeakerFile format

Topic, …

TOR

I.Purpose

II.Participants

III.Findings

IV.…

Conference Proceedings

Comments on PCN review

meeting

People create informationPeople find information by searching or browsing repositories

KnowledgeSharing

Metadata is the glue

• Quality of metadata is uneven

• It takes too much effort for the user to put it in

• Each individual (creator or searcher) may have different perspectives on how to describe information

But the problem with metadata is that…

• Semantic technologies provide an abstraction layer above existing IT technologies that enables bridging and interconnection of information, people, and processes.

• In the World Bank, we are using Conceptual Information Models, and SAS Teragram Technologies to create this layer.

What are Semantic Technologies?

Based on James Melzer’s EIA in Context, 2006.

Business Process

Managing InformationIAPolicy &

GuidanceDiscover/DesignArchitecture

Build StructuralSchemes

Designing Structures

Info

rmat

ion

Man

agem

ent

Fram

ewo

rk

Create, Capture &

Catalog Information

CaptureMetadata

Search,Contextualize

& Deliverto Audience

Organize, Manage &

PublishCollections

ManageWorkflow

Architecture

Structures

Information Governance

RecordsManagement

(Retention, etc

ConceptualInformation

Model

Information Access/Usage

Information Distribution/Movement

Go

vern

ance

: P

olic

ies,

Pro

ced

ure

s &

Sta

nd

ard

s

Mas

ter

Dat

a (M

od

els,

Ste

war

ds,

Dat

a H

arm

on

izat

ion

)

EnterpriseArchitecture

Business,

Application

Technology,

Information

Capture Metadata (Automated & Manual)

SAP P/Soft Other DBs

Structured

Create, Capture & Catalog

Unstructured

Processes to Manage Information

Pro

vid

e A

dm

inis

trat

ive

Rep

ort

s &

Man

age

Wo

rkfl

ow

Organize, Manage & Publish Collections

Search, Contextualize & Deliver to Audience

RecordsManagement

WebDocument

ManagementPortal

HQ Staff PartnersCO Staff Public

Documents Email Multimedia Etc.

A unified Enterprise Information Architecture

CDM Data Domains

ClientEmployee

/ Consultant

Finance

(Cost Centre –

Fund Centre,

Chart of Account)

WB

Organization

Project

Corporate Data Model FrameworkShared Information Domains

Policy

VendorBusiness

Partner Party

Product

&

Service

IdentityDocuments

& ReportsTheme Sector

Project

For each Data Domain,

Conceptual & Logical Data Models, Data Dictionary, Data Standards are provided

Reference

Data Geographical Country

Employee / Consultant

Project

Theme SectorReference

Data

Geographical Country

Identity

ClientClient

Core Metadata Standard for Unstructured Information

Identity

Author

Owner

Abstract

Document Date

Project ID

Title

Client Party

Topics

Business Function

Keywords

Language

Project

ResourceIdentifier

Core

Extension Country

Automatic Metadata Capture using Teragram

• Automatic Metadata Capture to generate consistent values for core metadata across information collections

• For high-value information collections, automated metadata extraction strengthen the information quality control function (e.g. indexers)

Teragram Technologies Overview

What does SAS / Teragram do?

• SAS Teragram applies natural language processing (NLP) and advanced linguistic techniques to automatically extract relevant concepts and categorize large volumes of multilingual content.

– Rule-based Automatic Categorization– Entity and Fact/Event Extraction– Document Summarization– Document De-duplication– Noun Phrase extraction– Clustering– Language detection– Tokenization, Stemming, Part-of-speech tagging, …

Why is the Bank using Teragram?

• Teragram will allow the Bank to standardize description of information across multiple systems and programmatically generate metadata:

– Standardization improves the consistency of metadata

– Automatic metadata capture saves time and resources

– Ability to process and describe huge amounts of information

– Will improve “findability” of information but providing data drive browsing structures

Case in Point: How Teragram Works in the Bank

At the World Bank, we create so much information that no number of human eyes could ever effectively categorize all of it in a timely manner.

You might think it would be easy to tell what the document you’re reading is

about. However, this software can tell us not only what you think it’s about,

but what the Bank thinks it’s about.

Fortunately, we can automatically process a great deal of it, using a Teragram that scans documents, recognizes terms and categorizes them for us. This is often more effective than letting a human being try to figure out what a document is about.

Case in Point

That report obviously

belongs under “Eco-Tourism”

This concerns preservation of heritage sites

A lot of these projects needs to be consistent

with the country’s cultural policy

Since we don’t have a folder for “tourism industry”, I’ll just

tag this “industry” for now.

I see this mainly as an “sustainable development”

project….

For example, the Bank produces a working paper on “Sustainable tourism and cultural heritage”. This report provides and overview of the relation between culture heritage preservation and tourism and present strategies for promoting sustainability in tourism industry associated with cultural heritage sites and natural environments.

Automatic Metadata Capture: Documents & Reports Library

Example of Browsing Structures: Documents & Report Library

Enterprise Topic Taxonomy

Automatic Metadata Capture: E-Library

Teragram-generatedTopics , Keywords,

Region

Example of Use of Thesaurus in Search

Teragram-generated

(Thesaurus)

Automatic Metadata Extraction and Categorization

Raw Content

Apply Teragram Profiles

Group into Collections

Extract Metadata

content metadata

Search Syndication Browsecontentdelivery

Quality

control

Rule-based Categorization: Enterprise Topics

Rule-based Categorization: Business Functions

Grammar-based Concept Extraction: People Names

RegEx-based Concept Extraction: Project Identifier

Summarizer: Project Identification Document

Other uses

• Find and link similar documents (e.g. versions, parts contained in other documents)

• Extract Institutions Referenced in the document

• Extract People Referenced in the document

• Extract Document date

• Extract Location (country, region, city)

• Extract Title of document

Key Outcomes & Lessons Learned

Key Outcomes

• Improve quality and consistency of metadata and reference sources

• Increase productivity of the metadata capture process– Prior to using Teragram's technology, Bank staff categorized three

electronic documents per hour. Teragram now drives 50,000 PDF pages per hour through its platform, dramatically improving the processing rate.

• Improve the availability & quantity of metadata– Prior to Teragram, Bank editors manually uncovered four to five

keywords per document. Today, the software identifies 70-300 keywords per document.

Lessons Learned

• Understand the characteristics of the information collections and the way a person decides to categorize the information

• Ensure that you can derive rules from the information and context

• Large initial investment in building the profiles

• Iterative process: use feedback to improve the profiles over time

• Link the initiative to Master Data/Reference Data Program

• Get buy-in from the Business Departments in the organization