popescu_denisa
TRANSCRIPT
Enterprise Information Integration and Semantic Technologies at the World Bank
Denisa PopescuEnterprise Architecture
World Bank Group
• Bank’s Information Challenges
• SAS Teragram Technologies Overview
• How Teragram Works in the Bank
• Key Outcomes & Lessons Learned
Presentation Outline
Bank’s Information Challenges
World Bank Group
• World Bank Group is an international development organization providing loans, grants and knowledge and advisory services to developing countries for a wide array of purposes that include education, health, public administration, infrastructure, financial and private sector development, agriculture and environmental and natural resource management.
• Office of the Enterprise Architecture is part of the Central IT Department and is responsible for the Enterprise Architecture Framework, Enterprise Information and Technology Standards and Policies, and Shared Enterprise Information Platforms and Tools.
• Numerous repositories that contain large amounts of information
• Most of our information is unstructured (pdfs,.doc, .txt, .ppt, .html)
• Rely on staff to “file” information and add metadata
• Lack of authoritative reference sources
As a result,
• Uneven capture of information and metadata across the Bank’s institutional repositories
• Similar information resides in multiple repositories
• Multiple representations for same type of “information”
• Staff can’t find related information
Bank’s Architectural & Information Challenges
Data Entities, Attributes, Relationships, Data definition: varchar(x),number, character, primary key, foreign key, etc, M
etad
ata
AuthorTitle
Project IDCountry
Topic….
AuthorTitleProject IDCountryBusiness Function…
Information in Bank’s Environment
Operational/Transaction Data
Operations HumanResources
FinancialMgmt
Loans
010101011101101110010011
010101011101101110010011
010101011101101110010011
010101011101101110010011
010101011101101110010011
Etc.…
Structured Data
Attributes describing unstructured content
Content
Web PagesEmail Records MultimediaBooks
Unstructured Information
Documents
Title Date
SpeakerFile format
Topic, …
TOR
I.Purpose
II.Participants
III.Findings
IV.…
Conference Proceedings
Comments on PCN review
meeting
People create informationPeople find information by searching or browsing repositories
KnowledgeSharing
Metadata is the glue
• Quality of metadata is uneven
• It takes too much effort for the user to put it in
• Each individual (creator or searcher) may have different perspectives on how to describe information
But the problem with metadata is that…
• Semantic technologies provide an abstraction layer above existing IT technologies that enables bridging and interconnection of information, people, and processes.
• In the World Bank, we are using Conceptual Information Models, and SAS Teragram Technologies to create this layer.
What are Semantic Technologies?
Based on James Melzer’s EIA in Context, 2006.
Business Process
Managing InformationIAPolicy &
GuidanceDiscover/DesignArchitecture
Build StructuralSchemes
Designing Structures
Info
rmat
ion
Man
agem
ent
Fram
ewo
rk
Create, Capture &
Catalog Information
CaptureMetadata
Search,Contextualize
& Deliverto Audience
Organize, Manage &
PublishCollections
ManageWorkflow
Architecture
Structures
Information Governance
RecordsManagement
(Retention, etc
ConceptualInformation
Model
Information Access/Usage
Information Distribution/Movement
Go
vern
ance
: P
olic
ies,
Pro
ced
ure
s &
Sta
nd
ard
s
Mas
ter
Dat
a (M
od
els,
Ste
war
ds,
Dat
a H
arm
on
izat
ion
)
EnterpriseArchitecture
Business,
Application
Technology,
Information
Capture Metadata (Automated & Manual)
SAP P/Soft Other DBs
Structured
Create, Capture & Catalog
Unstructured
Processes to Manage Information
Pro
vid
e A
dm
inis
trat
ive
Rep
ort
s &
Man
age
Wo
rkfl
ow
Organize, Manage & Publish Collections
Search, Contextualize & Deliver to Audience
RecordsManagement
WebDocument
ManagementPortal
HQ Staff PartnersCO Staff Public
Documents Email Multimedia Etc.
A unified Enterprise Information Architecture
CDM Data Domains
ClientEmployee
/ Consultant
Finance
(Cost Centre –
Fund Centre,
Chart of Account)
WB
Organization
Project
Corporate Data Model FrameworkShared Information Domains
Policy
VendorBusiness
Partner Party
Product
&
Service
IdentityDocuments
& ReportsTheme Sector
Project
For each Data Domain,
Conceptual & Logical Data Models, Data Dictionary, Data Standards are provided
Reference
Data Geographical Country
Employee / Consultant
Project
Theme SectorReference
Data
Geographical Country
Identity
ClientClient
Core Metadata Standard for Unstructured Information
Identity
Author
Owner
Abstract
Document Date
Project ID
Title
Client Party
Topics
Business Function
Keywords
Language
Project
ResourceIdentifier
Core
Extension Country
Automatic Metadata Capture using Teragram
• Automatic Metadata Capture to generate consistent values for core metadata across information collections
• For high-value information collections, automated metadata extraction strengthen the information quality control function (e.g. indexers)
Teragram Technologies Overview
What does SAS / Teragram do?
• SAS Teragram applies natural language processing (NLP) and advanced linguistic techniques to automatically extract relevant concepts and categorize large volumes of multilingual content.
– Rule-based Automatic Categorization– Entity and Fact/Event Extraction– Document Summarization– Document De-duplication– Noun Phrase extraction– Clustering– Language detection– Tokenization, Stemming, Part-of-speech tagging, …
Why is the Bank using Teragram?
• Teragram will allow the Bank to standardize description of information across multiple systems and programmatically generate metadata:
– Standardization improves the consistency of metadata
– Automatic metadata capture saves time and resources
– Ability to process and describe huge amounts of information
– Will improve “findability” of information but providing data drive browsing structures
Case in Point: How Teragram Works in the Bank
At the World Bank, we create so much information that no number of human eyes could ever effectively categorize all of it in a timely manner.
You might think it would be easy to tell what the document you’re reading is
about. However, this software can tell us not only what you think it’s about,
but what the Bank thinks it’s about.
Fortunately, we can automatically process a great deal of it, using a Teragram that scans documents, recognizes terms and categorizes them for us. This is often more effective than letting a human being try to figure out what a document is about.
Case in Point
That report obviously
belongs under “Eco-Tourism”
This concerns preservation of heritage sites
A lot of these projects needs to be consistent
with the country’s cultural policy
Since we don’t have a folder for “tourism industry”, I’ll just
tag this “industry” for now.
I see this mainly as an “sustainable development”
project….
For example, the Bank produces a working paper on “Sustainable tourism and cultural heritage”. This report provides and overview of the relation between culture heritage preservation and tourism and present strategies for promoting sustainability in tourism industry associated with cultural heritage sites and natural environments.
Automatic Metadata Capture: Documents & Reports Library
Example of Browsing Structures: Documents & Report Library
Enterprise Topic Taxonomy
Automatic Metadata Capture: E-Library
Teragram-generatedTopics , Keywords,
Region
Example of Use of Thesaurus in Search
Teragram-generated
(Thesaurus)
Automatic Metadata Extraction and Categorization
Raw Content
Apply Teragram Profiles
Group into Collections
Extract Metadata
content metadata
Search Syndication Browsecontentdelivery
Quality
control
Rule-based Categorization: Enterprise Topics
Rule-based Categorization: Enterprise Topics
Rule-based Categorization: Business Functions
Grammar-based Concept Extraction: People Names
Grammar-based Concept Extraction: People Names
RegEx-based Concept Extraction: Project Identifier
RegEx-based Concept Extraction: Project Identifier
Summarizer: Project Identification Document
Other uses
• Find and link similar documents (e.g. versions, parts contained in other documents)
• Extract Institutions Referenced in the document
• Extract People Referenced in the document
• Extract Document date
• Extract Location (country, region, city)
• Extract Title of document
Key Outcomes & Lessons Learned
Key Outcomes
• Improve quality and consistency of metadata and reference sources
• Increase productivity of the metadata capture process– Prior to using Teragram's technology, Bank staff categorized three
electronic documents per hour. Teragram now drives 50,000 PDF pages per hour through its platform, dramatically improving the processing rate.
• Improve the availability & quantity of metadata– Prior to Teragram, Bank editors manually uncovered four to five
keywords per document. Today, the software identifies 70-300 keywords per document.
Lessons Learned
• Understand the characteristics of the information collections and the way a person decides to categorize the information
• Ensure that you can derive rules from the information and context
• Large initial investment in building the profiles
• Iterative process: use feedback to improve the profiles over time
• Link the initiative to Master Data/Reference Data Program
• Get buy-in from the Business Departments in the organization