the future of metadata denise bedford world bank presentation to fall metadata forum november 2,...
TRANSCRIPT
The Future of Metadata
Denise BedfordDenise BedfordWorld Bank World Bank
Presentation to Fall Metadata ForumPresentation to Fall Metadata ForumNovember 2, 2005 November 2, 2005
Department of Homeland SecurityDepartment of Homeland Security
Meta-FutureMeta-Future Most of our information use and access today is based on an Most of our information use and access today is based on an
anonymous access model anonymous access model
It is increasingly clear that anonymous access to information and It is increasingly clear that anonymous access to information and the packaging of information for single use contexts is neither the packaging of information for single use contexts is neither sufficient for users nor an efficient use of development/engineering sufficient for users nor an efficient use of development/engineering resourcesresources
We need to think in terms of contextualization and sensitization of We need to think in terms of contextualization and sensitization of information so that it can be used in any context where it pertainsinformation so that it can be used in any context where it pertains
In the future, information will flow – information, not the systems in In the future, information will flow – information, not the systems in which it lives or was created, will be our focuswhich it lives or was created, will be our focus
Information needs to be agile and mobile – it needs to be sensitized Information needs to be agile and mobile – it needs to be sensitized to the contexts in which it might be used, to the interests of those to the contexts in which it might be used, to the interests of those who might use it, and to the applications that might consume itwho might use it, and to the applications that might consume it
Meta-FutureMeta-Future Envision a future like that described in the Netcentric Envision a future like that described in the Netcentric
Information Models formulated by the Dept. of DefenseInformation Models formulated by the Dept. of Defense
Information is created, tagged, posted and sharedInformation is created, tagged, posted and shared
Any applications or users can – according to security Any applications or users can – according to security privileges – use any information they can find, in any privileges – use any information they can find, in any application they need to use to do their workapplication they need to use to do their work
Technology becomes increasingly invisible but more logic Technology becomes increasingly invisible but more logic basedbased
More and different kinds of information such as reference More and different kinds of information such as reference sources need to be managed and maintainedsources need to be managed and maintained
This meta-future is heavily dependent upon the existence This meta-future is heavily dependent upon the existence of rich, conceptual, sensitized, meaningful metadataof rich, conceptual, sensitized, meaningful metadata
This future is now – it is simply a practical view of the This future is now – it is simply a practical view of the Semantic WebSemantic Web
The problem with metadataThe problem with metadata This future sounds wonderful and the contextualization This future sounds wonderful and the contextualization
vision is exciting but there’s just one problem…metadatavision is exciting but there’s just one problem…metadata
Metadata….Metadata….– Is expensive and time consuming to createIs expensive and time consuming to create– Is sometimes subjective and not granular enoughIs sometimes subjective and not granular enough– Doesn’t always address the ways that users and Doesn’t always address the ways that users and
systems think about the information it describessystems think about the information it describes– May not tell us enough about the information to trust it May not tell us enough about the information to trust it – may address only one context – the context for which it may address only one context – the context for which it
is createdis created– May lives in the source application where it was createdMay lives in the source application where it was created– May not be as accessible as the information assetMay not be as accessible as the information asset
If a Meta-Future depends on metadata, we have to solve If a Meta-Future depends on metadata, we have to solve these problems these problems
The problem with technologiesThe problem with technologies Many of the tools are so tightly integrated, you might Many of the tools are so tightly integrated, you might
generate rich metadata, but it will not make your information generate rich metadata, but it will not make your information agile or mobileagile or mobile
Statistical clustering engines do not get us to persistent Statistical clustering engines do not get us to persistent meaning or contextualization. Clustering engines are great meaning or contextualization. Clustering engines are great for thresholding or pattern tracings, but they will not for thresholding or pattern tracings, but they will not generate the kind of metadata we need to realize this futuregenerate the kind of metadata we need to realize this future
We need semantic engines at the base of all our metadata We need semantic engines at the base of all our metadata efforts, and these engines need to be available in multiple efforts, and these engines need to be available in multiple languages -- semantics vary by language languages -- semantics vary by language
Magic black box approaches are neither meaningful nor Magic black box approaches are neither meaningful nor sustainable -- you need to have access to the programs sustainable -- you need to have access to the programs through a user-friendly interface so you can adapt them to through a user-friendly interface so you can adapt them to your environment without having to have programming your environment without having to have programming knowledgeknowledge
You need to have several different kinds of technologies to do You need to have several different kinds of technologies to do what I’m going to describe today – not just one toolwhat I’m going to describe today – not just one tool
Content Dimension
User Dimension
Information Diffusion (Context Sensitive – Group)_
Information Gathering& Transformation
(Context Sensitive – Person)
Understanding the Dimensions of Contextualization
Topic Scheme
BusinessActivityScheme
CentralizedCollections
ContentElements &
Structure (XML)
Content Metadata
Ideas &Tacit Knowledge
Content QualityManagement
Topic Thesaurus
Anonymous Access(Context Free)
InstitutionalRoles
InstitutionalProfilesCommunities
Of Practice
CommunitiesSDI
SocialGroups Social Group
Profiles
IndividualProfiles
IndividualProfiles
Browsing
ParametricSearching
Searching By Source
Searching By Tools
Programmatic Metadata Capture
ResultsClustering
Text Classification
PersonalSDI
Social GroupSDI
Individual Discovery
IndividualLearning
Task Oriented SDI
Directories of Expertise
ConceptFiltering
Threshold Filtering
User-User ProfileMatching
SenseMaking
Content Repurposing
Collaborative Filtering
ContentAggregation
RecommenderEngines
Publishing
SyndicationEngines
Business Process
Awareness
CommunityBuilding
SocialFiltering
KnowledgeSharing
AdvisoryServices
Q&ASystems
ConceptExtraction
TaskFiltering
ResultsSorting
Searching
CountryScheme
RegionScheme
Bank’s BusinessLanguage
CollectionDevelopment
Policy
TranslationSystems
Organizational Entities
ClientProfiles
PartnerProfiles
AuthorizationRules
AuthenticationRules
Metadata Management
Co
nte
xt
Dim
en
sio
n
WorkflowManagement
OnlineTraining
Vision of ContextualizationVision of Contextualization We need to address metadata challenges not in a We need to address metadata challenges not in a
traditional way but in the future context – with the idea that traditional way but in the future context – with the idea that metadata is contextualizable and sensitized – to support metadata is contextualizable and sensitized – to support information agility and mobilityinformation agility and mobility
In order to achieve contextualization you need to have In order to achieve contextualization you need to have ‘extreme metadata’ ‘extreme metadata’ – Metadata about the informationMetadata about the information– Metadata about the userMetadata about the user– Metadata about the contextMetadata about the context– Rich metadata designed to meet many functional requirementsRich metadata designed to meet many functional requirements– Metadata in multiple languagesMetadata in multiple languages
Metadata needs to be ‘interpretable’ for and in a contextMetadata needs to be ‘interpretable’ for and in a context– Reference sources not only for traditional metadata but for all Reference sources not only for traditional metadata but for all
of the relationships and logic that are present in an ontology of the relationships and logic that are present in an ontology (simply different kinds of taxonomy representations)(simply different kinds of taxonomy representations)
– Metadata must reflect any context or interest that a user might Metadata must reflect any context or interest that a user might express express
– Still need to have some control over metadata in order to Still need to have some control over metadata in order to make it understandable in different contextsmake it understandable in different contexts
Content Entity1
Content Elements
Content
Metadata
Topic Class Scheme
Business ProcessScheme
Thesaurus
Country Names
Region Names
Skill Sets/Competencies
Standard Statistical Variables
Has values
usesHas
Contains
UserHas relationship to
Has Meaning in
Context
ContextualMatrix &Sensiing
Contextual Logic
uses
Hierarchy Flat Taxonomy Network Taxonomy
Profile
Has
Business Rule
Rule Logic
Has values
Content Parts
Has
Metadata
Has
Faceted Taxonomy Ring Taxonomy
New View of OntologyNew View of Ontology
People Referenced
Orgs ReferencedMetadata
Getting to Rich MetadataGetting to Rich Metadata
Given the future demand for rich, contextualizable metadata, Given the future demand for rich, contextualizable metadata, and all of the traditional drawbacks… how will we achieve this and all of the traditional drawbacks… how will we achieve this futurefuture
We need to look for a different model for creating and We need to look for a different model for creating and sustaining metadata and reference sourcessustaining metadata and reference sources
We need to teach technologies how to capture the metadata we We need to teach technologies how to capture the metadata we need and how to maintain our reference sourcesneed and how to maintain our reference sources
I’d like to show you an example of how we might achieve that I’d like to show you an example of how we might achieve that future future
Please keep in mind that I’m showing you an example of what is Please keep in mind that I’m showing you an example of what is possible – Enterprise Search, Authority Control/Entity Discoverypossible – Enterprise Search, Authority Control/Entity Discovery
Fueling Semantic Search With Fueling Semantic Search With MetadataMetadata
Or, ….if Metadata is Dead, Semantic Web and Or, ….if Metadata is Dead, Semantic Web and Semantic Search Are DeadSemantic Search Are Dead
Building and Maintaining Building and Maintaining TaxonomiesTaxonomies
Moving towards automated metadata generation means that Moving towards automated metadata generation means that catalogers shift their effort to reviewing the metadata catalogers shift their effort to reviewing the metadata generated and to more fully developing and maintaining generated and to more fully developing and maintaining subject headings/thesauri and classification schemes as part subject headings/thesauri and classification schemes as part of a suite of categorization toolsof a suite of categorization tools
Level of effort shifts to training and developing the tools and Level of effort shifts to training and developing the tools and away from original cataloging and metadata capture away from original cataloging and metadata capture
Continue to work closely with subject experts to define the Continue to work closely with subject experts to define the controlled vocabularies and classification schemescontrolled vocabularies and classification schemes
It means that you have to have a metadata infrastructure It means that you have to have a metadata infrastructure that looks something like that ontology we just reviewedthat looks something like that ontology we just reviewed
There is no silver bullet ontology tool out there that will do There is no silver bullet ontology tool out there that will do this work for you – your knowledge and skills are criticalthis work for you – your knowledge and skills are critical
Metadata Capture MethodsMetadata Capture Methods
Agent Country Authorized By
Record I dentifier
Title Region Rights Management
Disposal Status
Date Abstract/ Summary
Access Rights
Disposal Review Date
Format Keywords Location Management History
Publisher Subject- Sector- Theme- Topic
Use History Retention Schedule/Mandate
Language Business Function
Preservation History
Version Aggregation Level
Series & Series #
Relation
Content Type
Identification/ Distinction
Use Management Compliant Document Management
Human CaptureProgrammatic Capture
Inherit from System Context
Extrapolate from Business Rules
Search & Browse
Smart Use of TechnologiesSmart Use of Technologies
Sample structure – Bank Topics Classification Scheme Sample structure – Bank Topics Classification Scheme (hierarchical taxonomy)(hierarchical taxonomy)
– Oracle data classes used to represent Topic Classification Oracle data classes used to represent Topic Classification scheme scheme hierarchical taxonomy as reference source for the hierarchical taxonomy as reference source for the
attribute – Topicattribute – Topic used for Browse, Search, Content Syndication, used for Browse, Search, Content Syndication,
PersonalizationPersonalization
– 11stst challenge is to architect the hierarchy correctly challenge is to architect the hierarchy correctly 3 distinct data classes, not a tree structure with 3 distinct data classes, not a tree structure with
inheritanceinheritance Allows you to use the three data classes for distinct Allows you to use the three data classes for distinct
functions across systems but still enforce relationships functions across systems but still enforce relationships across the classesacross the classes
Categorizing and Indexing ContentCategorizing and Indexing Content
Let’s look at how we’re categorizing our content to this Let’s look at how we’re categorizing our content to this structure automaticallystructure automatically
Topic classification, geographical region assignment, Topic classification, geographical region assignment, keywording exampleskeywording examples
Can apply this approach to any kind of content Can apply this approach to any kind of content
Enables us to build a robust metadata repository model, Enables us to build a robust metadata repository model, with strong metadata quality, to move towards SI at the with strong metadata quality, to move towards SI at the functional levelfunctional level
Also note that we can do this across many languagesAlso note that we can do this across many languages
Semantic Analysis Semantic Analysis Using The Technologies to Best Using The Technologies to Best
AdvantageAdvantage
Semantic analysis tools which support concept extraction, Semantic analysis tools which support concept extraction, categorization, summarization and pattern matching rules categorization, summarization and pattern matching rules enginesengines
Teragram works in 23 languagesTeragram works in 23 languages
Use categorization to capture Topics, Business Activities, Use categorization to capture Topics, Business Activities, Regions, Sectors, Themes, etc.Regions, Sectors, Themes, etc.
Use Concept Extraction to capture keywordsUse Concept Extraction to capture keywords
Use Rules Engine to capture Loan #, Credit #, Project ID, Trust Use Rules Engine to capture Loan #, Credit #, Project ID, Trust Fund #, etc.Fund #, etc.
Use Summarization to generate a ‘gist’ of the contentUse Summarization to generate a ‘gist’ of the content
Semantic Analysis BasicsSemantic Analysis Basics
Once you have made some sense of the sentence Once you have made some sense of the sentence (decompose), reconstruct entities for information (decompose), reconstruct entities for information extraction (compose)extraction (compose)
– Identify names and other fixed form expressions – Identify names and other fixed form expressions – people, organizations, actions, relationships, placespeople, organizations, actions, relationships, places
– Identify basic noun groups, verb groups, formatting Identify basic noun groups, verb groups, formatting elements, logic statementselements, logic statements
– Construct complex noun groups and verb groupsConstruct complex noun groups and verb groups
– Identify event structuresIdentify event structures
– Identify common elements and associate Identify common elements and associate
Leveraging the Topic StructureLeveraging the Topic Structure
Each subtopic is a knowledge domain (hierarchical taxonomy)Each subtopic is a knowledge domain (hierarchical taxonomy)
Each subtopic has an extensive concept level definition Each subtopic has an extensive concept level definition (1,000 – 5,000+ concepts)(1,000 – 5,000+ concepts)
Concepts are controlled vocabularies in their raw form (flat Concepts are controlled vocabularies in their raw form (flat taxonomy)taxonomy)
Concepts with relationships (extensive per new Z39.19 Concepts with relationships (extensive per new Z39.19 standard) comprise semantic network (network taxonomy)standard) comprise semantic network (network taxonomy)
Categorization tools work with topic structure & concept Categorization tools work with topic structure & concept definitions to categorize and index content definitions to categorize and index content
The following screen illustrates how that same structure is The following screen illustrates how that same structure is embedded into Teragram profile to support categorizationembedded into Teragram profile to support categorization
Example of use of Authority Control to capture country
names but extract ‘authorized’ version of
country name
Example of use of a gazetteer + concept
extraction + rules engine to support semantic
interoperability
Overview of Process & ToolsOverview of Process & ToolsActivityActivity ApproachApproach ToolsTools
Create new facetCreate new facet Human review & consultation, Human review & consultation, data structures, governancedata structures, governance
Oracle DBMS, in future Metadata Oracle DBMS, in future Metadata Repository tools (ISO 11179); Repository tools (ISO 11179); Oracle representation of data Oracle representation of data classesclasses
Create new classCreate new class Human review & harmonization Human review & harmonization of existing information of existing information structures; tool based discovery structures; tool based discovery of new structures through of new structures through clustering & extractionclustering & extraction
Teragram dynamic concept Teragram dynamic concept extraction using grammars, extraction using grammars, categorization, clustering; Oracle categorization, clustering; Oracle representation of data classesrepresentation of data classes
Create new conceptCreate new concept Create training sets working with Create training sets working with experts, identify & integrate experts, identify & integrate existing vocabulariesexisting vocabularies
Teragram concept extraction, Teragram concept extraction, Oracle representation of values Oracle representation of values
Create new relationshipCreate new relationship Human relationship creation, Human relationship creation, augmented by technological augmented by technological discoverydiscovery
Teragram clustering engine, Teragram clustering engine,
MultiTes Thesaurus Management MultiTes Thesaurus Management System, Oracle copy of System, Oracle copy of thesaurus relationshipsthesaurus relationships
Create new metadata Create new metadata Enterprise Profile Development Enterprise Profile Development with human review in some with human review in some cases, no review in others; cases, no review in others; Metadata in the language of the Metadata in the language of the document/contentdocument/content
Teragram enterprise profile Teragram enterprise profile leveraging concept extraction, leveraging concept extraction, categorization, and categorization, and summarizaitonsummarizaiton
Enterprise Profile
Development & Maintenance
Enterprise Metadata Profile
Concept Extraction TechnologyCountryOrganization NamePeople NameSeries Name/Collection TitleAuthor/CreatorTitlePublisher Standard Statistical VariableVersion/Edition
Categorization TechnologyTopic CategorizationBusiness Function CategorizationRegion CategorizationSector CategorizationTheme Categorization
Rule-Based CaptureProject IDTrust Fund #Loan #Credit #Series #Publication DateLanguage
Summarization
e-CDS Reference Sources forCountry, Region, Topics
Business Function, Keywords,Project ID, People, Organization
Data GovernanceProcess for
Topics, Business Function,Country, Region, Keywords,
People, Organizations, Project ID
Teragram Team
TK240 Client ISP IRIS ImageBankFactiva
JOLISE-Journals
Enterprise Profile Creation and Maintenance
UCM ServiceRequests
Update & Change Requests
ImageBank Integration
Content Capture
ISP Integration
Enterprise Profile
Development &
Maintenance
XML Wrapped Metadata
Dedicated Server – Teragram Semantic
Engine – Concept Extraction, Categorization, Clustering, Rule Based Engine, Language Detection
APIs & Integration
APIs & Integration
Content Capture
XML Wrapped Metadata
Factiva Metadata Database
IRIS Integration
APIs & Integration
EnterpriseMetadata Capture Strategy
TK240 Client
XML Output
e-CDS Reference Sources
APIs & Technical Integration
Content OwnersContent Owners
Business Analyst
IDU Indexers SITRC Librarians
IRIS FunctionalTeam
Enterprise Metadata Capture – Functional Reference Model
Impacts & OutcomesImpacts & Outcomes Information Access impactsInformation Access impacts
– Increased precision of searchIncreased precision of search– Better control over recall Better control over recall – Searching like we talk Searching like we talk – Exact match searching – known item searching will work betterExact match searching – known item searching will work better– Metadata based searching now begins to resemble full-text Metadata based searching now begins to resemble full-text
searching but with all the advantages of structure & context, and searching but with all the advantages of structure & context, and a significant reduction in the amount of noisea significant reduction in the amount of noise
Productivity ImprovementsProductivity Improvements– Can now assign deep metadata to all kinds of content Can now assign deep metadata to all kinds of content – Remove the human review aspect from the metadata captureRemove the human review aspect from the metadata capture– Reduce unit times where human review is still usedReduce unit times where human review is still used
Information Quality impactsInformation Quality impacts– All metadata carries the information architecture with itAll metadata carries the information architecture with it– Apply quality metrics at the metadata level to eliminate need to Apply quality metrics at the metadata level to eliminate need to
build ‘fuzzy search architectures’ – these rarely scale or improve build ‘fuzzy search architectures’ – these rarely scale or improve in performancein performance
– Use the technologies to identify and fix problems with our dataUse the technologies to identify and fix problems with our data
In Progress ImpactsIn Progress Impacts
Same methodology can be leveraged to develop a structure of Same methodology can be leveraged to develop a structure of lines of business, entities prominent in particular domains, lines of business, entities prominent in particular domains, relationships among entities in a domain, standard statistical relationships among entities in a domain, standard statistical variables, etc.variables, etc.
The richer the metadata and the more fully elaborated the The richer the metadata and the more fully elaborated the reference structures, the closer we come to understanding at a reference structures, the closer we come to understanding at a system level what is happening in a particular domain at any system level what is happening in a particular domain at any point in timepoint in time
It is this overall structure which can then be leveraged in other It is this overall structure which can then be leveraged in other contexts, perhaps even a counter-terrorism context, to threshold contexts, perhaps even a counter-terrorism context, to threshold eventsevents
Without metadata, though, no information asset can be secured Without metadata, though, no information asset can be secured but still its importance knownbut still its importance known
Without metadata, no information is agile or mobileWithout metadata, no information is agile or mobile