t he e volving s emantic w orld barbara mcglamery taxonomist martha stewart living omnimedia
TRANSCRIPT
ABOUT ME Masters in Library and Information Science
Long Island University
New York Public Library Branch librarian NYPL for the Performing Arts – Drama reference
Entertainment Weekly Data Manager
Time Inc. Senior Data Manager, Taxonomist, Metadata Architect,
Ontologist
Martha Stewart Living Omnimedia Taxonomist
AGENDA
What is the Semantic Web? Big “S” and little “s” semantics
What we used to believe Time Inc. & the theory of overkill
What we know now Martha Stewart and the theory that less is more
Where we’re going Leaner and meaner (but more standards)
The Semantic Web is a web of data…. (it) provides a
common framework that allows data to be shared and
reused across applications, enterprise, and community
boundaries.
--w3c
"The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”
--Tim Berners-Lee, James Hendler, and Ora Lassila, Scientific American, 2001
BIG S SEMANTIC WEB
…big "S" web technologies provide a
framework for describing data on a web page when
the data on the website is published. If data is read
or captured, because the data's semantic meaning
has already been described, you don't have to go
through the process of understanding the meaning
of the data after the fact.
--Sean Martin, CEO of Cambridge Semantics
LITTLE S SEMANTICS
Little "s" web technologies capture and filter data with no description or understanding of the data provided after the capture process. The process of understanding the meaning of that data starts once data capture has happened. People have to intervene to provide the context and meaning for language on the web.
--Sean Martin, CEO of Cambridge Semantics
ESSENTIALS OF BIG S SEMANTIC WEB
URI – Uniform Resource Identifier
RDF – Resource Description Framework
OWL – Web Ontology Language
Semantic reasoner (inference engine)
URI – UNIFORM RESOURCE IDENTIFIER
Way to identify things Images, pages of text, locations
De-referenceable Freebase
http://www.freebase.com/view/en/will_smith
• URI’s are unique, no two are the same
• Will Smith http://www.freebase.com/view/en/
will_smith
RDF – RESOURCE DESCRIPTION FRAMEWORK
Framework used to describe relationships between objects
Extends and formalizes XML
Subject>Predicate>Object
RDF – RESOURCE DESCRIPTION FRAMEWORK
Subject>Predicate>Object
http://ew.com/PersonsTax/Will_Smith
http://ew.com/EntertainmentOnt/leadPerformanceIn
http://ew.com/EntertainmentTax/Movies/Bad_Boys
Will Smith Bad
Boys
>> >>>is the lead actor >>>>>>
OWL – WEB ONTOLOGY LANGUAGE
…designed to be used by applications that need to process the content of information instead of just presenting it to humans
-- W3C
OWL – WEB ONTOLOGY LANGUAGE
Metadata model Extends RDF to further define properties
Ex: Equivalent relationships
>> >>>is married to>>>>>>
>> >>>is married to>>>>>>
SEMANTIC REASONER
Software able to infer logical consequences from a set of asserted facts
Follows inference rules specified by OWL properties
Inverse Transitive Symmetric Functional/Inverse functional Equivalent
PUTTING IT ALL TOGETHER
Ontology Rule set
Classes and Properties
Taxonomy Application of Rule Set
Tags and Relationships
Everything is a statement Subject>Predicate>Object
Ex: Will Smith is lead performer in Bad Boys
BENEFITS OF RDF/OWL
Persistent URIs
Verifiable XML
Unambiguous Relationships
Polyhierarchy
Interoperability
LIMITATIONS OF RDF/OWL
Difficult to propagate across web
Challenge to integrate with legacy systems
Expensive queries
No “Killer App”
RDFa - Resource Description Framework (in) Attributes
W3C recommendation that adds a set of attribute-level extensions to XHTML for embedding rich metadata within Web documents
Easy to implement Not HTML 5 compliant
LINKED OPEN DATA 2007
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-
cloud.net/”
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Linked Open Data2010
MICROFORMATS
Semantic markup which seeks to re-use existing HTML/XHTML class attributes to structure data
Easy to implement Limited formats
MICRODATA
A WHATWG HTML5 specification used to nest semantics within existing content on web pages
Officially supported by Bing, Yahoo, & Google Can imbed other markup languages like
RDFa, microformats, and Dublin Core Not well-known (yet)
OPEN GRAPH PROTOCOL
Facebook-created markup language that turns any web page into an Open Graph Objects allowing for any page to become a Facebook page
I “Like” you Good for targeted advertising Limited in scope
BACK-OF-THE-NAPKIN COMPARISON
Features RDF/OWL
RDFa MF MD OGP
W3C standard
X X X
Extensible X X X
Pre-existing Vocabs
X X
Uses URIs X X
Easy to implement
X X X X
HMTL 5 compliant
X X X
Inferencing
X
STATUS REPORT ON S SEMANTIC WEB
Linked Open Data graph growing
Many countries have developed government sites with rich semantics
Development of Semantic search
More widespread adoption of lighter semantics
WHERE WE MIGHT BE GOING
Pharmaceutical industry identifies trends across clinical studies, and not just within them
News industry better targets content by locale
Department of Defense using it to make better decisions in the field
Utilized in advertising to drive more and more revenue
TIME INC
Largest magazine media company in U.S.
48 websites worldwide
Websites attract more than 50M unique visitors each month
Domains includes lifestyle, entertainment, style, news, sports, and business
Early adopter (2005-2006) of SW technologies
GOALS
Enhance data integrity
Improve editorial efficiency
Create contextual presentation of content
Develop relationships that cannot be derived from content
Share resources among titles
Improve search and facilitate guided navigation
CHALLENGES
Aging CMS with sites on different versions
Many different domains
Scalability to accommodate volume of data and development of complex relationships
Lack of resources, money, and time
45
Star Wars: Episode I -- The Phantom MenaceEpisode 1Episode IPhantom MenaceStar Wars Episode I The Phantom MenaceStar Wars Episode I: The Phantom MenaceStar Wars prequelStar Wars: Episode 1 -- The Phantom MenaceStar Wars: Episode i -- the Phantom MenaceStar Wars: Episode I: The Phantom MenaceStar Wars: Episode I--The Phantom MenaceStar Wars: Episode I--The Phantom MenanceStar Wars: Episode One -- The Phantom MenaceStar Wars: The Phantom MenaceStar Wars: The Phantom Menace -- Episode IThe Phantom MenaceThe Phanton Menace
WHY WE NEED CONTROLLED VOCABULARIES (OR WHY FREEFORM KEYWORDS JUST DON’T WORK)
Star Wars: Episode I -- The Phantom Menace
WHAT STANDARD TO ADOPT?
RDF Flexible Scalable Fits business needs New technology but industry standard
Microformats Easy to implement No inferencing Solved some business needs but not all No standards Limited formats
SEARCH FOR VENDORS
In 2005 few commercial RDF/OWL tool available that fit our needs
Open source reasoners like Jena and a proprietary design seemed more cost-effective and realistic
WHAT IS TOPICS?
Librarian Tool – allows librarians to create resources and properties
Relationship Tool - generates unambiguous connections between data
Classification Tool - allows editors to add uniform, structured metadata to content
Semantic reasoner - finds new facts from existing data
Query Engine - manages logical retrieval of data
TECHNICAL DETAILS OF SYSTEM
Java application Jena semantic reasoner Joseki query engine Sybase database
ENTERTAINMENT WEEKLY
Aggregated content
Related content
Improved search
Sharing of resources among titles
Features
PEOPLE
Aggregated content
Related content
Improved search
Sharing of resources among titles
Features
THIS OLD HOUSE
Aggregated content
Navigational taxonomy
Improved search
Related content
Faceted browse
Features
STRENGTHS OF TOPICS
Utilizes URIs
Sharable
Create once use many times
Unambiguous relationships
Facilitates aggregation of content
Controlled SEO keywords
,
WEAKNESSES OF TOPICS
Creates massive database of RDF triples
Expensive to query
Based on unsupported open source code (Jena)
Polyhierarchy makes it difficult to create navigational taxonomies
MARTHA STEWART LIVING OMNIMEDIA
MSLO is a Publishing, Broadcasting and Merchandising businesses
Extensive cross-promotion of content and products
3 websites and numerous digital apps
Domains include home, food, weddings, and healthy living
GOALS
Enhance data integrity
Improve editorial efficiency
Share resources among titles and types of content
Create contextual presentation of content
Improve search and facilitate guided navigation
CHALLENGES
Between CMS’s Vingette to Drupal 6
Limited resources, time, money Working on new CMS
Fuzzy business requirements Unclear plan for redesign
DECISIONS DECISIONS
RDF/OWL Expensive to implement No easy HTML 5 implementation No business reason to undertake such a large
endeavor
Roadblocks (Lots) LOE (Great) Time (Massive) Resources (Plenty)
DECISIONS DECISIONS
RDFa No easy HTML 5 implementation
Microformats Useful for recipes but limited formats
Microdata Useful for recipes, but new and untested
Open Graph Protocol Facebook use only, but critical to deploy ASAP
JUST ENOUGH SEMANTICS
Now Microformats
Google Rich Snippets and Recipe search OGP
Site-wide implementation
Next up Probably Microdata from Schema.org
Google approved Integration of other formats
Shiny and new, untested
LESSONS LEARNED (SO FAR)
Educate the troops
Buy-in from senior leadership
Loose, but coherent implementation plan
Concise, easy-to-reach business goals to start
One content type to start, then branch out
WHAT’S NEXT FOR MARTHA
Microdata deployed across all sites
Development of more sophisticated relationships with our content
Roll out of more robust faceted search
Integration of all content types into topic pages
FUTURE OF SEMANTIC WEB
Move from web of objects to web of data
More personalized experiences
Positive impact on content management costs
Classifying content well allows for unanticipated uses and users; cataloging allows for audience targeting.
Barbara McGlameryTaxonomistMartha Stewart Living Omnimedia(212)[email protected]