connecting concepts : joining up the bbc

Connecting Concepts : joining up the BBC

Rob Lee : [email protected]://www.rattlecentral.com

Contextual semantic disambiguationJoining across the BBC

mailto:[email protected]

mailto:[email protected]

http://www.rattlecentral.com

http://www.rattlecentral.com

Going to tell a bit of a story, where we ended up was different to where we thought we’d end upNever thought we’d end up building a semantic categorisation system

Innovation Labs+

How did it begin ?BBC Innovation labs

BBC Ivory tower of content - above the web, not part of it BBC News in particular - premier property, horizontal navigation isn’t bad, but links out to the wider web missing or very ‘functional’ or unsurprisingJournalists make poor librarians, how can we breathe a little more like into this contentWhere’s the ‘wilfing’ ?

Wikipedia demonstrates wilfing really well, it has good internal and outbound links

wilfing - What Was I Looking For ?

We wanted to look at ways we could bring this wilfing experience to the BBC.Options, train 100 journalists to go back through the content archives OR automate itTo automate it, the first thing we need is a data source - lets use wikipedia as it’s good for wilfing

Muddy Boots

Muddy Boots, research project, tramping new trails through the pristine pastures of the BBC contentBuilt on some fairly simple precepts, using freely available technology

Lets take a look at a wikipedia page, how does it support wilfing. Good internal page links support horizontal nav

circles aroundcomparison of web app frameworksinterview linkDDH says no to use of rails logorailscasts

Not just internal links that are useful - there are typically an ‘interesting’ set of external linksSome functional but others picked by users and thus likely to be interesting

how do we relate the story to our commons content ?We need to relate Wikipedia data to our content, most archives have poor classification/descriptive data, journalists make poor librarians. How can we improve the classification of content in our archives, how can find out what it’s about ?

Various automated techniques available including Semantic Analysis, Term ExtractionWe cheated - used YTENow we can say what characterises a story and start to relate it to Wikipedia

Too much information - we’ve established a set of relationships but how do we pick out something useful AND relevant/interesting from all these articles ? How do we rank information and what information do we want to extract/use ?We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing

Use another 3rd party service to ranks external URL’s, tried using google rank, technorati buzz - but they didn’t necessarily rank the interesting links highly (e.g. try searching for google.com on google and see how many hits there are) - del.icio.us is different though users provide both context and ranking (via tagging and the process of bookmarking a URL) - so for each external link from all the wikipedia pages, we can see how many of the original extracted tags there are compared to the all the tags for a URL - high matches = high rank - how relevant is this url to the story in question and the fact it’s been bookmarked means it’s interesting to someone

Recent story - Journalist has added good related links this time -> Interestingly the muddyboots system has ‘recommended’ many of the same links independently - nice verification of the method

Problems - Real time performance -> currently takes about two minutes, due to API access and data-set sizes (we have local mirror of wikipedia db) but del.icio.us queries are expensiveStory classification incorrectDoesn’t always work - sometimes the coverage isn’t there in del.icio.us (not such a biggie)Disambiguation can be a big problem causing false positives - the main issue Problems - language, geographical relevance

Apple Apple ?

A more semantic approach can help, if we can say this is a person or a company or a place rather than just a piece of text then we can be more certain about what it is and thus be less ambiguous, it would be be good if we had a single point of reference for this ‘thing’

Simplify the problem :

Unambiguously identify the “main actors” in a news story

Then add semantic markup for them

Answers the “who” in who, what, where, why ...

http://www.flickr.com/photos/donnagrayson/195244498/sizes/l/Back to basics, only try to do one thing well - and solve the problem of ambiguity

http://www.flickr.com/photos/donnagrayson/195244498/sizes/l/


geonames.org

In order to solve the disambiguation problem, we could use the disambiguation pagesBut wikipedia is difficult to work with for machines, hence use dbpediaAlso possible to link other datasources such as geonames or musicbrainz, so we can find out more information than previously was possible just using wikipediaWe’re building a semantic categorisation system

Page for ‘Apple on dbpedia -> lots of data

For a given term we can find a) if it’s ambiguous and b) what the term could possibly meanIt’s possible to use this information to help determine which term we are actually refer to in the original text.

Entity Extraction

Yahoo term extraction

TagThe.net

Voting SystemLeveraging existing web services to perform entity extraction is useful, especially when employing a voting system. We also used a local named entity extraction service, this is more useful in the future as we have some direction over it’s evolution

We can also say that this is a company (or if it was Steve Jobs for example - a person). How about we use this information to markup our original content with hCard microformats, suddenly we’ve created a way for semantic aggregators to know our story is really about apple the company - we can drive more targeted, better quality search traffic to our site - new routes into the content

Extract (& Classify Entities) Find In

DBpedia / Wikipedia

Extract Required Attributes Parse

Content &Markup

One Possible Workflow

Classify Entities via DBpedia

Entity extraction - many methods available, entity classification via DBpedia is very extensible, finally microformat markup, good for machines and the semantic web as a whole

Chris sizemore slide

• contxt

Chris Sizemore approach to using wikipedia as a controlled vocabulary : can produce interesting ‘human’ categorisation results

Sample of MuddyBoots output, classification of a BBC news article. Demonstrates ‘main actor’ discovery and automated microformatting and inclusion of extra content from DBpedia in a ‘featured actors’ sidebar. The inclusion of microformats means machines can now query this page in a more granular fashion

Added bonus of creating semantic links and using ‘web scale identifiers’. BBC Music beta aggregates around Music Brains identifiers, DBpedia knows about MusicBrainz, therefor we can provide news feeds for any artist on BBC Music beta using this relationship Demonstrating how common controlled vocab can help us join up both the BBC and link out to other web databases

http://www.muddy.it

Text

Where next ? Precision and recall testing with the BBCTrialling the technology in a production environmentMuddy.it service, for those of you who are interested in the technology, then go have a look at muddy.it. We’re testing it with beta partners at the moment and you can go register you interest

http://www.muddy.it

http://www.muddy.it

In summary:Using dbpedia as a controlled vocabulary has a number of benefits : no maintenance required - performed by communityuse of web scale identifiers allows you to link your content to other web scale databases e.g. musicbrainzthere’s lots of information within dbpedia that allows you to move beyond NLP and into contextual semantic disambiguation

http://www.flickr.com/photos/evapro/305689596/

http://www.flickr.com/photos/jvk/141284308

http://www.flickr.com/photos/aquatic/500443103/

http://www.flickr.com/photos/donnagrayson/195244498/

http://flickr.com/photos/garrulus/82714475/

http://www.flickr.com/photos/bekathwia/2120050762/

Photos:







connecting concepts : joining up the bbc

Technology

thought wed

web scale

external links

controlled

bbc

markup

wilng

wikipedia