semantic search: from document retrieval to virtual assistants

61
Semantic search: from document retrieval to virtual assistants PRESENTED BY Peter Mika, Sr. Research Scientist, Yahoo Labs ⎪ June 19, 2014

Upload: peter-mika

Post on 06-May-2015

1.326 views

Category:

Technology


2 download

DESCRIPTION

Keynote at the 3rd Spanish Conference in Information Retrieval (CERI)

TRANSCRIPT

  • 1.Semantic search: from document retrieval to virtual assistants P R E S E N T E D B Y P e t e r M i k a , S r . R e s e a r c h S c i e n t i s t , Y a h o o L a b s J u n e 1 9 , 2 0 1 4

2. Agenda 2 Invite What is Semantic Search? Applications to Web search Enhanced results Entity retrieval and recommendations Beyond Web search 3. Yahoo Labs Barcelona Established January, 2006 Part of a global network of Labs in Sunnyvale, New York, Barcelona, Haifa, Bangalore, Beijing, Santiago Led by Ricardo Baeza-Yates Research areas Distributed Systems Semantic Search Social Media Web Mining Web Retrieval 4. Semantic Search Research Jordi Atserias Sr. Research Engineer Roi Blanco Sr. Research Scientist Hugues Bouchard Sr. Research Engineer Peter Mika Sr. Research Scientist Manager Tim Potter Research Engineer Edgar Meij Research Scientist 5. What is Semantic Search? 5 6. Search is really fast, without necessarily being intelligent 7. Why Semantic Search? Improvements in IR are harder and harder to come by Basic relevance models are well established Machine learning using hundreds of features Heavy investment in computational power, e.g. real-time indexing and instant search Remaining challenges are not computational, but in modeling user cognition Could Watson explain why the answer is Toronto? Need a deeper understanding of the query, the content and the relationship of the two 8. Semantic gap Ambiguity jaguar paris hilton Secondary meaning george bush (and I mean the beer brewer in Arizona) Subjectivity reliable digital camera paris hilton sexy Imprecise or overly precise searches jim hendler Complex needs Missing information brad pitt zombie florida man with 115 guns 35 year old computer scientist living in barcelona Category queries countries in africa barcelona nightlife Transactional or computational queries 120 dollars in euros digital camera under 300 dollars world temperature in 2020 Poorly solved information needs remain Are there even true keyword queries? Users may have stopped asking them 9. Real problem 10. What its like to be a machine? Roi Blanco 11. What its like to be a machine? = 12. Def. Semantic Search is any retrieval method where User intent and resources are represented in a semantic model A set of concepts or topics that generalize over tokens/phrases Additional structure such as a hierarchy among concepts, relationships among concepts etc. Semantic representations of the query and the user intent are exploited in some part of the retrieval process As a research field Workshops ESAIR (2008-2014) at CIKM, Semantic Search (SemSearch) workshop series (2008-2011) at ESWC/WWW, EOS workshop (2010-2011) at SIGIR, JIWES workshop (2012) at SIGIR, Semantic Search Workshop (2011-2014) at VLDB Special Issues of journals Surveys Christos L. Koumenides, Nigel R. Shadbolt: Ranking methods for entity- oriented semantic web search. JASIST 65(6): 1091-1106 (2014) 12 Semantic Search 13. Semantic models: implicit vs. explicit 13 Implicit/internal semantics Models of text extracted from a corpus of queries, documents or interaction logs Query reformulation, term dependency models, translation models, topic models, latent space models, learning to match (PLS) See Hang Li and Jun Xu: Semantic Matching in Search. Foundations and Trends in Information Retrieval Vol 7 Issue 5, 2013, pp 343-469 Explicit/external semantics Explicit linguistic or ontological structures extracted from text and linked to external knowledge Obtained using IE techniques or acquired from Semantic Web markup 14. Semantic Search a process view Query Constructi on Keywords Forms NL Formal language Query Processin g IR-style matching & ranking DB-style precise matching KB-style matching & inferences Result Presentation Query visualization Document and data presentation Summarization Query Refinement Implicit feedback Explicit feedback Incentives Document Representation Knowledge Representation Semantic Models Resources Documents 15. What its like to be a machine? = 16. What its like to be a machine? = 17. Information Extraction 17 Documents Natural language Named Entity Recognition & Disambiguation (entity linking) Deep parsing (dependency parsing) Specific to the Web Extraction from web tables, wrapper induction etc. Open Information Extraction such as NELL, ReVerb etc. Queries Short text and no structure nothing to do? 18. Information Extraction on queries 18 Entities play an important role ~70% of queries contain a named entity (entity mention queries) and ~50% of queries have an entity focus (entity seeking queries) brad pitt attacked by fans ~10% of queries are looking for a class of entities brad pitt movies See Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010: 771-780 Thomas Lin, Patrick Pantel, Michael Gamon, Anitha Kannan, Ariel Fuxman: Active objects: actions for entity-centric search. WWW 2012: 589-598 19. Information Extraction on queries 19 Common structure to entity mention queries: query = + Intent is typically an additional word or phrase to Disambiguate, e.g. brad pitt actor Specify action or aspect e.g. brad pitt net worth, brad pitt download Useful also in off-line query log analysis Reduce the sparsity of query log data by mapping entities and intents to a reference base of entities and intents 20. oakland as bradd pitt movie moneyball movies.yahoo.com oakland as wikipedia.org captain america movies.yahoo.com moneyball trailer movies.yahoo.com money moneyball movies.yahoo.com moneyball movies.yahoo.com movies.yahoo.com en.wikipedia.org movies.yahoo.com peter brand peter brand oakland nymag.com moneyball the movie www.imdb.com moneyball trailer movies.yahoo.com moneyball trailer brad pitt brad pitt moneyball brad pitt moneyball movie brad pitt moneyball brad pitt moneyball oscar www.imdb.com relay for life calvert ocunty www.relayforlife.org trailer for moneyball movies.yahoo.com moneyball.movie-trailer.com moneyball en.wikipedia.org movies.yahoo.com map of africa www.africaguide.com money ball movie www.imdb.com money ball movie trailer moneyball.movie-trailer.com brad pitt new www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com brad pitt news news.search.yahoo.com moneyball trailer moneyball trailer www.imdb.com www.imdb.com Patterns in logs are hard to see Sample of sessions from June, 2011 containing the term moneyball What are users trying to do? 21. oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org Semantic annotations help to generalize Sports team Movie Actor 22. and understand user needs 6/19/201422 moneyball trailer what the user wants to do with it Movie Object of the query 23. Information extraction on queries 23 Entity linking Tutorial: Entity Linking and Retrieval by Edgar Meij, Krisztin Balog and Daan Odijk Dataset for evaluation of entity linking (2013) Yahoo WebScope dataset L24 - Yahoo Search Query Log To Entities, version 1.0 Semantic annotation for query log analysis Frequent pattern mining on raw queries fails due to large amount of noise Meaningful patterns start to emerge when mining the semantic annotations instead Laura Hollink, Peter Mika, Roi Blanco: Web usage mining with semantic analysis. WWW 2013: 561-570 24. Semantic Web 24 Significant extension of the Web stack Languages for publishing raw data and document annotations Standards for querying, validating and reasoning with data distributed across the Web Research community formed around 2001 ISWC, ESWC, WWW Semantic Web Track, JWS Conflicted history with Information Retrieval Misplaced expectations as to what the Semantic Web will bring Building the chicken farm before any chickens or eggs Since 2007 more solid progress in adoption Metadata in HTML Public and private Knowledge Graphs 25. Metadata in HTML: schema.org 25 Agreement on a shared set of schemas for common types of web content Bing, Google, and Yahoo! as initial founders (June, 2011), joined by Yandex later Similar in intent to sitemaps.org Use a single format to communicate the same information to all three search engines

Pirates of the Carribean: On Stranger Tides (2011) Jack Sparrow and Barbossa embark on a quest to find the elusive fountain of youth, only to discover that Blackbeard and his daughter are after it too. Director:
Rob Marshall

26. Substantial adoption of schema.org markup 26 Over 15% of all pages now have schema.org markup Over 5 million sites, over 25 billion entity references In other words: same order of magnitude as the web Source: R.V. Guha: Light at the end of the tunnel, ISWC 2013 keynote See also P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 Based on Bing US corpus 31% of webpages, 5% of domains contain some metadata (including Facebooks OGP) WebDataCommons Based on CommonCrawl Nov 2013 26% of webpages, 14% of domains contain some metadata (including Facebooks OGP) 27. Knowledge Graphs 27 Linked (Open) Data (linkeddata.org) Public movement for making open/public databases available in standard Semantic Web formats interlinking them Dbpedia is a central hub in this network of datasets Software framework to extract structured data from Wikipedia and consolidate it under a common ontology The resulting dataset that contains links to Freebase and others Freebase links to IMDB and so on Basis for private Knowledge Graphs Bing, Google, Yahoo 28. Yahoos Knowledge Graph Chicago Cubs Chicago Barack Obama Carlos Zambrano 10% off tickets for plays for plays in lives in Brad Pitt Angelina Jolie Steven Soderbergh George Clooney Oceans Twelve partner directs casts in E/R casts in takes place in Fight Club casts in Dust Brothers casts in music by Nicolas Torzec: Making knowledge reusable at Yahoo!: a Look at the Yahoo! Knowledge Base (SemTech 2013) 29. Building Yahoos Knowledge Graph Ontology building and maintenance Editorially maintained OWL ontology with 300+ classes Covering the domains of interest of Yahoo Information extraction Public datasets and proprietary data Data fusion Manual mapping from the source schemas to the ontology Supervised entity reconciliation Kedar Bellare, Carlo Curino, Ashwin Machanavajihala, Peter Mika, Mandar Rahurkar, Aamod Sane: WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis. PVLDB 2013 Michael J. Welch, Aamod Sane, Chris Drome: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012 Editorial curation and quality assessment 30. Applications in Web Search 33 31. Semantic Search for 34 Improving ad-hoc document retrieval Query composition Result presentation Matching Ranking Providing new search functionality Entity retrieval Related entity recommendation Personalization Question-answering Task completion 32. Exploiting Semantic Web markup (internal prototype, 2007) Personal and private homepage of the same person (clear from the snippet but it could be also automatically de-duplicated) Conferences he plans to attend and his vacations from homepage plus bio events from LinkedIn Geolocation 33. Search snippets using Semantic Web markup Summarization of HTML is a hard task Template detection Selecting relevant snippets Composing readable text Efficiency constraints Yahoo SearchMonkey (2008) Enhanced results using structured data from the page Key/value pairs Deep links Image or Video 34. Effectiveness of enhanced results Explicit user feedback Side-by-side editorial evaluation (A/B testing) Editors are shown a traditional search result and enhanced result for the same page Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384) Implicit user feedback Click-through rate analysis Long dwell time limit of 100s (Ciemiewicz et al. 2010) 15% increase in good clicks User interaction model Enhanced results lead users to relevant documents even though less likely to clicked than textual results Enhanced results effectively reduce bad clicks! See Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011: 725-734 35. Enhanced results at other search providers Google announces Rich Snippets - June, 2009 Faceted search for recipes - Feb, 2011 Bing tiles Feb, 2011 Facebooks Like button and the Open Graph Protocol (2010) Shows up in profiles and news feed Site owners can later reach users who have liked an object 36. Moving beyond entity markup 39 We would like to help our users in task completion But we have trained our users to talk in nouns Retrieval performance decreases by adding verbs to queries Markup for actions/intents could potentially help Modeling actions Understand what actions can be taken on a page Help users in mapping their query to potential actions Applications in web search, email etc. THING THING Schema.org v1.2 including Actions vocabulary published April 16, 2014 37. Applications of Actions markup Email (Gmail) SERP (Yandex) 38. Entity retrieval Which entity does a keyword query refer to, if any? Related entities for navigation Which entity would the user visit next? Entity displays in web search 39. Entity Retrieval Keyword search over entity graphs see Pound et al. WWW08 for a definition No common benchmark until 2010 SemSearch Challenge 2010/2011 50 entity-mention queries Selected from the Search Query Tiny Sample v1.0 dataset (Yahoo! Webscope) Billion Triples Challenge 2009 data set Evaluation using Mechanical Turk See report: Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran: Repeatable and reliable semantic search evaluation. J. Web Sem. 21: 14-29 (2013) 40. Glimmer: open-source entity retrieval engine from Yahoo Extension of MG4J from University of Milano Indexing of RDF data MapReduce-based Horizontal indexing (subject/predicate/object fields) Vertical indexing (one field per predicate) Retrieval BM25F with machine-learned weights for properties and domains 52% improvement over the best system in SemSearch 2010 See Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data. International Semantic Web Conference (1) 2011: 83-97 https://github.com/yahoo/Glimmer/ 41. Other evaluations in Entity Retrieval TREC Entity Track 2009-2011 Data ClueWeb 09 collection Queries Related Entity Finding Entities related to a given entity through a particular relationship (Homepages of) airlines that fly Boeing 747 Entity List Completion Given some elements of a list of entities, complete the list Professional sports teams in Philadelphia such as the Philadelphia Wings, Relevance assessments provided by TREC assessors Question Answering over Linked Data 2011-2014 Data Dbpedia and MusicBrainz in RDF Queries Full natural language questions of different forms, written by the organizers Multi-lingual Give me all actors starring in Batman Begins Results are defined by an equivalent SPARQL query Systems are free to return list of results or a SPARQL query 45 42. Related entity recommendations Related entities 43. Example user sessions 44. Spark(le) system for related entity recommendations 1. Knowledge Graph Filtering and enrichment 2. Feature extraction Query logs, Flickr, Twitter 3. MLR 4. Online/offline evaluation Point-wise assessments Side-by-side testing Online evaluation 5. Runtime Unary Popularity features from text: probability, entropy, Wiki entity popularity Graph features: PageRank on the entity graph, Wikipedia, Web graph Type features: entity type Binary Co-occurrence features from text: conditional probability, joint probability Graph features: common neighbors Type features: relation type 48 Roi Blanco, B. Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013 45. Beyond Web Search 49 46. Mobile search on the rise Information access on-the-go requires hands-free operation Driving, walking, gym, etc. Americans spend 540 hours a year in their cars [1] vs. 348 hours browsing the Web [2] ~50% of queries are coming from mobile devices (and growing) Changing habits, e.g. iPad usage peaks before bedtime Limitations in input/output [1] http://answers.google.com/answers/threadview?id=392456 [2] http://articles.latimes.com/2012/jun/22/business/la-fi-tn-top-us-brands-news-web-sites-20120622 47. Mobile search: challenges and opportunities 51 Interaction Question-answering Support for interactive retrieval Spoken-language access Task completion Contextualization Personalization Geo Context (work/home/travel) Try getaviate.com 48. Interactive, conversational voice search Parlance EU project Complex dialogs within a domain Requires complete semantic understanding Complete system (mixed license) Automated Speech Recognition (ASR) Spoken Language Understanding (SLU) Interaction Management Knowledge Base Natural Language Generation (NLG) Text-to-Speech (TTS) Video 49. Example dialogue 50. Components of a Spoken Dialog Systems (SDS) Recognizer (ASR) Semantic Decoder Dialog Control Synthesizer (TTS) Message Generator User Waveforms Words Dialog Acts I want to find a restaurant? inform(task=find, entity=restaurant) request(food)What kind of food would you like? The Web Currently limited domain Hand-crafted using rule-based parsers, template generators and flowchart-based dialog control Expensive to build and fragile in operation 51. A Statistical Spoken Dialogue System Bayesian Belief Network Semantic Decoder Stochastic Policy Response Generator Ontology inform(food=italian){0.6} inform(food=indian) {0.2} inform(area=east){0.1} null(){0.1} confirm(food=italian) request(area) Action Reward Function Rewards: success/fail Reinforcement Learning Supervised Learning Partially Observable Markov Decision Process (POMDP) ASR Evidence Belief State Belief Propagation I want an Italian You are looking for an Italian restaurant? Whereabouts? Id like italian {0.4} I want an Italian {0.2} Id like Indian{0.2} In the east{0.1} TTS Ita Ind - Food N E S W Area 52. Semantic Decoding Im looking for a place to eat perhaps french. Extract features eg frequent N-grams Im looking Im looking for for a place place to eat french u-act = request u-act = inform entity=restaurant entity=bar entity=hotel food=french food=chinese etc Bank of binary classifiers inform(entity=restaurant, food=french) {0.5} User Acts0.1 0.6 0.5 0.3 0.0 0.8 0.1 inform(entity=bar, food=french) {0.3} . inform(entity=restaurant, food=chinese) {0.1} 53. Belief State oentity gentity uentity Goal User Act Observation at time t User Behaviour Recognition/ Understanding Errors task -> find(entity,method,) entity -> restaurant(food, ..) entity -> bar(food, ..) food = French, Italian, Indian, .. ofood gfood ufood NextTimeSlicet+1 Compile Bayesian Network a Ontology 54. Choosing the next action the Policy gentity gfood inform(entity=bar) {0.4} HB R Fr It In - b Feature Extraction summary belief space select(entity=bar, entity=restaurant) Sample argmaxa{Q(b,a): a A} Gaussian Process Q-Function Approximation Q(b, a) = E rt | b, a t =t+1 T {Q(b,a) : a A} 55. Large Scale Evaluation Task Success Rates Word Err Rate Conventional Success Rate POMDP System Success Rate Telephone 21% 84.6% 86.9% Telephone + noise 30% 75.2% 81.2% In Car 29% 67.8% 75.8% Success = finding the required information for a restaurant which matches the supplied criteria Note that users perceived success rate was ~10% higher! 56. Real Users Working System Scaling up to the Web We can build a fully statistical spoken dialogue system for a specific narrow domain but how do we scale up too much broader domains? CamInfo Restaurant System Crowd-sourced annotators Data for input output mapping User simulator for policy optimisation Corpus Data for model parameter estimation Domain Ontology Hand-crafted input, output, and model parameters Personal Assistant Corpus Data for model parameter estimation Domain Ontology Unsupervised learning Fast on-line reinforcement learning Wide coverage ontology Real Users 57. Conclusions 61 Semantic Search Explicit understanding for queries and documents through links to external knowledge Using methods of Information Extraction or explicit annotations (markup) in webpages Semantic Web as a source of external knowledge Increasing level of understanding Early focus on entities and their attributes Applications in web search: rich results, entity displays, entity recommendation Moving toward modeling intents/actions Adding human-like interaction 58. Q&A Many thanks to members of the Semantic Search team at Yahoo Labs Barcelona and to Yahoos around the world Slides on POMDP-based dialogue systems courtesy of prof. Steve Young, UCAM Contact [email protected] @pmika http://www.slideshare.net/pmika/ Ask about our internships and other opportunities