2011 search query rewrites - synonyms & acronyms

Bay Area Search Wednesday July 27, 2011

Upload: brian-johnson

Post on 22-Jan-2015




0 download


July 27, 2011 Bay Area Search Presentation Brian Johnson, Engineering Director, Query Services @ eBay Query expansion is an important part of of the search recall for all search engines. In this talk I'll discuss some of the general trend driving Hadoop adoption within the Search Query Services team at eBay, and the types of algorithms/techniques we've moved to Hadoop at eBay. Over time we've moved from smaller, editorial data sets to large machine generated data sets mined from behavior log data, items/listings, catalogs, etc. One common workflow is to mine large candidate rewrites/expansions data sets from multiple data sources, use crowd sourced human judgment to classify a subset of the candidates (true positive, false positive), use machine learning techniques discard false positives, run automated validation on the final data set, and automatically push to production. Ravi Jammalakadaka, Senior Applied Researcher, Query Services @ eBay Ravi is a real engineer. Not a pointy haired manager like the previous speaker. Expect some real engineering:-) He'll be doing a literature review for acronym mining and discussing a real world implementation. Title: Mining Acronyms From Raw Text Abstract: Significant number of eBay products are known by their acronyms. eBay query expansion service expands user queries by their acronym equivalents to increase recall. The challenge is to mine acronyms from either seller ( ex. item descriptions, titles) or buyer ( ex. queries) data. Ravi will present the state of the art algorithms from recent conferences that mine acronyms from raw text and present their limitations. He will present a new acronym mining algorithm that seeks to address the limitations identified with previous algorithms. He will present a machine learning classifier that seeks to remove the false positives generated from the acronym mining algorithm.


  • 1. Bay Area Search WednesdayJuly 27, 2011

2. Agenda 6:30 Eat & Greet - Free Food & Beer 7:00 Speaker #1 Brian Johnson 7:45 Speaker #2 Ravi Jammalamadaka Plan on 2 fabulous 45 minute presentations by excellent local search experts.Please suggest speakers or topics you would like to hear. Great speakers, good food, fine beer, and everyones favorite search term - Free,Free, Free:-) Event will be held at the eBay campus just off 17/880 @ Hamilton in the mainCommunity building. Look for lobby/flagpole. 4th Wednesday of every month http://www.meetup.com/Bay-Area-Search/ 3. How Can I Help?Speakers Feedback Organizers Videographers 4. Brian Johnson Brian is the Director of Engineering for Query Services at eBay. He has held thisrole since January of 2011. Prior to that he managed the engineering teams forQuery Understanding (metrics and crowdsourced human judgment), classification,data publishing, and browsing. Brian has been at eBay since 2002. Prior to eBay Brian was at (http://www.linkedin.com/in/brianscottjohnson) Handspring - Managed the team working on email/IM/web browsing for one ofthe first smartphones (Treo) Excite@Home - Director of Engineering for the Excite homepage Synopsys - Engineer for chip design visualization AT&T Bell Labs - Data visualization research Brian received his PHD in Computer Science from the University of Maryland in1993. His papers regarding visualizing hierarchical and categorical data withTreemaps have been cited hundreds of times. Brian is a pleasure to listen to and Im sure youll appreciate his insights from thetrenches regarding search query rewrite research and practice at eBay. 5. Ravi JammalamadakaRavi works in the query services team at eBay looking at ways to rewrite user queries to improve both precision and recall.Received his PhD from University of California, Irvine. Research on Data Security, DatabasesRavi published 10 research papers in the areas of databases, data security and data mining.Ravi was invited to be a Program committee member for IEEE ISI 2010, 2011 and ICDE 2010 (demo track). 6. Query Rewrites Brian JohnsonBay Area Search July 27, 2011 7. Documents + UsersSEARCH 8. What Is A Query? Queries are more than a text box Keywords=Red Size 7 Shoes Keywords=Red, Category=Shoes Keywords=Red, Category=Shoes, Size = 7 Many filter variables affects recall Query, category, attributes current context dimension targets Format, condition, location/distance, shipping, seller, price 9. Questions About Queries Popularity/Rank Supply Demand Click Through Rate (CTR) Conversion Rewrites/Expansions Related Searches with CTR & Conversion Category Supply/Demand/CTR/Sales Product Supply/Demand/CTR/Sales Top Products Items (recalled, view, bin, bid, offer, watch, ask, purchase) Autocompletes Classification (broad, narrow, ambiguous, help, navigational) Purchase Site Frequency by day, day of week, time of day Cross Border Sales Position distribution in user sessions Result set size Exit Rate Exit Destination 9 10. Data Mining & Machine LearningTRENDS 11. Query Rewrite TrendsIntelligence:Human MachineData:Small BigSources: Few ManyContext: Little Some 12. EXAMPLES 13. Example Query Services/Rewrites Related Search canon sd1300is, canon sd1400 is, canon sd4000, canon sd1400is, canon sd, canon sd1300 is waterproof, canon sd 1300, canon Stemming (ipod or ipods) Spelling (cannon or canon) Condition (new or condition=new) Synonyms (boat carpet or marine carpet) Space Synonyms (MarioKart > Mario Kart) Item Specifics (blue or color=blue) Acronyms (os = one size in CSA | Operating Systems in Electronics) Category (shoes or category=63850) Cross Border (site=0 and category =123) or (site=3 and category=456) Fitment (fits model=X) Term Removal (Harry Potter and the Order of the Phoenix (daily deal))13 14. Context & Specificity Beyond decontextualized single entities Examples Stemming failures (cowboy v cowboys) and (hat v hats) Doesnt work for cowboy hats & dallas cowboy caps/hats hp printer > (hp v hewlett packard) printer 15 hp pump > 15 (hp v horsepower) pump motor bike > motor (bike v cycle) audi b6 > (audi v make=audi) & (b6 v platform=b6) v (product=789) the who != who the Time Today: latest generation > latest generation v (generation=4) Tomorrow: latest generation > latest generation v (generation=5) 15. HOW 16. Architecture Online (Code + NoSQL Cache) Offline (Hadoop) Document & Behavioral Data 17. Better, Faster, CheaperBetter Better recall Awesome related search suggestions Mind reading spell correctionsFaster Artist trading cardin ART ATC >Automatic Tool Changein Business and Industrial Directional Old> Antique Yoga towels/mats > YogitoesRescue Project28 29. Acronym/Abbreviation Category Based Mining Expansions Acronyms/Abbreviation mined from Rawtext and query logshpElectronicsCars and Trucks Look for patterns of text long form (short form) short form (long form)Employ intelligent matching algorithms toHewlett Packardhorsepowermine candidatesExample title: System allowsnew cheap Playstation portable (PSP) Category based expansionsAcronym discovered Directional expansionsPSP -> PlayStation Portable Positive and NegativeCandidates mined are fed through a expansionsmachine learning classifier to remove thefalse positives 30. THANKS&QUESTIONS 31. Mining Acronyms/Abbrevia PlayStation 3 eBay, Inc.39 40. Schwartz et al Pros: Finds almost all abbreviations and acronyms Cons: High False positive rate. Foot Massage Diabetes Treatment (FEET) Suffers from truncated long form problem. Example: American Automobile Association (AAA)eBay, Inc. 40 41. Acronym-Expansion Recognition and Ranking on the Web First few characters match Ignore Stop words Example: Cool - > Cooperation in Ontology and Linguistics. Alpa Jain, Silviu Cucerzan, Saliha Azzam. Acronym-Expansion Recognition and Ranking on the Web.eBay, Inc. 41 42. Jain et al Pros: Low false positive rate Cons: Does not do a good job at identifying abbreviations Misses out on a lot of actual acronyms Will not find PlayStation 3 and PS3 association. eBay, Inc.42 43. eBay Acronym Mining ArchitectureCandidate Feature Classier Generator Extractor User Dic4onary Data Live on Human A/B Test Site Judgment 44. eBay Acronym/Abbreviation Mining Algorithm Desirable Properties Find all abbreviation and Acronyms like the greedy match Reduce the amount of false positives Solve the truncated long form problem. What makes a good acronym expansion pair? Characters in the acronym are found at the beginning of the words. Expansions generally do not have words that are skipped or not represented in the acronym. Can a cost metric capture the intuition ?eBay, Inc. 44 45. Cost Based Approach for Mining Abbreviations CIM ------- Computer Interface Module Total Cost: Low costPVC ------- PolyVinyl Cloride Total Cost: medium costHSF -- Heat shock transcription factorTotal Cost: High Cost eBay, Inc.45 46. Cost Based Recursive AlgorithmTitle: new American Automobile Association (AAA) map ofmexicoObjective: Find the longest form with the lowest cost American Automobile Association (AAA)Min ( American Automobile Associ (AA) ,American Automobile Associ (AAA) ) +Cost so fareBay, Inc.46 47. Salient Properties of the new algorithm If Cost > Threshold, then the long form is a false positive. As cost increases False positives increase The chance that a real acronym is not identified decreases As cost decreases False positives decrease The chance that a real acronym is not identified increases. At lower costs, the algorithm behaves like the first few charactersmatch. At high costs, the algorithm behaves like the greedy matchalgorithm.eBay, Inc. 47 48. ExperimentsSample Dataset: 2.5 million item titlesAlgorithm Total CandidatesFalse Positive Rate YieldGreedy Match254839 %1554First Few 759 4%728Characters MatchCost Based Match, 122314 %1051k1Cost Based Match, 160416 %1284k2Cost Based Match, 202320 %1554k3 eBay, Inc. 48 49. Removing false positives Goal Develop a classification algorithm that will classify is a candidate is a acronym or not. Classification algorithm Decision trees TreeNet data mining tool. Candidate are tagged with many features. Classifier learns on the tagged golden set. New candidates are then run through the classifier. eBay, Inc.49 50. Example of a Decision Tree Tid Refund MaritalTaxableSplitting AttributesStatus Income Cheat 1YesSingle125K No 2No Married 100K NoRefundNoYes No 3No Single70K 4YesMarried 120K NoNOMarSt 5No Divorced 95K YesMarried Single, Divorced 6No Married 60KNo 7YesDivorced 220KNo TaxIncNO 8No Single85KYes < 80K> 80K 9No Married 75KNoNOYES 10 No Single90KYesModel: Decision Tree10Training Data eBay, Inc.50 Acknowledgements: George Kollios, [email protected] 51. Features: Neighborhood Similarity Rationale: Two synonym candidates A and B, will tendto have similar neighbors (viz keywords) surroundingthem. Neighborhood similarity = Intersection ( Neighbours(A) , Neighbours(b) ) Min (Neighbours(a), Neighbours(b)) eBay, Inc.51 52. Features: Mutual Information Rationale: The goal of this metric is determine if the co-occurrence of thecandidates in the description is significantly more than the randomchance of them co-occurring. eBay, Inc.52 53. Features: KL divergence Rationale: Two synonym candidates will have similarcategory distributions of their inventory.eBay, Inc.53 54. Kl distance: Example Ipods:Electronics (50),Electronics (100),Ipod: Clothing Shoes and Clothing Shoes and Accessories (1) Accessories (3) Ipod:Electronics (100), T-shirtClothing Shoes and Clothing Shoes and Accessories (1000), Accessories (3) Uniforms ( 50) KL divergence: 0.83 KL divergence: 128592.74 55. Classifier Decision Tree Example KL Distance> 2.5 2.5 False PositiveNeighbourhood Similarity > 0.2 0.2Mutual InformationFalse Positive > 0.003 0.003 True Positive False Positive 56. Classifier Results False positive rate at the candidate generation stage 20 % False positive rate after going through the classifier is 5.5 % The remaining false positives are removed by humanjudges.eBay, Inc.56 57. Conclusions We presented the state of the art algorithms for acronymmining and their limitations. We presented a new cost based algorithm for miningacronyms from raw text that seeks to address the limitationsof the previous algorithms. We presented a classifier approach to remove falsepositives. We experimentally validated our approach and show it is aviable approach for mining acronyms. eBay, Inc.57 58. References [1] Ariel S Schwartz, Marti A. Hearst. A simple Aglorithm for IdentifyingAbbreviation definition in BioMedical Text. [2] Yongja Park, Roy J. Byrd. Hybrid text mining for finding abbreviationsand their definitions. [3] Mathieu Roche, Violaine Prince. Managing the Acronym/ExpansionIdentification Process for Text-mining Applications.eBay, Inc.58 59. References(2) [4] Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee. Efficient Web-Based Linkage of Short to Long Forms. [5] Alpa Jain, Silviu Cucerzan, Saliha Azzam. Acronym-Expansion Recognitionand Ranking on the Web. [6]Xiaonan Ji, Gu Xu, James Bailey and Hang Li. Mining, Ranking and UsingAcronym Patterns.eBay, Inc. 59 60. ThankseBay, Inc. 60