rapid pruning of search space through hierarchical matching
DESCRIPTION
Presented by Chandra Mouleeswaran, Co Chair at Intellifest.org, ThreatMetrix This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.TRANSCRIPT
![Page 1: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/1.jpg)
RAPID PRUNING OF SEARCH SPACE THROUGH HIERARCHICAL MATCHING Chandra Mouleeswaran Machine Learning Scientist, ThreatMetrix Inc.
5/2/13 1
![Page 2: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/2.jpg)
My Background • Machine Learning Scien8st at ThreatMetrix Inc. • Co-‐ Chair, Developer Programs, IntelliFest.org, Oct 2013,
San Diego, CA Career Path -‐ Siemens Corporate Research: Learning & Expert Systems -‐ Technology division of Donaldson, LuQin and JenreSe
company (Pershing): Ar8ficial Intelligence Group -‐ Network Monitoring
-‐ Several startups: Classifica8on, Web Crawling, Security, Financial Trading etc.
5/2/13 2
![Page 3: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/3.jpg)
Outline
• Task descrip8on • Approaches • Why search paradigm? • Hierarchical matching • Results • Acknowledgments
5/2/13 3
![Page 4: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/4.jpg)
The Device Iden8fica8on Task
• Computa8onally, it’s a CLASSIFICATION problem: { a0, a1, a2, a3……….. an } è { ci } ai = ( aSribute | field | key ) value ci = ( label | signature | class | hash )
• Returning devices should be correctly iden8fied within certain tolerances
• New classes may be created if a good match is not found in the repository of known devices
• Devices age out, based on data reten8on policy 5/2/13 4
![Page 5: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/5.jpg)
Task Challenges
• Extremely vola8le aSributes • There are no pivot aSributes to divide and conquer the search space
• Changing distribu8ons • Emphasis on PRECISION • Stringent RESPONSE 8me
5/2/13 5
![Page 6: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/6.jpg)
Engineering Challenges
• Precision (accuracy) and latency (response 8me) are antagonis8c constraints
• Project management
Repository Size (millions)
Load (TPS)
Latency (ms)
Project start 28 200 < 100
Present 280 300 < 100
Change 10 X 1.5 X None
5/2/13 6
![Page 7: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/7.jpg)
Approaches
• Rules engine • Learning models • Vector space models Need an enterprise grade solu8on!
5/2/13 7
![Page 8: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/8.jpg)
Rules Engine
• No experts • Number of rules? • Maintenance?
Not a viable approach!
5/2/13 8
![Page 9: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/9.jpg)
Learning Models
• Most machine learning methods deal predominantly with binary classifica8on problems (eg. fraud / not fraud) or a small number of target classes
• Few exemplars for each class • ASribute values may be unbounded • ASributes may not follow a natural progression
5/2/13 9
![Page 10: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/10.jpg)
Learning Models …
• Unsupervised learning such as clustering methods would make good models, but not good enough to be of prac8cal use. Any simplifica8on process will compromise on accuracy
• Ability to explain is cri8cal • Tend to ignore domain knowledge Challenge in providing enterprise solu8on
5/2/13 10
![Page 11: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/11.jpg)
Thoughts
• No comparable applica8on with such requirements
• Build and deploy a classifier that explains itself easily, scales temporally and offers quick response
• Use domain knowledge to guide verifica8on • Improve the classifier through machine learning methods by analyzing performance in the field
5/2/13 11
![Page 12: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/12.jpg)
Vector-‐Space Models
• Similarity based search make vector-‐space model a good choice for genera8ng selec8ons
• Given the vola8le nature of data, informa8on retrieval (IR) systems can adapt easily
• Good at neighborhood search Sensi8ve to individual aSribute changes!
5/2/13 12
![Page 13: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/13.jpg)
Sources of Inspira8on
• Lucene/Solr features • Documenta8on from (erstwhile) Lucid Imagina8on
• Ease with which Lucene/Solr could be installed and explored
Very short learning curve for novices!
5/2/13 13
![Page 14: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/14.jpg)
Feature Selec8on
• Primi8ve and derived aSributes • Entropy • Distribu8on
5/2/13 14
![Page 15: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/15.jpg)
Domain
• Devices come with structural informa8on but not much grammar or seman8cs
• Bag-‐of-‐words (single field) approach is fast but not precise
• Using all fields is precise but response is slow Now what?
5/2/13 15
![Page 16: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/16.jpg)
Disjunc8on Max • Matrix of all possible combina8ons of user input query and document fields
• Transforms into a Boolean query of Disjunc8onMaxQueries of each row
• Maximum score of sub clauses Is used by Disjunc8onMaxQuery
• No single term in user input dominates This is needed! Src: SearchHub and LucidWorks 5/2/13 16
![Page 17: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/17.jpg)
DisMax Experiments (index size = 60 Million)
Scenario 1
mm=2 Solr fields = { a1, a2, a3 } Values= { phrase1, phrase2, phrase3} Must-‐Match Clauses Latency: YES (35 ms) Precision: NO (20% failure)
5/2/13 17
Scenario 2
mm = 50 % Solr fields = { a1 } Values= { term1, term2, term3 …. termn } Should-‐Match Clauses Latency: NO (> 2 seconds) Precision: YES (> 98%)
![Page 18: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/18.jpg)
Possible Workaround
• Look-‐ahead: Customize Lucene/Solr to do a branch-‐and-‐bound search, bail out on some lower bound score
• Minimize candidates for DisMax search -‐ reduce total number of Solr instances to search -‐ reduce total number of disjunc8ve terms
[ Empirical es8mate: tn = 2 * tn-‐1 where t = 8me & n = number of disjunc8ve terms]
5/2/13 18
![Page 19: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/19.jpg)
Phrases over Terms
• Used coloca8on (co-‐occurrence matrix) to determine most common phrases
• Delete terms covered by phrases • Add stop words based on frequency analysis • Ensure precision is preserved through regression tests
Reduced the number of DisMax terms by 30%
5/2/13 19
![Page 20: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/20.jpg)
Sources of Inspira8on
• Planning in a Hierarchy of Abstrac8on Spaces, Ar8ficial Intelligence, Vol. 5, No. 2, pp. 115-‐135 (1974)
• Search Reduc8on in Hierarchical Problem Solving, Proc. Of the 9th IJCAI, AAAI Press, Menlo Park, CA (1991)
• Excep8onal Data Quality Using Intelligent Matching and Retrieval, AI Magazine, AAAI Press (Spring 2010)
5/2/13 20
![Page 21: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/21.jpg)
Hierarchical Matching
Bag of words
Models Phrases
Filters DisMax
Query Formulator
Domain-‐specific paSerns
CSV/JSON
Solr instances selector
To Solr Servers
5/2/13 21
Verifica8on
![Page 22: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/22.jpg)
Conflict Resolu8on
• Top n candidates are returned from each Solr instance
• They are ranked based on custom verifica8on module
• Ties are broken using recency • Top candidate is persisted and returned along with custom score
5/2/13 22
![Page 23: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/23.jpg)
Comments
• Dismax performs mul8dimensional match • Extracted mul8ple filters and arranged them hierarchically
• Separa8on of selec8on and evalua8on -‐ Selec8on = approximate solu8on -‐ Evalua8on = refinement
5/2/13 23
![Page 24: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/24.jpg)
Where 8me went..
• ASribute selec8on • Ranking • Op8miza8on • Index re-‐genera8on • Regression tes8ng
5/2/13 24
![Page 25: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/25.jpg)
Sources for Tune Up
• Scaling Solr, Lucene Revolu8on, May 2011 • Prac8cal Search with Solr: Beyond just Looking it Up, Lucid Imagina8on, May 2010
5/2/13 25
![Page 26: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/26.jpg)
Tes8ng
• Precision tes8ng using self and mixed modes • Latency tests
-‐ custom harness for stand-‐alone tests -‐ integrated tests with JMeter framework
5/2/13 26
![Page 27: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/27.jpg)
Results
5/2/13 27
![Page 28: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/28.jpg)
Latency Percen8les
original edismax Ini8al solu8on
Op8miza8on 2: Domain paSerns, Stop words, de-‐dupe
Op8miza8on 1: Filters, Focused search, verifica8on
5/2/13 28
![Page 29: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/29.jpg)
TPS
5/2/13 29
![Page 30: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/30.jpg)
Response Times over Time
5/2/13 30
![Page 31: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/31.jpg)
Project Execu8on
• Agile Methodology • Risk mi8ga8on through primary and con8ngency plans
• Rapid prototyping followed by good sozware engineering prac8ces
• Evalua8ng DSE (DataStax) & Solr Cloud
5/2/13 31
![Page 32: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/32.jpg)
Gleanings
• You can classify anything with Lucene/Solr, lexicon is your own
• The ques8on is not whether Lucene/Solr can solve a par8cular classifica8on problem, but whether you can priori8ze among the many ways of doing it
• If you run into a problem, someone has solved it or will solve it in the near future
5/2/13 32
![Page 33: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/33.jpg)
Gleanings …
• Deal with accuracy before latency • If precision, latency and scale are all cri8cal to your domain, expect to invest some8me in hierarchical abstrac8ons
• Index once, run any8me, anywhere, does not apply during development
• Throwing all data at Lucene/Solr will not work for mission cri8cal applica8ons
• Rapid prototyping and willingness to fail
5/2/13 33
![Page 34: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/34.jpg)
Summary
Simplify and match at mul0ple levels of abstrac0on
5/2/13 34
![Page 35: Rapid pruning of search space through hierarchical matching](https://reader035.vdocuments.net/reader035/viewer/2022081403/556a714dd8b42ab0468b5356/html5/thumbnails/35.jpg)
Contributors
Chandra Mouleeswaran Research & Prototyping
Fang Chen Research & Prototyping
Luke Mertens Produc8za8on & Scalability
Brent Pearson Release Management
Tracy Hsu Precision Tes8ng & QA
5/2/13 35
Srinivas Nayani Deployment & QA