Title
Build Realtime Search From mobile SDK to SaaS, a tech POV
Sylvain Utard
SaaSisBeautiful #2 – June 2014
• Today Search means Google
• Search is a daily activity
• Search is complex
• DB are not handling text queries
• Speed and relevance are keys
• Fuzzy matching (typo-‐tolerance) 2
Search
• Databases
• Optimized for INSERT/UPDATE/DELETE/
SELECT (that's a lot)
• Structured query syntax (mostly SQL)
• Some operations scan all your rows3
Why Search Engines?
• Search engines
• HIGHLY optimized for “SELECT” (only)
• Full-‐text queries: understand what is a word
• Query execution time driven by the number of
matching documents
• And obviously, “LIKE '%foo%’" is not full-‐text search4
Why Search Engines?
• Indexing(input=documents)
• Multiple attributes (textual, numerical, geo)
• Search(input=query, output=documents)
• Full-‐text queries and/or numerical filters
• Understandable results: score (ranking) +
highlighting 5
How it works?
• 2 distinct processes
• Indexing
• Storing documents in a highly optimized way
• Query
• Matching documents
• Ranking matched documents 6
Implementation
• Indexing means building an “index“ or “inverted
lists“
• A dedicated data structure optimized for search
(only)
• Input = a set of documents containing words
• Output = a set of words associated to documents7
Implementation: Indexing process
8
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3Indexing
Inverted lists
Documents Index
• Queries
• Goal = Retrieve all documents matching a
user query
• Order results from the highest ranked to the
lowest9
Implementation: Query Process
10
Implementation: Query Process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching documents
Pagination
11
Implementation: Query Process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz qux"
Sort matching documents
Intersect inverted lists
Pagination
12
Database Search
Documents* Database*entries*
• Funded in 2012
• 2012 → Mar 2013
• Mobile-‐oriented
• Now: SaaS-‐oriented
• Search engine as a Service13
Title
• Embed a Search Engine in your App • iOS, Android, Windows Phone
• SDK/library provider • Offline • Ideal customers • Evernote, Contacts, POI, …
14
Mobile first
• Search as you type
• Typo-‐tolerance
• High-‐performance
• Target most phones
• Starting from the cheapest Android phone15
Mobile focus
• 10-‐20 queries / sec
• Realtime if <100ms
• 1 sec to build a 10K entries index
• C++ engine + Objective-‐C/C#/Java interfaces
• <100KB of RAM, whatever the index size16
Mobile Performance
• Same issues on websites & apps
• Used to Google/Amazon: it just works
• Poor search experience everywhere
• SQL/NoSQL technologies are not providing
any working solution17
What about hosted search?
18
Hosted Search
1. Push a copy of your data
2. Get blazing fast search
• Open-‐source
• ElasticSearch, Solr, Sphinx
• Commercial
• Hosted ElasticSearch/Solr/Sphinx
• Enterprise-‐oriented on-‐premise engines19
Alternatives
• Mostly document oriented
• Designed to search in “big” documents
• Statistical ranking algorithm
• No instant-‐search capabilities
20
Alternatives
• Database Search
• Semi-‐structured objects (multiple
attributes)
• Give importance to the right attributes
• Combine text relevance & record popularity21
Database Search
• No stats, no TF-‐IDF, no “score”
• Tie-‐breaking based, one criterion after another
1. # typos 2. geo 3. proximity 4. attribute weight 5. exact match 6. custom 22
Record rank
• C++ mobile SDK → C++ backend search engine
• hosted as a NGINX module
• multi-‐tenant (mutualized resources)
• fault-‐tolerant (SLA 99.99%)
• Faceting, synonyms, analytics, …23
Repackaging + Improvements
• Each cluster = 3 machines
• Distributed consensus (SLA)
• Multiple datacenters (EU, US, ASIA)
• Bare-‐metal servers
• 6c (12t) 3.5Ghz
• 128GB RAM
• 2x480GB SSD (RAID-‐0) 24
SaaS Architecture
25
SaaS Architecture v1
• More and more users
• API slaughter
• Too many I/O
• Writes / sec
• Consensus26
SaaS Architecture v2
27
SaaS Architecture v2
• Data privacy
• Send us only non-‐critical data
• Dedicated cluster
• Per end-‐user security
• Restrict the result set per end-‐user, per tag, …
• Crawling
• Built-‐in rate-‐limits 28
SaaS Security
• 2B operations in June
• 30% month-‐over-‐month growth in MRR
• 40+ servers
29
What about scalability?
30
Monitoring
• ServerDensity
• Custom probes
• Alerts
• SMS
• RAM over-‐booking
• Small memory footprint per index
• All indexes are mmaped
• Lazy-‐loading (no query = no RAM consumption)
• SSD
• Disable swapping
• Setup a new cluster if the current one is full 31
RAM
• Do NOT trust your default system configuration
• I/O: not optimized for SSD
• Memory: not optimized for 128GB RAM
• Network: not optimized for +10K keep-‐alive
connections32
Network
• Automatic
• Ability to rollback
• Ability to test on a “fake” production env
33
Deployment
• Your server will:
• reboot
• crash
• explode
• Make it happen now!34
Hardware
35
Questions?