search at linkedin by sriram sankar and kumaresh pattabiraman
TRANSCRIPT
Recruiting SolutionsRecruiting SolutionsRecruiting Solutions
Search at LinkedIn
Sriram Sankar, Principal Staff EngineerKumaresh Pattabiraman, Senior Product Manager
https://www.youtube.com/watch?v=obCHKPYHuhA
2
Search at LinkedIn
Personalized professional search
Part of a bigger product experience
But a really big part of it
3
4
Some history . . .
Approach to Search
Off the shelf components (Lucene) Extended to address Lucene limitations (Sensei,
Bobo, Zoie, Content Store) Specialized verticals (Cleo, Krati)
Stack adopted for other purposes (recommendations, newsfeed, ads, analytics, etc.)
5
Lucene
An open source API that supports search functionality: Add new documents to index Delete documents from the index Construct queries Search the index using the query Score the retrieved documents
6
7
The Search Index
Inverted Index: Mapping from (search) terms to list of documents (they are present in)
Forward Index: Mapping from documents to metadata about them
8
BLAH BLAH BLAH Kumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH
BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2.
1.
Kumaresh Sriram LinkedIn
2
1
Inverted Index Forward Index
9
The Search Index
The lists are called posting lists Upto hundreds of millions of posting lists Upto hundreds of millions of documents Posting lists may contain as few as a single hit and
as many as tens of millions of hits Terms can be
– words in the document– inferred attributes about the document
10
Lucene Queries
“Sriram Sankar” Sriram Kumaresh +Sriram +LinkedIn +Kumaresh connection:418001 +Kumaresh industry:software
connection:418001^4
11
Lucene Scoring
As documents are added to the index, Lucene maintains some metadata on the terms (e.g., term position, tf/idf)
Lucene accepts scoring information via query modifications, boosts, etc.
Lucene assigns a score to each retrieved document using this information
12
Sensei
Layer over Lucene that provides: Sharding Cluster management Enhanced query language
13
14
Sensei BQL
SELECT *FROM carsWHERE price > 2000.00USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END
15
Live Updates – Zoie and Content Store
The index reader has to be reopened before earlier live updates are visible
The only way to perform a live update is to replace the entire document – which requires access to the unchanged attributes also
16
Zoie
17
Search Content Store
SearchContent
Store
LuceneIndex
ActivityFeeds Deletes
Inserts
18
Faceting
19
Bobo
20
Typeahead (Instant Search)
Results as you type
Conventional wisdom: Inverted indices cannot support typeahead
Cleo, Krati
21
Fast forward to last year – and growing pains . . .
22
Scalability
Rebuilding index from scratch extremely difficult
Not possible to use complex algorithms during indexing
Live updates at document granularity
Inflexible scoring – both at Lucene and Sensei levels
23
Fragmentation
Too many open source components glued together with primary developers spread across many companies
Different instantiations starting to diverge to deal with their specific growing pains – so diverging stacks and distracted engineers
24
Our new search stack . . .Two verticals already in
production
25
Life of a Query
Query Rewriter/Planner
ResultsMerging
UserQuer
y
Search
Results
Search Shard
Search Shard
26
Life of a Query – Within A Search Shard
Rewritten
Query
TopResult
sFromShard
INDEX
TopResult
s
Retrieve aDocument
Score theDocument
27
Life of a Query – Within A Rewriter
Query
DATAMODEL
Rewriter
State
Rewriter
Module
DATAMODEL
DATAMODEL
Rewritten
Query
Rewriter
Module
Rewriter
Module
28
Life of Data - Offline
INDEX
Derived Data
Raw Data
DATAMODEL
DATAMODEL
DATAMODEL
DATAMODEL
DATAMODEL
29
Benefits of New Stack
A complete search engine Frequent reindexing possible (a full reset) Resharding becomes easy Clear separation of infrastructure and relevance
functions
A single stack with a single identity!
30
Early Termination
We order documents in the index based on a static rank – from most important to least important
An offline relevance algorithm assigns a static rank to each document on which the sorting is performed
This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)
Happens to work well with personalized search also
31
New Strategy for Live Updates
Lucene segments are “document-partitioned” We have enhanced Lucene with “term-partitioned”
segments We use 3 term-partitioned segments:
– Base index (never changed)– Live update buffer– Snapshot index
Fault tolerant, and performant No more content store!
32
Base IndexSnapshot
IndexLive Update
Buffer
33
Data Distribution
Bit torrent based data distribution framework
More details at a later time
34
Relevance
Offline analysis – resulting in a better index and data models
Query rewriting – for better and more accurate recall
Scoring – to fine tune each of the retrieved results
Reranking – selection of top results for overall result set quality
Blending – to combine results from multiple verticals
35
Machine Learned Scorers
Goal: To automatically build a function whose arguments are interesting features of the query and the document
Input to the machine learning system is a set of training data that describes how the function should behave on various combination of feature values
The function takes the form of standard templates – a linear formula is commonly used (due to simplicity)
36
Linear Regression on a Single Feature
37
LinkedIn Scorer:Different Linear Models for Different Intents
Relevance models incorporate user features:
score = P (Document | Query, User)
Tree with linear regression leaves
37
X 2=0
X2=?
X2=
1
X10< 0.1234 ?
Yes
No
38
Going Forward
Further standardize infrastructure for relevance components
Scatter-gather
Java GC issues
Extend infrastructure to browser/device
Reintegrate diverging stacks
39
Product Overview
40
LinkedIn’s Vision
“Create economic opportunity for every member of the global workforce”
41
The Economic Graph
42
Search is core to the economic graph vision
LI as a way to get the day job
Job Seeker
Who uses search?
Casual User
LI as professional identity
43
Outbound professional(Recruiter / Sales)
LI as day job
44
Casual User
Name SearchTopic Search
Instant: Name Search
Search all members by name or approximate name
45
Unified Search: Topic Search
One federated search result page with all relevant entities about the topic
46
47
Outbound professional
Exploratory people search
Instant: Search Suggestions
Entity-aware suggestions for companies, skills & titles
48
Instant: Just one keystroke
From name search to exploratory search
49
People Search
Explore using facets and advanced search fields
50
People Search
Leverage the network through shared connections
51
Recruiter & Sales Navigator
Products powered by search
52
53
Job Seeker
Job Search
Instant: Search Suggestions
Entity-aware suggestions for companies, skills & titles
54
Job Search
Explore using facets and advanced search fields
55
Job Search
Leverage the network through relationship to job poster or connections in the company
56
57
Other Search Users include…
Students – University SearchInformation Seekers / Researchers - Content SearchAdvertisers / Content Marketers – Company & Group Search
58
Bringing it all together
300 Million+ members
Search the economic graph of300M profiles
3B Endorsements300K jobs
3M Companies2M Groups
25K Schools100M+ pieces of professional
content
One indexOne unified search stack
Users
Product
Platform
59