search at linkedin by sriram sankar and kumaresh pattabiraman

Recruiting SolutionsRecruiting SolutionsRecruiting Solutions

Search at LinkedIn

Sriram Sankar, Principal Staff EngineerKumaresh Pattabiraman, Senior Product Manager

https://www.youtube.com/watch?v=obCHKPYHuhA

2

Search at LinkedIn

Personalized professional search

Part of a bigger product experience

But a really big part of it

3

4

Some history . . .

Approach to Search

Off the shelf components (Lucene) Extended to address Lucene limitations (Sensei,

Bobo, Zoie, Content Store) Specialized verticals (Cleo, Krati)

Stack adopted for other purposes (recommendations, newsfeed, ads, analytics, etc.)

5

Lucene

An open source API that supports search functionality: Add new documents to index Delete documents from the index Construct queries Search the index using the query Score the retrieved documents

6

7

The Search Index

Inverted Index: Mapping from (search) terms to list of documents (they are present in)

Forward Index: Mapping from documents to metadata about them

8

BLAH BLAH BLAH Kumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH

BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2.

1.

Kumaresh Sriram LinkedIn

2

1

Inverted Index Forward Index

9

The Search Index

The lists are called posting lists Upto hundreds of millions of posting lists Upto hundreds of millions of documents Posting lists may contain as few as a single hit and

as many as tens of millions of hits Terms can be

– words in the document– inferred attributes about the document

10

Lucene Queries

“Sriram Sankar” Sriram Kumaresh +Sriram +LinkedIn +Kumaresh connection:418001 +Kumaresh industry:software

connection:418001^4

11

Lucene Scoring

As documents are added to the index, Lucene maintains some metadata on the terms (e.g., term position, tf/idf)

Lucene accepts scoring information via query modifications, boosts, etc.

Lucene assigns a score to each retrieved document using this information

12

Sensei

Layer over Lucene that provides: Sharding Cluster management Enhanced query language

14

Sensei BQL

SELECT *FROM carsWHERE price > 2000.00USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END

15

Live Updates – Zoie and Content Store

The index reader has to be reopened before earlier live updates are visible

The only way to perform a live update is to replace the entire document – which requires access to the unchanged attributes also

16

Zoie

17

Search Content Store

SearchContent

Store

LuceneIndex

ActivityFeeds Deletes

Inserts

18

Faceting

19

Bobo

20

Typeahead (Instant Search)

Results as you type

Conventional wisdom: Inverted indices cannot support typeahead

Cleo, Krati

21

Fast forward to last year – and growing pains . . .

22

Scalability

Rebuilding index from scratch extremely difficult

Not possible to use complex algorithms during indexing

Live updates at document granularity

Inflexible scoring – both at Lucene and Sensei levels

23

Fragmentation

Too many open source components glued together with primary developers spread across many companies

Different instantiations starting to diverge to deal with their specific growing pains – so diverging stacks and distracted engineers

24

Our new search stack . . .Two verticals already in

production

25

Life of a Query

Query Rewriter/Planner

ResultsMerging

UserQuer

y

Search

Results

Search Shard

Search Shard

26

Life of a Query – Within A Search Shard

Rewritten

Query

TopResult

sFromShard

INDEX

TopResult

s

Retrieve aDocument

Score theDocument

27

Life of a Query – Within A Rewriter

Query

DATAMODEL

Rewriter

State

Rewriter

Module

DATAMODEL

DATAMODEL

Rewritten

Query

Rewriter

Module

Rewriter

Module

28

Life of Data - Offline

INDEX

Derived Data

Raw Data

DATAMODEL

DATAMODEL

DATAMODEL

DATAMODEL

DATAMODEL

29

Benefits of New Stack

A complete search engine Frequent reindexing possible (a full reset) Resharding becomes easy Clear separation of infrastructure and relevance

functions

A single stack with a single identity!

30

Early Termination

We order documents in the index based on a static rank – from most important to least important

An offline relevance algorithm assigns a static rank to each document on which the sorting is performed

This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)

Happens to work well with personalized search also

31

New Strategy for Live Updates

Lucene segments are “document-partitioned” We have enhanced Lucene with “term-partitioned”

segments We use 3 term-partitioned segments:

– Base index (never changed)– Live update buffer– Snapshot index

Fault tolerant, and performant No more content store!

32

Base IndexSnapshot

IndexLive Update

Buffer

33

Data Distribution

Bit torrent based data distribution framework

More details at a later time

34

Relevance

Offline analysis – resulting in a better index and data models

Query rewriting – for better and more accurate recall

Scoring – to fine tune each of the retrieved results

Reranking – selection of top results for overall result set quality

Blending – to combine results from multiple verticals

35

Machine Learned Scorers

Goal: To automatically build a function whose arguments are interesting features of the query and the document

Input to the machine learning system is a set of training data that describes how the function should behave on various combination of feature values

The function takes the form of standard templates – a linear formula is commonly used (due to simplicity)

36

Linear Regression on a Single Feature

37

LinkedIn Scorer:Different Linear Models for Different Intents

Relevance models incorporate user features:

score = P (Document | Query, User)

Tree with linear regression leaves

37

X 2=0

X2=?

X2=

1

X10< 0.1234 ?

Yes

No

38

Going Forward

Further standardize infrastructure for relevance components

Scatter-gather

Java GC issues

Extend infrastructure to browser/device

Reintegrate diverging stacks

39

Product Overview

40

LinkedIn’s Vision

“Create economic opportunity for every member of the global workforce”

41

The Economic Graph

42

Search is core to the economic graph vision

LI as a way to get the day job

Job Seeker

Who uses search?

Casual User

LI as professional identity

43

Outbound professional(Recruiter / Sales)

LI as day job

44

Casual User

Name SearchTopic Search

Instant: Name Search

Search all members by name or approximate name

45

Unified Search: Topic Search

One federated search result page with all relevant entities about the topic

46

47

Outbound professional

Exploratory people search

Instant: Search Suggestions

Entity-aware suggestions for companies, skills & titles

48

Instant: Just one keystroke

From name search to exploratory search

49

People Search

Explore using facets and advanced search fields

50

People Search

Leverage the network through shared connections

51

Recruiter & Sales Navigator

Products powered by search

52

53

Job Seeker

Job Search

Instant: Search Suggestions

Entity-aware suggestions for companies, skills & titles

54

Job Search

Explore using facets and advanced search fields

55

Job Search

Leverage the network through relationship to job poster or connections in the company

56

57

Other Search Users include…

Students – University SearchInformation Seekers / Researchers - Content SearchAdvertisers / Content Marketers – Company & Group Search

58

Bringing it all together

300 Million+ members

Search the economic graph of300M profiles

3B Endorsements300K jobs

3M Companies2M Groups

25K Schools100M+ pieces of professional

content

One indexOne unified search stack

Users

Product

Platform

search at linkedin by sriram sankar and kumaresh pattabiraman

Technology