did you mean 'galene'?

18
Did you mean ‘Galene’?

Upload: azeem-mohammad

Post on 29-Jul-2015

102 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Did you mean 'Galene'?

Did you mean ‘Galene’?

Page 2: Did you mean 'Galene'?

The Galene Search Stack

• Query starts at the browser/device.

• Moves to Search Frontend where some processing is done.

• Moves to backend where the bulk of the Galene search functionality lives.

• Results returning from the backend go back to the user through the frontend to the browser/device.

Page 3: Did you mean 'Galene'?

The Federator and Broker

Federator and Broker are similar services:

• They accept as input query + metadata.

• Send that input to multiple services.

• Wait for responses from these services.

• Combines the responses and returns an aggregated response to the caller.

Page 4: Did you mean 'Galene'?

The Federator

• Federator invokes a ‘Query Re-writer’ to convert the plain text query into a ‘structured’ query.

• The Query Re-writer also enhances the query with additional metadata.

• Federator than passes the query re-writers output to one/more ‘Search Verticals’.

Page 5: Did you mean 'Galene'?

Search Verticals

• A search vertical serves a specific kind of entity – eg: members, content, etc.

Page 6: Did you mean 'Galene'?

The Broker

• The Broker is the receiving service in each vertical.

• Performs additional, vertical specific query re-writing.

• Sends re-written query to ‘Searchers’.

• Gathers results from searchers and ‘merges’ them together.

• Merging – Can be simple ‘score’ based merge sort or a sophisticated re-ranking algorithm.

• Merged results sent back to Federator.

Page 7: Did you mean 'Galene'?

SEARCH RESULTS

• The Federator blends(combines) the results sent to it by brokers of multiple verticals.

• Blending can involve complex relevance algorithms.

• Blended search results are sent to the Frontend.

Page 8: Did you mean 'Galene'?

Query Re-writers• A rewriter is made up of

multiple rewriter modules each of which performs a specific function.

• Eg.- Synonym expansion, spelling correction, or graph proximity personalization

• Rewriter modules operate in sequence and update an internal state. The final rewritten query is produced from this state.

• Data models – for example, synonym maps, common ngrams, query completion data. Data models are built offline along with the search index and copied into the Federator/Broker.

Page 9: Did you mean 'Galene'?

The Searcher

• Searcher operates on a single shard of the index.

• Receives the rewritten query and metadata from the Broker, and retrieves matching entities from the index.

• The entities are scored and the top scoring entities are returned to the Broker.

Page 10: Did you mean 'Galene'?

The Scorer

• Scorers are built as plugins to a scorer API exported by the Searcher.

• Inputs to the Scoring Algorithm:– Input query +– Input metadata + – Details on how the query matched the entity + – Forward index

• Since scorers are pluggable, a ML based Scorer can also be developed.

Page 11: Did you mean 'Galene'?

Indexing on Hadoop

• HDFS contains raw data containing all the information we need to build the index.

• First we run map-reduce jobs with relevance algorithms embedded that enrich the raw data – resulting in the derived data.

• Relevance algorithms eg.: spell correction, standardization of concepts (for example, unifying “software engineer” and “computer programmer”), etc.

• Galene provides custom map-reduce templates that perform the final step of building the actual index and data models. These templates are instantiated for specific jobs through schema definitions.

Page 12: Did you mean 'Galene'?

Early Termination: Static Ranking• Static Rank = Measure of importance of an entity that is independent

of any search query.

• Static Ranks – Calculated while indexes are being built.

• Entities ordered by static rank in the index, placing the most important entities for a term first.

• Retrieval process terminated as soon as adequate no. of entities matching the query are collected.

• Do not have to wait for all matching entities to be considered – Early termination.

• Scoring is expensive – early termination helps reduce the no. of entities scored.

Page 13: Did you mean 'Galene'?

Live Updates

Pre-Galene:

– Updates at the granularity of an entity – impacts performance.

– We have to maintain a second copy of the index in the Search Content Store.

– Adding and removing entities from the index upsets the static rank order and the ability to perform early termination.

– The index is always being modified resulting in a brittle system making it difficult to easily recover from index corruptions, etc.

Page 14: Did you mean 'Galene'?

Live Updates in Galene

• Live updates are performed at the granularity of single fields.

• Made possible by a new type of index segment – the term partitioned segment.

• The inverted index and forward index of each entity may be split up across such segments.

• The same posting list( term -> list of documents) can be present in multiple segments and a traversal of a single posting list becomes the traversal of a disjunction of the posting lists in each of the segments.

• For this to work properly, the entities in each segment have to be ordered in the same manner – using static rank helps us meet this constraint.

• The forward index becomes the union of the forward indices in each of the segments.

Page 15: Did you mean 'Galene'?

• In Galene, we maintain three such segments:

• The base index – Built offline on Hadoop – Rebuilt periodically (every week). Once built, it is never modified, only discarded after the next base index is built.

• The live update buffer – maintained in-memory. This segment is designed to accept incremental updates and augment itself to retain the entities in the correct static rank order.

• The snapshot index – Periodic snapshot of the base index plus all previous live update buffers. Snapshot index is re-built periodically by combining the existing snapshot index with the live update buffer. Each time a snapshot index is generated, the live update buffer is reset.

Page 16: Did you mean 'Galene'?

Horizontal lines are entities and the vertical lines are posting lists(term -> list of docs). The box at the right extreme of each entity represents its forward index.

Page 17: Did you mean 'Galene'?

Index Lifecycle Management

• A base index is built every week, snapshots are generated every few hours, and these indices have to be present on all replicas of the searchers.

• Snapshots are generated on seperate machines k/a Indexers. Every few hours indexers merge their live update buffers with the snapshot index.

• The base index needs to be moved from HDFS to the searchers and indexers, and the snapshot index from the indexers to the searchers.

• This transfer of indices is supported by a ‘Replica Group’ which is a Bit-Torrent based framework.

• Machines can join replica groups and automatically get all data associated with that replica group.

• These replica group members can also add data to their replica groups, which then get replicated to all the other members.

• This framework is also used to move data models from HDFS to the Federators and Brokers. Additional lifecycle management, such as versioning of indices and rolling back capabilities, are also built into this system.