hpts talk on micro sharding with katta

Randomized Clustering

Text Retrieval Task

• Text viewed as a sequences of terms in fields

• Document and position for each term are indexed

• Query is a sequence of terms (typically many more than user actually types)

Text Retrieval

• Scores computed by merging occurrences of terms in query

• Only top scoring documents are kept

• Deletion and document edits done by adding new documents and keeping deletion list

• (this is all standard ... Lucene is the best known example)

Traditional Scaling

sharshard 1d 1

sharshard 2d 2

sharshard 3d 3

sharshard 4d 4

sharshard 5d 5

sharshard 1d 1

sharshard 2d 2

sharshard 3d 3

sharshard 4d 4

sharshard 5d 5

sharshard 1d 1

sharshard 2d 2

sharshard 3d 3

sharshard 4d 4

sharshard 5d 5

Rep

licati

on

Sharding?1..n/4 n/4+1...n/2 n/2+1...3n/4 3n/4+1...n

Consistent Hashing

0 1

01

0 1

Problems

• Presumes objects can be moved individually

• Has very high insertion/deletion rate

• Has disordered access patterns

• Often exhibits content/placement correlations

Micro Sharding

Retrieval Indexer Retrieval Indexer #1#1Retrieval Indexer Retrieval Indexer

#2#2Retrieval Indexer Retrieval Indexer ##nn

Content Indexer Content Indexer #1#1Content Indexer Content Indexer

#2#2Content Indexer Content Indexer ##mm

for (t in types)for (t in types) yield [key:(t, h(key)yield [key:(t, h(key)%shardCnt), %shardCnt), value:doc]value:doc]

n,m >> number of search nodes

map reduce hdfs

Search Architecture

Retrieval Engine Retrieval Engine #1#1Retrieval Engine Retrieval Engine #2#2

Retrieval Engine Retrieval Engine ##nn

Content Engine Content Engine #1#1Content Engine Content Engine ##mm

federatofederatorrfederatofederatorr

presentatipresentation layeron layer

Control Architecture

Retrieval Engine Retrieval Engine #2#2

federatofederatorr

zookeepezookeeperr

indexeindexerr

katta katta mastemasterr

HDFS

Scenario: Node Start

● Node starts, tells ZK it exists and has no shards

● Master notified by ZK, looks at shard placement

● Imbalance exists so Master assigns shards to new node

● Node notified by ZK, downloads shard, tells ZK

● Master notified by ZK, looks at shard placement, unassigns shard somewhere

Scenario: Node Crash

● ZK detects node connection loss and session expiration

● Master is notified by ZK that node ephemeral file has vanished, looks at shard placements

● If under-replication exists, Master assigns shards to other nodes

● Nodes are notified by ZK, download shards, tell ZK

● Master is notified by ZK, no action needed

Summary of Master

● Master is notified of node set or shard set change

● Master examines current state of cluster● If shards are under-replicated, add

assignments● If shards are over-replicated, delete

assignments● If cluster is imbalanced, add assignments● Rinse, repeat

Quick Results• No deletion/insertion in indexes at runtime

• Reloading micro-shards allows large sequential transfers

• Multiple shards allows very simple threading of search

• Random placement guided by balancing policy gives near optimal motion

• Node addition and failure are simple, reliable

• Random sharding also near optimallocal = global statistics, 2x query time improvementload balancinguniform management

• EC2 - elastic compute

• Zookeeper - reliable coordination

• Katta - shard and query management

• Hadoop - map-reduce, RPC for Katta

• Lucene - candidate set retrieval, index file storage

• Deepdyve search algorithms - segment scoring

Building Blocks

Zookeeper• Replicated key-value in-memory store

• Minimal semanticscreate, read, replace specified versionsequential and ephemeral filesnotifications

• Very strict correctness guaranteesstrict orderingquorum writesno blocking operationsno race conditions

• High speed50,000 updates per second, 200,000 reads per second

• EC2 - elastic compute

• Zookeeper - reliable coordination

• Katta - shard and query management

• Hadoop - map-reduce, RPC for Katta

• Lucene - candidate set retrieval, index file storage

• Deepdyve search algorithms - segment scoring

Building Blocks

Katta Interface• Simple Interface

Client - horizontal broadcast for query, vertical broadcast for updateInodeManaged - add/removeShard

• Pluggable Application Interface

• Pluggable Return PolicyGiven current return statereturn < 0 => donereturn 0 => return result, allow updatesreturn n => wait at most n milliseconds

• Comprehensive ResultsResults, exceptions, arrival times

Horizontal/Vertical Broadcast

sharshard 1d 1

sharshard 2d 2

sharshard 3d 3

sharshard 4d 4

sharshard 1d 1

sharshard 2d 2

sharshard 3d 3

sharshard 4d 4

sharshard 1d 1

sharshard 2d 2

sharshard 3d 3

sharshard 4d 4

Rep

licati

on

1..n/4 n/4+1...n/2 n/2+1...3n/4 3n/4+1...n

Operations

Retrieval Engine Retrieval Engine #2#2

federatofederatorr

zookeepezookeeperr

indexeindexerr

katta katta mastemasterr

HDFS

Impact of Cloud Approach

• Scale-free programming

• Deployed in EC2 (test) or in private farm (production)

• No single point of failure

• Real-time scale up/down

• Extensible to real-time index updates

Lessons

• Random is goodno correlations in document to shard assignments

⇒ strong bounds on node variations in search time⇒ local statistics are as good as global statistics

no structure in shard to node assignments

⇒ node failure is not correlated to documents⇒ load balancing and rebalancing is trivial⇒ threaded search is trivial

More Lessons

• Randomized clustering requires good coordination

Zookeeper makes that easy

• Good coordination means not having to say you’re sorry

Masters coordinate but don’t participate

Resources

● My blog– http://tdunning.blogspot.com/

● The web-site–www.deepdyve.com

● Source code– Katta (sourceforge)–Hadoop (Apache)– Lucene (Apache)

hpts talk on micro sharding with katta

Technology