hpts talk on micro sharding with katta
DESCRIPTION
A talk given by Ted Dunning in the HPTS workshop in Asilomar in 2009. It describes the ideas behind micro-sharding and outlines how Katta can manage micro-shards. Some builds and spacing are off because this was exported as power point from Keynote.TRANSCRIPT
Randomized Clustering
Text Retrieval Task
• Text viewed as a sequences of terms in fields
• Document and position for each term are indexed
• Query is a sequence of terms (typically many more than user actually types)
Text Retrieval
• Scores computed by merging occurrences of terms in query
• Only top scoring documents are kept
• Deletion and document edits done by adding new documents and keeping deletion list
• (this is all standard ... Lucene is the best known example)
Traditional Scaling
sharshard 1d 1
sharshard 2d 2
sharshard 3d 3
sharshard 4d 4
sharshard 5d 5
sharshard 1d 1
sharshard 2d 2
sharshard 3d 3
sharshard 4d 4
sharshard 5d 5
sharshard 1d 1
sharshard 2d 2
sharshard 3d 3
sharshard 4d 4
sharshard 5d 5
Rep
licati
on
Sharding?1..n/4 n/4+1...n/2 n/2+1...3n/4 3n/4+1...n
Consistent Hashing
0 1
01
0 1
Problems
• Presumes objects can be moved individually
• Has very high insertion/deletion rate
• Has disordered access patterns
• Often exhibits content/placement correlations
Micro Sharding
Retrieval Indexer Retrieval Indexer #1#1Retrieval Indexer Retrieval Indexer
#2#2Retrieval Indexer Retrieval Indexer ##nn
Content Indexer Content Indexer #1#1Content Indexer Content Indexer
#2#2Content Indexer Content Indexer ##mm
for (t in types)for (t in types) yield [key:(t, h(key)yield [key:(t, h(key)%shardCnt), %shardCnt), value:doc]value:doc]
n,m >> number of search nodes
map reduce hdfs
Search Architecture
Retrieval Engine Retrieval Engine #1#1Retrieval Engine Retrieval Engine #2#2
Retrieval Engine Retrieval Engine ##nn
Content Engine Content Engine #1#1Content Engine Content Engine ##mm
federatofederatorrfederatofederatorr
presentatipresentation layeron layer
Control Architecture
Retrieval Engine Retrieval Engine #2#2
federatofederatorr
zookeepezookeeperr
indexeindexerr
katta katta mastemasterr
HDFS
Scenario: Node Start
● Node starts, tells ZK it exists and has no shards
● Master notified by ZK, looks at shard placement
● Imbalance exists so Master assigns shards to new node
● Node notified by ZK, downloads shard, tells ZK
● Master notified by ZK, looks at shard placement, unassigns shard somewhere
Scenario: Node Crash
● ZK detects node connection loss and session expiration
● Master is notified by ZK that node ephemeral file has vanished, looks at shard placements
● If under-replication exists, Master assigns shards to other nodes
● Nodes are notified by ZK, download shards, tell ZK
● Master is notified by ZK, no action needed
Summary of Master
● Master is notified of node set or shard set change
● Master examines current state of cluster● If shards are under-replicated, add
assignments● If shards are over-replicated, delete
assignments● If cluster is imbalanced, add assignments● Rinse, repeat
Quick Results• No deletion/insertion in indexes at runtime
• Reloading micro-shards allows large sequential transfers
• Multiple shards allows very simple threading of search
• Random placement guided by balancing policy gives near optimal motion
• Node addition and failure are simple, reliable
• Random sharding also near optimallocal = global statistics, 2x query time improvementload balancinguniform management
• EC2 - elastic compute
• Zookeeper - reliable coordination
• Katta - shard and query management
• Hadoop - map-reduce, RPC for Katta
• Lucene - candidate set retrieval, index file storage
• Deepdyve search algorithms - segment scoring
Building Blocks
• EC2 - elastic compute
• Zookeeper - reliable coordination
• Katta - shard and query management
• Hadoop - map-reduce, RPC for Katta
• Lucene - candidate set retrieval, index file storage
• Deepdyve search algorithms - segment scoring
Building Blocks
Zookeeper• Replicated key-value in-memory store
• Minimal semanticscreate, read, replace specified versionsequential and ephemeral filesnotifications
• Very strict correctness guaranteesstrict orderingquorum writesno blocking operationsno race conditions
• High speed50,000 updates per second, 200,000 reads per second
• EC2 - elastic compute
• Zookeeper - reliable coordination
• Katta - shard and query management
• Hadoop - map-reduce, RPC for Katta
• Lucene - candidate set retrieval, index file storage
• Deepdyve search algorithms - segment scoring
Building Blocks
Katta Interface• Simple Interface
Client - horizontal broadcast for query, vertical broadcast for updateInodeManaged - add/removeShard
• Pluggable Application Interface
• Pluggable Return PolicyGiven current return statereturn < 0 => donereturn 0 => return result, allow updatesreturn n => wait at most n milliseconds
• Comprehensive ResultsResults, exceptions, arrival times
Horizontal/Vertical Broadcast
sharshard 1d 1
sharshard 2d 2
sharshard 3d 3
sharshard 4d 4
sharshard 1d 1
sharshard 2d 2
sharshard 3d 3
sharshard 4d 4
sharshard 1d 1
sharshard 2d 2
sharshard 3d 3
sharshard 4d 4
Rep
licati
on
1..n/4 n/4+1...n/2 n/2+1...3n/4 3n/4+1...n
Operations
Retrieval Engine Retrieval Engine #2#2
federatofederatorr
zookeepezookeeperr
indexeindexerr
katta katta mastemasterr
HDFS
Impact of Cloud Approach
• Scale-free programming
• Deployed in EC2 (test) or in private farm (production)
• No single point of failure
• Real-time scale up/down
• Extensible to real-time index updates
Lessons
• Random is goodno correlations in document to shard assignments
⇒ strong bounds on node variations in search time⇒ local statistics are as good as global statistics
no structure in shard to node assignments
⇒ node failure is not correlated to documents⇒ load balancing and rebalancing is trivial⇒ threaded search is trivial
More Lessons
• Randomized clustering requires good coordination
Zookeeper makes that easy
• Good coordination means not having to say you’re sorry
Masters coordinate but don’t participate
Resources
● My blog– http://tdunning.blogspot.com/
● The web-site–www.deepdyve.com
● Source code– Katta (sourceforge)–Hadoop (Apache)– Lucene (Apache)