distributed search solutions and comparison

Post on 15-Jan-2015

3.773 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Distributed Search - Solutions and Comparison

Ngọc Bùitrungngoc.bui@vtc.vn

Facts

FB:750 million active users3B photos upload each month. Record 750M photos uploaded to FB over new year’s weekend. 14M videos uploaded each monthMore than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.TBs log data daily

HOW TO FIND A NEEDLE IN THAT HUGE HAYSTACK?

Centralized Search – PROBLEM?

Lucene is great: high-performance, full-featured search library Incremental indexing Boolean Query, Fuzzy Query, Range Query, Multi

Phrase Query, Wild Card Query etc… It’s great BUT:

Slow if index is very big Index bigger than on HDD No load balance No failover

GOAL

Reliable index serving - by failover (master and nodes)

Scalable for traffic and index size by adding nodes Distributed TF-IDF

Solution:

Documents are indexed in parallel on different machines in a cluster. When a user issues a search, it will be spawned on to multiple machines in parallel.

Choices: Katta Elastic Search HbaseDirectory (our choice)

Katta

Katta is a distributed application running on many commodity hardware servers

An index for Katta is a folder with a set of subfolders. Those subfolder are called index shards.

The distributed configuration and locking system Zookeeper is used for master-node communication.

Pros and Cons

Pros: Copy and distribute Shards automatically on Slaves. Support distributing queries and aggregating results.

Cons: No indexing support. Incremental update index is hard Resharding is too expensive.

Elastic Search (www.elasticsearch.org)

Elastic Search is an Open Source, Distributed, RESTful, Search Engine built on top of LuceneAutomatic Shard allocationAuto shard index & update indexNetwork interface (http) for data indexing, searching and administrating purely RESTful API.Schema Free.Can be integrated well with Hadoop/Map-Reduce

Behind Elastic

automatic shard allocation

There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s).

If you want to scale out search, you can simply have more shard, replicas per shard.

HbaseDirectory – What?

Directory

HbaseDirectory – What?

Indexing PhaseSearching Phase

Directory

HbaseDirectory – What?

Directory is distributed? No but not impossible. Distributed? Using Directory on a distributed

storage system. HDFS: slowwww Hbase: our choice since it is optimized for random

access which is appropriate for accessing lucene index.

Hbase Directory: consider Hbase as a logical “Directory”.

Two Mode

Hbase Directory: lazy mode Keep lucene index file structures, porting to Hbase Only rewrite 2 libraries: FSDirectory & RAMDirectory

(Directory interface) Hbase Directory: active mode

Redesign index structure to utilize Hbase’s strength. Rewrite: 2 above + Indexreader & Indexwriter

Lucene index flow – Hbase flow

Performance & Conclusion

Refer to excel file HbaseDirectory – Active mode is the correct

choice. Improvement needed.

top related