distributed search solutions and comparison

Distributed Search - Solutions and Comparison

Ngọc Bùitrungngoc.bui@vtc.vn

FB:750 million active users3B photos upload each month. Record 750M photos uploaded to FB over new year’s weekend. 14M videos uploaded each monthMore than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.TBs log data daily

HOW TO FIND A NEEDLE IN THAT HUGE HAYSTACK?

Centralized Search – PROBLEM?

Lucene is great: high-performance, full-featured search library Incremental indexing Boolean Query, Fuzzy Query, Range Query, Multi

Phrase Query, Wild Card Query etc… It’s great BUT:

Slow if index is very big Index bigger than on HDD No load balance No failover

Reliable index serving - by failover (master and nodes)

Scalable for traffic and index size by adding nodes Distributed TF-IDF

Solution:

Documents are indexed in parallel on different machines in a cluster. When a user issues a search, it will be spawned on to multiple machines in parallel.

Choices: Katta Elastic Search HbaseDirectory (our choice)

Katta is a distributed application running on many commodity hardware servers

An index for Katta is a folder with a set of subfolders. Those subfolder are called index shards.

The distributed configuration and locking system Zookeeper is used for master-node communication.

Pros and Cons

Pros: Copy and distribute Shards automatically on Slaves. Support distributing queries and aggregating results.

Cons: No indexing support. Incremental update index is hard Resharding is too expensive.

Elastic Search (www.elasticsearch.org)

Elastic Search is an Open Source, Distributed, RESTful, Search Engine built on top of LuceneAutomatic Shard allocationAuto shard index & update indexNetwork interface (http) for data indexing, searching and administrating purely RESTful API.Schema Free.Can be integrated well with Hadoop/Map-Reduce

Behind Elastic

automatic shard allocation

There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s).

If you want to scale out search, you can simply have more shard, replicas per shard.

HbaseDirectory – What?

distributed search solutions and comparison

index shards

index size

mode hbase directory

lucene index flow hbase

incremental update index

lucene index file structures

distributed search solutions

big index bigger

Technology

comparison of multiobjective harmony search, cuckoo search...

distributed search over the hidden web:

investigation and comparison of distributed nosql database...

distributed word representations: vector comparison

towards a distributed web search...

project in distributed search engine sampler (044167)

penalties for distributed local search

distributed memory breadth-first search revisited...

hackney borough stop & search monitoring group. stop/search...

title distributed voltage control with electric springs:...

distributed stochastic search and distributed …distributed...

amazon cloud search comparison report

web distributed authoring and versioning (webdav) search

distributed search with rendezvous search systems christof...

towards a distributed concept search framework for

yokozuna, distributed search you don't think about

optimized distributed hyperparameter search and simulation...

smart grid communication middleware comparison distributed...

scalable grid resource discovery through distributed search

scalability comparison of peer-to-peer similarity-search...