scaling information retrieval to the webmooney/ir-course/slides/scalingir.pdf · apache/yahoo...
Post on 03-Jun-2020
18 Views
Preview:
TRANSCRIPT
Scaling Information Retrieval to the Web
1
Yinon BentorApril 13, 2010
Overview• What is large data? How big is it? How do we
handle it?
• What we don’t want to do
• The Google Platform (and the Apache Platform, and the Amazon Platform, ...)
• MapReduce for robust, efficient batch computation
• Distributed File Systems (GFS, HDFS), and why they’re useful
• Distributed Databases: BigTable, CouchDB, HBase
2
Overview
• And how does this apply to Information Retrieval?
• Distributed implementation of Inverted Indexing
• MapReduce for PageRank
• What else can we do?
• Practical considerations
3
Large Data• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s LHC will generate 15 PB a year
4
[Slide from Jimmy Lin: http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/ ]
What Can We Do With All This Data?
• Data Mining
• Question Answering
• Machine Translation
• Recommendation
• Ad Placement
5
• Train Classifiers (e.g., Spam Filters)
• Analyze Social Graphs
• “Discover the secrets of the universe”
“There’s no data like more data”
6
Numbers Everyone Should Know*
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA ! Netherlands ! CA 150,000,000 ns
* According to Jeff Dean (LADIS 2009 keynote) [Slide from Jimmy Lin: http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/ ]
What are the Lessons?• CPUs are fast, memory is slow, disk is slower:
use variable-length encodings, compression, etc.
• Read from memory whenever possible: Memory reads are ~80x faster than disk
• Prefer sequential disk reads to random access
• Prefer large files (64MB block sizes aren’t bad)
• Locality is important: Keep it within the same cache read, memory page, machine, rack, data center, continent, …
7
What we don’t want to do
Use expensive machines (they fail too)
➡ Cheap commodity hardware is better
Die on hardware failure
➡ Build reliability in software
Wait on shared resources
➡ Distribute everything
Transfer data unnecessarily
➡ Move code to data instead
8
9
Apache/Yahoo
Computation MapReduce HadoopEC2 / Elastic MapReduce
File Storage Google File System (GFS)
HDFS Amazon S3
Database BigTableHBase,
Cassandra, CouchDB
Amazon SimpleDB
Distributed/CloudComputing Platforms
MapReduce
10
“A simple programming model that applies to many large scale computing problems”
[Slide from Jeff Dean LADIS 2009]
Hide messy details in MapReduce runtime library:• automatic parallelization• load balancing• network and disk transfer optimizations• handling of machine failures• robustness
Improvements to core library benefit all users of library.
Programming Model (Lisp)• map : take a list and function f of 1
argument, apply f to each element:
11
map([1, 2, 4, 10],
function(x) {return x*x;})
> [1, 4, 16, 100]
fold([1, 4, 16, 100], 0,
function(x, y) {return x+y;})
> 121
• fold : take a list and a function g of 2 arguments, and an accumulator value, apply g iteratively to the accumulator and each value:
MapReduce Semantics
map:(k1, v1) → [(k2, v2)]
[sort and group by k2]
reduce:
(k2, [v2]) → [(k3, v3)]
12
MapReduce Operation
13
[Image from Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]
Word Count Example
14
, , , ,…,
MAPMAP MAP MAP MAP
Dracula 37 school 7 all 1057… all 187 school 4 cat 37…
cat 37 2 28 5 Dracula 37all 1057 187 22 school 7 4
Group by term, sort
all 1266 cat 72 Dracula 37 school 11
REDUCE REDUCE REDUCE REDUCE
Word Count: Pseudocode
15
[Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]
Generating Map Tiles
16[Slide from Jeff Dean LADIS 2009]
Inverted Indexing
• Recall that an Inverted Index is a map from a term to its posting list
• A Posting List is a list of each occurrence of the term in the corpus. For each posting, we might store:
• Additionally, we might want to compute Document Frequency (DF)
17
DocID PositionFeatures
Anchor? Title? Font Size
Inverted Index in MapReduce (Basic Implementation)
18
[Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]
19[Jimmy Lin, Data Intensive
Processing with MapReduce, (forthcoming)]
Inverted Index in MapReduce (Basic Implementation)
Inverted Index in MapReduce (Extensions)
• Mapper could:
• parse HTML or other data
• extract additional features from each page and emit more detailed postings
• Reducer could:
• implement compression, partitioning, and coding for more efficient retrieval
20
Inverted Index in MapReduce (Limitations)
• The basic implementation has a big scalability bottleneck. Using your IR knowledge, can you spot it?
• Vocabulary size is governed by Heap’s Law. Posting size is governed by Zipf’s Law. For some terms, we might be able to fit our posting list in memory!
• Workarounds exist. See [Lin 2010]
21
PageRank in MapReduceRecall that graphs can be represented as adjacency matrices or adjacency lists:
22
[Image from Jimmy Lin, Cloud Computing Course:http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html]
Which one is more appropriate for our task?
(Simplified) PageRank in MapReduce
23
(Assuming α=0 and no dangling edges) [Images from Jimmy Lin, Cloud Computing Course:http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html]
(Simplified) PageRank in MapReduce
24
[Jimmy Lin, Data Intensive Processing with MapReduce, (forthcoming)]
Each iteration is a MapReduce:
What about Retrieval?The indexing problem:
• Scalability is paramount
• Must be relatively fast, but need not be real time
• Fundamentally a batch operation
• Incremental updates may or may not be important
• For the web, crawling is a challenge in itself
The retrieval problem:
• Must have sub-second response time
• For the web, only need relatively few results
25
[Slide from Jimmy Lin, Cloud Computing Course:http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html]
Great for
MapReduce!
Not So Great
MapReduce: ExecutionThe MapReduce framework:
• schedules mappers and reducers
• allocates workers close to data
• checks for slow or failed processes periodically and re-submits
• Handles sorting and combining efficiently
26
MapReduce: Conclusions• Divide and Conquer on a
massive scale
• Can efficiently handle many IR batch tasks:
• Indexing, PageRank, Language Modeling, Sequence Alignment (for Translation), Classification, and more
• A reasonable abstracting, trading off between flexibility and ease of implementation
27
[Dean and Ghemawat, OSDI 2004]
File Storage• In traditional supercomputers, storage and
computation are kept separate. This means data must be transferred through fast interconnects to compute nodes (bad!).
• Google File System (GFS) and Hadoop File System (HDFS) keep data replicated across cheap commodity hardware
• Each file is replicated at least 3 times (more for highly-used or critical files)
28
GFS: Design Considerations
• Large files: 64MB chunks (why?) stored on Chunkservers
• GFS Masters manage metadata
• Clients retrieve file data directly from chunkservers
29
[Dean, Handling Large Datasets at Googlehttp://research.yahoo.com/files/6DeanGoogle.pdf]
GFS Usage (2007)
• 200+ GFS clusters
• Largest clusters:
• 5000+ machines
• 5+ PB of disk usage
• 10000+ clients
30
[Dean, Handling Large Datasets at Googlehttp://research.yahoo.com/files/6DeanGoogle.pdf]
Semi-Structured Data• Traditional relational databases fail at
this scale: most operations are too expensive.
• Solution: distributed databases
• Google’s BigTable stores data as a “sparse, distributed multi-dimensional sorted map”
• In the open-source world, Cassandra (Digg/Twitter/Facebook), HBase (Yahoo, others) and CouchDB perform similar roles
31
32 [http://blog.nahurst.com/visual-guide-to-nosql-systems]
BigTable Design Considerations
• Loose schema and data types
• BigTables divided into tablets which are replicated and distributed
• Tablets are (~100-200 MB/tablet) and stored in GFS. Each machine hosts ~100 tablets
• Optimized for reads and appends
• Tablets can be reallocated on failures or increased load
33
BigTable at Google
• Used widely: Google Earth, Analytics, Crawl, Print, Orkut, Blogger, …
• Largest cluster (2009): 70+ PB of data, 10M ops/sec, 30+ GB/s I/O
34
[Jeff Dean, LADIS 2009 Keynote]
Fitting It All Together
• Operating at Web Scale requires completely distributed, fault tolerant systems
• Replication and data locality is key
• Good abstractions allow smart programmers to be efficient
• Data is only going to get bigger
35
Questions?
36
top related