scalable web programming · ning, socialgo offer customizable communities. server & data...
TRANSCRIPT
![Page 1: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/1.jpg)
Scalable Web ProgrammingCS193S - Jan Jannink - 2/25/10
![Page 2: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/2.jpg)
Weekly Syllabus1.Scalability: (Jan.)
2.Agile Practices
3.Ecology/Mashups
4.Browser/Client
5.Data/Server: (Feb.)
6.Security/Privacy
7.Analytics
8.Cloud/Map-Reduce
9.Published APIs: (Mar.)*
10. Future
* PROJECT DUE DATE
![Page 3: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/3.jpg)
Cloud Recap
Progressive commoditization of IT services
Choose based on value creation in project
Build it, Scale it, Code it, Customize it
Google has most efficient data centers
Ning, SocialGo offer customizable communities
![Page 4: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/4.jpg)
Server & Data Scaling
Traditionally depended on next hardware release
AltaVista search engine
limited to the most expensive DEC Alpha box
Original eBay build-out
massive SUN/Teradata clusters
![Page 5: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/5.jpg)
Map/Reduce
Background
Google founders’ disdain for traditional RDBMS
Original paper published 5 years ago
Main Features
limitless scalability on cheap hardware
real time fault tolerance
![Page 6: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/6.jpg)
Google Example
Key ‘contrarian’ insight
scaling on cheap hardware is best
need generic API to divide and conquer
If McDonald’s needed a top chef in each store
how much more would it cost?
how many restaurants would it now have?
![Page 7: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/7.jpg)
Hadoop
First Open Source implementation
by Doug Cutting also of Lucene fame
Storage, Execution, Management components
HDFS, HBase, Hive
MapReduce
Pig, ZooKeeper
![Page 8: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/8.jpg)
1-2-3
Configure server clusters
Code Map & Reduce classes
input list of (key, value) pairs
output set of (key, value) pairs
Start HDFS, MapReduce servers
![Page 9: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/9.jpg)
Execution Schema
browser
![Page 10: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/10.jpg)
Execution Schema
browserapache
![Page 11: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/11.jpg)
Execution Schema
master browserapache
![Page 12: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/12.jpg)
Execution Schema
m
a
p
re
du
ce
master browserapache
![Page 13: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/13.jpg)
Execution Schema
m
a
p
re
du
ce
data
data
data
master browserapache
![Page 14: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/14.jpg)
Execution Schema
m
a
p
re
du
ce
data
data
data
master browserapache
![Page 15: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/15.jpg)
Execution Schema
m
a
p
re
du
ce
data
data
data
master browserapache
![Page 16: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/16.jpg)
Example:Word Count
cat data.txt | sort | uniq -c
When data.txt is huge
split it into even chunks
sort chunks (or better yet, prefix sort)
count on a per item/prefix basis
return result
![Page 17: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/17.jpg)
Insights
Relates to network switching, routing concepts
DB queries can generate initial (key, value) pairs
Master restarts failed map, reduce tasks
ping nodes after first results start coming in
Rule of thumb
data / 64MB ≅ # mappers > # reducers
![Page 18: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/18.jpg)
Additional Details
Storing intermediate results
RAM/File systems on mappers, reducers
Can combine, e.g., duplicate removal, on mappers
Partition mapper results by key for reducers
Backup masters possible (not frequently used)
Caching becomes a critical system component
![Page 19: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/19.jpg)
Caching is a Huge Win
m
a
p
re
du
ce
data
data
data
master browserapache
cache
![Page 20: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/20.jpg)
Current Status
Current production development at Google
entirely Map/Reduce
Yahoo runs ~4000 node Hadoop cluster
100TB data sorting record < 3 hours
Facebook has ~1000 node Hadoop install
Amazon offers Elastic MapReduce
![Page 21: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/21.jpg)
Controversy?
DB community on both sides of argument
DeWitt, Stonebraker: a giant step backwards
Abadi: HadoopDB
In web apps, however
DBs are used for persistence, not transactions
MapReduce provides scale and fault tolerance
![Page 22: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/22.jpg)
Future Directions
AsterData, Google
hybrid Map/Reduce DBs, Data Warehouses
HadoopDB
open source PostgreSQL & Hadoop hybrid
NoSQL movement
![Page 23: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/23.jpg)
Data Glut
Over 3 million English Wikipedia articles
1000+ new articles a day
unthinkable to use without search
Facebook Data Warehouse adds 15 TB a day
in 2007 it was 15 TB total
Google has several multi Petabyte data stores
![Page 24: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/24.jpg)
Here to Stay
Easy to learn, easy to maintain
very incremental learning curve
Scales as fast as data is growing
only option for large data mining tasks
Accommodates multiple persistence backends
![Page 25: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/25.jpg)
Worth Checking Out
Hadoop
http://hadoop.apache.org/
Wikimedia Report Card
http://stats.wikimedia.org/reportcard/
Data blog
http://www.dbms2.com/
![Page 26: Scalable Web Programming · Ning, SocialGo offer customizable communities. Server & Data Scaling Traditionally depended on next hardware release AltaVista search engine limited to](https://reader034.vdocuments.net/reader034/viewer/2022050308/5f70dcc078c3ea7f6208a020/html5/thumbnails/26.jpg)
Q & A Topics
Merging DB optimizers and Map/Reduce
Managing multi stage Map/Reduce pipelines
Map/Reduce and virtual machines