cloud computing lecture #1 what is cloud...

22
1 Cloud Computing Lecture #1 What is Cloud Computing? What is Cloud Computing? (and an intro to parallel/distributed processing) Jimmy Lin The iSchool University of Maryland University of Maryland Wednesday, September 3, 2008 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License) Source: http://www.free-pictures-photos.com/

Upload: vuongtram

Post on 26-Mar-2018

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

1

Cloud Computing Lecture #1What is Cloud Computing?What is Cloud Computing?(and an intro to parallel/distributed processing)

Jimmy LinThe iSchoolUniversity of MarylandUniversity of Maryland

Wednesday, September 3, 2008

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

Source: http://www.free-pictures-photos.com/

Page 2: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

2

What is Cloud Computing?1. Web-scale problems

2. Large data centers

ff f3. Different models of computing

4. Highly-interactive Web applications

The iSchoolUniversity of Maryland

1. Web-Scale ProblemsCharacteristics:

Definitely data-intensiveMay also be processing intensive

Examples:Crawling, indexing, searching, mining the Web“Post-genomics” life sciences researchOther scientific data (physics, astronomers, etc.)Sensor networksWeb 2.0 applications

The iSchoolUniversity of Maryland

Web 2.0 applications…

Page 3: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

3

How much data?Wayback Machine has 2 PB + 20 TB/month (2006)

Google processes 20 PB a day (2008)

““all words ever spoken by human beings” ~ 5 EB

NOAA has ~1 PB climate data (2007)

CERN’s LHC will generate 15 PB a year (2008)

640K ought to be

The iSchoolUniversity of Maryland

genough for anybody.

Maximilien Brice, © CERN

Page 4: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

4

Maximilien Brice, © CERN

There’s nothing like more data!s/inspiration/data/g;

The iSchoolUniversity of Maryland(Banko and Brill, ACL 2001)

(Brants et al., EMNLP 2007)

Page 5: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

5

What to do with more data?Answering factoid questions

Pattern matching on the WebWorks amazingly well

Learning relationsStart with seed instancesSearch for patterns on the WebUsing patterns to find more instances

Who shot Abraham Lincoln? → X shot Abraham Lincoln

Wolfgang Amadeus Mozart (1756 1791)

The iSchoolUniversity of Maryland

Birthday-of(Mozart, 1756)Birthday-of(Einstein, 1879)

Wolfgang Amadeus Mozart (1756 - 1791)Einstein was born in 1879

PERSON (DATE –PERSON was born in DATE

(Brill et al., TREC 2001; Lin, ACM TOIS 2007)(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )

2. Large Data CentersWeb-scale problems? Throw more machines at it!

Clear trend: centralization of computing resources in large data centersdata centers

Necessary ingredients: fiber, juice, and spaceWhat do Oregon, Iceland, and abandoned mines have in common?

Important Issues:RedundancyEffi i

The iSchoolUniversity of Maryland

EfficiencyUtilizationManagement

Page 6: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

6

Source: Harper’s (Feb, 2008)

Maximilien Brice, © CERN

Page 7: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

7

Key Technology: Virtualization

App App App

Hardware

Operating System

App App App

Traditional Stack

Hardware

OS

App App App

Hypervisor

OS OS

Virtualized Stack

The iSchoolUniversity of Maryland

3. Different Computing Models

Utility computingWhy buy machines when you can rent cycles?

“Why do it yourself if you can pay someone to do it for you?”

Why buy machines when you can rent cycles?Examples: Amazon’s EC2, GoGrid, AppNexus

Platform as a Service (PaaS)Give me nice API and take care of the implementationExample: Google App Engine

Software as a Service (SaaS)

The iSchoolUniversity of Maryland

( )Just run it for me!Example: Gmail

Page 8: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

8

4. Web Applications A mistake on top of a hack built on sand held together by duct tape?

What is the nature of software applications?What is the nature of software applications?From the desktop to the browserSaaS == Web-based applicationsExamples: Google Maps, Facebook

How do we deliver highly-interactive Web-based applications?

The iSchoolUniversity of Maryland

AJAX (asynchronous JavaScript and XML)For better, or for worse…

What is the course about?MapReduce: the “back-end” of cloud computing

Batch-oriented processing of large datasets

Ajax: the “front-end” of cloud computingAjax: the front end of cloud computingHighly-interactive Web-based applications

Computing “in the clouds”Amazon’s EC2/S3 as an example of utility computing

The iSchoolUniversity of Maryland

Page 9: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

9

Amazon Web ServicesElastic Compute Cloud (EC2)

Rent computing resources by the hourBasic unit of accounting = instance-hourAdditional costs for bandwidth

Simple Storage Service (S3)Persistent storageCharge by the GB/monthAdditional costs for bandwidth

You’ll be using EC2/S3 for course assignments!

The iSchoolUniversity of Maryland

You ll be using EC2/S3 for course assignments!

This course is not for you…If you’re not genuinely interested in the topic

If you’re not ready to do a lot of programming

fIf you’re not open to thinking about computing in new ways

If you can’t cope with uncertainly, unpredictability, poor documentation, and immature software

If you can’t put in the time

The iSchoolUniversity of Maryland

Otherwise, this will be a richly rewarding course!

Page 10: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

10

Source: http://davidzinger.wordpress.com/2007/05/page/2/

Cloud Computing ZenDon’t get frustrated (take a deep breath)…

This is bleeding edge technologyThose W$*#T@F! moments

Be patient… This is the second first time I’ve taught this course

Be flexible…There will be unanticipated issues along the way

Be constructive…

The iSchoolUniversity of Maryland

Tell me how I can make everyone’s experience better

Page 11: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

11

Source: Wikipedia

Source: Wikipedia

Page 12: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

12

Source: Wikipedia

Source: Wikipedia

Page 13: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

13

Things to go over…Course schedule

Assignments and deliverables

C /SAmazon EC2/S3

The iSchoolUniversity of Maryland

Web-Scale Problems?Don’t hold your breath:

BiocomputingNanocomputingQuantum computing…

It all boils down to…Divide-and-conquerThrowing more hardware at the problem

The iSchoolUniversity of Maryland

Simple to understand… a lifetime to master…

Page 14: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

14

Divide and Conquer

“Work” Partition

w1 w2 w3

r1 r2 r3

“worker” “worker” “worker”

The iSchoolUniversity of Maryland

“Result” Combine

Different WorkersDifferent threads in the same core

Different cores in the same CPU

ff CDifferent CPUs in a multi-processor system

Different machines in a distributed system

The iSchoolUniversity of Maryland

Page 15: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

15

Choices, Choices, ChoicesCommodity vs. “exotic” hardware

Number of machines vs. processor vs. cores

fBandwidth of memory vs. disk vs. network

Different programming models

The iSchoolUniversity of Maryland

Flynn’s Taxonomy

Instructions

Single (SI) Multiple (MI)

Dat

a

MD

)

SISD

Single-threaded process

MISD

Pipeline architecture

SIMD MIMD

Sin

gle

(SD

)

The iSchoolUniversity of Maryland

Mul

tiple

(M Vector Processing Multi-threaded Programming

Page 16: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

16

SISD

D D D D D D D

Processor

Instructions

The iSchoolUniversity of Maryland

SIMD

Processor

D0 D0D0 D0 D0 D0

D1

D2

D3

D4

D1

D2

D3

D4

D1

D2

D3

D4

D1

D2

D3

D4

D1

D2

D3

D4

D1

D2

D3

D4

D1

D2

D3

D4

D0

The iSchoolUniversity of Maryland

Instructions

Dn Dn Dn Dn Dn DnDn

Page 17: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

17

MIMD

D D D D D D D

Processor

Instructions

D D D D D D D

Processor

The iSchoolUniversity of Maryland

D D D D D D D

Instructions

Memory Typology: Shared

Memory

Processor

Processor Processor

Processor

The iSchoolUniversity of Maryland

Page 18: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

18

Memory Typology: Distributed

MemoryProcessor MemoryProcessor

MemoryProcessor MemoryProcessor

Network

The iSchoolUniversity of Maryland

Memory Typology: Hybrid

Processor ProcessorMemory

Processor

Network

MemoryProcessor

MemoryProcessor

MemoryProcessor

The iSchoolUniversity of Maryland

e o yProcessor

e o yProcessor

Page 19: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

19

Parallelization ProblemsHow do we assign work units to workers?

What if we have more work units than workers?

f ?What if workers need to share partial results?

How do we aggregate partial results?

How do we know all the workers have finished?

What if workers die?

The iSchoolUniversity of Maryland

What is the common theme of all of these problems?

General Theme?Parallelization problems arise from:

Communication between workersAccess to shared resources (e.g., data)

Thus, we need a synchronization system!

This is tricky:Finding bugs is hardSolving bugs is even harder

The iSchoolUniversity of Maryland

Page 20: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

20

Managing Multiple WorkersDifficult because

(Often) don’t know the order in which workers run(Often) don’t know where the workers are running(Often) don’t know when workers interrupt each other

Thus, we need:Semaphores (lock, unlock)Conditional variables (wait, notify, broadcast)Barriers

Still lots of problems:

The iSchoolUniversity of Maryland

Still, lots of problems:Deadlock, livelock, race conditions, ...

Moral of the story: be careful!Even trickier if the workers are on different machines

Patterns for ParallelismParallel computing has been around for decades

Here are some “design patterns” …

The iSchoolUniversity of Maryland

Page 21: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

21

Master/Slaves

master

The iSchoolUniversity of Maryland

slaves

Producer/Consumer Flow

CP

P

P

C

C

CP

P

P

C

C

The iSchoolUniversity of Maryland

Page 22: Cloud Computing Lecture #1 What is Cloud Computing?lintool.github.io/UMD-courses/bigdata-2008-Fall/Session1.pdf · 2 What is Cloud Computing? 1. Web-scale problems 2. Large data centers

22

Work Queues

CP

P

P

C

C

shared queue

W W W W W

The iSchoolUniversity of Maryland

Rubber Meets RoadFrom patterns to implementation:

pthreads, OpenMP for multi-threaded programmingMPI for clustering computing…

The reality:Lots of one-off solutions, custom codeWrite you own dedicated library, then program with itBurden on the programmer to explicitly manage everything

MapReduce to the rescue!

The iSchoolUniversity of Maryland

MapReduce to the rescue!(for next time)