1 barc barc microsoft bay area research center tom barclay tyler beam (u va)* gordon bell joe...

83
1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco Erik Riedel (CMU)* Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen (NTFS)* http://www.research.Microsoft.com/barc/

Upload: penelope-welch

Post on 11-Jan-2016

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

1

BARC

BARCMicrosoft Bay Area Research Center

Tom BarclayTyler Beam (U VA)*Gordon BellJoe BarreraJosh Coates (UCB)* Jim Gemmell Jim GraySteve LuccoErik Riedel (CMU)*Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen (NTFS)*

http://www.research.Microsoft.com/barc/

Page 2: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

2

Overview

• Telepresence»Goals

»Prototypes

• Rags: automating software testing

• Scaleable Systems.»Goals

»Prototypes

• Misc.

Page 3: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

3

Page 4: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

4

Telepresence: The next Killer App

• Space shifting: »Reduce travel

• Time shifting: »Retrospectives

»Condensations

»Just in time meetings.

• Example: ACM 97 »http://research.Microsoft.com/barc/acm97/

»NetShow and Web site.

»More web visitors than attendees

Page 5: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

5

What We Are Doing• Scalable Reliable Multicast (SRM)

»used by WB (white board) of Mbone

»Nack suppression (backoff)

»N2 message traffic to set up

• Error Correcting SRM (EC SRM)Error Correcting SRM (EC SRM)»Do not resend lost packets.

»Send Error Correction in addition to regular

»(or)Send Error Correction in response to NACK

»One EC packet repairs any of k lost packets

»Improved scaleability (millions of subscribers).

Page 6: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

6

Telepresence Prototypes• PowerCast: multicast PowerPoint

» Streaming - pre-sends next anticipated slide» Send slides and voice rather than talking head and voice» Uses ECSRM for reliable multicast» 1000’s of receivers can join and leave any time.» No server needed; no pre-load of slides.» Cooperating with NetShow

• FileCast: multicast file transfer.» Erasure encodes all packets» Receivers only need to receive as many bytes

as the length of the file» Multicast IE to solve Midnight-Madness problem

• NT SRM: reliable IP multicast library for NT

• Spatialized Teleconference Station» Texture map faces onto spheres

» Space map voices

Page 7: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

7

IP Multicast• Is pruned broadcast to a multicast address

• Unreliable

• Reliable would require Ack/Nack.

• State or Nack implosion problem

routerrouter routerrouter

routerrouter

=sender=sender =receiver=receiver =not interested=not interested

routerrouter

Page 8: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

8

(n,k) encodingOriginal packetsOriginal packets

1 2 k

1 2 k k+1k+2 n

Encode Encode (copy 1st k)(copy 1st k)

1 2 k Original packetsOriginal packets

DecodeDecode

Take any kTake any k

Page 9: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

9

Fcast

• File tranfer protocol

• FEC-only

• Files transmitted in parallel

Page 10: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

10

Fcast send order

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 nFile 1File 1

File 2File 2

X Need k from Need k from each roweach row

Page 11: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

11

ECSRM - Erasure Correcting SRM

• Combines:

» suppression

» erasure correction

Page 12: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

12

Suppression

• Delay a NACK or repair in the hopes that someone else will do it.

• NACKs are multicast

• After NACKing, re-set timer and wait for repair

• If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.

Page 13: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

13

ECSRM - adding FEC to suppression

• Assign each packet to an EC group of size k

• NACK: (group, # missing)

• NACK of (g,c) suppresses all (g,xc).

• Don’t re-send originals; send EC packets using (n,k) encoding

Page 14: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

14

ECSRM • Combine suppression & erasure correction

• Assign each packet to an EC group of size k

• NACK: (group, # missing)

• NACK of (g,c) suppresses all (g,xc).

• Don’t re-send originals; send EC packets using (n,k) encoding

• Below, 1 NACK and one EC packet fixes all errors.1

2

3

4

5

6

7

EC1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

XXXX

XX XX XXXX

XX

Page 15: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

15

Multicast PowerPoint Add-in

Slides

Annotations

Control informationECSRECSRMM

slide masterFcastFcast

Page 16: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

16

Multicast PowerPoint - Late Joiners

• Viewers joining late don’t impact others with session persistent data (slide master)

timetime

joinjoin leaveleave

FcastFcast

ECSRECSRMM

joinjoin

Page 17: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

17

Future Work

• Adding hierarchy (e.g. PGM by Cisco)

• Do we need 2 protocols?

Page 18: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

18

Spatialized Teleconferences

• Map heads to “Eggs”

• Project voices in stereo using “nose vector”

Page 19: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

19

RAGS: RAndom SQL test Generator

• Microsoft spends a LOT of money on testing. (60% of development according to one source).

• Idea: test SQL by » generating random correct queries» executing queries against database» compare results with SQL 6.5, DB2, Oracle, Sybase

• Being used in SQL 7.0 testing.» 375 unique bugs found (since 2/97)

» Very productive test tool

Page 20: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

20

Sample Rags Generated Statement

SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notesFROM titles T0, roysched T1WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY (

SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS (

SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 )

This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error:Query processor could not produce a query plan.

Page 21: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

21

Automation

• Simpler Statement with same errorSELECT roysched.royalty FROM titles, royschedWHERE EXISTS (

SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1)

• Control statement attributes»complexity, kind, depth, ...

• Multi-user stress tests»tests concurrency, allocation, recovery

Page 22: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

22

One 4-Vendor Rags Test3 of them vs Us

• 60 k Selects on MSS, DB2, Oracle, Sybase.

• 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements.

• Examine 10 suspects, filed 4 Bugs!One duplicate. Assume 3/10 are new

• Note: This is the SS Beta 2 ProductQuality rising fast (and RAGS sees that)

Page 23: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

23

RAGS Next Steps

• Done:

»Patents, Papers, Talks

» tech transfer to development• SQL 7 (over 400 bugs), FoxPro, OLE DB.

• Next steps:

»Make even more automatic

»Extend to other parts of SQL and Tsql

»“Crawl” the config space (look for new holes)

»Apply ideas to other domains (ole db).

Page 24: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

Scale Up and Scale Out

SMPSMPSuper ServerSuper Server

DepartmentalDepartmentalServerServer

PersonalPersonalSystemSystem

Grow Up with SMPGrow Up with SMP4xP6 is now standard4xP6 is now standard

Grow Out with ClusterGrow Out with Cluster

Cluster has inexpensive partsCluster has inexpensive parts

Clusterof PCs

Page 25: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

Billions Of Clients

• Every device will be “intelligent”

• Doors, rooms, cars…

• Computing will be ubiquitous

Page 26: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

Billions Of ClientsNeed Millions Of Servers

MobileMobileclientsclients

FixedFixedclients clients

ServerServer

SuperSuperserverserver

ClientsClients

ServersServers

All clients networked All clients networked to serversto servers May be nomadicMay be nomadic

or on-demandor on-demand Fast clients wantFast clients want

fasterfaster servers servers Servers provide Servers provide

Shared DataShared Data ControlControl CoordinationCoordination CommunicationCommunication

Page 27: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

ThesisMany little beat few big

Smoking, hairy golf ballSmoking, hairy golf ball How to connect the many little parts?How to connect the many little parts? How to program the many little parts?How to program the many little parts? Fault tolerance?Fault tolerance?

$1 $1 millionmillion $100 K$100 K $10 K$10 K

MainframeMainframe MiniMiniMicroMicro NanoNano

14"14"9"9"

5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"1 M SPECmarks, 1TFLOP1 M SPECmarks, 1TFLOP

101066 clocks to bulk ram clocks to bulk ram

Event-horizon on chipEvent-horizon on chip

VM reincarnatedVM reincarnated

Multiprogram cache,Multiprogram cache,On-Chip SMPOn-Chip SMP

10 microsecond ram

10 millisecond disc

10 second tape archive

10 nano-second ram

Pico Processor

10 pico-second ram

1 MM 3

100 TB

1 TB

10 GB

1 MB

100 MB

Page 28: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

28

Microsoft TerraServer: Scaleup to Big Databases

• Build a 1 TB SQL Server database• Data must be

» 1 TB» Unencumbered» Interesting to everyone everywhere» And not offensive to anyone anywhere

• Loaded » 1.5 M place names from Encarta World Atlas» 3 M Sq Km from USGS (1 meter resolution)» 1 M Sq Km from Russian Space agency (2 m)

• On the web (world’s largest atlas)• Sell images with commerce server.

Page 29: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

29

Microsoft TerraServer Background

• Earth is 500 Tera-meters square» USA is 10 tm2

• 100 TM2 land in 70ºN to 70ºS

• We have pictures of 6% of it» 3 tsm from USGS

» 2 tsm from Russian Space Agency

• Compress 5:1 (JPEG) to 1.5 TB.

• Slice into 10 KB chunks

• Store chunks in DB

• Navigate with

» Encarta™ Atlas• globe

• gazetteer

» StreetsPlus™ in the USA

40x60 km2 jump image

20x30 km2 browse image

10x15 km2 thumbnail

1.8x1.2 km2 tile

• Someday» multi-spectral image

» of everywhere

» once a day / hour

Page 30: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

30

USGS Digital Ortho Quads (DOQ) • US Geologic Survey

• 4 Tera Bytes

• Most data not yet published

• Based on a CRADA» Microsoft TerraServer makes

data available.

USGS “DOQ”

1x1 meter4 TBContinentalUSNew DataComing

Page 31: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

31

Russian Space Agency(SovInfomSputnik) SPIN-2 (Aerial Images is Worldwide Distributor)

• 1.5 Meter Geo Rectified imagery of (almost) anywhere

• Almost equal-area projection

• De-classified satellite photos (from 200 KM),

• More data coming (1 m)

• Selling imagery on Internet.

• Putting 2 tm2 onto Microsoft TerraServer.

SPIN-2

Page 32: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

32

Live on the internet 6/24/98For 18 Months

One Billion Served • New Since S-Day: • More data:

• 4.8 TB USGS DOQ• .8 TB Russian

• Bigger Server:• Alpha 8400

• 8 proc, 8 GB RAM, • 2.9 TB Disk

• Improved Application• Better UI• Uses ASP• Commerce App

• Load 6 TB more 60% US

4% world

• 30 M web hits per day peak

• 8 Mhpd avg (1 M page views /day)

• 1 Billion pages served!

• 99.95% available

• No NT failures, 30 minute SQL restart

Page 33: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

33

http://www.TerraServer.Microsoft.com/

Demo

SPIN-2

Microsoft

BackOffice

Page 34: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

34

Demo • navigate by coverage map to White House

• Download image

• buy imagery from USGS

• navigate by name to Venice

• buy SPIN2 image & Kodak photo

• Pop out to Expedia street map of Venice

• Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)

Page 35: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

35

1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 )

Hardware

100 MbpsEthernet Switch

DS3

SiteServersInternet

MapServer

SPIN-2

Web Servers

STK9710DLTTapeLibrary

489 GBDrives

AlphaServer8400

Enterprise Storage Array

8 x 440MHzAlpha cpus

10 GB DRAM

489 GBDrives

489 GBDrives

489 GBDrives

489 GBDrives

489 GBDrives

489 GBDrives

Page 36: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

36

The Microsoft TerraServer Hardware

• Compaq AlphaServer 8400

• 8x400Mhz Alpha cpus

• 10 GB DRAM

• 324 9.2 GB StorageWorks Disks» 3 TB raw, 2.4 TB of RAID5

• STK 9710 tape robot (4 TB)

• WindowsNT 4 EE, SQL Server 7.0

Page 37: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

37

browser

HTMLJava

Viewer

The Internet

Web Client

Microsoft AutomapActiveX Server

Internet InfoServer 4.0

Image DeliveryApplication

SQL Server7

MicrosoftSite Server EE

Internet InformationServer 4.0

Image Provider Site(s)

TerraServer DB Automap Server

Terra-ServerStored Procedures

InternetInformationServer 4.0

ImageServer

Active Server Pages

MTS

TerraServer Web Site

Software

SQL Server 7

Page 38: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

38

• Backup and Recovery

»STK 9710 Tape robot

»Legato NetWorker™

»SQL Server 7 Backup & Restore

»Clocked at 80 MBps (peak)(~ 200 GB/hr)

• SQL Server Enterprise Mgr

»DBA Maintenance

»SQL Performance Monitor

System Management & Maintenance

Page 39: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

39

Microsoft TerraServer File Group Layout

• Convert 324 disks to 28 RAID5 setsplus 28 spare drives

• Make 4 WinNT volumes (RAID 50)

595 GB per volume

• Build 30 20GB files on each volume

• DB is File Group of 120 files

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

E: F: G: H:

HSZ70 A

HSZ70 B

Page 40: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

40

Image Delivery and LoadIncremental load of 4 more TB in next 18 months

DLTTape “tar”

\Drop’N’ DoJobWait 4Load

LoadMgrDB

100mbitEtherSwitch

108 9.1 GBDrives

Enterprise Storage Array

AlphaServer8400

108 9.1 GBDrives

108 9.1 GBDrives

STKDLTTape

Library

604.3 GBDrives

AlphaServer4100

ESAAlphaServer4100

LoadMgr

DLTTape

NTBackup

ImgCutter

\Drop’N’ \Images

10: ImgCutter20: Partition30: ThumbImg40: BrowseImg45: JumpImg50: TileImg55: Meta Data60: Tile Meta70: Img Meta80: Update Place

...LoadMgr

Page 41: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

41

Technical ChallengeKey idea

• Problem: Geo-Spatial Search without geo-spatial access methods.(just standard SQL Server)

• Solution:Geo-spatial search key:

Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y)

Z-transform X & Y into single Z value, build B-tree on Z

Adjacent images stored next to each other

Search Method:Latitude and Longitude => X, Y, then Z

Select on matching Z value

Page 42: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

42

Some Tera-Byte DatabasesKilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

• The Web: 1 TB of HTML

• TerraServer 1 TB of images

• Several other 1 TB (file) servers

• Hotmail: 7 TB of email

• Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked

• EOS/DIS (picture of planet each week)» 15 PB by 2007

• Federal Clearing house: images of checks» 15 PB by 2006 (7 year history)

• Nuclear Stockpile Stewardship Program» 10 Exabytes (???!!)

Page 43: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

43

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

A novel A letter

Library of Library of Congress Congress (text)(text)

All Disks

All Tapes

A Movie

LoC (image)

All Photos

LoC (sound + cinima)

All Information!

Page 44: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

44

Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html

• Soon everything can be recorded and kept

• Most data will never be seen by humans

• Precious Resource: Human attention Auto-SummarizationAuto-Search

will be a key enabling technology.

Page 45: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

45

Scalability1 billion 1 billion

transactionstransactions

1.8 million 1.8 million mail messagesmail messages

4 terabytes of 4 terabytes of datadata

100 million100 millionweb hitsweb hits

• Scale up: to large SMP nodesScale up: to large SMP nodes• Scale out: to clusters of SMP nodesScale out: to clusters of SMP nodes

Page 46: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

46

1.2 B tpd• 1 B tpd ran for 24 hrs.

• Out-of-the-box software

• Off-the-shelf hardware

• AMAZING!

•Sized for 30 days•Linear growth•5 micro-dollars per transaction

Page 47: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

47

Millions of Transactions Per Day

0.1

1.

10.

100.

1,000.

1 Btpd Visa ATT BofA NYSE

Mtp

d

Millions of Transactions Per Day

0.100.200.300.400.500.600.700.800.900.

1,000.

1 Btpd Visa ATT BofA NYSE

Mtp

d

How Much Is 1 Billion Tpd?• 1 billion tpd = 11,574 tps

~ 700,000 tpm (transactions/minute)• ATT

» 185 million calls per peak day (worldwide)

• Visa ~20 million tpd» 400 million customers» 250K ATMs worldwide» 7 billion transactions

(card+cheque) in 1994

• New York Stock Exchange » 600,000 tpd

• Bank of America» 20 million tpd checks cleared

(more than any other bank)» 1.4 million tpd ATM transactions

• Worldwide Airlines Reservations: 250 Mtpd

Page 48: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

48

NCSA Super Cluster

• National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana

• 512 Pentium II cpus, 2,096 disks, SAN• Compaq + HP +Myricom + WindowsNT• A Super Computer for 3M$• Classic Fortran/MPI programming• DCOM programming model

http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html

Page 49: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

49

NT Clusters (Wolfpack)• Scale DOWN to PDA: WindowsCE

• Scale UP an SMP: TerraServer

• Scale OUT with a cluster of machines

• Single-system image

»Naming

»Protection/security

»Management/load balance

• Fault tolerance

»“Wolfpack”

• Hot pluggable hardware & software

Page 50: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

50

Web Web sitesite

DatabaseDatabase

Web site filesWeb site files

Database filesDatabase files

Server 1Server 1

BrowserBrowser

Symmetric Virtual Server Failover Example

Server 1Server 1 Server 2Server 2

Web site filesWeb site files

Database filesDatabase files

Web Web sitesite

DatabaseDatabase

Web Web sitesite

DatabaseDatabase

Page 51: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

51

Clusters & BackOffice• Research: Instant & Transparent failover

• Making BackOffice PlugNPlay on Wolfpack

»Automatic install & configure

• Virtual Server concept makes it easy

»simpler management concept

»simpler context/state migration

»transparent to applications

• SQL 6.5E & 7.0 Failover

• MSMQ (queues), MTS (transactions).

Page 52: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

52

Storage Latency: How Far Away is the Data?

Storage Latency: How Far Away is the Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

109

106

This CampusThis Room

10 min

My Head 1 min

1.5 hrSacramento

2 YearsPluto

2,000 YearsAndromeda

Page 53: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

53Controller

The Memory Hierarchy

• Measuring & Modeling Sequential IO

• Where is the bottleneck?

• How does it scale with

»SMP, RAID, new interconnects

Adapter SCSIFile cache PCI

MemoryGoals:balanced bottlenecksLow overheadScale many processors (10s)Scale many disks (100s)

Mem

bus

App address space

Page 54: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

54

Sequential IO your mileage

will vary0.00

5.00

10.00

15.00

20.00

25.00

30.00

2 4 8 16 32 64 128

transfer size (KB)

MB

/sec

4 disk read

4 disk write

1 disk read

1 disk write

Striping HelpsController is bottneck

40 MB/sec Advertised UW SCSI

35r-23w MB/sec Actual disk transfer

29r-17w MB/sec 64 KB request (NTFS)

9 MB/sec Single disk media

3 MB/sec 2 KB request (SQL Server)

• Measuring hardware & Software

• Looking for software fixes..

• Aiming for “out of the box” 1/2 power point: 50% of peak power“out of the box”

0.00

2.00

4.00

6.00

8.00

10.00

2 4 8 16 32 64 128

transfer size (KB)

MB

/sec

1 disk read

1 disk write

1 disk read/(NTFS buffer)

1 disk write(NTFS buffer)

NTFS Read is good at 8KB, but writes are uniformly slow

Page 55: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

55

PAP (peak advertised Performance) vs RAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point)

System Bus422 MBps

7.2 MB/s

133 MBps7.2 MB/s

10-15 MBps7.2 MB/s

SCSIFile System Buffers

ApplicationData

Disk

PCI

40 MBps7.2 MB/s

Page 56: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

56

The Best Case: Temp File, NO IO• Temp file Read / Write File System Cache

• Program uses small (in cpu cache) buffer.

• So, write/read time is bus move time (3x better than copy)

• Paradox: fastest way to move data is to write then read it.

• This hardware islimited to 150 MBpsper processor

Temp File Read/Write

148 136

54

0

50

100

150

200

Temp read Temp write Memcopy ()

MB

ps

Page 57: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

57

Bottleneck Analysis

• Drawn to linear scale

TheoreticalBus Bandwidth

422MBps = 66 Mhz x 64 bits

MemoryRead/Write

~150 MBps

MemCopy~50 MBps

Disk R/W~9MBps

Page 58: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

58

3 Stripes and Your Out!• 3 disks can saturate adapter

• Similar story with UltraWide

• CPU time goes down with request size

• Ftdisk (striping is cheap)

Read Throughput vs Stripes - 3 deep Fast

0

5

10

15

20

2 4 8 16 32 64 128 192Request Size (K bytes)

Th

rou

gh

pu

t (M

B/s

)

WriteThroughput vs Stripes - 3 deep Fast

0

5

10

15

20

2 4 8 16 32 64 128 192Request Size (K bytes)

Th

rou

gh

pu

t (M

B/s

)

1 Disk

2 Disks

3 Disks

4 Disks

CPU miliseconds per MB

1

10

100

2 4 8 16 32 64 128 192

Request Size (bytes)

Co

st (

CP

U m

s/M

B)

=

Page 59: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

59

Parallel SCSI Busses Help• Second SCSI bus nearly

doubles read and wce throughput

• Write needs deeper buffers

• Experiment is unbuffered(3-deep +WCE)

One or Two SCSI Busses

0

5

10

15

20

25

2 4 8 16 32 64 128 192

Request Size (K bytes)

Th

rou

gh

pu

t (M

B/s

)

ReadWriteWCEReadWriteWCE

2 busses

1 Bus

2 x

Page 60: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

60

File System Buffering & Stripes(UltraWide Drives)

• FS buffering helps small reads

• FS buffered writes peak at 12MBps

• 3-deep async helps

• Write peaks at 20 MBps

• Read peaks at 30 MBps

Three Disks, 1 Deep

0

5

10

15

20

25

30

35

2 4 8 16 32 64 128 192Request Size (K Bytes)

Th

rou

gh

pu

t (M

B/s

)

FS Read

ReadFS Write WCE

Write WCE

Three Disks, 3 Deep

0

5

10

15

20

25

30

35

2 4 8 16 32 64 128 192Request Size (K Bytes)

Th

rou

gh

pu

t (M

B/s

)

Page 61: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

61

PAP vs RAP• Reads are easy, writes are hard

• Async write can match WCE.

422 MBps

142 MBps

133 MBps

72 MBps

10-15 MBps

9 MBps

SCSI

File System

ApplicationData

PCI SCSI

Disks40 MBps

31 MBps

Page 62: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

62

Bottleneck Analysis• NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI

~ 65 MBps Unbuffered read~ 43 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

Adapter

70 M

Bps

Page 63: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

63

Hypothetical Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI

(not measured, we had only one PCI bus available, 2nd one was “internal”)

~ 120 MBps Unbuffered read

~ 80 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

PCI

Adapter

Adapter

Adapter

120

MB

ps

Page 64: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

64

Computers shrink to a point• Disks on track

• 100x in 10 years 2 TB 3.5” drive

• Shrink to 1” is 200GB

• Disk replaces tape?

• Disk is super computer!

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

Page 65: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

65

Data Gravity Processing Moves to Transducers

• Move Processing to data sources

• Move to where the power (and sheet metal) is

• Processor in

»Modem

»Display

»Microphones (speech recognition) & cameras (vision)

»Storage: Data storage and analysis

Page 66: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

66

It’s Already True of PrintersPeripheral = CyberBrick

• You buy a printer

• You get a

»several network interfaces

»A Postscript engine • cpu, • memory, • software,• a spooler (soon)

»and… a print engine.

Page 67: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

67

Remember Your Roots

Page 68: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

68

Year 2002 Disks• Big disk (10 $/GB)

» 3”

» 100 GB

» 150 kaps (k accesses per second)

» 20 MBps sequential

• Small disk (20 $/GB)» 3”

» 4 GB

» 100 kaps

» 10 MBps sequential

• Both running Windows NT™ 7.0?(see below for why)

Page 69: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

69

How Do They Talk to Each Other?• Each node has an OS

• Each node has local resources: A federation.

• Each node does not completely trust the others.

• Nodes use RPC to talk to each other» CORBA? DCOM? IIOP? RMI?

» One or all of the above.

• Huge leverage in high-level interfaces.

• Same old distributed system story.

Wire(s)h

stre

ams

data

gram

s

RP

C?

Applications

VIAL/VIPL

streams

datagrams

RP

C ?

Applications

Page 70: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

70

What if Networking Was as Cheap As Disk IO?

• TCP/IP

»Unix/NT 100% cpu @ 40MBps

• Disk

»Unix/NT 8% cpu @ 40MBps

Why the Difference?Host Bus Adapter does

SCSI packetizing, checksum,…flow controlDMA

Host doesTCP/IP packetizing, checksum,…flow controlsmall buffers

Page 71: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

71

Technology Drivers: The Promise of SAN/VIA:10x in 2 years

http://www.ViArch.org/• Today:

»wires are 10 MBps (100 Mbps Ethernet)

»~20 MBps tcp/ip saturates 2 cpus

»round-trip latency is ~300 us

• In the lab»Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,…

» Fast user-level communication• tcp/ip ~ 100 MBps 10% of each processor

• round-trip latency is 15 us

Page 72: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

72

Gbps Ethernet: 110 MBps

SAN: Standard

Interconnect

PCI: 70 MBps

UW Scsi: 40 MBps

FW scsi: 20 MBps

scsi: 5 MBps

• LAN faster than memory bus?

• 1 GBps links in lab.

• 100$ port cost soon

• Port is computer

RIPFDDI

RIPATM

RIPSCI

RIPSCSI

RIPFC

RIP?

Page 73: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

73

Technology Drivers

Plug & Play Software• RPC is standardizing: (DCOM, IIOP, HTTP)

» Gives huge TOOL LEVERAGE» Solves the hard problems for you:

• naming, • security, • directory service, • operations,...

• Commoditized programming environments » FreeBSD, Linix, Solaris,…+ tools» NetWare + tools» WinCE, WinNT,…+ tools» JavaOS + tools

• Apps gravitate to data.

• General purpose OS on controller runs apps.

Page 74: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

74

Disk = Node• has magnetic storage (100 GB?)

• has processor & DRAM

• has SAN attachment

• has execution environment

OS KernelSAN driver Disk driver

File System RPC, ...Services DBMS

Applications

Page 75: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

75

Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark

• How much can you sort for a penny.» Hardware and Software cost» Depreciated over 3 years» 1M$ system gets about 1 second,» 1K$ system gets about 1,000 seconds.» Time (seconds) = SystemPrice ($) / 946,080

• Input and output are disk resident

• Input is » 100-byte records (random data)» key is first 10 bytes.

• Must create output file and fill with sorted version of input file.

• Daytona (product) and Indy (special) categories

Page 76: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

76

PennySort• Hardware

» 266 Mhz Intel PPro

» 64 MB SDRAM (10ns)

» Dual Fujitsu DMA 3.2GB EIDE

• Software» NT workstation 4.3

» NT 5 sort

• Performance» sort 15 M 100-byte records (~1.5 GB)

» Disk to disk

» elapsed time 820 sec • cpu time = 404 sec

PennySort Machine (1107$ )

board13%

Memory8%

Cabinet + Assembly

7%

Network, Video, floppy

9%

Software6%

Other22%

cpu 32%

Disk25%

Page 77: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

77

Cluster Sort Conceptual Model

•Multiple Data Sources

•Multiple Data Destinations

•Multiple nodes

•Disks -> Sockets -> Disk -> DiskB

AAABBBCCC

A

AAABBBCCC

C

AAABBBCCC

BBBBBBBBB

AAAAAAAAA

CCCCCCCCC

BBBBBBBBB

AAAAAAAAA

CCCCCCCCC

Page 78: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

78

Cluster Install & Execute

•If this is to be used by others, it must be:

•Easy to install•Easy to execute

• Installations of distributed systems take time and can be tedious. (AM2, GluGuard)

• Parallel Remote execution is non-trivial. (GLUnix, LSF)

How do we keep this “simple” and “built-in” to NTClusterSort ?

Page 79: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

79

Remote Install

RegConnectRegistry()

RegCreateKeyEx()

•Add Registry entry to each remote node.

Page 80: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

80

Cluster Execution

MULT_QI COSERVERINFO•Setup :

MULTI_QI structCOSERVERINFO struct

•CoCreateInstanceEx()

•Retrieve remote object handle from MULTI_QI struct

•Invoke methods as usual

HANDLEHANDLE

HANDLE

Sort()

Sort()

Sort()

Page 81: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

81

Public Service• Gordon Bell

» Computer Museum

» Vanguard Group

» Edits column in CACM

• Jim Gray» National Research Council Computer Science and

Telecommunications Board

» Presidential Advisory Committee on NGI-IT-HPPC.

• Tom Barclay» USGS and Russian cooperative research

Page 82: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

82

A Plug for CoRR

• CoRR = Computer Science Research Repository

• All computer science literature in cyberspace

• http://xxx.lanl.gov/archive/cs

• Endorsed by CACM

• Reviewed & Refereed EJournals will evolve from this archive

• PLEASE submit articles

• Copyright issues are still problematic

Page 83: 1 BARC BARC Microsoft Bay Area Research Center Tom Barclay Tyler Beam (U VA)* Gordon Bell Joe Barrera Josh Coates (UCB)* Jim Gemmell Jim Gray Steve Lucco

83

BARC

BARCMicrosoft Bay Area Research Center

Tom BarclayTyler Beam (U VA)*Gordon BellJoe BarreraJosh Coates (UCB)* Jim Gemmell Jim GraySteve LuccoErik Riedel (CMU)*Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen (NTFS)*

http://www.research.Microsoft.com/barc/