Tom Barclay
Jim Gray, Don Slutz, Greg Smith, many others
Microsoft Research
SPIN-2
Scaleup - Big Database
Build a 1 TB SQL Server database Data must be
– 1 TB– Unencumbered– Interesting to everyone everywhere– And not offensive to anyone anywhere
Loaded – 1.1 M place names from Encarta World Atlas– 1 M Sq Km from USGS (1 meter resolution)– 2 M Sq Km from Russian Space agency (2 m)
Will be on web (world’s largest atlas) Sell images with commerce server.
3
What’s a Terabyte?1 Terabyte 1,000,000,000 business letters 150 miles of book shelf 100,000,000 book pages 15 miles of book shelf 50,000,000 FAX images 7 miles of book shelf 10,000,000 TV pictures (mpeg) 10 days of video 4,000 LandSat images 16 earth images (100m)
Library of Congress (in ASCII) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes
1998: 100 k$ of magnetic disc 60 discs 50 K$ nearline tape 30 tapes
Terror Byte !!
Some Other Terror-Byte Databases
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
TerraServer Sloan Digital Sky Survey:
– 40 TB raw, 2 TB cooked – EOS/DIS (picture of planet each week)– 15 PB by 2007
Federal Reserve Clearing house: images of checks– 15 PB by 2006 (7 year history)
Nuclear Stockpile Stewardship Program– 10 Exabytes (???!!)
TerraServer is:
An on-line demo and sales tool directed at IT customers and ISVs
A test of the Sphinx VLDB features:– Load performance
– Online Backup/Restore
– Query Performance
A “cool 90s app”– Image and Text data
– Web-lication
– Electronic Commerce
“A shameless advertisement of WNT and SQL Server Scalability”
Application Requirements
BIG —1 TB of data.
PUBLIC — available on the world wide web.
INTERESTING — to a wide audience
ACCESSIBLE — using standard browsers (IE, Netscape)
REAL — a real application (users can buy imagery)
FREE —cannot require NDA or money to access
FAST — impress customers for BackOffice, StorageWorks
EASY — Inexpensive to develop, deploy, and maintain
Project PartnersMotivation
Distribute DOQs to awider audienceLower cost of distribution
Demo scope & qualityof Spin-2 imageryOpen new marketsfor imagery sales
SPIN-2
Demo DEC Alpha& StorageWorks™ScalabilityRecognized as superior h/w vendor
Demo Scalabilityof NT &SQL Server
Database & App UI Coverage: Range from 70ºN to 70ºS
35% U.S., 1% outside U.S. Source Imagery:
– 3.5 TB 1sq meter/pixel Aerial (USGS - 60,000 46Mb B&W- 151Mb Color IR files)
– 700 GB 1.56 meter/pixelSatellite (Spin-2 - 2400 300 Mb B&W)
Display Imagery: 80 m 225 x 150 pixel images, 1.6 m x 3 sub-sampled views
Nav Tools: – 1.5 m place names
– “Click-on” Coverage map
– Expedia & Virtual Globe map
1.8x1.2km 32m “city view”
1.8x1.2km 16m thumbnail
1.8x1.2km 8m browse
225x150m tile
Concept: User navigates an ‘almost seamless’ image of earth
TerraServer Demo
Intranet Beta Sites: – http://terraweb1– http://terraweb2
Internal Beta Schedule– Mon April 27 - June 23
What Microsoft & DEC Contribute Microsoft’s contribution:
– Build an “internet UI” – Design the app and the database– Slice & Dice & Load the data.– Build “electronic stores” for USGS’ for Aerial Images to operate to
sell & distribute images– Run a “robust”web site 18 months
Digital contribution:– Provide high-performance processors – provide high capacity, reliable storage. – Provide technical advice
World’s Largest PC!
– 324 disks (2.4 TB)
– 8 x 440 mhz Alpha CPU
– 10 GB RAM
Alpha8400
(8x440)10GBRam
Enterprise Storage ArrayStorageTek
9 HSZ70 Ultra-SCSI Dual redundant Controllers
324 9.1 Seagate Disks
6 DLT7000Quantum DrivesFWD SCSI Compaq
55004x200mhz
Web Servers
Compaq5500
4x200mhzWeb
Servers
To the Web
Site Configuration
broswer
HTMLJava
Viewer
The Internet
Web Client
Microsoft AutomapActiveX Server
Internet InfoServer 4.0
Image DeliveryApplication
SQL Server7
MicrosoftSite Server EE
Internet InformationServer 4.0
Image Provider Site(s)
Terra-Server DB Automap Server
Sphinx(SQL Server)
Terra-ServerStored Procedures
InternetInformationServer 4.0
ImageServer
Active Server Pages
MTS
Terra-Server Web Site
Software
How We Did It “Chopped” big images into small “tiles”
– Sub-sampled tiles to create zoom levels– Tile sizes map to Lat/Lon system– Unique ID assigned to each Tile location
(Z-transform of lat/long or UTM)
– Unique ID clusters adjacent tiles onto the same database & index pages Wrote Load Management program
– Runs image cutting job– Loads meta and image data into SQL– Multiple Loaders can run in parallel– Web Active Server Page controls load process
USGS Editing Process1
Deg
ree
Lat
itud
e
DOQQ Origin Point
DOQTiles
Quad Cut 3x6Jump, Thumb-nails &Browse Images
1 2 3
4 5 6
7 8
1 89
641 Quadrangle (7.5’ x 7.5’)
1 “QUAD”DOQ Photo(3.75’ x 3.75’)
9
10 11 12
13 14 15
16 17 18
1 Degree Longitude
Spin-2 Image Editing Process48 x 96 cells per sq degree
Image aligned to left corner of grid system
Non-image squares (all white) are discarded
Cut Images are extracted
SubSample Jump
16m
Browse8m
Tiles are cut5x5, scrambledoutput Jpeg
32m
Thumb
Spin-2 Meta Data
File name (of image) City1
State1
Country Number of Rows Number of Columns Shooting Height Height of Sun Date of survey
(mm/dd/yyyy)
Time of survey (GMT)(hr:mn:ss)
Upper Left Latitude Upper Left Longitude Lower Right Latitude Lower Right Longitude Camera System1
Pixel size1
Copyright1
1Field is not required, if not present, then a blank field is present
Semi-colon delimited fields, ASCII encoding 1 records per line
Database Design and Load
Build a 1 TB (2**40B) SQL Server Database Database includes
– Gazetteer data for searching– Image data pyramid and metadata
Load the Database – Chop the big images into tiles– BCP data and metadata in– Allow for restart and undo of loads– Create indexes– Check consistency of the data
Keep it Simple, no Tricks, Test the Scaling
Jump image1 pixel = 32x32 m2
USGS Tile imageDOQ of Washington Monument1 pixel = 1 sq meter
Dithered Thumb image1 pixel = 8x8 m2
Dithered Browse image1 pixel = 16x16 m2
64:11:11:1
The Image Pyramid
Zooming in on the Washington Monument
‘Logical’ SchemaCountry State
PlacePlaceType
FeatureType
Gazetteer
Star schemaIndex on• image, place, type• image, state, type• image, state, country, type• image, place, state, type• image, place, country, typeall lookups are fast
ImgMeta TileMeta
Jump Img BrowseImg TileImg
Theme Meta Information TileLog
Thumb Img
Image Data & Meta Data
Lookup by UGrid or ZGrid ID plus resolutionLookups are fast.Indices are in DRAM (auto-magically by SQL)SQL manages all the tiles and indicesImages are brought in on demand
Lat/Long(U/ZGridId)
AlternateNameName
CountryIDStateIDTypeID
GazSourcIDLatitude
LongitudeUGridIDZGridIDDOQdate
SPIN2date
PlaceID
Place
ImageFlag
FeatureType
TypeIDDescription
13
GazetteerSource
GazSrcIDDescription
11,089,897
Country
CountryIDCountryName
UNcode
264
State
StateIDCountryIDStateName
1083
CountrySearch
AlternateNameCountryIDGazSrcID
1148
StateSerach
AlternateNameCountryID
StateIDFreatureIDGazSrcID
3776
PlaceGrid
ZGridIDBestPlaceName
XDistanceYDistrance
50,000,000
Gazetteer Design
Classic Snowflake Schema Top 10 Hint to RE for Cursor Select
1
ImgSource
SrcIDSrcName
SrcTblNameSrcDescription
GridSysIDImgTypeID
2
Jump
UGridIDZGridID
ZTileGridIDImgDataImgDate
ImgTypeIDImgMetaID
SrcIDEncryptKeyFile Name
.65 M SPIN21.5 M USGS
OriginalMetaData
OrigMetaIDSrcID
ImageSourceAgency
SourcePhotoIDSourcePhotoDateSourceDEMDate
MetaDataDateProductionSystem
ProductionDateDataFileSizeCompressionHeaderBytes
…80 other fields
650 k SPIN22 M USGS
Pick
NameDescription
LinkPickDate
10
ImageMeta
ImgMetaIDOrigMetaIDImgStatusImgDate
ImgTypeIDJumpPixHeightJumpPixWidth
BrowsePixHeightBrowsePixWidthThumbPixWidthThumbPixHeight
CutColCutRowMidLat
MidLongNELat
NELongNWLat
NWLongSELat
SELongSWLat
SWLongUGridID
UTMZoneXUtmIDYUtmIDXGridIDYGridIDZGridID
650 k SPIN22 M USGS
ImgType
ImgTypeIDImgFileDescImgFileExt
MimeStr
4
Browse
UGridIDZGridID
ZTileGridIDImgDataImgDate
ImgTypeIDImgMetaID
SrcIDEncryptKeyFile Name
.65 M SPIN21.5 M USGS
Thumb
UGridIDZGridID
ZTileGridIDImgDataImgDate
ImgTypeIDImgMetaID
SrcIDEncryptKeyFile Name
.65 M SPIN21.5 M USGS
Tile
UGridIDZGridID
ZTileGridIDImgDataImgDate
ImgTypeIDImgMetaID
SrcIDEncryptKeyFile Name
16 M SPIN296 M USGS
xxx
UGridHits
URLUGridID
ZTileGridIDcount
Log
URLTime
<extensivelist of actionparameters
xxx
TileMeta
ImgMetaIDOrigMetaID
SrcIDImgStatusImgDate
ImgTypeIDTilePixHeightTilePixWidth
CutColCutRowMidLat
MidLongNELat
NELongNWLat
NWLongSELat
SELongSWLat
SWLongUGridID
UTMZoneXUtmIDYUtmIDXGridIDYGridIDZGridID
16 M SPIN296 M USGS
Image Data Design Image pyramid stored in DBMS (250 M recs)
TerraServer File Group Design Make 28 RAID5 sets from 324 disks
Each raid set has 11 disks (16 spare drives)
Make 4 595GB NT volumes Each striped over 7 Raid sets on 7 controllers
Create 26 20,000MB files on F:, 27 on G: DB is File Group of 53 files (1.011TB)
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
F: G: H: I:
HSZ70 A
HSZ70 B
Physical Database
53 Files. 20,000MB each 16,960,000 extents 135,680,000 pages Separate tables for DOQ, Spin ‘Themes’ Each image stored in column of type ‘image’ All tile images in one (big) table A number of indexes too
TerraServer Tables USGS DOQ Data
– 48,000 DOQQ images (45-55mb / image)– Creates 864,000 Jump, Thumb, & Browse images (3.5 m rows)– Creates 55.3 m Tile images (110.6 m rows)
SPIN-2 Data– 3200 278 MB images (approximate size)– Creates 620,800 Jump, Thumb, & Browse images (2.5 m rows)– Creates 15.5 m Tile images (31 m rows)
Gazetteer Data– 1.1 m named places (Encarta World Atlas)– 45 m cell names
Total Rows = 193.7 M
The Loading Process Includes Cutting Images, building BCP files, BCP meta data, BCP image data First Load 1/97-5/97 for Scalability Day
– 190 GB actual image data, 800 GB duplicates– Pre-beta Sphinx
Second Load 12/97-4/98 for Web Server– 750 GB actual image data, all images recut
Image Preperation and LoadDLTTape “tar”
\Drop’N’ DoJobWait 4LoadLoadMgr
DB
100mbitEtherSwitch
108 9.1 GBDrives
Enterprise Storage Array
AlphaServer8400
108 9.1 GBDrives
108 9.1 GBDrives
STCDLTTape
Library
604.3 GBDrives
AlphaServer4100
ESAAlphaServer4100
LoadMgr
DLTTape
NTBackup
ImgCutter
\Drop’N’ \Images
10: ImgCutter20: Partition30: ThumbImg40: BrowseImg45: JumpImg50: TileImg55: Meta Data60: Tile Meta70: Img Meta80: Update Place
...LoadMgr
NT Backup
Pre-Process Data
Read *.IMD filesGenerate IdsGenerate ZLatLongSort by ZLatLong
Image Meta Tile Meta
Load Thumb ImgRead Image MetaRead Image DataBCP into ImgTbl
Load Browse ImgRead Image MetaRead Image DataBCP into ImgTbl
Load Tile ImgRead Tile MetaRead Tile DataBCP into TileTbl
*.IMD & *.JPG
Load Tile MetaRead Image MetaBCP into TileMeta
Load Img MetaRead Image MetaBCP into TileMeta
ImgMeta
ImgMetaId intOrigMetaId intSrcId intImgTypeId intXGridId intYGridId intImgDate DateHemisphere smallintContinent smallintxxLat smallintxxLong smallintZLatLong intMetaStr vchar(255)
TileMeta
TileMetaId intImgMetaId intOrigMetaId intSrcId intImgTypeId intXGridId intYGridId intHemisphere smallintContinent smallintxxLat smallintxxLong smallintZLatLong int
“SRC”ThumbImg
ThumbImgId int ImgMetaId int ZLatLong int SrcId intImgTypeId intPixWidth intPixHeight intImgData Blob
“SRC”BrowseImg
BrowseImgId intImgMetaId int ZLatLong int SrcId intImgTypeId intPixWidth intPixHeight intImgData Blob
“SRC”TileImg
TileImgId intTileMetaId intZLatLong int SrcId intImgTypeId intPixWidth intPixHeight intImgData Blob
Meta & ImageLoad Process
The Load Manager A Workflow System. Manages Job ‘Steps’. Built as an SQL Database App. Collects Stats. Would use Data Transformation Services today
Load Statistics 601 DOQ Jobs, 818 Spin Jobs
– Each job does 3 meta BCP, 4 Image BCP steps
5676 Image BCP Steps– 106 million total images loaded– 546 GB total. 5.4 KB avg image size
For Tile Images (96% of the database)– avg 68,000 images/step. max 757,000– avg 33 minutes/step. max 596– total time 796 hours (33 days)
Industrial Strength– High Performance– Online Backups– Simple, Error Free Media Handling– Minimal Recovery Time
System Maintenance: Backup &Recovery
Project Phases & Characteristics Load Phase
– Ongoing Massive Data Loads– Updates to Fix Errors in Meta-Data– Backups at Key Milestones
Deployed– 7 x 24– Some Updates to Existing Data– Small Loads as More Data Arrives– Infrequent Large Loads
SQL Server 7.0 Backup/RestoreFeatures Fast
Online Backup Under Load – Minimal Impact
Just the Data Backup Part of the Database Minimize Recovery Time
– Differential Backups, Log Backups
– Restore Only Damaged Files
Backup ISVs Address Limitations
Legato NetWorker™ Computer Associates ArcServe™ Seagate Backup Exec™ Others…
These Products support SQL Server 6.5
None Support SQL Server 7.0 yet.
Deployed6/98...
ISV Supports SQL Server 7.0 High Performance Backup API
ISV Supports Full Range of SQL Server 7.0 Backup/Restore Features
Backup Software
Backup API
SQL Server
Tape Library
Backup API Performance
Avg CPU Usage
0 10 20 30 40 50 60
Backup API (no write)
PIPE (no write)
NUL
Percent
Throughput
0 20 40 60 80 100 120
Backup API (no write)
PIPE (no write)
NUL
MB/sec
Verifying Backup/Restore Minimal Risk Restore to a Separate
System at DECWest– Early Problems with Unreadable Tapes
Test SystemTest SystemTerraServerTerraServer
Another Terabyte of Disk!
TerraServer Backup/Restore
Factoids Backup/Restore Rate
Time Required for Full Database Backup:
Number of DLT Tape Cartridges:
200 GB/Hr (57 MB/sec)
5 Hours
36
Other Details Active Server pages
– faster and easier than DB stored procedures.
Commerce Server is interesting– Images the Inventory
no SKU, millions of them
– USGS built their own they are very smart, but it is easy masquerade as a credit-card reader.
The earth is a geoid, and Every Geographer has a coordinate system (or two). Tapes are still a nightmare. Everyone is a UI expert.
Thank You!
SPIN-2
Microsoft
BackOffice