ps1 psps object data manager design

147
PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA

Upload: valentine-reilly

Post on 31-Dec-2015

24 views

Category:

Documents


1 download

DESCRIPTION

PS1 PSPS Object Data Manager Design. PSPS Critical Design Review November 5-6, 2007 IfA. Outline. ODM Overview Critical Requirements Driving Design Work Completed Detailed Design Spatial Querying [AS] ODM Prototype [MN] Hardware/Scalability [JV] How Design Meets Requirements - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PS1 PSPS Object Data Manager Design

PS1 PSPSObject Data Manager Design

PSPS Critical Design Review November 5-6, 2007

IfA

Page 2: PS1 PSPS Object Data Manager Design

slide 2

Outline

ODM Overview Critical Requirements Driving Design Work Completed Detailed Design Spatial Querying [AS]

ODM Prototype [MN]

Hardware/Scalability [JV]

How Design Meets Requirements WBS and Schedule Issues/Risks

[AS] = Alex, [MN] = Maria, [JV] = Jan

Page 3: PS1 PSPS Object Data Manager Design

slide 3

ODM Overview

The Object Data Manager will:

Provide a scalable data archive for the Pan-STARRS data products

Provide query access to the data for Pan-STARRS users

Provide detailed usage tracking and logging

Page 4: PS1 PSPS Object Data Manager Design

slide 4

ODM Driving Requirements

Total size 100 TB, • 1.5 x 1011 P2 detections• 8.3x1010 P2 cumulative-sky (stack) detections• 5.5x109 celestial objects

Nominal daily rate (divide by 3.5x365)• P2 detections: 120 Million/day• Stack detections: 65 Million/day• Objects: 4.3 Million/day

Cross-Match requirement: 120 Million / 12 hrs ~ 2800 / s DB size requirement:

• 25 TB / yr• ~100 TB by of PS1 (3.5 yrs)

Page 5: PS1 PSPS Object Data Manager Design

slide 5

Work completed so far

Built a prototype Scoped and built prototype hardware Generated simulated data

• 300M SDSS DR5 objects, 1.5B Galactic plane objects

Initial Load done – Created 15 TB DB of simulated data• Largest astronomical DB in existence today

Partitioned the data correctly using Zones algorithm Able to run simple queries on distributed DB Demonstrated critical steps of incremental loading It is fast enough

• Cross-match > 60k detections/sec• Required rate is ~3k/sec

Page 6: PS1 PSPS Object Data Manager Design

slide 6

Detailed Design

Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)

• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware

Query Manager (QM: CasJobs for prototype)

Page 7: PS1 PSPS Object Data Manager Design

slide 7

High-Level Organization

Legend

DatabaseFull table [partitioned table]Output tablePartitioned View

Query Manager (QM)Query Manager (QM)

PS1

P1 Pm

PartionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

Data Storage (DS)

Web Based Interface (WBI)Web Based Interface (WBI)

Data Transformation Layer (DX)Data Transformation Layer (DX)

LoadAdmin

LoadSupport1

objZoneIndx

orphans

Detections_l1

LnkToObj_l1

objZoneIndx

orphans

Detections_ln

LnkToObj_ln

LoadSupportn

Linked servers

PartitionMapData Loading Pipeline (DLP)

Legend

DatabaseFull table [partitioned table]Output tablePartitioned View

Query Manager (QM)Query Manager (QM)

PS1

P1 Pm

PartionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

Data Storage (DS)

PS1

P1 Pm

PartionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

Data Storage (DS)

Web Based Interface (WBI)Web Based Interface (WBI)

Data Transformation Layer (DX)Data Transformation Layer (DX)

LoadAdmin

LoadSupport1

objZoneIndx

orphans

Detections_l1

LnkToObj_l1

objZoneIndx

orphans

Detections_ln

LnkToObj_ln

LoadSupportn

Linked servers

PartitionMapData Loading Pipeline (DLP)

LoadAdmin

LoadSupport1

objZoneIndx

orphans

Detections_l1

LnkToObj_l1

objZoneIndx

orphans

Detections_ln

LnkToObj_ln

LoadSupportn

Linked servers

PartitionMapData Loading Pipeline (DLP)

Page 8: PS1 PSPS Object Data Manager Design

slide 8

Detailed Design

Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)

• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware

Query Manager (QM: CasJobs for prototype)

Page 9: PS1 PSPS Object Data Manager Design

slide 9

Data Transformation Layer (DX)

Based on SDSS sqlFits2CSV package• LINUX/C++ application• FITS reader driven off header files

Convert IPP FITS files to• ASCII CSV format for ingest (initially)• SQL Server native binary later (3x faster)

Follow the batch and ingest verification procedure described in ICD• 4-step batch verification• Notification and handling of broken publication cycle

Deposit CSV or Binary input files in directory structure• Create “ready” file in each batch directory

Stage input data on LINUX side as it comes in from IPP

Page 10: PS1 PSPS Object Data Manager Design

slide 10

DX Subtasks

DXDX

Initialization Job

FITS schemaFITS reader

CSV ConverterCSV Writer

Initialization Job

FITS schemaFITS reader

CSV ConverterCSV Writer

Batch Ingest

Interface with IPPNaming conventionUncompress batch

Read batchVerify Batch

Batch Ingest

Interface with IPPNaming conventionUncompress batch

Read batchVerify Batch

BatchVerification

Verify ManifestVerify FITS IntegrityVerify FITS Content

Verify FITS DataHandle Broken Cycle

BatchVerification

Verify ManifestVerify FITS IntegrityVerify FITS Content

Verify FITS DataHandle Broken Cycle

BatchConversion

CSV ConverterBinary Converter

“batch_ready”Interface with DLP

BatchConversion

CSV ConverterBinary Converter

“batch_ready”Interface with DLP

Page 11: PS1 PSPS Object Data Manager Design

slide 11

DX-DLP Interface

Directory structure on staging FS (LINUX):• Separate directory for each JobID_BatchID• Contains a “batch_ready” manifest file

– Name, #rows and destination table of each file• Contains one file per destination table in ODM

– Objects, Detections, other tables Creation of “batch_ready” file is signal to loader to ingest

the batch Batch size and frequency of ingest cycle TBD

Page 12: PS1 PSPS Object Data Manager Design

slide 12

Detailed Design

Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)

• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware

Query Manager (QM: CasJobs for prototype)

Page 13: PS1 PSPS Object Data Manager Design

slide 13

Data Loading Pipeline (DLP)

sqlLoader – SDSS data loading pipeline• Pseudo-automated workflow system• Loads, validates and publishes data

– From CSV to SQL tables• Maintains a log of every step of loading• Managed from Load Monitor Web interface

Has been used to load every SDSS data release• EDR, DR1-6, ~ 15 TB of data altogether• Most of it (since DR2) loaded incrementally• Kept many data errors from getting into database

– Duplicate ObjIDs (symptom of other problems)– Data corruption (CSV format invaluable in

catching this)

Page 14: PS1 PSPS Object Data Manager Design

slide 14

sqlLoader Design

Existing functionality• Shown for SDSS version• Workflow, distributed loading, Load Monitor

New functionality• Schema changes• Workflow changes• Incremental loading

– Cross-match and partitioning

Page 15: PS1 PSPS Object Data Manager Design

slide 15

sqlLoader Workflow

Distributed design achieved with linked servers and SQL Server Agent

LOAD stage can be done in parallel by loading into temporary task databases

PUBLISH stage writes from task DBs to final DB

FINISH stage creates indices and auxiliary (derived) tables

LOADLOAD

PUBLISHPUBLISHFINISHFINISH

EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD

PUBLISHPUBLISHFINISHFINISH

EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD

PUBLISHPUBLISHFINISHFINISH

EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD

PUBLISHPUBLISHFINISHFINISH

EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

Loading pipeline is a system of VB and SQL scripts, stored procedures and functions

Page 16: PS1 PSPS Object Data Manager Design

slide 16

Load Monitor Tasks Page

Page 17: PS1 PSPS Object Data Manager Design

slide 17

Load Monitor Active Tasks

Page 18: PS1 PSPS Object Data Manager Design

slide 18

Load Monitor Statistics Page

Page 19: PS1 PSPS Object Data Manager Design

slide 19

Load Monitor – New Task(s)

Page 20: PS1 PSPS Object Data Manager Design

slide 20

Test UniquenessOf Primary KeysTest UniquenessOf Primary Keys

TestForeign Keys

TestForeign Keys

TestCardinalities

TestCardinalities

TestHTM IDs

TestHTM IDs

Test Link TableConsistency

Test Link TableConsistency

Test the uniqueKey in each table

Test for consistencyof keys that link tables

Test consistency of numbers of various quantities

Test the HierarchicalTriamgular Mesh IDsused for spatial indexing

Ensure that links areconsistent

Data Validation

Tests for data integrity and consistency

Scrubs data and finds problems in upstream pipelines

Most of the validation can be performed within the individual task DB (in parallel)

Page 21: PS1 PSPS Object Data Manager Design

slide 21

Master Master

SlaveSlave SlaveSlave

Samba-mounted CSV/Binary FilesSamba-mounted CSV/Binary Files

PublishData

PublishData

FinishFinish

Task DB Task DBTaskDataTaskData

Task DB

Task DBView of

MasterSchema

TaskDataTaskData

LoadSupportLoadSupport Task DB

Task DB

TaskDataTaskData

Load Monitor

PublishSchema

View ofMaster

Schema

View ofMaster

Schema

MasterSchema

LoadAdminLoadAdmin

Distributed Loading

Publish

LoadSupportLoadSupportLoadSupportLoadSupport

Page 22: PS1 PSPS Object Data Manager Design

slide 22

Schema Changes

Schema in task and publish DBs is driven off a list of schema DDL files to execute (xschema.txt)

Requires replacing DDL files in schema/sql directory and updating xschema.txt with their names

PS1 schema DDL files have already been built Index definitions have also been created Metadata tables will be automatically generated using

metadata scripts already in the loader

Page 23: PS1 PSPS Object Data Manager Design

slide 23

LOADExportExport

CheckCSVs

CheckCSVs

CreateTask DBsCreate

Task DBs

Build SQLSchema

Build SQLSchema

ValidateValidate

XMatchXMatch

Workflow Changes

Cross-Match and Partition steps will be added to the workflow

Cross-match will match detections to objects

Partition will horizontally partition data, move it to slice servers, and build DPVs on main

PUBLISH

PartitionPartition

Page 24: PS1 PSPS Object Data Manager Design

slide 24

Matching Detections with Objects

Algorithm described fully in prototype section Stored procedures to cross-match detections will be part

of the LOAD stage in loader pipeline Vertical partition of Objects table kept on load server for

matching with detections Zones cross-match algorithm used to do 1” and 2”

matches Detections with no matches saved in Orphans table

Page 25: PS1 PSPS Object Data Manager Design

slide 25

XMatch and Partition Data Flow

Loadsupport

PS1

Pm

Detections

LoadDetections

XMatchDetections_In

PullChunk

LinkToObj_In

ObjZoneIndx

Orphans

Detections_chunk

LinkToObj_chunk

MergePartitions

Detections_m

LinkToObj_m

UpdateObjects

Objects_mPull

PartitionSwitch

Partition

Objects_m

LinkToObj_m

Objects

LinkToObj

Page 26: PS1 PSPS Object Data Manager Design

slide 26

Detailed Design

Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)

• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware

Query Manager (QM: CasJobs for prototype)

Page 27: PS1 PSPS Object Data Manager Design

slide 27

Data Storage – Schema

Page 28: PS1 PSPS Object Data Manager Design

slide 28

PS1 Table Sizes Spreadsheet

Stars 5.00E+09 1.51E+11Galaxies 5.00E+08 36750000000Total Objects 5.50E+09 m

2.3E-07 0.3*DR1 3.00P2 Detections per year 4.30E+10 0.3 0.29 0.57 0.86 1.00

tablename columns bytes/row total rows total size (TB) Prototype DR1 DR2 DR3 DR4

AltModels 0 7 1547 10 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1 1CameraConfig 0 5 287 30 8.61E-09 8.61E-09 8.61E-09 8.61E-09 8.61E-09 8.61E-09 1 1FileGroupMap 0 4 4335 100 4.335E-07 4.335E-07 4.335E-07 4.335E-07 4.335E-07 4.335E-07 1 1IndexMap 0 7 2301 100 2.301E-07 2.301E-07 2.301E-07 2.301E-07 2.301E-07 2.301E-07 1 1Objects 0 88 420 5.50E+09 2.31 0.693 2.31 2.31 2.31 2.31 1 0.33ObjZoneIndx 0 7 63 5.50E+09 0.3465 0.10395 0.3465 0.3465 0.3465 0.3465 1 0PartitionMap 0 3 4111 100 4.111E-07 4.111E-07 4.111E-07 4.111E-07 4.111E-07 4.111E-07 1 1PhotoCal 0 10 151 1000 0.000000151 0.000000151 0.000000151 0.000000151 0.000000151 0.000000151 1 1PhotozRecipes 0 2 267 10 2.67E-09 2.67E-09 2.67E-09 2.67E-09 2.67E-09 2.67E-09 1 1SkyCells 0 2 10 50000 0.0000005 0.0000005 0.0000005 0.0000005 0.0000005 0.0000005 1 1Surveys 0 2 267 30 8.01E-09 8.01E-09 8.01E-09 8.01E-09 8.01E-09 8.01E-09 1 1DropP2ToObj 1 4 39 4.00E+06 0.000156 1.33714E-05 4.45714E-05 8.91429E-05 0.000133714 0.000156 1 0.33DropStackToObj 1 4 39 4.00E+06 0.000156 1.33714E-05 4.45714E-05 8.91429E-05 0.000133714 0.000156 1 0.33P2AltFits 1 13 71 1.51E+10 1.06855 0.09159 0.3053 0.6106 0.9159 1.06855 0 0.33P2FrameMeta 1 18 343 1.05E+06 0.00036015 0.00003087 0.0001029 0.0002058 0.0003087 0.00036015 1 1P2ImageMeta 1 64 2870 6.72E+07 0.192864 0.0165312 0.055104 0.110208 0.165312 0.192864 1 1P2PsfFits 1 34 183 1.51E+11 27.5415 2.3607 7.869 15.738 23.607 27.5415 0 0.33P2ToObj 1 3 31 1.51E+11 4.6655 0.3999 1.333 2.666 3.999 4.6655 1 0.33P2ToStack 1 2 15 1.51E+11 2.2575 0.1935 0.645 1.29 1.935 2.2575 0 0.33StackDeltaAltFits 1 13 71 3.68E+09 0.260925 0.022365 0.07455 0.1491 0.22365 0.260925 0 0.33StackHiSigDeltas 1 32 167 3.68E+10 6.13725 0.52605 1.7535 3.507 5.2605 6.13725 0 0.33StackLowSigDelta 1 2 5000 1.65E+06 0.00825 0.000707143 0.002357143 0.004714286 0.007071429 0.00825 0 0.33StackMeta 1 49 1551 700000 0.0010857 0.00032571 0.0010857 0.0010857 0.0010857 0.0010857 0 0.33StackModelFits 1 131 535 7.50E+09 4.0125 0.343928571 1.146428571 2.292857143 3.439285714 4.0125 0 0.33StackPsfFits 1 44 215 8.25E+10 17.7375 1.520357143 5.067857143 10.13571429 15.20357143 17.7375 0 0.33StackToObj 1 4 39 8.25E+10 3.2175 0.275785714 0.919285714 1.838571429 2.757857143 3.2175 1 0.33StationaryTransient 1 2 23 5.00E+08 0.0115 0.000985714 0.003285714 0.006571429 0.009857143 0.0115 1 0.33

sum 69.76959861 6.549735569 21.83244779 41.00730812 60.18216845 69.76959861indices 13.95391972 1.309947114 4.366489558 8.201461624 12.03643369 13.95391972total 83.72351833 7.859682683 26.19893735 49.20876974 72.21860214 83.72351833

0 means the table size is essentially the same for all data releases Primary filegroup1 means the table size will grow

0 means full table1 means the table is partitioned and distributed across the cluster

Fraction of the table contained on each partition

Note: These estimates are for the whole PS1, assuming 3.5 years. 7 bytes added to each row for overhead as suggested by Alex

Page 29: PS1 PSPS Object Data Manager Design

slide 29

PS1 Table Sizes - All Servers

Table Year 1 Year 2 Year 3 Year 3.5

Objects 4.63 4.63 4.61 4.59

StackPsfFits 5.08 10.16 15.20 17.76

StackToObj 1.84 3.68 5.56 6.46

StackModelFits 1.16 2.32 3.40 3.96

P2PsfFits 7.88 15.76 23.60 27.60

P2ToObj 2.65 5.31 8.00 9.35

Other Tables 3.41 6.94 10.52 12.67

Indexes +20% 5.33 9.76 14.18 16.48

Total 31.98 58.56 85.07 98.87

Sizes are in TB

Page 30: PS1 PSPS Object Data Manager Design

slide 30

Data Storage – Test Queries

Drawn from several sources• Initial set of SDSS 20 queries• SDSS SkyServer Sample Queries• Queries from PS scientists (Monet, Howell, Kaiser,

Heasley) Two objectives

• Find potential holes/issues in schema• Serve as test queries

– Test DBMS iintegrity– Test DBMS performance

Loaded into CasJobs (Query Manager) as sample queries for prototype

Page 31: PS1 PSPS Object Data Manager Design

slide 31

Data Storage – DBMS

Microsoft SQL Server 2005• Relational DBMS with excellent query optimizer

Plus• Spherical/HTM (C# library + SQL glue)

– Spatial index (Hierarchical Triangular Mesh)• Zones (SQL library)

– Alternate spatial decomposition with dec zones• Many stored procedures and functions

– From coordinate conversions to neighbor search functions

• Self-extracting documentation (metadata) and diagnostics

Page 32: PS1 PSPS Object Data Manager Design

slide 32

Documentation and Diagnostics

Page 33: PS1 PSPS Object Data Manager Design

slide 33

Data Storage – Scalable Architecture

Monolithic database design (a la SDSS) will not do it SQL Server does not have cluster implementation

• Do it by hand Partitions vs Slices

• Partitions are file-groups on the same server– Parallelize disk accesses on the same machine

• Slices are data partitions on separate servers• We use both!

Additional slices can be added for scale-out For PS1, use SQL Server Distributed Partition Views

(DPVs)

Page 34: PS1 PSPS Object Data Manager Design

slide 34

Distributed Partitioned Views

Difference between DPVs and file-group partitioning• FG on same database• DPVs on separate DBs• FGs are for scale-up• DPVs are for scale-out

Main server has a view of a partitioned table that includes remote partitions (we call them slices to distinguish them from FG partitions)

Accomplished with SQL Server’s linked server technology

NOT truly parallel, though

Page 35: PS1 PSPS Object Data Manager Design

slide 35

Scalable Data Architecture

Shared-nothing architecture Detections split across cluster Objects

replicated on Head and Slice DBs

DPVs of Detections tables on the Headnode DB

Queries on Objects stay on head node

S2

S3

Head

S1

Objects_S1

Objects_S2

Objects_S3

Objects_S1

Objects_S2

Objects_S3

Detections_S1

Detections_S2

Detections_S3

Objects

Detections_S1

Detections_S2

Detections_S3

Detections DPV

Queries on detections use only local data on slices

Page 36: PS1 PSPS Object Data Manager Design

slide 36

Hardware - Prototype

LXPS01

L1PS13

L2/MPS05

Staging Loading

10 TB 9 TB

8

4

4HeadPS11

8

DB

S1PS12

8

S2PS03

4

S3PS04

4

WPS02

4

MyDB

39 TB

2A

2A

2A

2A

A

A2B B

RAID5 RAID10 RAID10 RAID10

14D/3.5W 12D/4W

Total space

RAID config

Disk/rack config

Function

10A = 10 x [13 x 750 GB]3B = 3 x [12 x 500 GB]

LX = LinuxL = Load serverS/Head = DB serverM = MyDB serverW = Web server

Web

0 TB

PS0x = 4-corePS1x = 8-core

Server NamingConvention:

Storage:

Function:

Page 37: PS1 PSPS Object Data Manager Design

slide 37

Hardware – PS1

Offline(Copy 2)

Spare(Copy 3)

Live(Copy 1)

Offline(Copy 2)

Spare(Copy 3)

Live(Copy 1)

Queries Ingest

Offline(Copy 1)

Spare(Copy 3)

Live(Copy 2)

Live(Copy 2)

Spare(Copy 3)

Live(Copy 1)

ReplicateQueries

Queries

Queries

Replicate

Queries

Ping-pong configuration to maintain high availability and query performance

2 copies of each slice and of main (head) node database on fast hardware (hot spares)

3rd spare copy on slow hardware (can be just disk)

Updates/ingest on offline copy then switch copies when ingest and replication finished

Synchronize second copy while first copy is online

Both copies live when no ingest

3x basic config. for PS1

Page 38: PS1 PSPS Object Data Manager Design

slide 38

Detailed Design

Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)

• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware

Query Manager (QM: CasJobs for prototype)

Page 39: PS1 PSPS Object Data Manager Design

slide 39

Query Manager

Based on SDSS CasJobs Configure to work with distributed database, DPVs Direct links (contexts) to slices can be added later if

necessary Segregates quick queries from long ones Saves query results server-side in MyDB Gives users a powerful query workbench Can be scaled out to meet any query load PS1 Sample Queries available to users PS1 Prototype QM demo

Page 40: PS1 PSPS Object Data Manager Design

slide 40

ODM Prototype Components

Data Loading Pipeline Data Storage CasJobs

• Query Manager (QM)• Web Based Interface (WBI)

Testing

Page 41: PS1 PSPS Object Data Manager Design

slide 41

Spatial Queries (Alex)

Page 42: PS1 PSPS Object Data Manager Design

slide 42

Spatial Searches in the ODM

Page 43: PS1 PSPS Object Data Manager Design

slide 43

Common Spatial Questions

Points in region queries1. Find all objects in this region2. Find all “good” objects (not in masked areas)3. Is this point in any of the regionsRegion in region4. Find regions near this region and their area5. Find all objects with error boxes intersecting region6. What is the common part of these regionsVarious statistical operations7. Find the object counts over a given region list 8. Cross-match these two catalogs in the region

Page 44: PS1 PSPS Object Data Manager Design

slide 44

Sky Coordinates of Points

Many different coordinate systems• Equatorial, Galactic, Ecliptic, Supergalactic

Longitude-latitude constraints Searches often in mix of different coordinate systems

• gb>40 and dec between 10 and 20• Problem: coordinate singularities, transformations

How can one describe constraints in a easy, uniform fashion?

How can one perform fast database queries in an easy fashion?• Fast:Indexes• Easy: simple query expressions

Page 45: PS1 PSPS Object Data Manager Design

slide 45

Describing Regions

Spacetime metadata for the VO (Arnold Rots) Includes definitions of

• Constraint: single small or great circle• Convex: intersection of constraints• Region: union of convexes

Support both angles and Cartesian descriptions Constructors for

• CIRCLE, RECTANGLE, POLYGON, CONVEX HULL

Boolean algebra (INTERSECTION, UNION, DIFF) Proper language to describe the abstract regions Similar to GIS, but much better suited for astronomy

Page 46: PS1 PSPS Object Data Manager Design

slide 46

Things Can Get Complex

AABB

AA

Green area: A (B- ε) should find B if it contains an A and not maskedYellow area: A (B±ε) is an edge case may find B if it contains an A.

Page 47: PS1 PSPS Object Data Manager Design

slide 47

We Do Spatial 3 Ways

Hierarchical Triangular Mesh (extension to SQL)• Uses table valued functions• Acts as a new “spatial access method”

Zones: fits SQL well• Surprisingly simple & good

3D Constraints: a novel idea• Algebra on regions, can be implemented

in pure SQL

Page 48: PS1 PSPS Object Data Manager Design

slide 48

PS1 Footprint

Using the projection cell definitions as centers for tessellation (T. Budavari)

Page 49: PS1 PSPS Object Data Manager Design

slide 49

CrossMatch: Zone Approach

Divide space into declination zones Objects ordered by zoneid, ra

(on the sphere need wrap-around margin.)

Point search look in neighboring zones within ~ (ra ± Δ) bounding box

All inside the relational engine Avoids “impedance mismatch” Can “batch” comparisons Automatically parallel Details in Maria’s thesis

r ra-zoneMax

zoneMax

x

ra ± Δ

Page 50: PS1 PSPS Object Data Manager Design

slide 50

Indexing Using Quadtrees

Cover the sky with hierarchical pixels COBE – start with a cube Hierarchical Triangular Mesh (HTM) uses trixels

• Samet, Fekete Start with an octahedron, and

split each triangle into 4 children,down to 20 levels deep

Smallest triangles are 0.3” Each trixel has a unique htmID

2,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

21

23

20

22

222

223220 221

Page 51: PS1 PSPS Object Data Manager Design

slide 51

Space-Filling Curve

100

103

102

101

120

1,2,1

122

121

110

113

112

111

132

133

130

131

[0.12,0.13)

[0.122,0.123)[0.121,0.122)[0.120,0.121) [0.123,0.130)

Triangles correspond to rangesAll points inside the triangle are inside the range.

[0.122,0.130)

[0.120,0.121)

Page 52: PS1 PSPS Object Data Manager Design

slide 52

SQL HTM Extension

Every object has a 20-deep htmID (44bits) Clustered index on htmID Table-valued functions for spatial joins

• Given a region definition, routine returns up to 10 ranges of covering triangles

• Spatial query is mapped to ~10 range queries Current implementation rewritten in C# Excellent performance, little calling overhead Three layers

• General geometry library• HTM kernel• IO (parsing + SQL interface)

Page 53: PS1 PSPS Object Data Manager Design

slide 53

Writing Spatial SQL

-- region description is contained by @area

DECLARE @cover TABLE (htmStart bigint,htmEnd bigint)INSERT @cover SELECT * from dbo.fHtmCover(@area)--DECLARE @region TABLE ( convexId bigint,x float, y float, z float)INSERT @region SELECT dbo.fGetHalfSpaces(@area)--SELECT o.ra, o.dec, 1 as flag, o.objid FROM (SELECT objID as objid, cx,cy,cz,ra,[dec]

FROM Objects q JOIN @cover AS c ON q.htmID between c.HtmIdStart and c.HtmIdEnd) AS o

WHERE NOT EXISTS (SELECT p.convexId FROM @region AS p WHERE (o.cx*p.x + o.cy*p.y + o.cz*p.z < p.c)GROUP BY p.convexId)

Page 54: PS1 PSPS Object Data Manager Design

slide 54

Status

All three libraries extensively tested Zones used for Maria’s thesis, plus various papers New HTM code in production use since July on SDSS Same code also used by STScI HLA, Galex Systematic regression tests developed Footprints computed for all major surveys Complex mask computations done on SDSS Loading: zones used for bulk crossmatch Ad hoc queries: use HTM-based search functions Excellent performance

Page 55: PS1 PSPS Object Data Manager Design

slide 55

Prototype (Maria)

Page 56: PS1 PSPS Object Data Manager Design

slide 56

PS1 PSPSObject Data Manager Design

PSPS Critical Design Review November 5-6, 2007

IfA

Page 57: PS1 PSPS Object Data Manager Design

slide 57

Detail Design

General Concepts Distributed Database architecture Ingest Workflow Prototype

Page 58: PS1 PSPS Object Data Manager Design

slide 58

Zones (spatial partitioning and indexing algorithm) Partition and bin the data into declination zones

• ZoneID = floor ((dec + 90.0) / zoneHeight) Few tricks required to handle spherical geometry Place the data close on disk

• Cluster Index on ZoneID and RA Fully implemented in SQL Efficient

• Nearby searches • Cross-Match (especially)

Fundamental role in addressing the critical requirements• Data volume management • Association Speed • Spatial capabilities

Zones

De

clin

atio

n (

De

c)

Right Ascension (RA)

Page 59: PS1 PSPS Object Data Manager Design

slide 59

Zoned Table

ObjID ZoneID* RA Dec CX CY CZ …

1 0 0.0 -90.0

2 20250 180.0 0.0

3 20250 181.0 0.0

4 40500 360.0 +90.0

* ZoneHeight = 8 arcsec in this example

ZoneID = floor ((dec + 90.0) / zoneHeight)

Page 60: PS1 PSPS Object Data Manager Design

slide 60

SQL CrossNeighbors

SELECT *FROM prObj1 z1

JOIN zoneZone ZZ ON ZZ.zoneID1 = z1.zoneID

JOIN prObj2 z2ON ZZ.ZoneID2 = z2.zoneID

WHERE z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha

AND z2.dec BETWEEN z1.dec-@r AND z1.dec+@r

AND (z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r))

Page 61: PS1 PSPS Object Data Manager Design

slide 61

Good CPU Usage

Page 62: PS1 PSPS Object Data Manager Design

slide 62

Partitions

SQL Server 2005 introduces technology to handle tables which are partitioned across different disk volumes and managed by a single server.

Partitioning makes management and access of large tables and indexes more efficient• Enables parallel I/O• Reduces the amount of data that needs to be

accessed• Related tables can be aligned and collocated in

the same place speeding up JOINS

Page 63: PS1 PSPS Object Data Manager Design

slide 63

Partitions

2 key elements• Partitioning function

– Specifies how the table or index is partitioned

• Partitioning schemas – Using a partitioning function, the schema specifies the placement

of the partitions on file groups

Data can be managed very efficiently using Partition Switching• Add a table as a partition to an existing table• Switch a partition from one partitioned table to another• Reassign a partition to form a single table

Main requirement• The table must be constrained on the partitioning column

Page 64: PS1 PSPS Object Data Manager Design

slide 64

Partitions

For the PS1 design, • Partitions mean File Group Partitions• Tables are partitioned into ranges of ObjectID,

which correspond to declination ranges.• ObjectID boundaries are selected so that each

partition has a similar number of objects.

Page 65: PS1 PSPS Object Data Manager Design

slide 65

Distributed Partitioned Views

Tables participating in the Distributed Partitioned View (DVP) reside on different databases which reside in different databases which reside on different instances or different (linked) servers

Page 66: PS1 PSPS Object Data Manager Design

slide 66

Concept: Slices

In the PS1 design, the bigger tables will be partitioned across servers

To avoid confusion with the File Group Partitioning, we call them “Slices”

Data is glued together using Distributed Partitioned Views

The ODM will manage slices. Using slices improves system scalability.

For PS1 design, tables are sliced into ranges of ObjectID, which correspond to broad declination ranges. Each slice is subdivided into partitions that correspond to narrower declination ranges.

ObjectID boundaries are selected so that each slice has a similar number of objects.

Page 67: PS1 PSPS Object Data Manager Design

slide 67

Detail Design Outline

General Concepts Distributed Database architecture Ingest Workflow Prototype

Page 68: PS1 PSPS Object Data Manager Design

slide 68

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

PS1 database

LoadAdmin

LoadSupport1

objZoneIndx

orphans_l1

Detections_l1

LnkToObj_l1

objZoneIndx

Orphans_ln

Detections_ln

LnkToObj_ln

detections

LoadSupportn

Linked servers

detections

PartitionsMap

PS1 Distributed DB system

Legend

Database

Full table [partitioned table]

Output table

Partitioned View

Query Manager (QM)

Web Based Interface (WBI)

Page 69: PS1 PSPS Object Data Manager Design

slide 69

Design Decisions: ObjID

Objects have their positional information encoded in their objID• fGetPanObjID (ra, dec, zoneH)• ZoneID is the most significant part of the ID

It gives scalability, performance, and spatial functionality Object tables are range partitioned according to their

object ID

Page 70: PS1 PSPS Object Data Manager Design

slide 70

ObjectID Clusters Data Spatially

ObjectID = 087941012871550661

Dec = –16.71611583 ZH = 0.008333

ZID = (Dec+90) / ZH = 08794.0661

RA = 101.287155

ObjectID is unique when objects are separated by >0.0043 arcsec

Page 71: PS1 PSPS Object Data Manager Design

slide 71

Design Decisions: DetectID

Detections have their positional information encoded in the detection identifier• fGetDetectID (dec, observationID, runningID,

zoneH) • Primary key (objID, detectionID), to align detections

with objects within partitions• Provides efficient access to all detections associated

to one object• Provides efficient access to all detections of nearby

objects

Page 72: PS1 PSPS Object Data Manager Design

slide 72

DetectionID Clusters Data in Zones

DetectID = 0879410500001234567

Dec = –16.71611583 ZH = 0.008333

ZID = (Dec+90) / ZH = 08794.0661

ObservationID = 1050000

Running ID = 1234567

Page 73: PS1 PSPS Object Data Manager Design

slide 73

ODM Capacity

5.3.1.3 The PS1 ODM shall be able to ingest into the

ODM a total of

• 1.51011 P2 detections• 8.31010 cumulative sky (stack) detections• 5.5109 celestial objects

together with their linkages.

Page 74: PS1 PSPS Object Data Manager Design

slide 74

PS1 Table Sizes - Monolithic

Table Year 1 Year 2 Year 3 Year 3.5

Objects 2.31 2.31 2.31 2.31

StackPsfFits 5.07 10.16 15.20 17.74

StackToObj 0.92 1.84 2.76 3.22

StackModelFits 1.15 2.29 3.44 4.01

P2PsfFits 7.87 15.74 23.61 27.54

P2ToObj 1.33 2.67 4.00 4.67

Other Tables 3.19 6.03 8.87 10.29

Indexes +20% 4.37 8.21 12.04 13.96

Total 26.21 49.24 72.23 83.74

Sizes are in TB

Page 75: PS1 PSPS Object Data Manager Design

slide 75

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

What goes into the main Server

Objects

LnkToObj

Meta

PartitionsMap

Legend

Database

Full table [partitioned table]

Output table

Distributed Partitioned View

Page 76: PS1 PSPS Object Data Manager Design

slide 76

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

What goes into slices

PartitionsMap

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

Legend

Database

Full table [partitioned table]

Output table

Distributed Partitioned View

Page 77: PS1 PSPS Object Data Manager Design

slide 77

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

What goes into slices

PartitionsMap

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

Legend

Database

Full table [partitioned table]

Output table

Distributed Partitioned View

Page 78: PS1 PSPS Object Data Manager Design

slide 78

Duplication of Objects & LnkToObj

Objects are distributed across slices Objects, P2ToObj, and StackToObj are duplicated in the

slices to parallelize “inserts” & “updates” Detections belong into their object’s slice Orphans belong to the slice where their position would

allocate them• Orphans near slices’ boundaries will need special

treatment Objects keep their original object identifier

• Even though positional refinement might change their zoneID and therefore the most significant part of their identifier

Page 79: PS1 PSPS Object Data Manager Design

slide 79

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

Glue = Distributed Views

Detections

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

Legend

Database

Full table [partitioned table]

Output table

Distributed Partitioned View

Detections

Page 80: PS1 PSPS Object Data Manager Design

slide 80

PS1

P1 Pm

Web Based Interface (WBI)

Linked servers

PS1 database

Partitioning in Main Server

Main server is partitioned (objects) and collocated (lnkToObj) by objid

Slices are partitioned (objects) and collocated (lnkToObj) by objid

Query Manager (QM)

Page 81: PS1 PSPS Object Data Manager Design

slide 81

PS1 Table Sizes - Main Server

Table Year 1 Year 2 Year 3 Year 3.5

Objects 2.31 2.31 2.31 2.31

StackPsfFits

StackToObj 0.92 1.84 2.76 3.22

StackModelFits

P2PsfFits

P2ToObj 1.33 2.67 4.00 4.67

Other Tables 0.41 0.46 0.52 0.55

Indexes +20% 0.99 1.46 1.92 2.15

Total 5.96 8.74 11.51 12.90

Sizes are in TB

Page 82: PS1 PSPS Object Data Manager Design

slide 82

PS1 Table Sizes - Each Slice

m=4 m=8 m=10 m=12

Table Year 1 Year 2 Year 3 Year 3.5

Objects 0.58 0.29 0.23 0.19

StackPsfFits 1.27 1.27 1.52 1.48

StackToObj 0.23 0.23 0.28 0.27

StackModelFits 0.29 0.29 0.34 0.33

P2PsfFits 1.97 1.97 2.36 2.30

P2ToObj 0.33 0.33 0.40 0.39

Other Tables 0.75 0.81 1.00 1.01

Indexes +20% 1.08 1.04 1.23 1.19

Total 6.50 6.23 7.36 7.16

Sizes are in TB

Page 83: PS1 PSPS Object Data Manager Design

slide 83

PS1 Table Sizes - All Servers

Table Year 1 Year 2 Year 3 Year 3.5

Objects 4.63 4.63 4.61 4.59

StackPsfFits 5.08 10.16 15.20 17.76

StackToObj 1.84 3.68 5.56 6.46

StackModelFits 1.16 2.32 3.40 3.96

P2PsfFits 7.88 15.76 23.60 27.60

P2ToObj 2.65 5.31 8.00 9.35

Other Tables 3.41 6.94 10.52 12.67

Indexes +20% 5.33 9.76 14.18 16.48

Total 31.98 58.56 85.07 98.87

Sizes are in TB

Page 84: PS1 PSPS Object Data Manager Design

slide 84

Detail Design Outline

General Concepts Distributed Database architecture Ingest Workflow Prototype

Page 85: PS1 PSPS Object Data Manager Design

slide 85

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

MetaDetections

Linked servers

PS1 database

LoadAdmin

LoadSupport1

objZoneIndx

orphans_l1

Detections_l1

LnkToObj_l1

objZoneIndx

Orphans_ln

Detections_ln

LnkToObj_ln

detections

LoadSupportn

Linked servers

detections

PartitionsMap

PS1 Distributed DB system

Legend

Database

Full table [partitioned table]

Output table

Partitioned View

Query Manager (QM)

Web Based Interface (WBI)

Page 86: PS1 PSPS Object Data Manager Design

slide 86

“Insert” & “Update”

SQL Insert and Update are expensive operations due to logging and re-indexing

In the PS1 design, Insert and Update have been re-factored into sequences of:

Merge + Constrain + Switch Partition Frequency

• f1: daily• f2: at least monthly• f3: TBD (likely to be every 6 months)

Page 87: PS1 PSPS Object Data Manager Design

slide 87

Ingest Workflow

ObjectsZ

CSV Detect

X(1”)

DXO_1a

NoMatch X(2”)

DXO_2a

DZone

P2PsfFits

Resolve

P2ToObjOrphans

Page 88: PS1 PSPS Object Data Manager Design

slide 88

Ingest @ frequency = f1

P2ToObj

Orphans

SLICE_1 MAIN

P2PsfFits

Metadata+

Objects

Orphans_1

P2ToPsfFits_1

P2ToObj_1

Objects_1

11 12 13

Stack*_1

1 2 3

P2ToObj

StackToObj

P2ToObj_1

P2ToPsfFits_1

Orphans_1

ObjectsZ

LOADER

Page 89: PS1 PSPS Object Data Manager Design

slide 89

SLICE_1 MAIN

Metadata+

Objects

Orphans_1

P2ToPsfFits_1

P2ToObj_1

Objects_1

11 12 13

Stack*_1

1 2 3

P2ToObj

StackToObj

LOADER

Objects

Updates @ frequency = f2

Page 90: PS1 PSPS Object Data Manager Design

slide 90

Updates @ frequency = f2

SLICE_1 MAIN

Metadata+

Objects

Orphans_1

P2ToPsfFits_1

P2ToObj_1

Objects_1

11 12 13

Stack*_1

1 2 3

P2ToObj

StackToObj

Objects

LOADER

Objects Objects_1

Page 91: PS1 PSPS Object Data Manager Design

slide 91

Snapshots @ frequency = f3

MAIN

Metadata+

Objects

1 2 3

P2ToObj

StackToObj

Snapshot

Objects

Page 92: PS1 PSPS Object Data Manager Design

slide 92

Batch Update of a Partition

A1 A2 A3

1 1 2 1 2 3…

merged

select into

switch

B1

select into … where

B2 + PK index

select into … where

B3 + PK index

switch switch

select into … where

B1 + PK index

Page 93: PS1 PSPS Object Data Manager Design

slide 93

P2

P3

PS1

P1

P2

Pm

P1

PartitionsMap

Objects

LnkToObj

Meta

Legend

Database Duplicate

Full table [partitioned table]

Partitioned View Duplicate P view

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

[Objects_p2]

[LnkToObj_p2]

[Detections_p2]

Meta

Query Manager (QM)

Detections

Linked servers

PS1 database

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

MetaPS1

PartitionsMap

Objects

LnkToObj

Meta

Detections

Pm-1

Pm

Scaling-out

Apply Ping-Pong strategy to satisfy query performance during ingest

2 x ( 1 main + m slices)

Page 94: PS1 PSPS Object Data Manager Design

slide 94

P2

P3

PS1

P1

P2

Pm

P1

Legend

Database Duplicate

Full table [partitioned table]

Partitioned View Duplicate P view

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

[Objects_p2]

[LnkToObj_p2]

[Detections_p2]

Meta

Query Manager (QM)

Linked servers

PS1 database

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

MetaPS1

PartitionsMap

Objects

LnkToObj

Meta

Detections

Pm-1

Pm

Scaling-out

More robustness, fault-tolerance, and reabilability calls for

3 x ( 1 main + m slices)

PartitionsMap

Objects

LnkToObj

Meta

Detections

Page 95: PS1 PSPS Object Data Manager Design

slide 95

Adding New slices

SQL Server range partitioning capabilities make it easy

Recalculate partitioning limits Transfer data to new slices Remove data from slices Define an d Apply new partitioning schema Add new partitions to main server Apply new partitioning schema to main server

Page 96: PS1 PSPS Object Data Manager Design

slide 96

Adding New Slices

Page 97: PS1 PSPS Object Data Manager Design

slide 97

Detail Design Outline

General Concepts Distributed Database architecture Ingest Workflow Prototype

Page 98: PS1 PSPS Object Data Manager Design

slide 99

ODM Ingest Performance

5.3.1.6 The PS1 ODM shall be able to ingest the data

from the IPP at two times the nominal daily arrival rate*

* The nominal daily data rate from the IPP is defined as the total data volume to be ingested annually by the ODM divided by 365.

Nominal daily data rate:• 1.51011 / 3.5 / 365 = 1.2108 P2 detections / day• 8.31010 / 3.5 / 365 = 6.5107 stack detections / day

Page 99: PS1 PSPS Object Data Manager Design

slide 100

Number of Objects

miniProto myProto Prototype PS1

SDSS* Stars 5.7 x 104 1.3 x 107 1.1 x 108

SDSS* Galaxies 9.1 x 104 1.1 x 107 1.7 x 108

Galactic Plane 1.5 x 106 3 x 106 1.0 x 109

TOTAL 1.6 x 106 2.6 x 107 1.3 x 109 5.5 x 109

* “SDSS” includes a mirror of 11.3 < < 30 objects to < 0

Total GB of csv loaded data: 300 GBCSV Bulk insert load: 8 MB/sBinary Bulk insert: 18-20 MB/sCreation Started: October 15th 2007

Finished: October 29th 2007 (??)Includes

• 10 epochs of P2PsfFits detections• 1 epoch of Stack detections

Page 100: PS1 PSPS Object Data Manager Design

slide 102

Prototype in Context

Survey Objects Detections

SDSS DR6 3.8 108

2MASS 4.7 108

USNO-B 1.0 109

Prototype 1.3 109 1.4 1010

PS1 (end of survey) 5.5 109 2.3 1011

Page 101: PS1 PSPS Object Data Manager Design

slide 103

Size of Prototype Database

Table Main Slice1 Slice2 Slice3 Loader Total

Objects 1.30 0.43 0.43 0.43 1.30 3.89

StackPsfFits 6.49 6.49

StackToObj 6.49 6.49

StackModelFits 0.87 0.87

P2PsfFits 4.02 3.90 3.35 0.37 11.64

P2ToObj 4.02 3.90 3.35 0.12 11.39

Total 15.15 8.47 8.23 7.13 1.79 40.77

Extra Tables 0.87 4.89 4.77 4.22 6.86 21.61

Grand Total 16.02 13.36 13.00 11.35 8.65 62.38

Table sizes are in billions of rows

Page 102: PS1 PSPS Object Data Manager Design

slide 104

Size of Prototype Database

Table Main Slice1 Slice2 Slice3 Loader Total

Objects 547.6 165.4 165.3 165.3 137.1 1180.6

StackPsfFits 841.5 841.6

StackToObj 300.9 300.9

StackModelFits 476.7 476.7

P2PsfFits 879.9 853.0 733.5 74.7 2541.1

P2ToObj 125.7 121.9 104.8 3.8 356.2

Total 2166.7 1171.0 1140.2 1003.6 215.6 5697.1

Extra Tables 207.9 987.1 960.2 840.7 957.3 3953.2

Allocated / Free 1878.0 1223.0 1300.0 1121.0 666.0 6188.0

Grand Total 4252.6 3381.1 3400.4 2965.3 1838.9 15838.3

9.6 TB of data in a distributed databaseTable sizes are in GB

Page 103: PS1 PSPS Object Data Manager Design

slide 105

Well-Balanced Partitions

Server Partition Rows Fraction Dec Range

Main 1 432,590,598 33.34% 32.59

Slice 1 1 144,199,105 11.11% 14.29

Slice 1 2 144,229,343 11.11% 9.39

Slice 1 3 144,162,150 11.12% 8.91

Main 2 432,456,511 33.33% 23.44

Slice 2 1 144,261,098 11.12% 8.46

Slice 2 2 144,073,972 11.10% 7.21

Slice 2 3 144,121,441 11.11% 7.77

Main 3 432,496,648 33.33% 81.98

Slice 3 1 144,270,093 11.12% 11.15

Slice 3 2 144,090,071 11.10% 14.72

Slice 3 3 144,136,484 11.11% 56.10

Page 104: PS1 PSPS Object Data Manager Design

slide 106

Ingest and Association Times

TaskMeasuredMinutes

Create Detections Zone Table 39.62

X(0.2") 121M X 1.3B 65.25

Build #noMatches Table 1.50

X(1") 12k X 1.3B 0.65

Build #allMatches Table (121M) 6.58

Build Orphans Table 0.17

Create P2PsfFits Table 11.63

Create P2ToObj Table 14.00

Total of Measured Times 140.40

Page 105: PS1 PSPS Object Data Manager Design

slide 107

Ingest and Association Times

TaskEstimatedMinutes

Compute DetectionID, HTMID 30

Remove NULLS 15

Index P2PsfFits on ObjID 15

Slices Pulling Data from Loader 5

Resolve 1 Detection - N Objects 10

Total of Estimated Times 75

Educated GuessWild Guess

Page 106: PS1 PSPS Object Data Manager Design

slide 108

Total Time to I/A daily Data

TaskTime

(hours)Time

(hours)

Ingest 121M Detections (binary) 0.32

Ingest 121M Detections (CSV) 0.98

Total of Measured Times 2.34 2.34

Total of Estimated Times 1.25 1.25

Total Time to I/A Daily Data 3.91 4.57

Requirement: Less than 12 hours (more than 2800 detections / s)

Detection Processing Rate: 8600 to 7400 detections / s

Margin on Requirement: 3.1 to 2.6

Using multiple loaders would improve performance

Page 107: PS1 PSPS Object Data Manager Design

slide 109

Insert Time @ slices

TaskEstimatedMinutes

Import P2PsfFits (binary out/in) 20.45

Import P2PsfFits (binary out/in) 2.68

Import Orphans 0.00

Merge P2PsfFits 58

Add constraint P2PsfFits 193

Merge P2ToObj 13

Add constraint P2ToObj 54

Total of Measured Times 362

6 h with 8 partitions/slice

(~1.3 x 109 detections/partition)

Educated Guess

Page 108: PS1 PSPS Object Data Manager Design

slide 110

Detections Per Partition

YearsTotal

Detections SlicesPartition per Slice

Total Partitions

Detections per Slice

0.0 0.00 4 8 32 0.00

1.0 4.29 1010 4 8 32 1.34 109

1.0 4.29 1010 8 8 64 6.7 108

2.0 8.57 1010 8 8 64 1.34 109

2.0 8.57 1010 10 8 80 1.07 109

3.0 1.29 1011 10 8 80 1.61 109

3.0 1.29 1011 12 8 96 1.34 109

3.5 1.50 1011 12 8 96 1.56 109

Page 109: PS1 PSPS Object Data Manager Design

slide 111

Total Time for Insert @ slice

TaskTime

(hours)

Total of Measured Times 0.25

Total of Estimated Times 5.3

Total Time for daily insert 6

Daily insert may operate in parallel with daily ingest and association.

Requirement: Less than 12 hours

Margin on Requirement: 2.0

Using more slices will improve insert performance.

Page 110: PS1 PSPS Object Data Manager Design

slide 112

Summary

Ingest + Association < 4 h using 1 loader (@f1= daily)• Scales with the number of servers• Current margin on requirement 3.1 • Room for improvement

Detection Insert @ slices (@f1= daily)• 6 h with 8 partitions/slice• It may happen in parallel with loading

Detections Lnks Insert @ main (@f2 < monthly)• Unknown• 6 h available

Objects insert & update @ slices (@f2 < monthly)• Unknown • 6 hours available

Objects update @ main server (@f2 < monthly)• Unknown• 12 h available. Transfer can be pipelined as soon as objects

have been processed

Page 111: PS1 PSPS Object Data Manager Design

slide 113

Risks

Estimates of Insert & Update at slices could be underestimated• Need more empirical evaluation of exercising

parallel I/O Estimates and lay out of disk storage could be

underestimated• Merges and Indexes require 2x the data size

Page 112: PS1 PSPS Object Data Manager Design

slide 114

Hardware/Scalability (Jan)

Page 113: PS1 PSPS Object Data Manager Design

slide 115

PS1 Prototype Systems DesignJan Vandenberg, JHU

Early PS1 Prototype

Page 114: PS1 PSPS Object Data Manager Design

slide 116

Engineering Systems to Support the Database Design

Sequential read performance is our life-blood. Virtually all science queries will be I/O-bound.

~70 TB raw data: 5.9 hours for full scan on IBM’s fastest 3.3 GB/s Champagne-budget SAN• Need 20 GB/s IO engine just to scan the full data in

less than an hour. Can’t touch this on a monolith. Data mining a challenge even with good index coverage

• ~14 TB worth of indexes: 4-odd times bigger than SDSS DR6.

Hopeless if we rely on any bulk network transfers: must do work where the data is

Loading/Ingest more cpu-bound, though we still need solid write performance

Page 115: PS1 PSPS Object Data Manager Design

slide 117

Choosing I/O Systems

So killer sequential I/O performance is a key systems design goal. Which gear to use?• FC/SAN?• Vanilla SATA?• SAS?

Page 116: PS1 PSPS Object Data Manager Design

slide 118

Fibre Channel, SAN

Expensive but not-so-fast physical links (4 Gbit, 10 Gbit) Expensive switch Potentially very flexible Industrial strength manageability Little control over RAID controller bottlenecks

Page 117: PS1 PSPS Object Data Manager Design

slide 119

Straight SATA

Fast Pretty cheap Not so industrial-

strength

Page 118: PS1 PSPS Object Data Manager Design

slide 120

SAS

Fast: 12 Gbit/s FD building blocks

Nice and mature, stable SCSI’s not just for swanky

drives anymore: takes SATA drives!

So we have a way to use SATA without all the “beige”.

Pricey? $4400 for full 15x750GB system ($296/drive == close to Newegg media cost)

Page 119: PS1 PSPS Object Data Manager Design

slide 121

SAS Performance, Gory Details

SAS v. SATA differences

Native SAS V. SATA Performance

0

50

100

150

200

250

300

350

400

450

500

1 2 3 4 5 6 7

Disks

MB

/s

20%

Page 120: PS1 PSPS Object Data Manager Design

slide 122

Per-Controller Performance

One controller can’t quite accommodate the throughput of an entire storage enclosure.

Controller Limits

0

200

400

600

800

1000

1200

1400

1 2 3 4 5 6 7 8 9 10 11 12 13

Disks

MB

/s

6 Gbit Limit

One Controller

Ideal

Page 121: PS1 PSPS Object Data Manager Design

slide 123

Resulting PS1 Prototype I/O Topology

1100 MB/s single-threaded sequential reads per server

Aggregate Design I/O Performance

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Disks

MB

/s

6 Gbit Limit

Dual Controllers

One Controller

Ideal

Page 122: PS1 PSPS Object Data Manager Design

slide 124

RAID-5 v. RAID-10?

Primer, anyone? RAID-5 perhaps feasible with contemporary

controllers… …but not a ton of redundancy But after we add enough disks to meet performance

goals, we have enough storage to run RAID-10 anyway!

Page 123: PS1 PSPS Object Data Manager Design

slide 125

RAID-10 Performance

0.5*RAID-0 for single-threaded reads RAID-0 perf for 2-user/2-thread workloads 0.5*RAID-0 writes

Page 124: PS1 PSPS Object Data Manager Design

slide 126

PS1 Prototype Servers

Prototype DB

H1

S1 S2 S3

Prototype Loader

Linuxstaging

L1 L2

Page 125: PS1 PSPS Object Data Manager Design

slide 127

PS1 Prototype Servers

PS1 Prototype

Page 126: PS1 PSPS Object Data Manager Design

slide 128

PS1 Prototype Servers

Page 127: PS1 PSPS Object Data Manager Design

slide 129

Projected PS1 Systems Design

R1

H1 H2

S1 S2 S8

….

Loader

Linuxstaging

L1 LN

….

R2

H1 H2

S1 S2 S8

….

R3

H1 H2

S1 S2 S8

….

G1

H1 H2

S1 S2 S8

….

Page 128: PS1 PSPS Object Data Manager Design

slide 130

Backup/Recovery/Replication Strategies

No formal backup• …except maybe for mydb’s, f(cost*policy)

3-way replication• Replication != backup

– Little or no history (though we might have some point-in-time capabilities via metadata

– Replicas can be a bit too cozy: must notice badness before replication propagates it

• Replicas provide redundancy and load balancing…• Fully online: zero time to recover• Replicas needed for happy production performance plus

ingest, anyway Off-site geoplex

• Provides continuity if we lose HI (local or trans-Pacific network outage, facilities outage)

• Could help balance trans-Pacific bandwidth needs (service continental traffic locally)

Page 129: PS1 PSPS Object Data Manager Design

slide 131

Why No Traditional Backups?

Money no object… do traditional backups too!!! Synergy, economy of scale with other collaboration

needs (IPP?)… do traditional backups too!!! Not super pricey… …but not very useful relative to a replica for our

purposes• Time to recover

Page 130: PS1 PSPS Object Data Manager Design

slide 132

Failure Scenarios (Easy Ones)

Zero downtime, little effort:• Disks (common)

– Simple* hotswap– Automatic rebuild from hotspare or replacement

drive• Power supplies (not uncommon)

– Simple* hotswap• Fans (pretty common)

– Simple* hotswap

* Assuming sufficiently non-beige gear

Page 131: PS1 PSPS Object Data Manager Design

slide 133

Failure Scenarios (Mostly Harmless Ones)

Some downtime and replica cutover:• System board (rare)• Memory (rare and usually proactively detected and

handled via scheduled maintenance)• Disk controller (rare, potentially minimal downtime

via cold-spare controller)• CPU (not utterly uncommon, can be tough and time

consuming to diagnose correctly)

Page 132: PS1 PSPS Object Data Manager Design

slide 134

Failure Scenarios (Slightly Spooky Ones)

Database mangling by human or pipeline error• Gotta catch this before replication propagates it everywhere• Need lots of sanity checks before replicating• (and so off-the-shelf near-realtime replication tools don’t help

us)• Need to run replication backwards from older, healthy replicas.

Probably less automated than healthy replication. Catastrophic loss of datacenter

• Okay, we have the geoplex– …but we’re dangling by a single copy ‘till recovery is

complete– …and this may be a while.– …but are we still in trouble? Depending on colo scenarios,

did we also lose the IPP and flatfile archive?

Page 133: PS1 PSPS Object Data Manager Design

slide 135

Failure Scenarios (Nasty Ones)

Unrecoverable badness fully replicated before detection Catastrophic loss of datacenter without geoplex Can we ever catch back up with the data rate if we need

to start over and rebuild with an ingest campaign? Don’t bet on it!

Page 134: PS1 PSPS Object Data Manager Design

slide 136

Operating Systems, DBMS?

Sql2005 EE x64• Why?• Why not DB2, Oracle RAC, PostgreSQL, MySQL,

<insert your favorite>? (Win2003 EE x64) Why EE? Because it’s there. <indexed DPVs?> Scientific Linux 4.x/5.x, or local favorite Platform rant from JVV available over beers

Page 135: PS1 PSPS Object Data Manager Design

slide 137

Systems/Database Management

Active Directory infrastructure Windows patching tools, practices Linux patching tools, practices Monitoring Staffing requirements

Page 136: PS1 PSPS Object Data Manager Design

slide 138

Facilities/Infrastructure Projections for PS1

Power/cooling• Prototype is 9.2 kW (2.6 Tons AC)• PS1: something like 43 kW, 12.1 Tons

Rack space• Prototype is 69 RU, <2 42U racks (includes 14U of

rackmount UPS at JHU)• PS1: about 310 RU (9-ish racks)

Networking: ~40 Gbit Ethernet ports …plus sundry infrastructure, ideally already in place

(domain controllers, monitoring systems, etc.)

Page 137: PS1 PSPS Object Data Manager Design

slide 139

Operational Handoff to UofH

Gulp.

Page 138: PS1 PSPS Object Data Manager Design

slide 140

How Design Meets Requirements

Cross-matching detections with objects• Zone cross-match part of loading pipeline• Already exceeded requirement with prototype

Query performance• Ping-pong configuration for query during ingest• Spatial indexing and distributed queries• Query manager can be scaled out as necessary

Scalability• Shared-nothing architecture• Scale out as needed• Beyond PS1 we will need truly parallel query plans

Page 139: PS1 PSPS Object Data Manager Design

slide 141

WBS/Development Tasks

Refine Prototype/Schema

Staging/Transformation

Initial Load

Load/Resolve Detections

Resolve/Synchronize Objects

Create Snapshot

Replication Module

Query Processing

• Workflow Systems• Logging• Data Scrubbing• SSIS (?) + C#

• QM/LoggingHardware

Documentation

2 PM

3 PM

1 PM

3 PM

3 PM

1 PM

2 PM

2 PM

2 PM

2 PM

4 PM

4 PM

4 PM

2 PM

Total Effort: 35 PMDelivery: 9/2008

Testing

Redistribute Data

Page 140: PS1 PSPS Object Data Manager Design

slide 142

Personnel Available

2 new hires (SW Engineers) 100% Maria 80% Ani 20% Jan 10% Alainna 15% Nolan Li 25% Sam Carliles 25% George Fekete 5% Laszlo Dobos 50% (for 6 months)

Page 141: PS1 PSPS Object Data Manager Design

slide 143

Issues/Risks

Versioning• Do we need to preserve snapshots of monthly

versions?• How will users reproduce queries on subsequent

versions?• Is it ok that a new version of the sky replaces the

previous one every month? Backup/recovery

• Will we need 3 local copies rather than 2 for safety• Is restoring from offsite copy feasible?

Handoff to IfA beyond scope of WBS shown• This will involve several PMs

Page 142: PS1 PSPS Object Data Manager Design

Mahalo!

Page 143: PS1 PSPS Object Data Manager Design

slide 145

Context that query

is executed in

MyDB table that query results go

into

Name that this query

job is given

Check query syntax

Get graphical query plan

Run query in quick (1

minute) mode

Submit query to long (8-

hour) queue

Query buffer

Load one of the sample queries into

query buffer

Query Manager

Page 144: PS1 PSPS Object Data Manager Design

slide 146

Stored procedure arguments

SQL code for stored procedure

Query Manager

Page 145: PS1 PSPS Object Data Manager Design

slide 147

MyDB context is the default, but other contexts can be selected

The space used and total space available

Multiple tables can be selected and dropped at once

Table list can be sorted by name, size, type.

User can browse DB Views, Tables, Functions and

Procedures

Query Manager

Page 146: PS1 PSPS Object Data Manager Design

slide 148

The query that created this

table

Query Manager

Page 147: PS1 PSPS Object Data Manager Design

slide 149

Search radius

Table to hold results

Context to run search on

Query Manager