lászló dobos, tamás budavári, alex szalay, istván csabai eötvös university / jhu aug. 25-26,...

10
On-demand associations using database server clusters László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008. IDIES Inaugural Symposium, Baltimore 1

Upload: ellen-owen

Post on 27-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 1

On-demand associations using database server clusters

László Dobos, Tamás Budavári,Alex Szalay, István Csabai

Eötvös University / JHU

Aug. 25-26, 2008.

Page 2: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 2

Cross-match problem in astronomyAstronomical catalogs in the TB range, o(100M) detections

per catalogGeographically distributed:

reliable, lightweight transfer protocol needed should benefit from co-located datasets

Goals: find the same object in every catalog find drop-outs (requires complete description of footprints) on-demand: do it quickly (< 5 min)

Matching primarily based on celestial coordinates astrometric error error can vary from object to object

Additional match criteria: size, color, etc.

Aug. 25-26, 2008.

Page 3: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 3

Cross-match problem in astronomyThe math:

Bayesian model selection[Budavári & Szalay 2008, „Probabilistic Cross-Identification of Astronomical Sources”]

First step: cut on distance Including additional match criteria is easy and naturalTested on simulations [Heinis et al. 2009]

The problemsone-to-one matching of objects is expensivetrigonometric computationsIO intensive if dataset is big: always have to

keep the right subset of data in memory

Aug. 25-26, 2008.

Page 4: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 4

Hardware and data layoutJHU Graywulf cluster:

Dell PowerEdge 2950 + Dell PowerVauld MD 1000,2 × PERC 5/e raid controller

1.2-1.4 GB/sec nominal IO bandwidth, InfiniBand2x4 core iXeon, 8-32 GB RAM5-20 machines partially assigned to cross-match engine

Catalogs are mirrored on every nodeUser catalogs uploaded to / located at a dedicated

nodeRemote data sources (via various protocols)Queries are partitioned and executed in parallel on

every machine

Aug. 25-26, 2008.

Page 5: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 5

Xmatch definition languageA cross-match query:

SELECT s.objId as SobjID, s.ra, s.dec, g.ra, g.dec, j_mFROM SDSS:PhotoObjAll AS s

CROSS JOIN GALEX:PhotoObjAll AS gXMATCH BAYESIAN AS x

MUST s ON Point(s.ra, s.dec), 0.1 MUST g ON Point(g.ra, g.dec), 0.5

HAVING x.BF > 1e3WHERE s.type = 3 AND s.ra BETWEEN 200 AND 210 AND s.dec BETWEEN -2 AND 2

AND g.ra BETWEEN 200 AND 210 AND g.dec BETWEEN -2 AND 2

A partitioned query:

SELECT s.ObjIDFROM SDSS:PhotoObjAll s PARTITION ON RaWHERE Ra BETWEEN 200 AND 210 AND Dec BETWEEN -5 AND 5

Aug. 25-26, 2008.

Page 6: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 6

Query Execution 1Parse:

proprietary SQL parser written from scratchcovers ~80% of SQL Server’s SELECT statement grammarextensions can be added easily by changing BNF grammar

Job assignment: (to be implemented)determine sets of collocated catalogs using a central registrysend part of cross-match job to remote service return only cross-matched result, not full raw datasetsmerge resultsets at any node

Partition:cross-match queries: on right ascensionsimple queries: on specified columnpartitioned determined based on histogram:

histogram query executed on a subsample to get metrics

Aug. 25-26, 2008.

Page 7: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 7

Query Execution 2Cache:

cache remote datasetscopy myDB tables to worker nodescan benefit from filters defined in query

Execute:construct T-SQL queriesexecute T-SQL queries on nodes in parallelautomatically retry on failure

Mergemerge result setsbenefit from clever partitioning: no duplicates

Aug. 25-26, 2008.

Page 8: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 8

Applied technologiesRelational Database Management System:

SQL Server 2008CLR integration with parallel execution support

Windows Workflow Foundation:coordinates the complex execution workflowtransactions help keep the system consistentparallel execution support

SMOSQL management objectseasy access to the database schema

Aug. 25-26, 2008.

Page 9: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 9

Zone algorithmZone algorithm:

Pure T-SQL: can leverage from query optimizer of SQL Server

Divide sphere into zonesZoneID: very simple hash on declinationIndexes built on ZoneID and right ascension

help very quick pre-filtering of match candidates

very well parallelized on multi-core machines [Gray, Szalay & Nieto-Santisteban 2006, The Zones Algorithm for

Finding Points-Near-a-Point or Cross-Matching Spatial Datasets]

Aug. 25-26, 2008.

Page 10: László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

IDIES Inaugural Symposium, Baltimore 10

Summary and future workOn-demand cross-matching is feasibleParser and partitioning logic built for handling cross-

match job descriptionsWorkflow built for executing partitioned jobsNew technologies allow rapid development of

complex workflows and high performance data warehouses

Future work:Develop GUIInstall and publish systemAdd support for remote datasetsAdd support to benefit from collocated datasets

Aug. 25-26, 2008.