lászló dobos, tamás budavári, alex szalay, istván csabai eötvös university / jhu aug. 25-26,...
TRANSCRIPT
IDIES Inaugural Symposium, Baltimore 1
On-demand associations using database server clusters
László Dobos, Tamás Budavári,Alex Szalay, István Csabai
Eötvös University / JHU
Aug. 25-26, 2008.
IDIES Inaugural Symposium, Baltimore 2
Cross-match problem in astronomyAstronomical catalogs in the TB range, o(100M) detections
per catalogGeographically distributed:
reliable, lightweight transfer protocol needed should benefit from co-located datasets
Goals: find the same object in every catalog find drop-outs (requires complete description of footprints) on-demand: do it quickly (< 5 min)
Matching primarily based on celestial coordinates astrometric error error can vary from object to object
Additional match criteria: size, color, etc.
Aug. 25-26, 2008.
IDIES Inaugural Symposium, Baltimore 3
Cross-match problem in astronomyThe math:
Bayesian model selection[Budavári & Szalay 2008, „Probabilistic Cross-Identification of Astronomical Sources”]
First step: cut on distance Including additional match criteria is easy and naturalTested on simulations [Heinis et al. 2009]
The problemsone-to-one matching of objects is expensivetrigonometric computationsIO intensive if dataset is big: always have to
keep the right subset of data in memory
Aug. 25-26, 2008.
IDIES Inaugural Symposium, Baltimore 4
Hardware and data layoutJHU Graywulf cluster:
Dell PowerEdge 2950 + Dell PowerVauld MD 1000,2 × PERC 5/e raid controller
1.2-1.4 GB/sec nominal IO bandwidth, InfiniBand2x4 core iXeon, 8-32 GB RAM5-20 machines partially assigned to cross-match engine
Catalogs are mirrored on every nodeUser catalogs uploaded to / located at a dedicated
nodeRemote data sources (via various protocols)Queries are partitioned and executed in parallel on
every machine
Aug. 25-26, 2008.
IDIES Inaugural Symposium, Baltimore 5
Xmatch definition languageA cross-match query:
SELECT s.objId as SobjID, s.ra, s.dec, g.ra, g.dec, j_mFROM SDSS:PhotoObjAll AS s
CROSS JOIN GALEX:PhotoObjAll AS gXMATCH BAYESIAN AS x
MUST s ON Point(s.ra, s.dec), 0.1 MUST g ON Point(g.ra, g.dec), 0.5
HAVING x.BF > 1e3WHERE s.type = 3 AND s.ra BETWEEN 200 AND 210 AND s.dec BETWEEN -2 AND 2
AND g.ra BETWEEN 200 AND 210 AND g.dec BETWEEN -2 AND 2
A partitioned query:
SELECT s.ObjIDFROM SDSS:PhotoObjAll s PARTITION ON RaWHERE Ra BETWEEN 200 AND 210 AND Dec BETWEEN -5 AND 5
Aug. 25-26, 2008.
IDIES Inaugural Symposium, Baltimore 6
Query Execution 1Parse:
proprietary SQL parser written from scratchcovers ~80% of SQL Server’s SELECT statement grammarextensions can be added easily by changing BNF grammar
Job assignment: (to be implemented)determine sets of collocated catalogs using a central registrysend part of cross-match job to remote service return only cross-matched result, not full raw datasetsmerge resultsets at any node
Partition:cross-match queries: on right ascensionsimple queries: on specified columnpartitioned determined based on histogram:
histogram query executed on a subsample to get metrics
Aug. 25-26, 2008.
IDIES Inaugural Symposium, Baltimore 7
Query Execution 2Cache:
cache remote datasetscopy myDB tables to worker nodescan benefit from filters defined in query
Execute:construct T-SQL queriesexecute T-SQL queries on nodes in parallelautomatically retry on failure
Mergemerge result setsbenefit from clever partitioning: no duplicates
Aug. 25-26, 2008.
IDIES Inaugural Symposium, Baltimore 8
Applied technologiesRelational Database Management System:
SQL Server 2008CLR integration with parallel execution support
Windows Workflow Foundation:coordinates the complex execution workflowtransactions help keep the system consistentparallel execution support
SMOSQL management objectseasy access to the database schema
Aug. 25-26, 2008.
IDIES Inaugural Symposium, Baltimore 9
Zone algorithmZone algorithm:
Pure T-SQL: can leverage from query optimizer of SQL Server
Divide sphere into zonesZoneID: very simple hash on declinationIndexes built on ZoneID and right ascension
help very quick pre-filtering of match candidates
very well parallelized on multi-core machines [Gray, Szalay & Nieto-Santisteban 2006, The Zones Algorithm for
Finding Points-Near-a-Point or Cross-Matching Spatial Datasets]
Aug. 25-26, 2008.
IDIES Inaugural Symposium, Baltimore 10
Summary and future workOn-demand cross-matching is feasibleParser and partitioning logic built for handling cross-
match job descriptionsWorkflow built for executing partitioned jobsNew technologies allow rapid development of
complex workflows and high performance data warehouses
Future work:Develop GUIInstall and publish systemAdd support for remote datasetsAdd support to benefit from collocated datasets
Aug. 25-26, 2008.