starr bloom t.c.p. using hadoop on yahoo's m45 cluster (20100112)

6
Berkeley Astronomy’s Transients Classification Pipeline Project Overview The TCP is a time series classification project which identifies flux varying stellar sources for “real-time” data streams. Upon identification of scientifically interesting sources, the pipeline emits source information to robotic telescopes for immediate and automated follow-up. TCP sub-project for Yahoo! M45 cluster We want to implement our most computationally expensive classifier generation technique using Hadoop. 1. Given an array of times which an interesting astronomical source has been observed. 2. Re-sample the time-series of well sampled and classified known sources, to match the given time-array. 3. Add noise to re-sampled data which is characteristic of the observing telescope, and seasonal / local conditions. 4. Generate a time-series science classifier using this re-sampled data. 5. Classify the original interesting source using this classifier. PI: Josh Bloom, Sw Eng: Dan Starr

Upload: dan-starr

Post on 14-Jul-2015

444 views

Category:

Technology


2 download

TRANSCRIPT

Berkeley Astronomy’sTransients Classification Pipeline

• Project Overview

• The TCP is a time series classification project which identifies flux varying stellar sources for “real-time” data streams.

• Upon identification of scientifically interesting sources, the pipeline emits source information to robotic telescopes for immediate and automated follow-up.

• TCP sub-project for Yahoo! M45 cluster

• We want to implement our most computationally expensive classifier generation technique using Hadoop.

1. Given an array of times which an interesting astronomical source has been observed.

2. Re-sample the time-series of well sampled and classified known sources, to match the given time-array.

3. Add noise to re-sampled data which is characteristic of the observing telescope, and seasonal / local conditions.

4. Generate a time-series science classifier using this re-sampled data.

5. Classify the original interesting source using this classifier.

PI: Josh Bloom, Sw Eng: Dan Starr

Berkeley’s Transients Classification Pipeline

• Hadoop technologies used:

• Hadoop Streaming

• Used to wrap existing TCP Python algorithms

• Cascading (ver 1.1-86)

• Allows construction of a Hadoop dataflow using pipes, sinks / source objects.

• Other packages used:

• Python modules: numpy, scipy, xml...ElementTree, pyephem

• WEKA (JAVA based machine learning software)

Berkeley’s Transients Classification Pipeline

[1.01, 1.15, 2.03, 3.72, 8.11, 8.25, 20.93, 21.03, 25.48]

(ID, <time-series in compressed XML>)

([1.01, ... 25.48], <time-series in compressed XML>)

Generate several “noisified”, resampled time-series for

each tuple

Weka .arff file

Generate time-series characterizing attributes for each time-series tuple. Only output resampled sources

where a period could be found.

Generating a “noisified” classifier

(ID, <python dictionary of time-series attributes>)

Reduce noisified source tuples into a single WEKA .arff formated string.

This .arff file is then usedto generate a WEKA classifier.

This classifier can then be applied to the original interesting sourceto obtain a science classification.

Interesting source time array

(ID, <time-series in compressed XML>)(ID, <time-series in compressed XML>)

(ID, <python dictionary of time-series attributes>)(ID, <python dictionary of time-series attributes>)

<time-series in compressed XML>

Reference, well sampled source

<time-series in compressed XML>

[1.01, 1.15, 2.03, 3.72, 8.11, 8.25, 20.93, 21.03, 25.48]

Join time-arrays with well-sampled source XMLs

([1.01, ... 25.48], <time-series in compressed XML>)([1.01, ... 25.48], <time-series in compressed XML>)

Berkeley’s Transients Classification Pipeline

• Metrics of noisifcation pipeline:

• M45 Hadoop Pipeline

• 15 minutes

• Original TCP Python

• Using IPython parallelization across 8 cores

• 150 minutes

• Other work done as part of the Yahoo! Cloud initiative

• We’ve developed code which applies a WEKA classifier to TCP’s VOSource XML.

• We’ve tested our software on other Hadoop distributions (Cloudera).

• Future tasks to improve comparison metrics

• More thought about more fairly distributed workload across map(), reduce()

• Have Python code self-contained in single distributable Python .egg

Berkeley’s Transients Classification Pipeline

• Issues we’ve had with the M45 cluster:

• M45’s system installation of Python is v2.4.3 (Old).

• This required modifying some syntax used by our code.

• M45’s Python does not include scipy or numpy Python modules

• This required the non-ideal hack of packaging numpy & scipy source code with certain map/reduce Hadoop Streaming jobs.

• This is not needed on Hadoop clusters which have numpy, scipy installed.

• Future work using Hadoop:

• Generate classifiers for real astronomical sources using the noisification pipeline.

• We’ve currently used only a test case astronomical source.

• Apply our Hadoop based pipelines to the TCP’s real-time datastream.

• Break pipeline into a finer granularity of map(), reduce() algorithms.

• Make use of other Hadoop based machine learning projects (e.g.: Mahout)

• Port other TCP tasks to Hadoop.

, Justin Higgins, Adam Morgan

Josh Bloom (PI)