starr bloom t.c.p. using hadoop on yahoo's m45 cluster (20100112)
TRANSCRIPT
Berkeley Astronomy’sTransients Classification Pipeline
• Project Overview
• The TCP is a time series classification project which identifies flux varying stellar sources for “real-time” data streams.
• Upon identification of scientifically interesting sources, the pipeline emits source information to robotic telescopes for immediate and automated follow-up.
• TCP sub-project for Yahoo! M45 cluster
• We want to implement our most computationally expensive classifier generation technique using Hadoop.
1. Given an array of times which an interesting astronomical source has been observed.
2. Re-sample the time-series of well sampled and classified known sources, to match the given time-array.
3. Add noise to re-sampled data which is characteristic of the observing telescope, and seasonal / local conditions.
4. Generate a time-series science classifier using this re-sampled data.
5. Classify the original interesting source using this classifier.
PI: Josh Bloom, Sw Eng: Dan Starr
Berkeley’s Transients Classification Pipeline
• Hadoop technologies used:
• Hadoop Streaming
• Used to wrap existing TCP Python algorithms
• Cascading (ver 1.1-86)
• Allows construction of a Hadoop dataflow using pipes, sinks / source objects.
• Other packages used:
• Python modules: numpy, scipy, xml...ElementTree, pyephem
• WEKA (JAVA based machine learning software)
Berkeley’s Transients Classification Pipeline
[1.01, 1.15, 2.03, 3.72, 8.11, 8.25, 20.93, 21.03, 25.48]
(ID, <time-series in compressed XML>)
([1.01, ... 25.48], <time-series in compressed XML>)
Generate several “noisified”, resampled time-series for
each tuple
Weka .arff file
Generate time-series characterizing attributes for each time-series tuple. Only output resampled sources
where a period could be found.
Generating a “noisified” classifier
(ID, <python dictionary of time-series attributes>)
Reduce noisified source tuples into a single WEKA .arff formated string.
This .arff file is then usedto generate a WEKA classifier.
This classifier can then be applied to the original interesting sourceto obtain a science classification.
Interesting source time array
(ID, <time-series in compressed XML>)(ID, <time-series in compressed XML>)
(ID, <python dictionary of time-series attributes>)(ID, <python dictionary of time-series attributes>)
<time-series in compressed XML>
Reference, well sampled source
<time-series in compressed XML>
[1.01, 1.15, 2.03, 3.72, 8.11, 8.25, 20.93, 21.03, 25.48]
Join time-arrays with well-sampled source XMLs
([1.01, ... 25.48], <time-series in compressed XML>)([1.01, ... 25.48], <time-series in compressed XML>)
Berkeley’s Transients Classification Pipeline
• Metrics of noisifcation pipeline:
• M45 Hadoop Pipeline
• 15 minutes
• Original TCP Python
• Using IPython parallelization across 8 cores
• 150 minutes
• Other work done as part of the Yahoo! Cloud initiative
• We’ve developed code which applies a WEKA classifier to TCP’s VOSource XML.
• We’ve tested our software on other Hadoop distributions (Cloudera).
• Future tasks to improve comparison metrics
• More thought about more fairly distributed workload across map(), reduce()
• Have Python code self-contained in single distributable Python .egg
Berkeley’s Transients Classification Pipeline
• Issues we’ve had with the M45 cluster:
• M45’s system installation of Python is v2.4.3 (Old).
• This required modifying some syntax used by our code.
• M45’s Python does not include scipy or numpy Python modules
• This required the non-ideal hack of packaging numpy & scipy source code with certain map/reduce Hadoop Streaming jobs.
• This is not needed on Hadoop clusters which have numpy, scipy installed.
• Future work using Hadoop:
• Generate classifiers for real astronomical sources using the noisification pipeline.
• We’ve currently used only a test case astronomical source.
• Apply our Hadoop based pipelines to the TCP’s real-time datastream.
• Break pipeline into a finer granularity of map(), reduce() algorithms.
• Make use of other Hadoop based machine learning projects (e.g.: Mahout)
• Port other TCP tasks to Hadoop.