imola k. fodor and chandrika kamath center for applied scientific computing lawrence livermore...

27
Imola K. Fodor and Chandrika Kamath Center for Applied Scientific Computing Lawrence Livermore National Laboratory IPAM Workshop January, 2002 Mining the FIRST Astronomical Survey

Upload: nya-pinion

Post on 16-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Imola K. Fodor and Chandrika Kamath

Center for Applied Scientific ComputingLawrence Livermore National Laboratory

IPAM Workshop January, 2002

Mining the FIRST Astronomical Survey

Sapphire/IKF 2CASC

Faint Images of the Radio Sky at Twenty-Centimeters (FIRST)

On-going sky survey, started in 1993 When completed, will cover more than 10,000 deg to a

flux density limit of 1.0 mJy (milli-Jansky) Current coverage is about 8,000 deg

– more than 32,000 two-million pixel images There are about 90 radio sources/deg Data available at http://sundog.stsci.edu

2

2

NRAO Very Large Array (VLA)

2

Sapphire/IKF 3CASC

One goal of FIRST is to identify radio galaxies with a bent-double morphology A bent-double galaxy is … Problem: there is no definition of “bent-double” Rough characteristic: there is a radio emitting “core”,

along with a number of (not necessarily two!) side- components that are “bent” around the core

Astronomers search manually for bent-doubles

Bent-doubles

Non bent-doubles

Sapphire/IKF 4CASC

Sapphire: use data mining to enhance the visual search for bent-doubles

Use galaxies classified by astronomers to model the binary response variable Y

Find features X and model f(X) with desired accuracy

Aim: 10% misclassification error, as manual classification is not more accurate

},{ bentnonbentY

Pre-processing Pattern recognitionFIRSTimages

Bent/nonbentcoordinates

“Good”features

DenoisingFeature extraction

Dimension reduction

Classification

YYff ˆ)(:?&? X X

Sapphire/IKF 5CASC

The FIRST catalog is based on fitting 2D elliptical Gaussians to denoised images

Image Map

1550 pixels

1150pixels

64 pixels32K image maps, 7.1MB each

RA DEC Peak Flux(mJy/bm)

Major Axis(arcsec)

Minor Axis(arcsec)

Position Angle(degrees)

00 56 25 -01 15 43 25.38 7.39 2.23 37.9

00 56 26 -01 15 57 5.50 18.30 14.29 94.2

00 56 24 -01 16 31 6.44 19.34 10.19 39.8

Catalog 720K entries

Radio source (RS)

Catalog entry (CE)

Sapphire/IKF 6CASC

A first pre-processing step is to identify potential features to discriminate bents

For the FIRST data, we extracted various features based on – radio intensities, angles, distances, …

For galaxies with 3 entries– a total of 103 features– three sets of single features, three pairs of

double features, and the triple features

– possible redundancies Reduce dimension using

– domain knowledge – EDA– PCA– GLM step-wise model selection

Sapphire/IKF 7CASC

Triple features for three catalog entries

A

B

C

MN

P

a

cb

Sapphire/IKF 8CASC

Using exploratory data analysis (EDA), we reduced the number of features to 25

Use EDA techniques such as – box-plots– multivariate plots– parallel-coordinate plots– correlation matrix

to – explore the data– find unusual observations– eliminate correlations among the features

Call these EDA features

Sapphire/IKF 9CASC

Example parallel coordinate plot: nine variables split by bentness category

Bent Non-bent

x

x X : unusualxxx

3/2 sky regions forbent/non-bent

large negativecorrelation

Sapphire/IKF 10CASC

Principal component analysis (PCA) finds linear combinations of variables

Suppose we have p features

and we want a linear combination with max. variance

By the spectral decomposition theorem,

the first PC, has maximal variance, and

The total variance is preserved,

Dimension reduction: use first k PCs as new “features”

Ψ,][XX' 0, [X] ,)(X ' EEX..., ,Xp1

1. aa'aXa' ,, pUU

),,...,(),1 pp1

..., ,( , diag ΛVVVΛ V' V Ψpxp

orthogonal,

.)var()var(...)var()var( ''11 1 pppUU XVXV

,'

11XVU

).var()var(11

2

p

i i

p

i itotalUX

Sapphire/IKF 11CASC

We used PCA differently to reduce the number of original features to 20

The first 20 PCs explain 90% of the variance PCs are hard to interpret – instead of using 20 PCs,

keep 20 of the original variables Multivariate Analysis (Mardia, Kent, Bibby)

– consider the last PC, with the smallest variance

– find the largest (in abs value) coefficient , and discard the corresponding original variable

– repeat the procedure w/ the second-to-last PC, and iterate until only 20 variables remain

Call these PCA features

jX

i

p

i piXVU pp

1 ,

' XVpj

V,

Sapphire/IKF 12CASC

We also used step-wise model selection to reduce the number of variables

Binary response: Y = {bent, non-bent} Explanatory variables: features Logistic regression, step-wise model selection with

the AIC as a measure of goodness (minimize -log-likelihood, with a penalty term for large models)

Cannot use all 103 features because of correlations We identified the features selected by EDA or PCA

– stepwise model selection => GLM 2 features (25) We identified the features selected by EDA and PCA

– stepwise model selection => GLM 3 features (10)– stepwise model selection, including second-order

interactions => GLM 4 features (9, +5 interactions)

iX

Sapphire/IKF 13CASC

Pattern recognition uses the features from pre-processing to classify the data

ExtractFeatures

Training data Create ClassifierDecision Tree

GLM

Check for Accuracy

Apply Classifier toUnclassified Data

Extract Featuresfor

Unclassified Data

Show Resultsand

Obtain Score

Update TrainingData

An iterative and interactive classification process

Sapphire/IKF 14CASC

We use decision trees to classify the radio sources into bents and non-bents

Use information gain to split : set of examples at a node : number of classes : split into two : number of class in

radius > a?

color? color?

nRLp

ppTEntropy

iii

i

k

i i

/)(

,log)(21

T n

k},{RLTTS T

iiRL , i

RLTT ,

)(||

||)(

||

||)(),(

RR

LL TEntropy

T

TTEntropy

T

TTEntropySTGain

Sapphire/IKF 15CASC

Decision tree created with all the features: Tree 1

Resubstitution error, train/test (90%) set: 2.8% Cross-validation error, train/validate (10%) set: 5.3%

Leaf node w/ 11 non-bents

Leaf node w/ 4 bents

Leaf node w/ 145 items, (145-4) bents, and

4 non-bents

Sapphire/IKF 16CASC

Decision tree created with the EDA and PCA features: Tree 2

Resubstitution error: 1.7% Cross-validation error: 5.3%

Sapphire/IKF 17CASC

Decision tree created with the GLM 3 features: Tree 3

Resubstitution error: 2.8% Cross-validation error: 0%

Using fewer, well-selected variables results in smaller and more accurate trees

Sapphire/IKF 18CASC

We also used generalized linear models (GLMs) to classify the galaxies

Linear models explain response variables in terms of linear combinations of explanatory variables

Least-squares estimate solves

No restrictions on the range of fitted values GLMs allow such restrictions by modeling

where g() is a monotone increasing link function

Σ) Cov(ε 0, ) (ε E ε,XβY

niyEXXyiiipipii

,...,1,)(,1,11,10

βX

β̂)}{(ˆ XβyΣ'Xβy β 1 ()min arg

βXY ˆˆ

),())(iiii

Vyg Var(,βX

Sapphire/IKF 19CASC

Logistic regression is a special GLM suitable for modeling binary responses

Y={0,1} Logit link and variance functions

Likelihood non-linear in parameters, no closed-form solution: iteratively reweighted least squares to find

Given ,

where is {0,1} according to {a=False, a=True}, and the fraction is generally taken to be 0.5

)1

log()(i

ii

g

)1()(iii

V

β̂

,}ˆexp{1

}ˆexp{ˆ

βX

βX

i

ii ,ˆ

}ˆ{ piiIy

}{aI

β̂

p

Sapphire/IKF 20CASC

GLM created with the GLM 2 features

Sapphire/IKF 21CASC

GLM created with the GLM 3 features

Sapphire/IKF 22CASC

GLM created with the GLM 4 features

Sapphire/IKF 23CASC

Tree 1 Tree 2 Tree 3

Mean 11.1% 9.5% 8.3%

SE 0.4% 0.4% 0.4%

Misclassification errors based on 10 ten-fold cross-validations in the training set

GLM 4GLM 3GLM 2

1.14%0.91%4.34%SE

4.00%7.84%18.74%Mean

Misclassification errors of best models are below the desired 10% in training set

Careful selection of variables reduces error Trees are less sensitive to input features than GLMs GLM 4 has lowest misclassification errors

Sapphire/IKF 24CASC

Our methods identified the “interesting” part of the FIRST dataset

15,059 three-entry radio sources in the 2000 catalog 2,577 labeled as bent by all six methods Astronomers can start by exploring the smaller set

Visually explore random samples to assess the percentage of false positives and missed bents

257710719 397999419399104319647Bent

637 43401108051185660 46285412Non-bent

All 6GLM4GLM3GLM2Tree3Tree2Tree1

Classification results for the entire 2000 catalog

Sapphire/IKF 25CASC

Example classifications for previously unlabeled galaxies are encouraging

The labels commonly assigned by the six methods are correct in the examples below

Bent

Non-bent

Sapphire/IKF 26CASC

Summary

Described how data mining can help identify radio galaxies with bent-double morphology

Illustrated specific data mining steps – data pre-processing is very crucial

In our experience, data mining is semi-automatic– interaction and feedback required at many stages– domain knowledge is essential

Multi-disciplinary collaboration is challenging, but rewarding– astronomy - computer science - statistics

There is always room for improvement– alternative techniques– your feedback welcome!

Sapphire/IKF 27CASC

The Sapphire team: supporting a multi-disciplinary endeavor

Chandrika Kamath (Project Lead) Erick Cantú-Paz Imola K. Fodor Nu A. Tang

Thanks to the FIRST scientists: Robert Becker, Michael Gregg, David Helfand, Sally Laurent-Muehleisen, and Rick White

www.llnl.gov/casc/sapphire

UCRL-JC-145672. This work was performed under the auspices of the U.S. Department of Energy byUniversity of California Lawrence Livermore National Laboratory under contract W-7405-Eng-48.