multivariate methods in hep

Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 1

Multivariate Methods in HEP

Pushpa Bhat Fermilab


Outline

• Introduction/History• Physics Analysis Examples• Popular Methods

• Likelihood Discriminants• Neural Networks• Bayesian Learning• Decision Trees

• Future• Issues and Concerns• Summary


Some History

• In 1990 most of the HEP community was skeptical towards use of multivariate methods, particularly so in case of neural networks (NN)• NN as a black box

Can’t understand weightsNonlinear mapping; higher order correlations Though mathematical function can’t explain in terms of physicsCan’t calculate systematic errors reliably

Uni-variate or “cut-based” analysis was the norm • Some were pursuing application of neural network methods to HEP

around 1990• Peterson, Lonnblad, Denby, Becks, Seixas, Lindsey, etc

• First AIHENP (Artificial Intelligence in High Energy & Nuclear Physics) workshop was in 1990.• Organizers included D. Perret-Gallix, K.H. Becks, R. Brun, J.Vermaseren. AIHENP metamorphosed into ACAT ten years later, in 2000

• Multivariate methods such as Fisher discriminants were in limited use.• In 1990, I began to pursue the use of multivariate methods, especially

NN, in top quark searches at Dzero.


Mid-1990’s

• LEP experiments had been using NN and likelihood discriminants for particle-ID applications and eventually for signal searches (Steinberger; tau-ID)

• H1 at HERA successfully implemented and used NN for triggering (Kiesling).

• Hardware NN was attempted at Fermilab at CDF• Fermilab Advanced Analysis Methods Group

brought CDF and DØ together for discussion of these methods and applications in physics analyses.


The Top QuarkPost-Evidence, Pre-Discovery !

Fisher Analysis of tte channel

One candidate event (S/B)(mt = 180 GeV)

= 18 w.r.t. Z = 10 w.r.t WW

NN Analysis tt e+jets channeltt

W+jets

W+jetstt160 Data

P. Bhat, DPF94


Cut Optimization for Top Discovery Feb. ‘95

Signal

BackgroundJan. ’95

(Aspen) cut

Mar. ’95Discovery cut

Contours: Possible NN cuts Feb. ‘95

Sig. Eff.

S/B (Feb-Mar, 95 -Discovery

Conventional cut)

S/B reach with 2-v NN analysisfor similar efficiency

(Jan, 95 –Aspen mtg.Conventional cut)

Neural Network Equi-probability Contour cuts from 2-variable analysis compared with conventional cuts used in Jan. ’95 and in Observation paper

P. Bhat, H.Prosper, E. AmidiD0 Top Marathon, Feb. ‘95


Measurement of the Top Quark Mass

Discriminant variables

mt = 173.3 ± 5.6(stat.) ± 6.2 (syst.) GeV/c2

The DiscriminantsThe Discriminants

DØ Lepton+jetsDØ Lepton+jets

Fit performed in 2-D: (DLB/NN, mfit)

Run I (1996) result with NN and likelihoodRecent (CDF+D0) mt measurement:

mt= 171.4 ± 2.1 Gev/c2

First significant physics result using multivariate methods


Higgs, the Holy Grail of HEPDiscovery Reach at the Tevatron

• The challenges are daunting! But using NN provides same reach with a factor of 2 less luminosity w.r.t. conventional analysis

• Improved bb mass resolution & b-tag efficiency crucial

Run II Higgs study hep-ph/0010338 (Oct-2000)P.C.Bhat, R.Gilmartin, H.Prosper, Phys.Rev.D.62 (2000) 074022


Then, it got easier

• One of the important steps in getting the NN accepted at the Tevatron experiments was to make the Bayesian connection.

• Another important message to drive home was “the maximal use of information in the event” for the job at hand

• Developed a random grid search technique that can be used as baseline for comparison

• Neural network methods now have become popular due to the ease of use, power and many successful applications

Maybe too easy??


Optimal Event Selection

x

r(x,y) = constant defines an optimaldecision boundary

r(x,y) = constant defines an optimaldecision boundary

Feature spaceFeature space

),|(

),|(

)()|,(

)()|,(),(

yxbp

yxsp

bpbyxp

spsyxpyxr

),|(

),|(

)()|,(

)()|,(),(

yxbp

yxsp

bpbyxp

spsyxpyxr

S = B =

Conventional cutsx x

y y

0

0

y

0y

x0

x

y

x

y

0x

0y


Limitations of “Conventional NN”

• The training yields one set of weights or network parameters• Need to look for “best” network, but avoid overfitting

• Heuristic decisions on network architecture• Inputs, number of hidden nodes, etc.

• No direct way to compute uncertainties


Ensembles of Networks

NN1

NN2

NN3

NNM

X

y1

y2

y3

yM

)(xyayi

ii

Decision by averaging over many networks (a committee of networks) has lower error than that of any individual network.


Bayesian Learning

• The result of Bayesian training is a posterior density of the network weights

P(w|training data) • Generate a sequence of weights (network

parameters) in the network parameter space i.e., a sequence of networks. The optimal network is approximated by averaging over the last K points:

K

1knew

1),( kwxy

Ky


Bayesian Learning – 2

• Advantages• Less prone to over-fitting• Less need to optimize the size of the network. Can use a

large network! Indeed, number of weights can be greater than number of training events!

• In principle, provides best estimate of p(t|x)p(t|x)

• Disadvantages• Computationally demanding!

• The dimensionality of the parameter space is, typically, large • There could be multiple maxima in the likelihood function p(t|

x,w), or, equivalently, multiple minima in the error function E(x,w).


Example: Single Top Search

• Training Data• 2000 events (1000 tqb- + 1000 Wbb-)• Standard set of 11 variables

• Network• (11, 30, 1) Network (391391 parameters!)

• Markov Chain Monte Carlo (MCMC)• 500 iterations, but use last 100 iterations • 20 MCMC steps per iteration• NN-parameters stored after each iteration• 10,000 steps• ~ 1000 steps / hour (on 1 GHz, Pentium III laptop)


Signal:tqb; Background:Wbb Distributions

Example: Single Top Search


Decision Trees

• Recover events that fail criteria in cut-based analyses• Start at first “node” with a fraction of the “training

sample” • Select best variable and cut with best separation to

produce two “branches ” of events, (F)ailed and (P)assed cut

• Repeat recursively on successive nodes• Stop when improvement stops or when too few events

are left • Terminal node is called a “leaf ” with purity =

Ns/(Ns+Nb)• Run remaining events and data through the tree to

derive results• Boosting DT:

• Boosting is a recently developed technique that improves any weak classifier (decision tree, neural network, etc)

• Boosting averages the results of many trees, dilutes the discrete nature of the output, improves the performance

DØ single topanalysis


Matrix Element MethodExample: Top mass measurement

• Maximal use of information in each event by calculating event-by-event signal and background probabilities based on the respective matrix element

x: reconstructed kinematic variables of final state objectsJES: jet energy Scale from Mw constraint

• Signal and background probabilities from differential cross sections

• Write combined likelihood for all events

• Maximize likelihood w.r.t. mtop, JES


Summary

• Multivariate methods are now used extensively in HEP data analysis

• Neural networks, because of their ease of use and power, are favorites for particle-ID and signal/background discrimination

• Bayesian neural networks take us one step closer to optimization

• Likelihood discriminants and Decision trees are becoming popular because they are easier to “defend” (no “black-box” stigma)

• Many issues remain to be addressed as we get ready to deploy the multivariate methods for discoveries in HEP


Nothing tends so much to the advancement of knowledge as the application of a new instrument - Humphrey Davy

No amount of experimentation can ever prove me right; a single experiment can prove me wrong. - Albert Einstein


CDF

CDF

DØ

DØDØ

Booster

World’s Highest Energy Laboratory

(for now)


Our Fancy New Toys

LHC Ring

SPS Ring

PS

Circumference = 27kmBeam Energy = 7.7 TeVLuminosity =1.65x1034 cm-2sec-1

Startup date: 2007

p p

LHC Magnet LHC Tunnel

TI 2TI 2

TI 8TI 8

The Large Hadron Collider

CMS


LHC Environment

14 TeV Proton Proton colliding beams

Parameter ValueBunch-crossing frequency 40 MHz

Average # of collisions / crossing

20

“interaction rate” ~109

Average # of charged tracks

1000

Radiation field severe

CMS Parameter ValueLevel-1 trigger rate 100 kHz

Mean time between triggers

10 sec

Trigger latency 3.2 sec

Solenoid field 4 T


CMS Silicon Tracker

Challenges


CMS Si Tracker

5.4 m

2,4

m

Inner Barrel & Disks

(TIB & TID)

PixelsOuter Barrel (TOB)


Lots of Silicon

214m2 of silicon sensors11.4 million silicon strips66 million pixels!


Si Tracker Challenges

• Large and complex system• 77.4 million total channels (out of a total of 78.2 M for

experiment)• Detector monitoring, data organization, data quality monitoring,

analysis, visualization, interpretation all daunting!

• Need to monitor every channel and make sure most of the detector is working at all times (live fraction of the detector and efficiencies bound to decrease with time)

• Need to verify data integrity and data quality for physics• Diagnose and fix problems ASAP• Keep calibration and alignment parameters current


Detector/Data Monitoring

• Monitor• Environmental variables

• Temperatures, coolant flow rates, interlocks, radiation doses

• Hardware status• Voltages, currents

• Channel Data• Readout states, Errors, missing data/channels, bad ID for

channel/modulemany kinds to be categorized and tracked and displayedshould be able to find rare problems/errors (with low

occurrence rate) that may corrupt data Problems (Rare problems may indicate a developing failure mode or hidden bad behavior)

Correlate problem/noisy channels with history, temperature, currents, etc.


Data Quality Monitoring

• Monitor• Raw Data

• Pedestals, noise, adc counts, occupancies, efficiencies• Processed high level objects

• Clusters, tracks, etc.• Evaluate thousands of histograms

• Can’t visually examine all• Automatically evaluate histograms by comparing to reference

histograms • Adaptive, efficient, find evolving patterns over time

• Quantiles? q-q plots/comparison instead of KS test?• A variety of 2D “heat” maps

• Occupancies, #of bad channels/module, #of errors/module, etc.

• Typical occupancy ~ 2% in strip tracker• 200,000 channels written out 100 times/sec


Module Assembly Precision

Example of a“Heat” map


Need smart approaches

• What are the best techniques for data-mining?• To organize data for analysis and data visualization

• complex geometry/addressing makes visualization difficult

• For finding problematic channels quickly, efficiently clustering, exploratory data-mining

• For finding anomalies, corrupt data, patterns of behaviorFeature-finding algorithms, superpose many events, time

evolution, spatial and temporal correlations

• Noise Correlations • Via correlation coefficients of defined groups• Correlate to history (time variations), environmental

variables


Data Visualization

• Based on hierarchical/geometrical structure of the tracker• Display every channel, attach objects/info to each

Sub-structuresLayers/ringsModulesReadout Chips


Multivariate Analysis Issues

• Dimensionality Reduction• Choosing Variables optimally without losing information

• Choosing the right method for the problem• Controlling Model Complexity• Testing Convergence• Validation

• Given a limited sample what is the best way?

• Computational Efficiency


Multivariate Analysis Issues

• Correctness of modeling• How do we make sure the multivariate modeling is

correct? • The data used for training or building PDEs represent reality.

Is it sufficient to check the modeling in the mapped variable? Pair-wise correlations? Higher order correlations?

• How do we show that the background is modeled well? How do we quantify the correctness of modeling?

• In conventional analysis, we normally look for variables that are well modeled in order to apply cuts

• How well is the background modeled in the signal region?

• Worries about hidden bias• Worries about underestimating errors


Sociological Issues

• We have been conservative in the use of MV methods for discovery.

• We have been more aggressive in the use of MV methods for setting limits.

• But discovery is more important and needs all the power you can muster!

• This is expected to change at LHC.


Summary

• The next generation of experiments will need to adopt advanced data mining and data analysis techniques

• Conventional/routine tasks such as alignment, detector performance and data quality monitoring and data visualization will be challenging and require new approaches

• Many issues regarding use of multivariate methods of data analysis for discoveries and measurements need to be addressed to make optimal use of data


MV: Where can we use them?

• Almost everywhere since HEP events are multivariate• Improve several aspects of analysis

• Event selection• Triggering, Real-time Filters, Data Streaming

• Event reconstruction• Tracking/vertexing, particle ID

• Signal/Background Discrimination• Higgs discovery, SUSY discovery, Single top, …

• Functional Approximation• Jet energy corrections, tag rates, fake rates

• Parameter estimation• Top quark mass, Higgs mass, SUSY model parameters

• Data Exploration• Knowledge Discovery via data-mining• Data-driven extraction of information, latent structure analysis

multivariate methods in hep

Documents

use of multivariate

hardware nn

variable analysis

based analysis

limited use

physics analyses

significant physics

fisher discriminants