using principal component analysis to find correlations...

3
USING PRINCIPAL COMPONENT ANALYSIS TO FIND CORRELATIONS AND PATTERNS AT DIAMOND LIGHT SOURCE C. Bloomer, G. Rehm, Diamond Light Source, Oxfordshire, UK Abstract Principal component analysis is a powerful data analysis tool, capable of reducing large complex data sets containing many variables. Examination of the principal components set allows the user to spot underlying trends and patterns that might otherwise be masked in a very large volume of data, or hidden in noise. Diamond Light Source archives many gigabytes of machine data every day, far more than any one human could effectively search through for correla- tions. Presented in this paper are some of the results from running principal component analysis on years of archived data in order to find underlying correlations that may other- wise have gone unnoticed. The advantages and limitations of the technique are discussed. INTRODUCTION Principal component analysis (PCA) is a technique used to find underlying correlations that exist in a (potentially very large) set of variables. The objective of the analysis is to take a set of n variables, Y 1 , Y 2 , Y 3 , ..., Y n , and to find corre- lations. The most important of these correlations are called the principal components (PCs). The analysis will return vectors Z 1 , Z 2 , Z 3 , ..., Z n , each describing a different un- derlying variation or trend found in the initial data set. The vectors of Z are ordered by their importance; that is to say the component Z 1 is the most prevalent trend seen through- out the data, and accounts for more variation than Z 2 . Z 2 is a component uncorrelated with Z 1 , and will account for the second largest trend seen in the data. Z 3 describes the third largest component, and so on. The ‘importance’ of each Z is determined by it’s variance [1]. The rationale behind performing PCA on a data set is the idea that hopefully much, or perhaps even most, of the vari- ation seen can be attributed to just a few of the most im- portant principal components. A highly correlated data set can often be described by just a handful of principal com- ponents. Equally, it is possible for the analysis to produce no useful results at all if the original variables are highly uncorrelated. Consider a simple case with two variables from Diamond Light Source, Y 1 and Y 2 (beam positions from two adjacent BPMs). These are plotted against one another in Fig 1. The first principal component, Z 1 , is marked in blue and in- dicates the predominant correlation between the variables. The second component, Z 2 , is marked in red and represents the variation between the variables that is uncorrelated with the first component. The length of the blue and red vectors are the standard deviation of each component ( variance), and thus indicate that component’s importance. In this case, -1 -0.5 0 0.5 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Y 1 [μm] Y 2 [μm] Figure 1: PCA for two highly correlated variables: two ad- jacent beam position monitors at Diamond Light Source. The blue and red lines indicate the first and second principal components respectively. it is clear that the relationship between Y 1 and Y 2 can be pri- marily described by Z 1 . Z 2 represents noise in this data. The chosen method to calculate the PCA presented in this paper is the singular value decomposition (SVD). This is a factorization of a matrix, based on a theorem from linear algebra. It states that a rectangular matrix, A, of size m × n, where m is the number of datapoints and n is the number of variables, can be broken down into a product of three matrices. Formally, this is usually written A = USV T The computation of the SVD itself is beyond the scope of this paper, but is discussed in great detail elsewhere [2] [3] [4]. For the purposes of this paper, it is enough to know that the SVD is a very general method of computing the principal components of a data set, and that the SVD of a matrix can be robustly and quickly computed in many soft- ware packages (MATLAB, Python, C++, to name a few). DATA SETS FROM DIAMOND LIGHT SOURCE Diamond Light Source archives over 100,000 process variables (PVs), resulting in gigabytes of data being stored every day. For this study 1113 archived PVs relating to Diamond Light Source storage ring parameters were chosen for analysis (temperatures, vacuum pressures, X- ray beam positions, electron BPM quadrupolar differences Q =( A+C)-( B+D), length encoders, loss monitors, and pin- hole camera measurements have all been included). Two years of data was retrieved from the archiver. A sim- ple Gaussian low pass filter was then applied, the data is ‘zeroed’ by subtracting the mean value from each variable, and normalised so that the standard deviation of each vari- able is always 1. This normalisation process is important 5th International Particle Accelerator Conference IPAC2014, Dresden, Germany JACoW Publishing ISBN: 978-3-95450-132-8 doi:10.18429/JACoW-IPAC2014-THPME188 06 Instrumentation, Controls, Feedback & Operational Aspects T03 Beam Diagnostics and Instrumentation THPME188 3719 Content from this work may be used under the terms of the CC BY 3.0 licence (© 2014). Any distribution of this work must maintain attribution to the author(s), title of the work, publisher, and DOI.

Upload: others

Post on 22-Mar-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

USING PRINCIPAL COMPONENT ANALYSIS TO FIND CORRELATIONS

AND PATTERNS AT DIAMOND LIGHT SOURCE

C. Bloomer, G. Rehm, Diamond Light Source, Oxfordshire, UK

Abstract

Principal component analysis is a powerful data analysis

tool, capable of reducing large complex data sets containing

many variables. Examination of the principal components

set allows the user to spot underlying trends and patterns

that might otherwise be masked in a very large volume of

data, or hidden in noise. Diamond Light Source archives

many gigabytes of machine data every day, far more than

any one human could effectively search through for correla-

tions. Presented in this paper are some of the results from

running principal component analysis on years of archived

data in order to find underlying correlations that may other-

wise have gone unnoticed. The advantages and limitations

of the technique are discussed.

INTRODUCTION

Principal component analysis (PCA) is a technique used

to find underlying correlations that exist in a (potentially

very large) set of variables. The objective of the analysis is

to take a set of n variables, Y1, Y2, Y3, ..., Yn , and to find corre-

lations. The most important of these correlations are called

the principal components (PCs). The analysis will return

vectors Z1, Z2, Z3, ..., Zn , each describing a different un-

derlying variation or trend found in the initial data set. The

vectors of Z are ordered by their importance; that is to say

the component Z1 is the most prevalent trend seen through-

out the data, and accounts for more variation than Z2. Z2 is

a component uncorrelated with Z1, and will account for the

second largest trend seen in the data. Z3 describes the third

largest component, and so on. The ‘importance’ of each Z

is determined by it’s variance [1].

The rationale behind performing PCA on a data set is the

idea that hopefully much, or perhaps even most, of the vari-

ation seen can be attributed to just a few of the most im-

portant principal components. A highly correlated data set

can often be described by just a handful of principal com-

ponents. Equally, it is possible for the analysis to produce

no useful results at all if the original variables are highly

uncorrelated.

Consider a simple case with two variables from Diamond

Light Source, Y1 and Y2 (beam positions from two adjacent

BPMs). These are plotted against one another in Fig 1.

The first principal component, Z1, is marked in blue and in-

dicates the predominant correlation between the variables.

The second component, Z2, is marked in red and represents

the variation between the variables that is uncorrelated with

the first component. The length of the blue and red vectors

are the standard deviation of each component (√

variance),

and thus indicate that component’s importance. In this case,

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Y1 [µm]

Y2 [

µm

]

Figure 1: PCA for two highly correlated variables: two ad-

jacent beam position monitors at Diamond Light Source.

The blue and red lines indicate the first and second principal

components respectively.

it is clear that the relationship between Y1 and Y2 can be pri-

marily described by Z1. Z2 represents noise in this data.

The chosen method to calculate the PCA presented in this

paper is the singular value decomposition (SVD). This is a

factorization of a matrix, based on a theorem from linear

algebra. It states that a rectangular matrix, A, of size m × n,

where m is the number of datapoints and n is the number

of variables, can be broken down into a product of three

matrices. Formally, this is usually written

A = USVT

The computation of the SVD itself is beyond the scope

of this paper, but is discussed in great detail elsewhere [2]

[3] [4]. For the purposes of this paper, it is enough to know

that the SVD is a very general method of computing the

principal components of a data set, and that the SVD of a

matrix can be robustly and quickly computed in many soft-

ware packages (MATLAB, Python, C++, to name a few).

DATA SETS FROM DIAMOND LIGHT

SOURCE

Diamond Light Source archives over 100,000 process

variables (PVs), resulting in gigabytes of data being stored

every day. For this study 1113 archived PVs relating

to Diamond Light Source storage ring parameters were

chosen for analysis (temperatures, vacuum pressures, X-

ray beam positions, electron BPM quadrupolar differences

Q = (A+C)-(B+D), length encoders, loss monitors, and pin-

hole camera measurements have all been included).

Two years of data was retrieved from the archiver. A sim-

ple Gaussian low pass filter was then applied, the data is

‘zeroed’ by subtracting the mean value from each variable,

and normalised so that the standard deviation of each vari-

able is always 1. This normalisation process is important

5th International Particle Accelerator Conference IPAC2014, Dresden, Germany JACoW PublishingISBN: 978-3-95450-132-8 doi:10.18429/JACoW-IPAC2014-THPME188

06 Instrumentation, Controls, Feedback & Operational AspectsT03 Beam Diagnostics and Instrumentation

THPME1883719

Cont

entf

rom

this

wor

km

aybe

used

unde

rthe

term

soft

heCC

BY3.

0lic

ence

(©20

14).

Any

distr

ibut

ion

ofth

isw

ork

mus

tmai

ntai

nat

tribu

tion

toth

eau

thor

(s),

title

ofth

ew

ork,

publ

isher

,and

DO

I.

0 5 10 15 20 25 30 35 40 45 500.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Principal component number

Fra

ctio

n o

f cu

mu

lative

va

ria

nce

2 year data set

Individual run data sets

Figure 2: Cumulative variance accounted for by principal

components for the whole two year data set, and for each

individual 6-8 week user run.

as the PCA results can be influenced by the absolute magni-

tude and magnitude of variation in the input data.

In order to reduce the computational requirements, the

data was then processed to remove periods with no beam

(machine shutdown).

The PCA results for two timescales are presented in this

paper. The full two year data set has been analysed for very

long-term trends and correlation, and smaller subsets of the

data have been analysed for short term variation seen over

the course of a single user run (typically 6-8 weeks).

THE PRINCIPAL COMPONENTS OF TWO

YEARS DATA

By performing PCA we obtain information about the con-

tributions that each principal component makes to the total

variance of the data. Over the complete two year data set

it is found that a handful of principal components make up

the bulk of the variance.

Figure 2 shows the cumulative variance accounted for by

each principal component. Across 1113 variables just the

first four principal components alone account for 50% the

total variance, and 12 components account for an enormous

80% of the total variance. The ‘total variance’ is defined in

this case to be the sum of the variances of each variable

in our data set. It is truly remarkable that so few principal

components account for so much of the variation seen in the

original data set.

Figure 3 shows the two most important principal com-

ponents identified for the two year span, along with a selec-

tion of the normalised variables themselves (shown here are

the variables that exhibit the highest correlation with each

shown principal component).

This is useful information as it allows effort to be fo-

cussed on investigating just the few most important trends.

As the bulk of the variation seen in Diamond Light Source

storage ring PVs can be accounted for by just a relatively

small number of components it makes sense to concentrate

efforts on investigating the causes behind just these largest

contributors. Table 1 lists a few of the storage ring variables

most closely associated with the first two principal compo-

nents.

Table 1: A partial list of variables found to contain the

largest contributions from the first and second principal

components for two years of data. Steerer magnet strengths

(STR), electron BPM quadrupolar differences (EBPM Q),

PIN diode beam loss monitors (PIN), and physical length

encoders measuring the distance between the electron BPM

button blocks and the experimental floor (LENC) are all

found to be correlated.

PVs correlated with:

1st PC 2nd PC

VSTR 06-7 LENC 09-3

EBPM 16-5 Q VSTR 13-3

VSTR 02-2 VSTR 13-4

VSTR 06-5 VSTR 14-4

HSTR 13-1 LENC 07-13

EBPM 04-1 Q HSTR 24-6

PIN 12-17 EBPM 17-5 Q

LENC 18-4 EBPM 12-1 Q

LENC 15-3 LENC 19-6

... ...

25% of total variance 15% of total variance

ANALYSIS OF SHORTER TIMESCALES

Analysing two years of data will very likely result in very

long term trends dominating the PCA results. It is instruc-

tive to also consider a smaller data set to determine whether

or not the same lists of variables are found to be correlated

over different timescales. Figure 4 shows the first and sec-

ond principal components for a single user run (Run 2, 2013

- chosen to present here as it shows the longest continuous

300 mA storage ring run time).

It is interesting to note that at these (shorter) timescales

the PCA has resulted in different principal components,

with a different set of correlated variables. This illustrates

an obvious aspect of PCA: one must carefully choose a sen-

sible data set to use as input. If the analysis of long-term

trends are of most importance, then one must input an ap-

propriately long data set. If variation on short timescales

are important, then providing a huge, year long, data set

will not necessarily yield useful results.

Comparing the PCA results for each individual run

within the complete two year data set shows some general

trends: the same groups of PVs tend to be correlated, al-

though the exact ordering of these groupings among the

principal components is subject to change.

CONCLUSIONS

PCA is a valuable data analysis tool for investigating large

volumes of data, and for identifying the principal trends and

their related variables. It can quickly identify which princi-

pal components provide the largest contribution to variation

in the data. Thus effort can be concentrated in trying to iden-

tify and understand these few most important components,

rather than facing the daunting task of trying to guess which

of the 1113 input variables might be of most importance.

5th International Particle Accelerator Conference IPAC2014, Dresden, Germany JACoW PublishingISBN: 978-3-95450-132-8 doi:10.18429/JACoW-IPAC2014-THPME188

THPME1883720

Cont

entf

rom

this

wor

km

aybe

used

unde

rthe

term

soft

heCC

BY3.

0lic

ence

(©20

14).

Any

distr

ibut

ion

ofth

isw

ork

mus

tmai

ntai

nat

tribu

tion

toth

eau

thor

(s),

title

ofth

ew

ork,

publ

isher

,and

DO

I.

06 Instrumentation, Controls, Feedback & Operational AspectsT03 Beam Diagnostics and Instrumentation

01/04 01/07 01/10 01/01 01/04 01/07 01/10−0.06

−0.04

−0.02

0

0.02

0.04

0.06Principal component 1

01/04 01/07 01/10 01/01 01/04 01/07 01/10−5

0

5Associated PVs

SR06A−PC−VSTR−07:I

SR16C−DI−EBPM−05:SA:Q

SR02A−PC−VSTR−02:I

SR06A−PC−VSTR−05:I

SR13A−PC−HSTR−01:I

01/04 01/07 01/10 01/01 01/04 01/07 01/10−0.06

−0.04

−0.02

0

0.02

0.04

0.06Principal component 2

01/04 01/07 01/10 01/01 01/04 01/07 01/10−5

0

5Associated PVs

SR09C−DI−LENC−03:POSITION

SR13A−PC−VSTR−03:I

SR13A−PC−VSTR−04:I

SR14A−PC−VSTR−04:I

SR19C−DI−LENC−06:POSITION

Figure 3: Two pairs of plots, showing the first and second

principal components respectively, and those variables with

the largest contribution from these components. The top

two plots show the first PC; the bottom two plots show the

second PC. The timespan shown is Jan 2012 - Jan 2014.

However, these results also serve as an important exam-

ple of the limitations of PCA: the analysis only provides in-

formation about correlations, it says nothing conclusive re-

garding causation. (Recall the important maxim correlation

does not imply causation!) For example, the variables iden-

tified as being correlated with the principal components in

Fig. 3 and Fig. 4 include corrector magnets, electron BPM Q,and storage ring length encoders, but the underlying cause

of the variation is not identified. Human intelligence is still

required to scrutinize the results and to draw conclusions.

Of particular interest to the authors is the correlation

found between corrector magnet strengths, electron BPM

Q, and in particular the length encoder measurements. Cor-

relation between electron BPM electrical stability and cor-

rector magnet variation has been long known, but the re-

sults of this analysis show that the physical movement of the

electron BPM blocks is also correlated with the very long

term trends in magnet variation. The correlation between

length encoders and magnets strengths in entirely different

21/04 28/04 05/05 12/05 19/05 26/05

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

Principal component 1

21/04 28/04 05/05 12/05 19/05 26/05−5

0

5Associated PVs

SR08A−PC−HSTR−05:I

SR21C−DI−LENC−08:POSITION

SR08A−PC−HSTR−04:I

SR01C−DI−LENC−11:POSITION

SR16A−PC−VSTR−06:I

21/04 28/04 05/05 12/05 19/05 26/05

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

Principal component 2

21/04 28/04 05/05 12/05 19/05 26/05−5

0

5Associated PVs

SR14C−DI−EBPM−06:SA:Q

SR15C−DI−EBPM−07:SA:Q

SR24C−DI−EBPM−04:SA:Q

SR07C−DI−EBPM−05:SA:Q

SR15C−DI−EBPM−04:SA:Q

Figure 4: As per Fig 3. The top two plots show the first prin-

cipal component; the bottom two show the second principal

component. Their respective correlated variables are plot-

ted underneath. The plots cover 7 weeks, Apr - Jun 2013.

sections of the storage ring is noteworthy. Applying our

human intelligence to the PCA results, we can reason that

the origin of this variation could be real, physical, machine

movement over time, previously thought to be negligibly

small.

REFERENCES

[1] B.F.J. Manly, Multivariate Statistical Methods, (Chapman &

Hall/CRC, 2005).

[2] K. Baker, Singular Value Decomposition Tutorial:

http://www.ling.ohio-state.edu/~kbaker/pubs/

Singular_Value_Decomposition_Tutorial.pdf

[3] M. Richardson, Principal Component Analysis:

http://people.maths.ox.ac.uk/richardsonm/

SignalProcPCA.pdf

[4] J. Shlens, A Tutorial on Principal Component Analysis:

http://www.cs.cmu.edu/~elaw/papers/pca.pdf

5th International Particle Accelerator Conference IPAC2014, Dresden, Germany JACoW PublishingISBN: 978-3-95450-132-8 doi:10.18429/JACoW-IPAC2014-THPME188

06 Instrumentation, Controls, Feedback & Operational AspectsT03 Beam Diagnostics and Instrumentation

THPME1883721

Cont

entf

rom

this

wor

km

aybe

used

unde

rthe

term

soft

heCC

BY3.

0lic

ence

(©20

14).

Any

distr

ibut

ion

ofth

isw

ork

mus

tmai

ntai

nat

tribu

tion

toth

eau

thor

(s),

title

ofth

ew

ork,

publ

isher

,and

DO

I.