1 the licre workflow · licre supplementary material algorithm 1 licre input: complete feature set,...

LICRE Supplementary Material

1 THE LICRE WORKFLOW

An enlarged illustration of the LICRE algorithm workflow is shown

in Figure 1.

2 THE LICRE ALGORITHM

The pseudo code for the LICRE algorithm presented in Algorithms

1 and 2.

1


Fig. 1. A summary of the LICRE algorithm workflow.

2


Algorithm 1 LICRE

Input: Complete feature set, FOutput: Reduced feature set, FR

// Apply the threshold selection algorithm

γ = ThresholdSelect(F );

// Create a relevance network where the edges represent correlation coefficient greater than γ threshold

SR = createRelevanceNetwork(S, γ);

// Find all features form singleton connected components in SR and store them

FR = findSingletons(SR);

SR = removeVertices(SR, FR);

// Find all features that form 2 element components in SR (i.e., pairs)

P = findPairs(SR);

// Select the feature with the higher median measurement from the pair

R = selectHigherMeasurementNode(P);

FR = FR ∪ R;

SR = removeVertices(SR, P);

// Find feature (centriod) with highest mean correlation to its neighbours in S, store it and drop its adjacent neighbours, and repeat until

clusters with three or more nodes (features) are removed from SR

while SR does not contain pairs or singletons do

vx = findMaxAvgCorr(SR);

N = neighbours(SR, vx) ∪ vx;

SR = removeVertices(SR, N );

FR = FR ∪ {vx};

end while

// Assess resultant residual feature pairs and singletons from centroid pruning

while SR does not contain feature pairs do

// Find all residual feature pairs

P = findPairs(SR);

// Select the feature with the higher median measurement from the pair

R = selectHigherMeasurementNode(P);

FR = FR ∪ R;

SR = removeVertices(SR, P);

end while

// Identify all residual singletons in SR (all remaining features in SR are singletons) and add them to the reduced feature set, FR

FR = FR ∪ SR;

return FR

3


Algorithm 2 Threshold Selection (ThresholdSelect)

Input: Complete feature set, FOutput: Correlation coefficient threshold, γ

// Compute pairwise correlation

S = correlation(F );

// Construct a Maximum Spanning Tree(MAST) from VAT layout of S

Tmst = VAT(S);

// Compute a histogram based on the edges (correlation coefficients) of the Tmst

HR = histogram(Tmst);

// Apply soft means clustering to identify threshold to delineate cluster of high correlation coefficients from the histogram

γ = softmeansclustering(HR);

return γ

4


5

3 RESULTS ON CLINICAL DATASETS

Table 1 – 3 summarises the classification performance of three different example datasets, (1) Diabetes versus NGT (2) Prediabetes versus NGT and (3) Unstable CAD versus Stable CAD. Example datasets (1) and (2) were based on lipidomics performed on subjects drawn from The Australian Diabetes, Obesity and Lifestyle Study, the largest Australian longitudinal population-based study examining the natural history of diabetes, pre-diabetes, heart disease and kidney disease. Example dataset (3) was based on lipidomics performed on subjects with unstable coronary artery disease and stable coronary artery disease.

Models were trained on LICRE feature-reduced datasets and complete (unreduced) datasets where the latter was used for benchmarking purposes. The order of feature inclusion into the models was determined by 5 approaches,

• Feature selection by univariate area under the receiver operator characeristic curve (ROC)

o Each feature was assessed univariately in terms of the area under the ROC curve achieved by the feature alone. The feature with the highest area under the ROC was the first to be included, followed by the feature with the second highest AUC, etc.

• Feature selection by univariate parametric test (t-test)

o The order of feature inclusion was determined by the p-value obtained from a parametric univariate test i.e., the student’s t-test on log-transformed lipid measurements – the most significant feature was first to be included.

• Feature selection by univariate non-parametric test (Mann-Whitney U test)

o The order of feature inclusion was determined by the p-value obtained from a non-parametric univariate test i.e., the Mann-Whitney U-test – the most significant feature was first to be included.

• Feature selection by the reliefF algorithm (reliefF)

o The order of feature inclusion was determined by the reliefF (Robnik- Šikonja and Kononenko, 2003) algorithm.

• Feature selection by LASSO (LASSO)

o The order of feature inclusion is determined by the absolute value of the fitted coefficients of LASSO (Tibshirani, 1996), the feature with the highest fitted coefficient was first to be included.

Model training and testing was performed within a 3-fold cross-validation framework repeated 200 times. The classifier model used was C-SVC (C-Support Vector Classifier) with a polynomial kernel as implemented in LIBSVM (Chang and Lin, 2011).

Figures 2-4 are the area under the ROC curve plots (mean and 95% confidence intervals) for various combinations of feature selection approaches applied on LICRE-reduced datasets or complete (unreduced) datasets. Table 4 summarises the most frequently incorporated features into each model based on the feature selection approach combination with LICRE that yielded the highest area under the ROC curve performance. The features selected from LICRE-reduced datasets were more evenly distributed across lipid classes while features selected from an unreduced dataset were more concentrated within limited lipid classes. The features from the latter set are highly correlated while the selected features based on the LICRE-reduced dataset are less correlated.


6

Table 1. Comparison of Diabetes vs Normal Glucose Tolerance (NGT) Classification Performance Based on LICRE-Reduced Dataset and Full Dataset

Paired with Different Feature Selection Approaches

Feature

Selection

Area Under ROC Curve @ 5 Features Area Under ROC Curve @ 10 Features Area Under ROC Curve @ 30 Features

LICRE Full Dataset LICRE Full Dataset LICRE Full Dataset

ROC 0.774(0.771,0.778) 0.743(0.739,0.747) 0.779(0.776,0.783) 0.753(0.749,0.757) 0.791(0.788,0.794) 0.769(0.766,0.773)

t-test 0.769(0.766,0.773) 0.737(0.733,0.741 0.77(0.767,0.773) 0.74(0.736,0.744) 0.771(0.767,0.775) 0.738(0.734,0.742)

Mann-Whitney

U test 0.767(0.763,0.771) 0.675(0.671,0.679) 0.779(0.776,0.783) 0.703(0.699,0.707) 0.778(0.774,0.781) 0.752(0.748,0.756)

reliefF 0.761(0.757,0.765) 0.644(0.64,0.649) 0.769(0.765,0.773) 0.674(0.67,0.679) 0.772(0.768,0.776) 0.727(0.723,0.731)

LASSO 0.672(0.665,0.679) 0.631(0.625,0.638) 0.744(0.739,0.748) 0.69(0.685,0.695) 0.773(0.769,0.776) 0.741(0.737,0.745)


7

Table 2. Comparison of Prediabetes vs NGT Classification Performance Based on LICRE-Reduced Dataset and Full Dataset Paired with Different

Feature Selection Approaches

Feature

Selection



ROC 0.722(0.717,0.727) 0.662(0.657,0.668) 0.734(0.729,0.738) 0.681(0.677,0.686) 0.725(0.721,0.729) 0.699(0.694,0.703)

t-test 0.725(0.72,0.729) 0.656(0.649,0.663) 0.738(0.733,0.742) 0.681(0.676,0.686) 0.726(0.722,0.73) 0.708(0.703,0.713)

Mann-Whitney

U test 0.696(0.691,0.701) 0.561(0.555,0.566) 0.726(0.722,0.731) 0.578(0.572,0.584) 0.727(0.723,0.731) 0.668(0.663,0.672)

reliefF 0.724(0.719,0.729) 0.664(0.659,0.67) 0.732(0.727,0.736) 0.685(0.68,0.69) 0.722(0.718,0.727) 0.705(0.7,0.71)

LASSO 0.585(0.577,0.592) 0.55(0.543,0.557 0.637(0.631,0.643) 0.594(0.588,0.6) 0.704(0.699,0.708) 0.664(0.659,0.669)


8

Table 3. Comparison of Unstable CAD vs Stable CAD Classification Performance Based on LICRE-Reduced Dataset and Full Dataset Paired with

Different Feature Selection Approaches

Feature

Selection



ROC 0.783(0.778,0.787) 0.771(0.767,0.776) 0.802(0.797,0.806) 0.788(0.784,0.793) 0.787(0.782,0.792) 0.791(0.786,0.795)

t-test 0.781(0.777,0.786) 0.771(0.766,0.775) 0.783(0.778,0.788) 0.777(0.772,0.782) 0.755(0.751,0.76) 0.764(0.759,0.769)

Mann-Whitney

U test 0.753(0.748,0.758) 0.729(0.723,0.734) 0.775(0.77,0.78) 0.762(0.757,0.767) 0.776(0.771,0.781) 0.774(0.769,0.78)

reliefF 0.761(0.757,0.766) 0.675(0.669,0.682) 0.77(0.766,0.775) 0.742(0.737,0.747) 0.782(0.778,0.787) 0.775(0.771,0.78)

LASSO 0.641(0.633,0.649) 0.644(0.637,0.651) 0.706(0.699,0.713) 0.699(0.693,0.705) 0.761(0.756,0.766) 0.752(0.747,0.757)


9

Fig. 2. Comparison of Diabetes vs NGT Classification Performance for Different Models with and without LICRE Reduction


10

Fig. 3. Comparison of Prediabetes vs NGT Classification Performance for Different Models with and without LICRE Reduction


11

Fig. 4. Comparison of Unstable CAD vs Stable CAD Classification Performance for Different Models with and without LICRE Reduction


12

Table 4. Features Ranked by Frequency of Inclusion (within the first five features) of Each Model

Rank Diabetes vs NGT Prediabetes vs NGT Unstable vs Stable CAD

LICRE (ROC) Full (ROC) LICRE (t-test) Full (t-test) LICRE (ROC) Full (ROC)

1 DG 16:0/22:6 DG 16:0/22:6 DG 16:0/22:6 TG 16:1/16:1/18:1 PC 34:5 PC 34:5

2 dhCer 18:0 CE 16:2 TG 16:1/18:1/18:2 TG 16:1/18:1/18:1 PC 36:4 PC 36:4

3 PE 38:1 dhCer 18:0 dhCer 18:0 TG 16:0/16:1/18:1 PC 34:3 PC 34:4

4 PG 36:2 DG 16:0/18:1 TG 16:0/16:0/18:1 DG 16:1/18:0 DG 14:0/14:0 PC 34:3

5 CE 24:5 TG 16:0/16:0/18:1 COH DG 18:0/18:1 PC 38:6a DG 14:0/14:0

6 CE 18:1 DG 16:0/16:0 dhCer 22:0 TG 16:0/18:1/18:1 PC(O-32:2) PC(O-32:2)

7 TG 16:1/16:1/18:0 DG 18:0/18:1 PE 38:1 dhCer 18:0 LPC 14:0 PC 38:6a

8 LPC(O-24:2) CE 16:1 Cer 20:0 DG 16:1/18:1 PI 36:3 LPC 14:0

9 PC(O-34:2) TG 16:0/16:0/18:2 PG 36:2 TG 16:1/16:1/16:1 CE 14:0 PI 36:3

10 CE 22:4 CE 20:3 PC(O-36:2) DG 16:0/22:6 PE(O-34:1) LPC 18:3

11 CE 24:6 TG 16:0/16:0/16:0 CE 24:5 DG 18:1/18:1 PE(O-36:4) PC 36:6

12 SM 39:1 CE 20:4 PC 40:5 DG 16:0/18:1 PE(P-34:2) PE(O-36:3)

13 dhCer 24:0 DG 18:1/18:1 CE 24:6 TG 14:1/18:1/18:1 PC 28:0 CE 14:0

14 LPC(O-24:1) TG 18:0/18:1/18:1 PG 34:2 TG 14:1/18:0/18:2 CE 18:3 PE(O-34:1)

15 PC 37:4 TG 16:0/18:1/18:1 PC(O-36:4) dhCer 24:1 PI 38:2 PE(O-36:4)

16 LPC 18:2 CE 22:6 PS 34:1 TG 18:0/18:1/18:1 PI 36:4 PE(P-34:2)

17 PS 40:6 CE 18:2 CE 20:2 TG 18:1/18:1/22:6 PC(O-35:4) CE 18:3

18 LPC(O-22:0) CE 22:5 PC 37:4 COH TG 14:0/16:0/18:2 PC 28:0

19 Cer 20:0 PE 40:6 TG 18:2/18:2/20:4 dhCer 22:0 Cer 18:0 PI 38:2

20 PI 38:6 DG 16:1/18:0 PE 38:5 TG 16:1/18:1/18:2 PI 36:1 PI 36:4


13

Comparison to PCA

Principal component analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components. Similar to LICRE, PCA is an unsupervised approach but unlike LICRE the derived principal components in PCA are very difficult if not impossible to interpret biologically since they are constructed based on a linear weighted combination of multiple features (lipids). The use of principal component scores as features (predictors) results in the loss of the biological meaning of the original features (lipids) which renders this approach unsuited for biomarker discovery. It is also unclear how lipid species will be selected from the derived principal components for the purpose of developing clinical tests for disease diagnosis / prognosis. In contrast, LICRE data reduction retains the original representation and biological meaning of each feature in the reduced dataset, i.e., lipids remain unaltered as features and the interpretation of each feature in the LICRE-reduced dataset is direct as its original meaning is retained. Consequently, the task of identifying lipid biomarkers for disease screening / diagnosis / prognosis is straightforward. Additionally, principal components seek out the direction of greatest variation in the data, however the greatest variation in the data is not guaranteed to be biologically-driven, and could possibly be driven by technical reasons, e.g. batch variation in samples from different runs on the mass spectrometer. In this scenario, data reduction with principal components analysis will result in the loss of biological information. LICRE avoids this pitfall since it focuses on feature correlation and effectively summarises clusters of similar or highly correlated features with a feature that sufficiently represents each cluster of highly correlated features without relying on optimisation of other data parameters. Finally, as the computed principal components are data dependent, they will change constantly when models are trained and tested within a cross-validation framework as the subsets of data for training and testing are resampled. This will result in a meaningless comparison of the models and the features (principal components) since different principal components will be constructed in each trial of cross-validation.


14

REFERENCES

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27. Robnik- Šikonja, M. and Kononenko, I. (2003). Theoretical and empirical analysis of Relief and ReliefF. Mach. Learn., 53(1-2), 23–69. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of theRoyal Statistical Society (Series B), 58, 267–288.

1 the licre workflow · licre supplementary material algorithm 1 licre input: complete feature set,...

Documents