[ieee 2010 4th international conference on bioinformatics and biomedical engineering (icbbe) -...

Feature extraction and classification of proteomics data using stationary wavelet transform and naive

Bayes classifier Liu Dan, Huang Yuan-yuan, Ma Chen-xiang

School of Life Science and Technology, Xi'an Jiaotong University Xi'an 710049, P. R. China

e-mail: [email protected]

Abstract—The purpose of the current study was to investigate the changes of serum proteome and to discover potential biomarkers from a publicly available proteomic ovarian dataset. A workflow that combines stationary wavelet transform with naive Bayes classifier was presented to select candidate biomarkers form 253 proteomic serum profiles of cancer and control. The method identified correlative mass points and obtained a discriminative pattern with 96.7% sensitivity and 92.7% specificity.

Keywords- mass spectrometry; proteomic; stationary wavelet transform; naive Bayes classifier

I. INTRODUCTION Recent years, applications of mass spectrometry (MS)-

based proteomic profiling was used to discriminate diseased from healthy individuals with the aim of discovering molecular markers for disease. The methods have been used to study several different diseases[1-3], the results are inspiring. Generally, MS data consist of tens of thousands of measurements and are inherently noisy. Noises are mainly due to interference from the matrix material and sample contaminations (chemical noise) and the physical characteristics of the machine (electrical noise). Broadly speaking, a mass spectrum plots the time-of-flight on the x-axis and ion counts on the y-axis. Alternatively, time-of-flight can be transformed to molecular weight over charge (m/z) and ion counts into signal intensity. The data processing in most studies adopted genetic algorithm and self-organising cluster analysis, appeared in Petricoin’s paper first. Still other studies focused on data regression and dimensionality reduction. Though the results are inspiring, it is very suspect because the selected biomarkers are binned m/z ratios, and most of the m/z ratios actually do not contain biologically relevant information.

From a biological perspective, we believe that only peaks in MS data may have real biological meaning, peaks constitute the most important features of a single spectrum. In proteomic studies the goal should aimed to identify peaks that are related to specific outcomes of different malignant diseases or specific clinical responses. Proteins corresponding to the selected peaks can finally be identified by laboratory experiments.

Typically, a preprocessing step must be done before any further analysis that includes normalization, baseline subtraction, denoising, peak detection. The quality of the results of further analysis heavily depends on these preprocessing steps, and especially on the denoising. A variety

of studies have focused on denoising, some techniques used in digital signal processing were adopted, such as wavelet transform. Since the MS signal is actually a kind of digital signal that generated by mass spectrometer, these methods should reveal and find the real valuable information from the MS dada. However, their work is mainly focused on peak detection and does not investigate the following procedures. We found that the following steps are equally important. In our implement, we put forward a workflow that combined wavelet transform, statistical analysis and data mining to process MS data to find a discriminative pattern.

II. DATASET The ovarian cancer dataset 8-07-02 containing 162

instances of ovarian cancer and 91 instances of healthy control, which is composed of 15,154 pairwise data points, is downloaded form the website of the US National cancer Institute (http://home.ccr.cancer.gov/).

III. DATA ANALYSIS

A. Preprocess We applied several successively analysis to preprocess the

raw MS data. We began the analysis with outlier screening, where we removed spectra whose data distribution substantially deviated from others by examining the average Pearson's correlation coefficient of each spectrum against all other spectra within the dataset, if the value is lower than others obviously, the corresponding spectra will be discarded. To reduce the noise and dimensionality of the raw data, we used a stationary wavelet transform based denoising procedure. The low-frequency baseline of each spectrum was estimated by using multiple shifted windows of 200 bins. Spline approximation was used to regress the varying baseline. The regressed baseline was subtracted from the spectrum, yielding a baseline corrected spectrum. For peak detection, an M/Z value is identified as a peak if the sign of the intensity’s slope changes from positive to negative (the second derivative value is zero). Peaks with intensity below a threshold were considered as noise and were discarded.

1) Normalize intensity values using total ion current: Normalization reduces variation in signal intensity between spectra. A commonly used normalization method for mass spectrometric data is rescaling each spectrum by its total ion current. In our paper, each spectrum was normalized according

This research was supported by the Undergraduate Innovation Experiment Project of Xi'an Jiaotong University under Grant No.091069823

978-1-4244-4713-8/10/$25.00 ©2010 IEEE

g[n]

h[n]

h[n]

g[n]

S

1D

2D

2A

to total ion current as follows: calculate the total signal for each spectrum; calculate the average signal by dividing the total signal by the number of data points in each spectrum; calculate the average signal for all spectra using the averages from previous step; calculate the normalization factor for each spectrum by dividing the average signal for each spectrum by the average total signal; Multiply each value in each spectrum by the normalization factor. This step was performed by using SpecAlign software [4].

2) MS signal denoising using stationary wavelet transform: Wavelet transform has successfully been used in various applications to remove noise and recover the true signal. It can provide high time resolution and low frequency resolution for high frequencies and high frequency resolution and low time resolution for low frequencies; such features allow the wavelet transform representing signals that have localized features, just as MS signal. To the MS signal, the localized features are mainly real signal and noises that have different time-frequency attributes. Discrete wavelet transform (DWT) is the most commonly used wavelet algorithm for scientific application. Because DWT employs decimators after filtering, both approximation signal and detail signal are half as much as original signal in length after transformation. From the computational point of view, it is very efficient. However, the classical DWT suffers a drawback: it is not a time-invariant transform. This means that, even with periodic signal extension, in general, the DWT of a translated version of a signal s is not the translated version of the DWT of s. In order to get more complete characteristics of analyzed signal and restore the translation invariance which is a desirable property lost by the classical DWT, we proposed the stationary wavelet transform (SWT, also called undecimated wavelet transform or maximal overlap DWT), decimators are removed in SWT, so that signals are no longer decimated after filtering, approximation signal and detail signal are of a size which is the same as the size of analyzed signal. Compared with DWT,

SWT provides much more precise information. Fig 1. Diagram of SWT decomposition

Fig. 1 illustrates a diagram of SWT decomposition, where h[n] is a high-pass filter and g[n] is a low-pass filter. In SWT, the decimation operation after filtering were omitted, so all subband signals(the details D1, D2 and the approximations A2) have the same size as that of the input signal s. At each resolution level, the low-pass and high-pass filters h[n] and g[n] have to be up-sampled to keep a consistent multi-resolution analysis. This up-sampling is done by including zeros between

each of the filter’s coefficients at each level. The detail coefficients are computed as the difference between the low passed signals from two consecutive levels. Compared to DWT, the shortage of SWT is that it needs larger storage space requirements and involves more computations.

Several commonly used wavelet functions including Daubechies, Symlets, Coiflets, Meyer were assessed. Finally we adopted Daubechies wavelet functions (db) due to its good regularity character, thus the signal can be reconstructed smoothly. The regularity index of db is about N/5(N is the order), thus we selected a moderate N that is 5.

We use a global threshold Thglobal which was calculated according to the equation bellow:

Thglobal= σ×)log(2 n (1) Where n is the length of the MS data, σ is the detail

coefficients' standard deviation obtained. After perform soft thresholding to all detail coefficients, and then use the inverse stationary wavelet transform, the denoised MS data can be obtained.

In signal analysis, the L2-norm (Euclidean length) can be regarded as the signal energy. So we evaluate the performance of denoising by calculate the percent of the L2-norm recovery of the denoised signal Sdenoised and original Soriginal according to the following equation:

L2-norm recovery = ‖Sdenoised‖/‖Soriginal‖ (2) The energy recovery of the denoised signal of the mean

cancer and normal are 0.9967 and 0.9985 respectively, which showed a high energy recovery rate.

3) Baseline subtraction: In the first observation of a raw spectrum generated by mass spectrometry, it is very obviously that a elevated baseline is exhibited, and levels off to a plateau at larger m/z, more so at smaller m/z values than at larger values. This elevated baseline is mainly caused by the chemical noises in the energy-absorbing molecule and ion overload. Ideally, a spectrum should rest more or less on the zero horizontal line. In order to make different spectra comparable, this baseline needs to be subtracted from each spectrum. In our paper, the low-frequency baseline of each spectrum was estimated by using multiple shifted windows of 200 bins. Spline approximation was used to regress the varying baseline. The regressed baseline was subtracted from the spectrum, yielding a baseline corrected spectrum. In addition, the spectra were scaled to have an overall maximum intensity of 100.

4) Peak detection:Before peak detection, the noise data must be estimated to get an optimum signal to noise(S/N). Median absolute deviation (MAD) method was used to estimate noise data. The MAD is a robust statistic, being more resilient to outliers in a data set than the standard deviation, so it is a more robust and more general estimator of scale than the sample variance or standard deviation. Because the region of a peak generally contains 10 m/z, so the noise estimation window size is set to 9 (because it must be an odd integer).

The peak is the point where its second derivative is zero. This method was based on the ideas of Coombes et al[5, 6].

0 10,000 20,000 30,000 40,000 50,0000

0.2

0.4

0.6

baseline subtracted and noise removed datapeaks satisfied with S/N thresholdthe 5 highest peaks

Due to the noise disturbance, not all the peaks are real protein peaks, so there should be a criterion to filter those peaks that are not protein peaks. We noticed that the mean of signal to noise ratio of MS data are mainly between 50-60db. In the process of sound denoising and compression, 60db is always the optimum selection, so we also select 60db as the threshold of signal to noise ratio. If the threshold is bigger, the number of peaks found will decrease. Totally 298 peaks were found, the result is showed in fig. 2.

Fig 2. Peaks found in MS data after denoising and baseline subtraction

B. Biomarker selection 1) Statistical test: Not all the peaks satisfied the criterion

of signal to noise ratio are useful in the following discrimination, only those who are significantly different between two group may have real sense. We adopted Kolmogorov-Smirnov test that is the most commonly used nonparametric test method for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples, and is a very useful nonparametric method for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. The test is two sided, and the significance level is set to 0.05.

2) Data mining using naive Bayes classifier:Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. It uses machine learning, statistical and visualization techniques to discover and present knowledge in a form which is easily comprehensible to humans. It has many algorithms such as naive Bayes classifier, support vector machine and artificial neural network. The naive Bayes classifier technique[7] is based on the Bayesian theorem and is particularly suited when the dimensionality of the inputs is high, just as the MS proteomic data. Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix. In spite of

their naive design and apparently over-simplified assumptions, naive Bayes classifiers can often outperform more sophisticated classification methods.

3) The performance of classification was examined by 10-fold cross-validation as it has relatively low bias and variance and is often used for estimating accuracy. The algorithm was implemented by using the WEKA machine learning package[8]. A discriminative pattern was obtained with 96.7% sensitivity and 92.7% specificity.

IV. RESULTS AND CONCLUSION The data processes of MS signal in this paper mainly

include two parts: preprocessing and biomarker selection, and the results are determined mainly by these two steps. To the denoising using SWT, compared to DWT, SWT it is very appropriate for this application for the characteristics of the MS data. Results of our early studies show that the choice of the wavelet family does not affect the results much. Wavelets with higher numbers of vanishing moments are more regular and lead to smoother approximations. On the other hand the support of the wavelets increases with the regularity and boundary effects may arise in the DWT, so that a trade-off is often necessary. We analyze the data with our method using Daubechies with different vanishing moments, families show very similar denoising and detection performances (results not shown). To a given wavelet function, the thresholding method is very important, we used a most common used value that gain good result due to its high energy recovery. After using SWT, most noise in the MS signal can be eliminated effectively, thus the latter process will have fewer disturbances.

Statistical analysis is the most common approaches for dimensionality reduction and feature reduction, it is obviously that only peaks are statistically different between control and case group will be regarded as the candidate features of the discrimination. Therefore, a statistical test is essential. Since the distribution law of the peaks of MS signal is not sure, we used a nonparametric test to select the peak.

The naive Bayes classifier technique is based on the Bayesian theorem and is robust when the dimensionality of the inputs is high, it can be trained very efficiently in a supervised learning setting, and it requires a small amount of training data to estimate the parameters necessary for classification. In our implement, we found that naive Bayes classifier is very fast and suit for this application. The obtained discriminative pattern got high sensitivity, specificity and accuracy.

In this paper, we presented a successive analysis methods for MS data processing workflow that combines stationary wavelet transform with naive Bayes classifier. We showed that the proposed approaches can select mass points from the MS dataset. The final selected peaks are more likely to represent identifiable proteins, protein fragments or peptides, which is important for the ultimate goal of identifying proteins or peptides that distinguish disease from control. Once the proteins are identified, focus will be on validating the proteins through other sample-sets and analytical platforms. We believe that the use of computational methods alone cannot provide a solution to the complex task of biomarker discovery from mass spectra involving thousands of proteins. In addition to

advanced computational methods that are capable of extracting knowledge from complex and high dimensional data, this task requires a careful study design, sample collection and preparation, improved mass spectrometry, well-designed data analyses methods and inter-laboratory validation.

REFERENCES

[1] E. P. Diamandis, "Analysis of Serum Proteomic Patterns for Early Cancer Diagnosis: Drawing Attention to Potential Problems," Journal of the National Cancer Institute, vol. 96, pp. 353-356, 2004.

[2] T. D. Veenstra, et al., "Proteomic patterns for early cancer detection," Drug Discovery Today, vol. 9, pp. 889-897, Oct 15 2004.

[3] S. E. Lana, et al., "Proteomic profiling using SELDI-TOF mass spectrometry: a diagnostic tool for canine B-cell lymphoma," Vet Comp Oncol, vol. 2, p. 106, Jun 2004.

[4] J. W. H. Wong, et al., "SpecAlign-processing and alignment of mass spectra datasets," Bioinformatics, vol. 21, pp. 2088-2090, 2005.

[5] D. Kevin R. Coombes, et al., "Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform," Proteomics, vol. 5, pp. 4107-4117, 2005.

[6] J. S. Morris, et al., "Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum," Bioinformatics, vol. 21, pp. 1764-1775, 2005.

[7] P. Domingos and M. Pazzani, "On the optimality of the simple Bayesian classifier under zero-one loss," Machine Learning, vol. 29, pp. 103-130, 1997.

[8] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005.

[ieee 2010 4th international conference on bioinformatics and biomedical engineering (icbbe) -...

Documents