Big Challenges with Big Data in Life Sciences

Download Big Challenges with Big Data in Life Sciences

Post on 03-Jan-2016

16 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Big Challenges with Big Data in Life Sciences. Shankar Subramaniam UC San Diego. The Digital Human. A Super-Moores Law. Adapted from Lincoln Stein 2012. The Phenotypic Readout. Data to Networks to Biology. NETWORK RECONSTRUCTION. Data-driven network reconstruction of biological systems - PowerPoint PPT Presentation

TRANSCRIPT

  • Big Challenges with Big Data in Life Sciences

    Shankar SubramaniamUC San Diego

  • The Digital Human

  • A Super-Moores LawAdapted from Lincoln Stein 2012

  • The Phenotypic Readout

  • Data to Networks to Biology

  • NETWORK RECONSTRUCTIONData-driven network reconstruction of biological systemsDerive relationships between input/output dataRepresent the relationships as a networkInverse Problem: Data-driven Network Reconstruction

  • Network ReconstructionsReverse Engineering of biological networksReverse engineering of biological networks:

    Structural identification: to ascertain network structure or topology.

    Identification of dynamics to determine interaction details.

    Main approaches:

    Statistical methodsSimulation methodsOptimization methodsRegression techniquesClustering

  • Network Reconstruction of Dynamic Biological Systems: Doubly Penalized LASSOBehrang Asadi*, Mano R. Maurya*, Daniel Tartakovsky, Shankar SubramaniamDepartment of BioengineeringUniversity of California, San Diego NSF grants (STC-0939370, DBI-0641037 and DBI-0835541) NIH grants 5 R33 HL087375-02* Equal effort

  • APPLICATIONPhosphoprotein signaling and cytokine measurements in RAW 264.7 macrophage cells.

  • MOTIVATION FOR THE NOVEL METHODVarious methodsRegression-based approaches (least-squares) with statistical significance testing of coefficientsDimensionality-reduction to handle correlation: PCR and PLSOptimization/Shrinkage (penalty)-based approach: LASSOPartial-correlation and probabilistic model/Bayesian-basedDifferent methods have distinct advantages/disadvantagesCan we benefit by combining the methods?Compensate for the disadvantagesA novel method: Doubly Penalized Linear Absolute Shrinkage and Selection Operator (DPLASSO)Incorporate both statistical significant testing and Shrinkage

  • LINEAR REGRESSIONGoal: Building a linear-relationship based model

    X: input data (m samples by n inputs), zero mean, unit standard deviationy: output data (m samples by 1 output column), zero-meanb: model coefficients: translates into the edges in the networke: normal random noise with zero mean

    Ordinary Least Squares solution:

    Formulation for dynamic systems:

  • Most coefficients non-zero, a mathematical artifactPerform statistical significance testingCompute the standard deviation on the coefficients

    RatioCoefficient is significant (different from zero) if:

    Edges in the network graph represents the coefficients.STATISTICAL SIGNIFICANCE TESTING* Krmer, Nicole, and Masashi Sugiyama. "The degrees of freedom of partial least squares regression."Journal of the American Statistical Association106.494 (2011): 697-705.

  • Partial least squares finds direction in the X space that explains the maximum variance direction in the Y space

    PLS regression is used when the number of observations per variable is low and/or collinearity exists among X valuesRequires iterative algorithm: NIPALS, SIMPLS, etcStatistical significance testing is iterativeCORRELATED INPUTS: PLS* H. WOLD, (1975), Soft modelling by latent variables; the non-linear iterative partial least squares approach, in Perspectives in Probability and Statistics, Papers in Honour of M. S. Bartlett, J. Gani, ed., Academic Press, London.

  • Derivative of the regression coefcients and degrees of freedom for PLSX: n x py: n x 1S: n x nM: Number of (latent) components

  • LASSOShrinkage version of the Ordinary Least Squares, subject to L-1 penalty constraint (the sum of the absolute value of the coefficients should be less than a threshold)Where represents the full least square estimates0 < t < 1: causes the shrinkageThe LASSO estimator is then defined as:* Tibshirani, R.: Regression shrinkage and selection via the Lasso, J. Roy. Stat. Soc. B Met., 1996, 58, (1), pp. 267288Cost FunctionL-1 Constraint

  • Noise and Missing DataMore systematic comparison needed with respect to:Noise: Level, TypeSize (dimension)Level of missing dataCollinearity or dependency among input channelsMissing dataNonlinearity between inputs/outputs and nonlinear dependencyTime-series inputs(/outputs) and dynamic structure

  • METHODSLinear Matrix Inequalities (LMI)*Converts a nonlinear optimization problem into a linear optimization problem.

    Congruence transformation:

    Pre-existing knowledge of the system (e.g. ) can be added in the form of LMI constraints:

    Threshold the coefficients: * [Cosentino, C., et al., IET Systems Biology, 2007. 1(3): p. 164-173]

  • METRICSMetrics for comparing the methods

    Reconstruction from 80% of datasets and 20% for validation RMSE on the test set, and the number and the identity of the significant predictors as the basic metric to evaluate the performance of each method Fractional error in the estimating the parameters

    Sensitivity, specificity, G, accuracy

    parameters smaller than 10% of the standard deviation of all parameter values were set to 0 when generating the synthetic data

    TP : True PositiveFP : False PositiveTN : True NegativeFN : False Negative

  • RESULTS: DATA SETSData sets for benchmarking: Two data sets

    First set: experimental data measured on macrophage cells (Phosphoprotein (PP) vs Cytokine)*

    Second sets consist of synthetic data generated in Matlab. We build the model using 80% of the data-set (called training set) and use the rest of data-set to validate the model (called test set).* [Pradervand, S., M.R. Maurya, and S. Subramaniam, Genome Biology, 2006. 7(2): p. R11].

  • RESULTS: PP-Cytokine Data SetSchematic representation of Phosphoprotein (PP) vs CytokineSignals were transmitted through 22 recorded signaling proteins and other pathways (unmeasured pathways).

    Only measured pathways contributed to the analysis

    Schematic graphs from:[Pradervand, S., M.R. Maurya, and S. Subramaniam, Genome Biology, 2006. 7(2): p. R11].

  • PP-CYTOKINE DATASETCourtesy: AfCS

  • Measurements of cytokines in response to LPS~ 250 such datasets

  • RESULTS: COMPARISONComparison on synthetic noisy dataThe methods are applied on synthetic data with 22 inputs and 1 output. The true coefficients for the inputs (about 1/3rd) are made zero to test the methods if they identify them as insignificant.

    Effect of noise levelFour outputs with 5, 10, 20 and 40% noise levels, respectively, are generated from the noise-free (true) output.

    Effect of noise typeThree outputs with White, t-distributed, and uniform noise types, respectively are generated from the noise-free (true) output

  • RESULTS: COMPARISONVariability between realizations of data with white noise

    PCR, LASSO, and LMIare used to identify significant predictors for 1000 input-output pairs.

    Histograms of the coefficients in the three significant predictors common to the three methods:Mean and standard deviation in the histograms of the coefficients computed with PCR, LASSO, and LMI.

    MethodPredictor #11011True value-3.405.82-6.95PCRMean-3.814.73-6.06Std.0.330.320.32Frac. Err. in mean0.120.190.13LASSOMean-2.824.48-5.62Std.0.340.320.33Frac. Err. in mean0.170.230.19LMIMean-3.704.74-6.34Std.0.340.320.34Frac. Err. in mean0.090.180.09

  • RESULTS: COMPARISONComparison of outcome of different methods on the real dataDifferent methods identified unique sets of common and distinct predictors for each output

    Graphical illustration of methods PCR, LASSO, and LMI in detection of significant predictors for output IL-6 in PP/cytokine experimental dataset

    Only the PCR method detects the true input cAMP

    zone I provides validation and it highlights the common output of all the methods

  • RESULTS: SUMMARY

    Comparison with respect to different noise types:LASSO is the most robust methods for different noise types.

    Missing data RMSE: LASSO less deviation, more robust.

    Collinearity: PCR less deviation against noise level, better accuracy and G with increasing noise level.

  • A COMPARISON (Asadi, et al., 2012)

    Methods / CriteriaPCRLASSOLMIIncreasing NoiseRMSEScore= (average RMSE across different noise levels for LS)/(average RMSE across different noise levels for the chosen method) / 0.68degrades gradually with level of noise / 0.56 / 0.94Standard deviation and error in mean of Coefficients. Score = 1 average (fractional error in mean(10,12,20) + (std(10,12,20)/ |true associated coefficients|) ) / 0.53 / 0.47 / 0.55Acc./GScore = average accuracy across different noise levels for chosen method (white noise) / 0.70 / 0.87 / 0.91at high noise all similarFractional Error in estimating the parametersScore = 1- average fractional error in estimating the coefficients across different noise levels for chosen method (white noise)/ 0.81 / .55 / 0.78Types of noiseFractional Error in estimating the parameters Score = 1- average fractional error in estimating the coefficients across different noise levels and different noise types (20% noise level) / 0.80 / 0.56 / 0.79Accuracy and GScore = average accuracy across different noise levels and different noise types / 0.71 / 0.87 / 0.91Dimension ratio / SizeFractional Error in estimating the parametersScore = 1- average fractional error in estimating the coefficients across different noise levels and different ratios (m/n = 100/25, 100/50, 400/100) / 0.77 / 0.53 / 0.75Accuracy and G Score = average accuracy across different white noise levels and different ratios (m/n = 100/25, 100/50, 400/100) / 0.66 / 0.83 / 0.90

  • DPLASSODoubly Penalized Least Absolute Shrinkage and Selection Operator

  • OUR APPROACH: DPLASSO

  • Our approach: DPLASSO includes two parameter selection layers:

    Layer 1 (supervisory layer): Partial Least Squares (PLS)Statistical significance testing

    Layer 2 (lower layer):LASSO with extra weights on less informative model parameters derived in layer 1Retain significant predictors and set the remaining small coefficients to zero

    DPLASSO WORK FLOW

  • DPLASSO: EXTENDED VERSIONSmooth weights: Layer 1 : Continuous significance score (versus binary):

    Mapping function (logistic significance score):

    Layer 2:Continuous weight vector (versus fuzzy weight vector)

    Tuning parameter

  • APPLICATIONSSynthetic (random) networks: Datasets generated in MatlabBiological dataset: Saccharomyces cerevisiae - cell cycle model

  • SYNTHETIC (RANDOM) NETWORKSDatasets generated in Matlab using:

    Linear dynamic systemDominant poles/Eigen values () ranges [-2,0]Lyapunov stable Informal definition from wikipedia: if all solutions of the dynamical system that start out near an equilibrium point xe stay near xe forever, thenthe systemis Lyapunov stable.Zero-input/Excited-state release condition 5% measurement (white) noise.

  • Two metrics to evaluate the performance of DPLASSOSensitivity, Specificity, G (Geometric mean of Sensitivity and Specificity), Accuracy

    The root-mean-squared error (RMSE) of prediction

    METRICSTP : True PositiveFP : False PositiveTN : True NegativeFN : False Negative

  • TUNINGTuning shrinkage parameter for DPLASSO

    The shrinkage parameters in LASSO level (threshold t) via k-fold cross-validation (k = 10) on associated datasetValidation error versus selection threshold t for DPLASSO on synthetic data set

    Rule of thumb after cross validations:

    Example:Optimal value of the tuning parameter for a network with 65% connectivity roughly equal to 0.65

  • PERFORMANCE COMPARISON: ACCURACYPLS Better performanceDPLASSO provides good compromise between LASSO and PLS in terms of accuracy for different network densities

  • PERFORMANCE COMPARISON: PRECISIONDPLASSO provides good compromise between LASSO and PLS in terms of precision for different network densities.

  • PERFORMANCE COMPARISON: SENSITIVITYLASSO has better performanceDPLASSO provides good compromise between LASSO and PLS in terms of Sensitivity for different network densities

  • PERFORMANCE COMPARISON: SPECIFICITYDPLASSO provides good compromise between LASSO and PLS in terms of specificity for different network densities.

  • PERFORMANCE COMPARISON: NETWORK-SIZEDPLASSO provides good compromise between LASSO and PLS in terms of accuracy for different network sizesDPLASSO provides good compromise between LASSO and PLS in terms of sensitivity (not shown) for different network sizes

  • ROC CURVE vs. DYNAMICS AND WEIGHTINGS DPLASSO exhibits better performance for networks with slow dynamics. The parameter in DPLASSO can be adjusted to improve performance for fast dynamic networks

  • ROC CURVE vs. DYNAMICS AND WEIGHTINGS DPLASSO exhibits better performance for networks with slow dynamics. The parameter in DPLASSO can be adjusted to improve performance for fast dynamic networks

  • YEAST CELL DIVISIONExperimental dataset generated via well-known nonlinear model of a cell division cycle of fission yeast. The model is dynamic with 9 state variables.

    * Novak, Bela, et al. "Mathematical model of the cell division cycle of fission yeast."Chaos: An Interdisciplinary Journal of Nonlinear Science11.1 (2001): 277-286.

  • CELL DIVISION CYCLETrue Network (Cell Division Cycle)PLSDPLASSOLASSO

  • RECONSTRUCTION PERFORMANCECase Study II: Cell Division Cycle, Average over value Case Study I: 10 Monte Carlo Simulations, Size 20, Average over different , , network density, and Monte Carlo sample datasets

    MethodMetricAccuracySensitivitySpecificitySD RMSE/MeanLASSO 0.310.920.160.14DPLASSO0.560.730.520.08PLS0.600.670.630.09

    MethodMetricAccuracySensitivitySpecificitySD RMSE/MeanLASSO 0.390.900.050.06DPLASSO0.520.900.340.07PLS0.590.800.200.07

  • RECONSTRUCTION PERFORMANCECase Study II: Cell Division Cycle, Average over value Case Study I: 10 Monte Carlo Simulations, Size 20, Average over different , , network density, and Monte Carlo sample datasets

    MethodMetricAccuracySensitivitySpecificityLiftSD RMSE/MeanLASSO 0.310.920.167.530.14DPLASSO0.560.730.526.120.08PLS0.600.670.636.050.09

    MethodMetricAccuracySensitivitySpecificityLiftSD RMSE/MeanLASSO 0.390.900.052.30.06DPLASSO0.520.900.342.20.07PLS0.590.800.202.10.07

  • CONCLUSIONNovel method, Doubly Penalized Linear Absolute Shrinkage and Selection Operator (DPLASSO), to reconstruct dynamic biological networks Based on integration of significance testing of coefficients and optimizationSmoothening function to trade off between PLS and LASSO

    Simulation results on synthetic datasetsDPLASSO provides good compromise between PLS and LASSO in terms of accuracy and sensitivity for Different network densitiesDifferent network sizes

    For biological datasetDPLASSO best in terms of sensitivityDPLASSO good compromise between LASSO and PLS in terms of accuracy, specificity and lift

  • Information Theory MethodsFarzaneh Farangmehr

  • Mutual InformationIt gives us a metric that is indicative of how much information from a variable can be obtained to predict the behavior of the other variable .

    The higher the mutual information, the more similar are the two profiles.

    For two discrete random variables of X={x1,..,xn} and Y={y1,ym}:

    p(xi,yj) is the joint probability of xi and yjP(xi) and p(yj) are marginal probability of xi and yj

  • Information theoretical approachShannon theoryHartleys conceptual framework of information relates the information of a random variable with its probability.

    Shannon defined entropy, H, of a random variable X given a random sample in terms of its probability distribution:

    Entropy is a...