american statistical association october 23 2009 presentation part 1
DESCRIPTION
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer datasetTRANSCRIPT
Size distribution of microtumors
0
10
20
30
40
50
60
70
80
90
2000
0
6000
0
1000
00
1400
00
1800
00
2200
00
2600
00
3000
00
3400
00
3800
00
4200
00
4600
00
5000
00
5400
00
5800
00
6200
00
6600
00
7000
00
7400
00
7800
00
8200
00
8600
00
9000
00
9400
00
Size limit (μm2)
Num
ber
of m
icro
tum
ors
A range of sizes and morphologies observed: MicrotumorsNumber and Average Size of Aggregates and Microtumors
0
20000
40000
60000
80000
100000
120000
140000
160000
ALLSTRUCTURES
AGGREGATESNOT MELANIZED
MICROTUMORSNOT MELANIZED
MICROTUMORSPARTIALLYMELANIZED
MICROTUMORSSTRONGLYMELANIZED
Pro
jection A
rea (µm
²) .
(n=932)
(n=513)
(n=268)
(n=55)
(n=96)
Ubc9- dif- dl-
Ubc9-
-
Microtumor
MicrotumorMicrotumor
AggregateCluster
Aggregate
SmallMicrotumor
Fat Body
419Projection >10,000 m2
Estimated volume: 0.5 mm3 -1 mm3 932
513
Fruitfly Tumors
Genotype Number of Larvae
Ubc9-(transheterozygote) 58
Bc + Ubc9- 55 95% CI
Odds Ratio: NS>5% 0.85- 1.25
Ubc9-
Aggregates + Tumors Aggr Tumors
Totals 932 513 419
% 55.04% 44.96%
Bc Ubc9/+ Ubc9-
Aggregates + Tumors Aggr Tumors
Totals 874 262 612
% 29.98% 70.02%
(Chiu et al 2005) : dUbc9 negatively regulates the Toll-NF-nB pathways in larval hematopoiesis and drosomycin activation in Drosophila. Developmental Biology.
Bc allele backgroundFlyBase GBrowse
modENCODE GBrowse Gene Dmel\Bc
FB2009_07, released August 10, 2009 General Information Symbol Dmel\Bc Species D. melanogaster Name Black cells Annotation symbol CG5779 Feature
type protein_coding_gene FlyBase ID FBgn0000165 Gene Model Status Current Stock availability 68 publicly available Genomic Location Chromosome (arm) 2R Recombination map 2-80.6 Cytogenetic map 54F6-54F6 Sequence location
2R:13,774,718..13,777,477 [-] Genomic MapsThe gene Black cells is referred to in FlyBase by the symbol Dmel\Bc (CG5779, FBgn0000165). It is a protein_coding_gene
from Drosophila melanogaster. Its sequence location is 2R:13774718..13777477. It has the cytological map location 54F6. Its molecular function is described as: monophenol monooxygenase activity; oxygen transporter activity;
oxidoreductase activity. It is involved in the biological processes: defense response; melanization defense response; scab formation; response to symbiont; response to wounding; transport. 10 alleles are reported. The phenotypes of these
alleles are annotated with: crystal cell; hemocyte; hemolymph; lymph gland; adult; procrystal cell; lamellocyte; posterior lymph gland pair. It has one annotated transcript and one annotated polypeptide.
Takehana, A., Katsuyama, T., Yano, T., Oshima, Y., Takada, H., Aigaki, T., Kurata, S. (2002). Overexpression of a pattern-recognition receptor, peptidoglycan-recognition protein-LE, activates
imd/relish-mediated antibacterial defense and the prophenoloxidase cascade in Drosophila larvae. Proc. Natl. Acad. Sci. U.S.A. 99(21): 13705--13710.
Ye, Y.H., Chenoweth, S.F., McGraw, E.A. (2009). Effective but costly, evolved mechanisms of defense against a virulent opportunistic pathogen in Drosophila melanogaster. PLoS Pathog. 5(4): e1000385.
Comparative Analysis of Area limits 25K to 300K and 300K to 600K in both Genotypes: Higher Maximum Likelihood mean, variances and wider confidence interval of 25K-300K shows faster mitosis and cell death than that of 300K-600K.
Maximum Likelihood (ML) Estimates of BC-All (BC-lwr) and lwr43-5 All
BC-All
Mean Tumors Variance Tumors 95% Confidence Interval
25K-300K 4.86 0.85 1.22 to 1.84
300K-600K 1.67 0.02 1.11 to 1.20
lwr43-5 All
Mean Tumors Variance Tumors 95% Confidence Interval
25K-300K 4.5 0.97
1.10 to 1.88
300K-600K 1.27 0.02 1.05 to 1.12
25K-300K Area Size Tumor Log-Normal Distribution in BC-All and Recessive Genotypes (number of micro tumor found or
frequency on Y-axis; every 25K scale)
PROBLEM STATEMENTTumor size data from non-random and correlated data. Samples were prepared for 8 days and scored on 9th day- cumulative effects on frequencies of BC-All and recessive (lwr-) Area size Units between 25k to 600k size distributions? Effects of new VS experienced PhD student on data collection?
612 VS 419. This difference is not statistically significant (P> 5%). EXPECTED frequency higher at all area size for Semidominant gene in the hypothetical Y-axis.
Does not have a pattern to quantify by a Dynamical simulation equations- tried 100’s of published math methods…. Sample size is ONLY 48 rows of Tumor Frequency data!
ASA 10/23/2009 Minneapolis Presentation Predictive Modeling, Mathematical Simulations and Data Mining: Making Sense Out of Really
Difficult Cancer Data.Navin K. Sinha, MS (Statistical Genetics), MS (Biometrics) and MBA (Decision Sciences)
Bc mutation alters aggregate proportions?Bc = Black cellsSemidominantDead crystal cellsVisible easily
Analysis of Raw data showing V-shape residual and compensatory response by 25K area limit (R-square = 0.36 VS 0.76 VS 0.86). Data Analysis needs Dynamical
Simulations, Reverse Engineering Algorithms and Simulated OLS Regression.
LITERATURE REVIEW & METHODSDynamical Simulation by Taylor’s Power Series like Math equation: A. Y= x1 + x2+x3 + x4. Reference: “Lee Specter and Shawn Luke- Culture Enhances the Evolvability of Cognition. 1996. In Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society. “
According to Specter and Luke, special type of Dynamical Simulation is Symbolic Regression- “to produce a function, in symbolic form, that fits a provided set of data points. For each element of a set of (x,y) points, the function should map the x value to an appropriate y value. This sort of problem faced by a scientist who has obtained a set of experimental data points and suspects that a simple formula will suffice to explain the data”. This method is a standard example from Dynamical simulation and used in many different types of biological systems (Koza, J.R. 1992. Genetic Programming: on the programming of computers by means of natural selection. Cambridge, MA, MIT Press).
B. Reverse Engineering Prediction by the equation of y = 4.251a2 + ln(a2) + 7.243ea- CF. (Candida Ferreira. 2003. www.gene-expression-
programming.com/author.asp- equation 3.2 )
Ekaterina Vladislavleva- June 2008- PhD Theses
Models to exhibit not only requiredproperties, but also additional convenient properties like compactness, small
number of constants, etc. It is important, that generated models are interpretable and transparent, in order to provide additional understanding
of the underlying system or process.
Modified Candida Ferreira Method (Equation 3.2): Correction Factor (CF)- Genetic Fitness not as Underestimated: Consistency
in Results. Original Frequency Data (Y-axis) Residual Plot
Residual Plot of Graph of a Function after Matrix Algebra Treatment.
Reverse Engineering of Polynomial Models of Gene Regulatory Networks (Visual Analytics = Meta Modeling = what are the ranges of input variables that cause theresponse to take certain values, not necessarily optimal? )
Dr. Eduardo Mendoza Mathematics Department Center for NanoScience Ludwig-Maximilians-University Munich, Germany [email protected]
Brody et al. October 1, 2002: PNAS: Significance and Statistical Errors in the analysis of DNA microarray data. 99 (20): 12975-12978 (Even for Lorentizian like distributions, median of ratios provide distributions more Gaussian like).
Reverse Engineering of SystemsSystems identification in Engineering: goal is to construct a
system with prescribed dynamical properties
In Systems Biology, one is interested in identifying as closely as possible a unique biological system that
has been observed experimentally
In both cases: sparsity of available measurements will leave
the system underdetermined (GIGO- Uninterpretable)
Mathematical Genetics Concepts
•Average Effects of a Gene: Mean deviation from population mean of Individuals which received that gene from one
parent, the gene received from other parent having come at random from the population.
•Average Effects of Gene Substitution: Change one allele (i.e. A2 allele) into another allele (i.e. A1 alleles) at random in the population and observe resulting change in genotypic value.
•Breeding Value: Twice the Average Value of an individual’s offspring, expressed as deviation from population mean. Also
known as sum of the average effects of genes.
Average Effects of Gene Substitution: І7.333І; very close to equation 3.2 of Candida Ferreira (frequency of 0= 7.243 x12= 86.916 VS 7.333x12=88.0). Comparison: Lowest to Highest R-sq. is represented by linear, Quadratic and Cubic model Respectively. Very comparable to Original frequencies.
A. “Operon or Tumor Gene Expression occurs in a deterministic way from 25K to 300K area limits, and hence would have high survival probability”. This hypothesis indicates that there are conserved Protein motifs which
generates various Brain Tumor sizes in Fruit fly in predetermined frequencies. Thus, micro-tumors counted (frequency) for lower size
limits can be predicted by least non-linear mathematical and statistical equations .
B. “Log-Normal distribution arose due to compensatory response by lowest size distribution over the next few micro-tumor classes”. If the number of micro-tumors counted for 25K area size is at the expense of
next few, then a Log-Normal Distribution can be assured.
Log-Normal Distribution explanation
Leo Breiman: Statist. Sci. Volume 16, Issue 3 (2001), 199-231. Statistical Modeling: The Two Cultures
AbstractThere are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical
community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept
statisticians from working on a large range of interesting current problems.
Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence
on data models and adopt a more diverse set of tools.
A. Analysis of size distribution of lwr (-) microtumors from 58 animals
Projection >10,000 m2; Estimated volume: 0.5 mm3 -1 mm3
Taylor series: y = x1 + x2+x3 + x4 Area Limit Simulated Frequency100,000 -01 (1)200,000 +01 (2)275,000 -02 (3)
MLE:25k-300kMean=4.5 TumorsVariance=0.97 TumorsCI= 1.10-1.88 Tumors MLE: 300k-600kMean=1.27 TumorsVariance=0.02 Tumors
CI= 1.05-1.12 Tumors
RESULTS : Specter and Luke INPUT/OUTPUT Method (Genomics by Stanford University): The frequency of 300K was taken as x1 value and plugged into the equation. First the whole formula was used (1), then x4 was dropped (2), 3 was x1 + x2.
A. Bc-ALL B. Bc-All (corrected)Area limit Simul. Freq. Area Size Simul. Freq.
25K - 97 (1) 25K -1875K + 13 (2) 50K -04 150K + 01 (3) 75K -04 175K +01 (3) (1) THE PATTERN OF SIZE DISTRIBUTION OF SMALL TUMORS IN BOTH GENOTYPES SUGGESTS THAT MITOSIS IS DRIVING TUMORGENESIS. (2) CELL DEATH CONTRIBUTES TO SHIFTING TUMOR SIZE DISTRIBUTION-AS MORE CELLS DIE FROM COMPETITION, MORE SMALL TUMOR CELLS WERE CREATED TO FILL VACANT SPACE.
Ekaterina Vladislavleva- PhD: JUNE 2008
Both measured and simulated data are very often corrupted by noise,and in case of real measurements can be driven by a combination of both
measured and unmeasured input variables, empirical models should not onlyaccurately predict the observed response, but also have some extra generalization capabilities. The same requirement holds for models
developed on simulated data.
Models to exhibit not only requiredproperties, but also additional convenient properties like compactness, small
number of constants, etc. It is important, that generated models are interpretable and transparent, in order to provide additional understanding
of the underlying system or process.
VISUAL ANALYTICS: Meta Modeling: No Plateau Observed! Genetic Fitness keeps increasing-DNA structural similarity is NOT Functional
Similarity.
Original Data Reverse Engineering Algorithm
B. COMPENSATORY RESPONSE HYPOTHESES: BRODY et. al. “Even for Lorentizian like distributions, median of ratios provide
distributions more Gaussian like”
Bc-all Tumor size FREQ/lwr tumor size FREQ Summary Statistics • Obtain Ratio from all cell sizes and then summary statistics on it. • Mean = 1.509206 (ratio by lwr freq. of 8 was very similar to it)• Standard Error = 0.201937 • Median = 1.513738 Tumors • Mode #N/A • Standard Deviation = 0.699531 • Sample Variance = 0.489343 • Kurtosis = 0.430923 (ratio by lwr freq. of 11 was very similar to it)• Skewness = 0.566484 • Minimum = 0.545455, Maximum = 3.0 • Count = 12 = Number of Tumor Cell Sizes. • Confidence Level(95.0%) = 0.444461
B. COMPENSATORY RESPONSE HYPOTHESES…
Bc-all/11 / lwr/11 Ratio Bc-all/8 / lwr/8 Ratio
• Mean = 1.509206 • Standard Error = 0.2019371 • Median = 1.5137383 • Mode #N/A • Standard Deviation = 0.6995308 • Sample Variance = 0.4893433 • Kurtosis = 0.4309225 • Skewness = 0.5664844 • Minimum 0.5454545 Maximum 3.0 • Count = 12 • Confidence Level(95.0%) = 0.4444606
• Mean = 1.509206 • Standard Error = 0.2019371 • Median = 1.5137383 • Mode #N/A • Standard Deviation = 0.6995308 • Sample Variance = 0.4893433 • Kurtosis = 0.4309225 • Skewness = 0.5664844• Minimum = 0.5454545 Maximum = 3.0 • Count = 12 • Confidence Level(95.0%) = 0.4444606
COMPENSATORY RESPONSE HYPOTHESES…
Bc-all FREQ+11 / lwr FREQ ratio Simulation: Summary Statistics on Median Data
• Mean = 2.828184 Tumors• Standard Error = 0.616883 • Median = 1.742237 • Mode = 3.0 • Standard Deviation = 2.136946 • Sample Variance = 4.56654 • Kurtosis = 1.466792 • Skewness = 1.543556 • Minimum = 0.691358 Maximum = 7.5 • Count = 12 • Confidence Level(95.0%) = 1.357751
• Mean = 5.530856 Tumors• Standard Error = 2.08372 • Median = 1.711034 is close to +11 Summary
Statistics • Mode = 1.513738 = Median value prev.• Standard Deviation = 6.589301 • Sample Variance = 43.41889 • Kurtosis = -0.73807 • Skewness = 1.136088 • Minimum = 0.703297 Maximum =16.65112 • Count 10 = # of simulations= 10 medians
from various ratio simulations were generated and summery statistics generated here is from those 10 medians.
• Confidence Level(95.0%) = 4.713702