american statistical association october 23 2009 presentation part 1

Size distribution of microtumors

0

10

20

30

40

50

60

70

80

90

2000

0

6000

0

1000

00

1400

00

1800

00

2200

00

2600

00

3000

00

3400

00

3800

00

4200

00

4600

00

5000

00

5400

00

5800

00

6200

00

6600

00

7000

00

7400

00

7800

00

8200

00

8600

00

9000

00

9400

00

Size limit (μm2)

Num

ber

of m

icro

tum

ors

A range of sizes and morphologies observed: MicrotumorsNumber and Average Size of Aggregates and Microtumors

0

20000

40000

60000

80000

100000

120000

140000

160000

ALLSTRUCTURES

AGGREGATESNOT MELANIZED

MICROTUMORSNOT MELANIZED

MICROTUMORSPARTIALLYMELANIZED

MICROTUMORSSTRONGLYMELANIZED

Pro

jection A

rea (µm

²) .

(n=932)

(n=513)

(n=268)

(n=55)

(n=96)

Ubc9- dif- dl-

Ubc9-

-

Microtumor

MicrotumorMicrotumor

AggregateCluster

Aggregate

SmallMicrotumor

Fat Body

419Projection >10,000 m2

Estimated volume: 0.5 mm3 -1 mm3 932

513

Fruitfly Tumors

Genotype Number of Larvae

Ubc9-(transheterozygote) 58

Bc + Ubc9- 55 95% CI

Odds Ratio: NS>5% 0.85- 1.25

Ubc9-

Aggregates + Tumors Aggr Tumors

Totals 932 513 419

% 55.04% 44.96%

Bc Ubc9/+ Ubc9-

Aggregates + Tumors Aggr Tumors

Totals 874 262 612

% 29.98% 70.02%

(Chiu et al 2005) : dUbc9 negatively regulates the Toll-NF-nB pathways in larval hematopoiesis and drosomycin activation in Drosophila. Developmental Biology.

Bc allele backgroundFlyBase GBrowse

modENCODE GBrowse Gene Dmel\Bc

FB2009_07, released August 10, 2009 General Information Symbol Dmel\Bc Species D. melanogaster Name Black cells Annotation symbol CG5779 Feature

type protein_coding_gene FlyBase ID FBgn0000165 Gene Model Status Current Stock availability 68 publicly available Genomic Location Chromosome (arm) 2R Recombination map 2-80.6 Cytogenetic map 54F6-54F6 Sequence location

2R:13,774,718..13,777,477 [-] Genomic MapsThe gene Black cells is referred to in FlyBase by the symbol Dmel\Bc (CG5779, FBgn0000165). It is a protein_coding_gene

from Drosophila melanogaster. Its sequence location is 2R:13774718..13777477. It has the cytological map location 54F6. Its molecular function is described as: monophenol monooxygenase activity; oxygen transporter activity;

oxidoreductase activity. It is involved in the biological processes: defense response; melanization defense response; scab formation; response to symbiont; response to wounding; transport. 10 alleles are reported. The phenotypes of these

alleles are annotated with: crystal cell; hemocyte; hemolymph; lymph gland; adult; procrystal cell; lamellocyte; posterior lymph gland pair. It has one annotated transcript and one annotated polypeptide.

Takehana, A., Katsuyama, T., Yano, T., Oshima, Y., Takada, H., Aigaki, T., Kurata, S. (2002). Overexpression of a pattern-recognition receptor, peptidoglycan-recognition protein-LE, activates

imd/relish-mediated antibacterial defense and the prophenoloxidase cascade in Drosophila larvae. Proc. Natl. Acad. Sci. U.S.A. 99(21): 13705--13710.

Ye, Y.H., Chenoweth, S.F., McGraw, E.A. (2009). Effective but costly, evolved mechanisms of defense against a virulent opportunistic pathogen in Drosophila melanogaster. PLoS Pathog. 5(4): e1000385.

Comparative Analysis of Area limits 25K to 300K and 300K to 600K in both Genotypes: Higher Maximum Likelihood mean, variances and wider confidence interval of 25K-300K shows faster mitosis and cell death than that of 300K-600K.

Maximum Likelihood (ML) Estimates of BC-All (BC-lwr) and lwr43-5 All

BC-All

Mean Tumors Variance Tumors 95% Confidence Interval

25K-300K 4.86 0.85 1.22 to 1.84

300K-600K 1.67 0.02 1.11 to 1.20

lwr43-5 All

Mean Tumors Variance Tumors 95% Confidence Interval

25K-300K 4.5 0.97

1.10 to 1.88

300K-600K 1.27 0.02 1.05 to 1.12

25K-300K Area Size Tumor Log-Normal Distribution in BC-All and Recessive Genotypes (number of micro tumor found or

frequency on Y-axis; every 25K scale)

PROBLEM STATEMENTTumor size data from non-random and correlated data. Samples were prepared for 8 days and scored on 9th day- cumulative effects on frequencies of BC-All and recessive (lwr-) Area size Units between 25k to 600k size distributions? Effects of new VS experienced PhD student on data collection?

612 VS 419. This difference is not statistically significant (P> 5%). EXPECTED frequency higher at all area size for Semidominant gene in the hypothetical Y-axis.

Does not have a pattern to quantify by a Dynamical simulation equations- tried 100’s of published math methods…. Sample size is ONLY 48 rows of Tumor Frequency data!

ASA 10/23/2009 Minneapolis Presentation Predictive Modeling, Mathematical Simulations and Data Mining: Making Sense Out of Really

Difficult Cancer Data.Navin K. Sinha, MS (Statistical Genetics), MS (Biometrics) and MBA (Decision Sciences)

Bc mutation alters aggregate proportions?Bc = Black cellsSemidominantDead crystal cellsVisible easily

Analysis of Raw data showing V-shape residual and compensatory response by 25K area limit (R-square = 0.36 VS 0.76 VS 0.86). Data Analysis needs Dynamical

Simulations, Reverse Engineering Algorithms and Simulated OLS Regression.

LITERATURE REVIEW & METHODSDynamical Simulation by Taylor’s Power Series like Math equation: A. Y= x1 + x2+x3 + x4. Reference: “Lee Specter and Shawn Luke- Culture Enhances the Evolvability of Cognition. 1996. In Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society. “

According to Specter and Luke, special type of Dynamical Simulation is Symbolic Regression- “to produce a function, in symbolic form, that fits a provided set of data points. For each element of a set of (x,y) points, the function should map the x value to an appropriate y value. This sort of problem faced by a scientist who has obtained a set of experimental data points and suspects that a simple formula will suffice to explain the data”. This method is a standard example from Dynamical simulation and used in many different types of biological systems (Koza, J.R. 1992. Genetic Programming: on the programming of computers by means of natural selection. Cambridge, MA, MIT Press).

B. Reverse Engineering Prediction by the equation of y = 4.251a2 + ln(a2) + 7.243ea- CF. (Candida Ferreira. 2003. www.gene-expression-

programming.com/author.asp- equation 3.2 )

Ekaterina Vladislavleva- June 2008- PhD Theses

Models to exhibit not only requiredproperties, but also additional convenient properties like compactness, small

number of constants, etc. It is important, that generated models are interpretable and transparent, in order to provide additional understanding

of the underlying system or process.

Modified Candida Ferreira Method (Equation 3.2): Correction Factor (CF)- Genetic Fitness not as Underestimated: Consistency

in Results. Original Frequency Data (Y-axis) Residual Plot

Residual Plot of Graph of a Function after Matrix Algebra Treatment.

Reverse Engineering of Polynomial Models of Gene Regulatory Networks (Visual Analytics = Meta Modeling = what are the ranges of input variables that cause theresponse to take certain values, not necessarily optimal? )

Dr. Eduardo Mendoza Mathematics Department Center for NanoScience Ludwig-Maximilians-University Munich, Germany [email protected]

[email protected]

Brody et al. October 1, 2002: PNAS: Significance and Statistical Errors in the analysis of DNA microarray data. 99 (20): 12975-12978 (Even for Lorentizian like distributions, median of ratios provide distributions more Gaussian like).

Reverse Engineering of SystemsSystems identification in Engineering: goal is to construct a

system with prescribed dynamical properties

In Systems Biology, one is interested in identifying as closely as possible a unique biological system that

has been observed experimentally

In both cases: sparsity of available measurements will leave

the system underdetermined (GIGO- Uninterpretable)

Mathematical Genetics Concepts

•Average Effects of a Gene: Mean deviation from population mean of Individuals which received that gene from one

parent, the gene received from other parent having come at random from the population.

•Average Effects of Gene Substitution: Change one allele (i.e. A2 allele) into another allele (i.e. A1 alleles) at random in the population and observe resulting change in genotypic value.

•Breeding Value: Twice the Average Value of an individual’s offspring, expressed as deviation from population mean. Also

known as sum of the average effects of genes.

Average Effects of Gene Substitution: І7.333І; very close to equation 3.2 of Candida Ferreira (frequency of 0= 7.243 x12= 86.916 VS 7.333x12=88.0). Comparison: Lowest to Highest R-sq. is represented by linear, Quadratic and Cubic model Respectively. Very comparable to Original frequencies.

A. “Operon or Tumor Gene Expression occurs in a deterministic way from 25K to 300K area limits, and hence would have high survival probability”. This hypothesis indicates that there are conserved Protein motifs which

generates various Brain Tumor sizes in Fruit fly in predetermined frequencies. Thus, micro-tumors counted (frequency) for lower size

limits can be predicted by least non-linear mathematical and statistical equations .

B. “Log-Normal distribution arose due to compensatory response by lowest size distribution over the next few micro-tumor classes”. If the number of micro-tumors counted for 25K area size is at the expense of

next few, then a Log-Normal Distribution can be assured.

Log-Normal Distribution explanation

Leo Breiman: Statist. Sci. Volume 16, Issue 3 (2001), 199-231. Statistical Modeling: The Two Cultures

AbstractThere are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical

community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept

statisticians from working on a large range of interesting current problems.

Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence

on data models and adopt a more diverse set of tools.

A. Analysis of size distribution of lwr (-) microtumors from 58 animals

Projection >10,000 m2; Estimated volume: 0.5 mm3 -1 mm3

Taylor series: y = x1 + x2+x3 + x4 Area Limit Simulated Frequency100,000 -01 (1)200,000 +01 (2)275,000 -02 (3)

MLE:25k-300kMean=4.5 TumorsVariance=0.97 TumorsCI= 1.10-1.88 Tumors MLE: 300k-600kMean=1.27 TumorsVariance=0.02 Tumors

CI= 1.05-1.12 Tumors

RESULTS : Specter and Luke INPUT/OUTPUT Method (Genomics by Stanford University): The frequency of 300K was taken as x1 value and plugged into the equation. First the whole formula was used (1), then x4 was dropped (2), 3 was x1 + x2.

A. Bc-ALL B. Bc-All (corrected)Area limit Simul. Freq. Area Size Simul. Freq.

25K - 97 (1) 25K -1875K + 13 (2) 50K -04 150K + 01 (3) 75K -04 175K +01 (3) (1) THE PATTERN OF SIZE DISTRIBUTION OF SMALL TUMORS IN BOTH GENOTYPES SUGGESTS THAT MITOSIS IS DRIVING TUMORGENESIS. (2) CELL DEATH CONTRIBUTES TO SHIFTING TUMOR SIZE DISTRIBUTION-AS MORE CELLS DIE FROM COMPETITION, MORE SMALL TUMOR CELLS WERE CREATED TO FILL VACANT SPACE.

Ekaterina Vladislavleva- PhD: JUNE 2008

Both measured and simulated data are very often corrupted by noise,and in case of real measurements can be driven by a combination of both

measured and unmeasured input variables, empirical models should not onlyaccurately predict the observed response, but also have some extra generalization capabilities. The same requirement holds for models

developed on simulated data.

Models to exhibit not only requiredproperties, but also additional convenient properties like compactness, small

number of constants, etc. It is important, that generated models are interpretable and transparent, in order to provide additional understanding

of the underlying system or process.

VISUAL ANALYTICS: Meta Modeling: No Plateau Observed! Genetic Fitness keeps increasing-DNA structural similarity is NOT Functional

Similarity.

Original Data Reverse Engineering Algorithm

B. COMPENSATORY RESPONSE HYPOTHESES: BRODY et. al. “Even for Lorentizian like distributions, median of ratios provide

distributions more Gaussian like”

Bc-all Tumor size FREQ/lwr tumor size FREQ Summary Statistics • Obtain Ratio from all cell sizes and then summary statistics on it. • Mean = 1.509206 (ratio by lwr freq. of 8 was very similar to it)• Standard Error = 0.201937 • Median = 1.513738 Tumors • Mode #N/A • Standard Deviation = 0.699531 • Sample Variance = 0.489343 • Kurtosis = 0.430923 (ratio by lwr freq. of 11 was very similar to it)• Skewness = 0.566484 • Minimum = 0.545455, Maximum = 3.0 • Count = 12 = Number of Tumor Cell Sizes. • Confidence Level(95.0%) = 0.444461

B. COMPENSATORY RESPONSE HYPOTHESES…

Bc-all/11 / lwr/11 Ratio Bc-all/8 / lwr/8 Ratio

• Mean = 1.509206 • Standard Error = 0.2019371 • Median = 1.5137383 • Mode #N/A • Standard Deviation = 0.6995308 • Sample Variance = 0.4893433 • Kurtosis = 0.4309225 • Skewness = 0.5664844 • Minimum 0.5454545 Maximum 3.0 • Count = 12 • Confidence Level(95.0%) = 0.4444606

• Mean = 1.509206 • Standard Error = 0.2019371 • Median = 1.5137383 • Mode #N/A • Standard Deviation = 0.6995308 • Sample Variance = 0.4893433 • Kurtosis = 0.4309225 • Skewness = 0.5664844• Minimum = 0.5454545 Maximum = 3.0 • Count = 12 • Confidence Level(95.0%) = 0.4444606

COMPENSATORY RESPONSE HYPOTHESES…

Bc-all FREQ+11 / lwr FREQ ratio Simulation: Summary Statistics on Median Data

• Mean = 2.828184 Tumors• Standard Error = 0.616883 • Median = 1.742237 • Mode = 3.0 • Standard Deviation = 2.136946 • Sample Variance = 4.56654 • Kurtosis = 1.466792 • Skewness = 1.543556 • Minimum = 0.691358 Maximum = 7.5 • Count = 12 • Confidence Level(95.0%) = 1.357751

• Mean = 5.530856 Tumors• Standard Error = 2.08372 • Median = 1.711034 is close to +11 Summary

Statistics • Mode = 1.513738 = Median value prev.• Standard Deviation = 6.589301 • Sample Variance = 43.41889 • Kurtosis = -0.73807 • Skewness = 1.136088 • Minimum = 0.703297 Maximum =16.65112 • Count 10 = # of simulations= 10 medians

from various ratio simulations were generated and summery statistics generated here is from those 10 medians.

• Confidence Level(95.0%) = 4.713702

american statistical association october 23 2009 presentation part 1

Documents

data analysis

data collection

data mining

correlated data

mean tumors variance

tumors aggr tumors totals

frequencies of bc

estimates of bc