statistical modeling: building a better mouse trap, and others dec 10, 2012 at the university of...

Download Statistical Modeling: Building a Better Mouse Trap, and others Dec 10, 2012 at the University of Hong Kong Stephen Sauchi Lee Associate Professor of Statistics

If you can't read please download the document

Upload: elaine-whitehead

Post on 17-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1
  • Statistical Modeling: Building a Better Mouse Trap, and others Dec 10, 2012 at the University of Hong Kong Stephen Sauchi Lee Associate Professor of Statistics Affiliated Professor of Bioinformatics and Computational Biology Department of Statistics University of Idaho Moscow, Idaho, USA
  • Slide 2
  • Statistical Modeling On 3 projects I.Building a Better Mouse Trap? The Incremental Utility Behind the Methodology of Risk Assessment II.Predicting Parkinson Disease Status III.Demographic Impacts on Social Vulnerability in Norway
  • Slide 3
  • Academy of Criminal Justice Sciences NYC 2012-03-16 Zachary Hamilton, PhD Melanie-Angela Neuilly, PhD Robert Barnoski, PhD Washington State University Pullman, Washington Stephen S. Lee, PhD University of Idaho Moscow, Idaho
  • Slide 4
  • Four generations: 1) clinical judgment 2) static predictors 3) dynamic factors 4) automated Regression methods utilized for instrument creation LSI-R: Logistic Regression COMPAS: Survival Regression Recent advancements in prediction rarely utilized for criminal risk assessment Decision trees Neural networks Latent class analysis
  • Slide 5
  • Linear approaches assume equal additive quality for all factors (Steadman, et al., 2000) Typically neglect interaction effects Machine-learning mirror diagnostic processes more closely (Steadman, et al., 2000) Machine-learning approaches most commonly used: Classification Trees (CT) and other recursive partitioning models (CHAID, CART, ICT, Random Forests, etc.) Neural Networks (NN)
  • Slide 6
  • Hierarchical question-decision tree model (Breiman, 1984) The final answer is the result of a series of conditioning answers (If this -> then that, etc.) Used in diagnostic reasoning No statistical significance Random Forests Inductive statistical learning Aggregation of hundreds of Classification Trees
  • Slide 7
  • Final split Second split First split Starting point Total sample Age at first arrest 25 Prior arrests 2 No recidivism Prior arrests 2 Recidivism Age at first arrest 25 No recidivism
  • Slide 8
  • Developed in Artificial Intelligence research Data mining technique for pattern recognition Aim at modeling the lower level brain functions Layered nodes of fact-sets instead of rules, used to train the network Based on the training data, the network learns to deduce the right answer to any new piece of information Used in psychiatric diagnostic
  • Slide 9
  • Inputs Weights Hidden neurons Activation functions Weights Predicted Outputs Recalculation of weights based on predicted and actual outputs
  • Slide 10
  • Studies using CT-like analyses, as well as NN tend to make use of smaller samples ( 1,500) (except Berk et al., 2009; Palocsay et al., 2000; and Silver et al., 2000) Overall, results are mixed, but those finding significant improvement via CT use lack proper validations (Liu, 2010) Studies using NN show very split results (Liu, 2010)
  • Slide 11
  • Overall, very few studies have investigated the utility of CT and NN for predicting recidivism Previous studies have been limited In power To violence prediction The current study remedy such limitations Close to million cases General recidivism as well as possibilities for investigating offense-specific recidivism
  • Slide 12
  • Previously utilized LSI-R Found laborious by community corrections officers Evaluated to be strengthened by increase of static items ( Barnowski, 2003 ) Created current instrument in 2006 Factors strongly related to recidivism: demographics, juvenile record, commitments to DOC, felonies, misdemeanors, and violations Removed dynamic items (interview not required) Instrument scored from logistic regression - logit weights Comparable Predictive Validity for WA Sample (WSIPP, 2007) LSI-R AUC =.66 WA Static Risk AUC =.74
  • Slide 13
  • 24 variables included from current risk prediction instruement 3 year follow-up (release from incarceration) Any felony recidivism 2 step creation Construction sample All offenders released from prison or jail placed on community supervision from 1986 to 2000 (N = 287,417) Validation sample All offenders released from prison or jail placed on community supervision from 2001 to 2002 (N = 71,957) Compare methods of prediction models Area under the receiver operating characteristic (AUC) Values of.500s indicate no predictive accuracy Where.600s are weak,.700s moderate, and above.800 strong predictive accuracy
  • Slide 14
  • Descriptive Statistics (N=359,374) Predictor%/Mean(SD) White (not included in model)79.7 1. Male18.7 2. Age At Risk31.7(10.2) 3. Adult Felonies 2.1(1.9) 4. Juvenile Felony Score32 5. Juvenile Person Score 6 6. Number of DOC Commitments 2.0(1.7) 7. Homicide/Manslaughter 1 8. Felony Sex 7 9. Felony Violent Property 9 10. Felony Non-Dometic Violence Assault16 11. Felony Dometic Violence Assault 2 12. Felony Weapon 4 13. Felony Property 85 14. Felony Drug 62 15. Felony Escape 8 16. Misdemenor Non-Dometic Violence Assault 23 17. Misdemenor Dometic Violence Assault 21 18. Misdemenor Sex 3 19. Misdemenor Dometic Violence Other 1 20. Misdemenor Weapon 4 21. Misdemenor Property 52 22. Misdemenor Drug 17 23. Misdemenor Escape 1 24. Misdemenor Alcohol 17 NewFelony (Outcome) 44
  • Slide 15
  • Extended validation sample of original instrument construction Strongest model predictors (weights) were: 1) Misd. Property, 2) Juvenile Felony, 3) Misd. Dometic Violence Assault, 4) Misd. Drug, 5)Misd. Sex, 6)Male Findings comparable to original instrument construction ModelsConstruction Sample ROCsValidation Sample ROCs Original Sample.756.742 Extended Sample.750.749
  • Slide 16
  • Strongest Model Predictors : 1)Felony Adjudications, 2) Misd. Property, 3)Sentence Length 4)Juvenile Felony 5)Age 6) Felony Property
  • Slide 17
  • ModelsConstruction Sample ROC (SE)Validation Sample ROC (SE) Logistic Regression.750 (.001).749* (.002) Neural Network.755* (.001).750* (.002) Random Forest.750 (.001).734 (.002) Model Comparisons
  • Slide 18
  • Significant differences found Neural network significantly greater predictive validity than random forest Neural network significantly greater predictive validity than logistic regression but only construction sample
  • Slide 19
  • Neural networks performed best, followed by logistic regression and random forest ROC differences of methods found to be significant but not universally Preliminary nature of findings are stressed
  • Slide 20
  • Lack of specificity of outcome measure and sample heterogeneity Any felony within 3 years Specialization and taxonomic structures not considered Unit of analysis is incarceration cycle Violation of independence assumption for repeat incarcerations Exclusion of dynamic predictors
  • Slide 21
  • Add dynamic predictors to models Available since 2008 Prior/preliminary findings indicate only modest improvement Examine impact of latent variable methods 4 th potential model Disentangle heterogeneity Subgroup analyses based on offense specialties i.e. drug, violent, sex offender
  • Slide 22
  • Predicting Parkinsons disease status with vocal dysphonia measurements Roxana Hickey Bioinformatics & Computational Biology Statistics 519 Multivariate Statistics Term Project Professor Stephen Lee April 27, 2011
  • Slide 23
  • Outline Background Parkinsons disease Vocal dysphonia Study dataset Statistical analyses Conclusions
  • Slide 24
  • Parkinsons disease Neurological disorder that leads to shaking and difficulty with walking, movement and coordination 1 Affects >1 million people in North America 2 rapidly increased prevalence after age 60 3 No cure, but medication available to alleviate symptoms, especially in early stages 4 early detection key to effective treatment strategies http://www.healthtree.com/articles/parkinsons-disease/causes/
  • Slide 25
  • Parkinsons disease & vocal impairment ~90% of individuals with Parkinsons disease have some form of vocal impairment 5, 6 characteristics 7 dysphonia (impaired production of vocal sounds) dysarthria (problems with normal articulation in speech) may be one of earliest indicators of onset of illness 8 Tests for vocal impairment 9,10 sustained phonations 11, 12 (focus of this study) produce single vowel and hold pitch constant running speech 12 speak standard sentences that contains representative sample of linguistic units
  • Slide 26
  • Measures of assessing vocal dysphonia Traditional methods 11, 12 pitch (F0, fundamental frequency of vocal oscillation) absolute sound pressure level (loudness) jitter (variation in F0 from vocal cycle to vocal cycle) shimmer (variation in amplitude) noise-to-harmonics ratio Novel methods 13, 14 nonlinear dynamical systems theory and nonlinear time series analysis recurrence period density entropy detrended fluctuation analysis
  • Slide 27
  • Measures of assessing vocal dysphonia Measurements differ in robustness 14 uncontrolled variation in acoustic environment physical condition and characteristics of subject Therefore, chosen measurement methods should be as robust as possible to this variation Goal of the study: identify an optimal feature set that is both robust to uncontrolled variation and able to classify patients with Parkinsons disease based on vocal dysphonic symptoms Additional advantage: possibility of monitoring patients remotely
  • Slide 28
  • http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data
  • Slide 29
  • Subjects & methods Subjects 31 individuals 8 healthy 23 with Parkinsons disease (PD) average of six sustained vowel phonations recorded from each subject Total n=195 Calculation of features via software programs traditional measures non-standard measures, including new measure proposed by authors: pitch period entropy
  • Slide 30
  • Variables AttributeDescriptionAttributeDescription MDVP:Jitter (%) MDVP jitter as percentage NHR Noise-to-Harmonics Ratio MDVP:Jitter(Abs) MDVP absolute jitter in microseconds HNR Harmonics-to-Noise Ratio MDVP:RAP MDVP Relative Amplitude Perturbation RPDE Recurrence Period Density Entropy MDVP:PPQ MDVP five-point Period Perturbation Quotient D2 Correlation dimension Jitter:DDP Average absolute difference of differences between cycles, divided by the average period DFA Detrended Fluctuation Analysis MDVP:Shimmer MDVP local shimmer spread1 Nonlinear measure of fundamental frequency variation MDVP:Shimmer(dB ) MCVP local shimmer in decibels spread2 Nonlinear measure of fundamental frequency variation Shimmer:APQ3 3-pt Amplitude Perturbation Quotient PPE Pitch period entropy Shimmer:APQ5 5-pt Amplitude Perturbation Quotient MDVP:Fo(Hz) Average vocal fundamental frequency MDVP:APQ MDVP 11-point Amplitude Perturbation Quotient MDVP:Fhi(Hz ) Maximum vocal fundamental frequency Shimmer:DDA Avg abs. diff. between consecutive differences between the amplitudes of consecutive periods MDVP:Flo(Hz) Minimum vocal fundamental frequency MDVP = (Kay Pentax) Multi-Dimensional Voice Program Measures of variation in amplitude Measures of variation in fundamental frequency Measures of ratio of noise to tonal components in voice Nonlinear dynamical complexity measures Single fractal scaling exponent Nonlinear measures of fundamental frequency variation Grouping variable: status =0 (healthy) =1 (PD)
  • Slide 31
  • Statistical analyses EDA PCA MANOVA Hotellings T 2 QDA Classification tree (with random forest)
  • Slide 32
  • EDA 0=healthy 1=PD
  • Slide 33
  • EDA 0=healthy 1=PD
  • Slide 34
  • EDA 0=healthy 1=PD
  • Slide 35
  • EDA 0=healthy 1=PD
  • Slide 36
  • EDA 0=healthy 1=PD
  • Slide 37
  • PCA
  • Slide 38
  • MANOVA template H 0 : healthy = Parkinsons
  • Slide 39
  • Hotellings T 2 test H 0 : healthy = Parkinsons (p=22) T-square test statistic = 187.48 df = 48 + 147 2 = 193 critical 2 0.05, 22, 193 47 (extrapolated) Conclusion: reject H 0 ( =0.05)
  • Slide 40
  • 01 03513 19138 Classified Actual park.qda.cv
  • Slide 41
  • Classification tree (CART) table(Actual=park.g$group, Classified=park.cart.pred) Error rate: (10+6)/195= 8.21% 01 03810 16141 Classified Actual
  • Slide 42
  • Random forests park.rf
  • Climate Change Effects in Arctic Europe Highest warming rate in last 3 decades was in Arctic. Barents Sea ice-free during past 5 years and declined at 12% decade, faster than models predicted. Boreal forest moving poleward. Shrubs becoming trees. Tree line altitude increase = 100 meters, poleward = >30 km in some locations since 1909. Near-surface thaw of permafrost causing infrastructure problems and loss of palsa mires, causing release of methane and CO 2. Reindeer and fish migrations changing. 52
  • Slide 53
  • Social Vulnerability Social vulnerability occurs when unequal exposure to risk is coupled with unequal access to resources(Morrow 2008). Groups potentially discriminated against are the socially, culturally, and economically marginalized (Mustafa 1998, Morrow 2008). Variables promoting social vulnerability include: poverty, minority status, gender, age, disability, human capital (Morrow 2008). 53
  • Slide 54
  • Why look at social vulnerability? Previous migration patterns can be detrimental Northern Norway still tied to natural resources Welfare state North receives subsidies and its debts are burdened by the national government. OECD states Norway will have to change its social programs and transfer programs to maintain a growth in the economy. 54
  • Slide 55
  • Methods and Results 55 Indices of accessibility for the region Migration rates during the past 2 decades Multivariate analysis with social vulnerability variables.
  • Slide 56
  • Accessibility Index = cities with population of 20,000 or greater. (Magnet Cities) = distance from municipal centroid to city ( ). To be included, city had to be within 200km ofcentroid. 56
  • Slide 57
  • 57
  • Slide 58
  • 58
  • Slide 59
  • 59 Net Migration (%) 1990 - 2009
  • Slide 60
  • 60
  • Slide 61
  • Pressures from Outmigration 61 Loss of potential labor force Loss of human capital Makes region less attractive for potential employers Less of a tax base Loss of potential progeny
  • Slide 62
  • Social Vulnerability Social vulnerability occurs when unequal exposure to risk is coupled with unequal access to resources(Morrow 2008). Groups potentially discriminated against are the socially, culturally, and economically marginalized (Mustafa 1998, Morrow 2008). Variables promoting social vulnerability include: poverty, minority status, gender, age, disability, human capital (Morrow 2008). 62
  • Slide 63
  • Variables used in multivariate analysis ____________________________________________________________________________________________________ -Percent Net-migration (1990-2009) -Percent Household Income < 150,000 NOK $22,000 USD (2010) -Percent Household Income > 500,000 NOK $85,000 USD (2010) -Percent elderly (Old age dependency) (2010) -Percent employed in primary industries ie. mining, fishing, farming (2010) -Percent Labor Force participation (2010) -Percent unemployed (2010) -Percent paid for Social Assistance (2009) -Percent over age 25 with only completed primary education (2010) -Percent over age 25 with secondary education attainment (2010) -Percent over age 25 with attainment beyond secondary w/o completion of tertiary (2010) -Percent over age 25 with attainment of tertiary education (2010) -Percent Voter turnout (2008) -Percent Municipal Net Loan to Gross Revenue (2010) -Percent Municipal Net loan debt per capita (2010) -Percent Municipal Long term debt to Revenue of (2010) 63
  • Slide 64
  • Non-Barents Barents Barents vs. Non-Barents Municipalities N=430 (N=342) (N=88) 64
  • Slide 65
  • 65 Df Hotelling-Lawley approx F num Df den Df Pr(>F) barentsF 1 1.5733 43.424 15 414 < 2.2e-16 *** Residuals 428
  • Slide 66
  • Component Eigenvalues % of total Variables and (component loadings) Variance _____________________________________________________________________________________________________________________ 1. Age, Income, School, 2.324 33.75%Percent Elderly (0.730) Migration and LaborIncome < 150,000 (0.762) ForcePercent Primary Sector (0.651) Percent Tertiary School 1 (-0.738) Percent Tertiary School 2 (-0.641) Net Migration (-0.695) Labor Force Part. (-0.612) Income > 500,000 (-0.814) 2. Social Welfare 1.692 17.90%Percent Unemploy. (0.599) Upper Secondary Ed. (-0.576) Social Assistance (0.543) Labor Force Part. (-0.527) 3. Debt 1.305 10.65%Net Loan to Gross Rev. (0.617) Long Term debt (0.637) Net Loan Debt/capita (0.709) 4. Education 1.115 7.77%Tertiary Ed (-0.543) 1 66
  • Slide 67
  • 67
  • Slide 68
  • 68
  • Slide 69
  • 69 Plots of municipalities on First 3 Principal Components Barents Non-Barents
  • Slide 70
  • 70 Standard Deviations on First Principal Component
  • Slide 71
  • QDA Analysis 71 Quadratic Discriminant Analysis on the same 16 variables. Results illustrate a discernible difference between North and South. ***correct classification rate of 94.65% ***cross-validated 1 = Non-Barents 2 = Barents
  • Slide 72
  • Discussion Distinction between North and South urbanization Migration and social vulnerability Life Biography (20 somethings) Caveats Missing variables (ethnic minority) Indigenous group Sami Further research Community level analysis 72
  • Slide 73
  • 73 Photo by Hildegun JohnsenHildegun Johnsen
  • Slide 74
  • Questions, Feedback?
  • Slide 75
  • Thank you
  • Slide 76
  • Determining the Geographic Origin of Potatoes with Trace Metal Analysis Using Statistical and Neural Network Classifiers The objective of this research was to develop a method to confirm the geographical authenticity of Idaho-labeled potatoes as Idaho-grown potatoes. Elemental analysis (K, Mg, Ca, Sr, Ba, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Mo, S, Cd, Pb, and P) PCA, CDA, discriminant function analysis, k-nearest neighbors, and neural network