Statistical Modeling: Building a Better Mouse Trap, and others

76
Statistical Modeling: Building a Better Mouse Trap, and others Dec 10, 2012 at the University of Hong Kong Stephen Sauchi Lee Associate Professor of Statistics Affiliated Professor of Bioinformatics and Computational Biology Department of Statistics University of Idaho Moscow, Idaho, USA

description

Statistical Modeling: Building a Better Mouse Trap, and others. Dec 10, 2012 at the University of Hong Kong Stephen Sauchi Lee Associate Professor of Statistics Affiliated Professor of Bioinformatics and Computational Biology - PowerPoint PPT Presentation

Transcript of Statistical Modeling: Building a Better Mouse Trap, and others

PowerPoint Presentation

Statistical Modeling: Building a Better Mouse Trap, and othersDec 10, 2012 at the University of Hong Kong

Stephen Sauchi Lee Associate Professor of StatisticsAffiliated Professor of Bioinformatics and Computational BiologyDepartment of StatisticsUniversity of Idaho Moscow, Idaho, USAStatistical ModelingOn 3 projectsBuilding a Better Mouse Trap?The Incremental Utility Behind the Methodology of Risk AssessmentPredicting Parkinson Disease StatusDemographic Impacts on Social Vulnerability in NorwayBuilding a Better Mouse Trap?The Incremental Utility Behind the Methodology of Risk AssessmentAcademy of Criminal Justice Sciences NYC 2012-03-16

Zachary Hamilton, PhDMelanie-Angela Neuilly, PhDRobert Barnoski, PhDWashington State UniversityPullman, Washington

Stephen S. Lee, PhDUniversity of IdahoMoscow, Idaho

Emerging technology of risk assessmentFour generations:1) clinical judgment2) static predictors3) dynamic factors4) automated

Regression methods utilized for instrument creationLSI-R: Logistic RegressionCOMPAS: Survival Regression

Recent advancements in prediction rarely utilized for criminal risk assessmentDecision treesNeural networksLatent class analysisNon-linear prediction modelsLinear approaches assume equal additive quality for all factors (Steadman, et al., 2000)Typically neglect interaction effectsMachine-learning mirror diagnostic processes more closely (Steadman, et al., 2000)Machine-learning approaches most commonly used:Classification Trees (CT) and other recursive partitioning models (CHAID, CART, ICT, Random Forests, etc.)Neural Networks (NN)

CHAID = Chi Squared and Interactive Decision5Classification TreesHierarchical question-decision tree model (Breiman, 1984)The final answer is the result of a series of conditioning answers (If this -> then that, etc.)Used in diagnostic reasoningNo statistical significanceRandom ForestsInductive statistical learningAggregation of hundreds of Classification Trees

Hypothetical recidivism treeNeural NetworksDeveloped in Artificial Intelligence researchData mining technique for pattern recognitionAim at modeling the lower level brain functionsLayered nodes of fact-sets instead of rules, used to train the networkBased on the training data, the network learns to deduce the right answer to any new piece of informationUsed in psychiatric diagnosticSchematic Neural NetworkRecalculation of weights based on predicted and actual outputsPrevious CT and NN research on recidivism predictionStudies using CT-like analyses, as well as NN tend to make use of smaller samples ( 1,500) (except Berk et al., 2009; Palocsay et al., 2000; and Silver et al., 2000)Overall, results are mixed, but those finding significant improvement via CT use lack proper validations (Liu, 2010)Studies using NN show very split results (Liu, 2010)

Gaps in the literatureOverall, very few studies have investigated the utility of CT and NN for predicting recidivismPrevious studies have been limitedIn power To violence predictionThe current study remedy such limitationsClose to million casesGeneral recidivism as well as possibilities for investigating offense-specific recidivism

Washington State Static Risk AssessmentPreviously utilized LSI-RFound laborious by community corrections officersEvaluated to be strengthened by increase of static items (Barnowski, 2003)

Created current instrument in 2006Factors strongly related to recidivism: demographics, juvenile record, commitments to DOC, felonies, misdemeanors, and violationsRemoved dynamic items (interview not required)Instrument scored from logistic regression - logit weights

Comparable Predictive Validity for WA Sample (WSIPP, 2007)LSI-R AUC = .66WA Static Risk AUC = .74Analysis Plan24 variables included from current risk prediction instruement3 year follow-up (release from incarceration)Any felony recidivism

2 step creationConstruction sample All offenders released from prison or jail placed on community supervision from 1986 to 2000 (N = 287,417)

Validation sampleAll offenders released from prison or jail placed on community supervision from 2001 to 2002 (N = 71,957)

Compare methods of prediction modelsArea under the receiver operating characteristic (AUC) Values of .500s indicate no predictive accuracyWhere .600s are weak, .700s moderate, and above .800 strong predictive accuracyDescriptive Statistics (N=359,374)Predictor%/Mean(SD) White (not included in model)79.71. Male18.72. Age At Risk31.7(10.2)3. Adult Felonies 2.1(1.9)4. Juvenile Felony Score325. Juvenile Person Score 6 6. Number of DOC Commitments 2.0(1.7)7. Homicide/Manslaughter 18. Felony Sex 79. Felony Violent Property 910. Felony Non-Dometic Violence Assault1611. Felony Dometic Violence Assault 212. Felony Weapon 413. Felony Property 8514. Felony Drug 6215. Felony Escape 816. Misdemenor Non-Dometic Violence Assault 2317. Misdemenor Dometic Violence Assault 2118. Misdemenor Sex 319. Misdemenor Dometic Violence Other 120. Misdemenor Weapon 421. Misdemenor Property 5222. Misdemenor Drug 1723. Misdemenor Escape 124. Misdemenor Alcohol 17NewFelony (Outcome) 44Logistic regression methodExtended validation sample of original instrument constructionStrongest model predictors (weights) were: 1) Misd. Property, 2) Juvenile Felony, 3) Misd. Dometic Violence Assault, 4) Misd. Drug, 5)Misd. Sex, 6)MaleFindings comparable to original instrument constructionModelsConstruction Sample ROCsValidation Sample ROCsOriginal Sample.756.742Extended Sample .750.749Radom Forest Model

Strongest Model Predictors : 1)Felony Adjudications, 2) Misd. Property, 3)Sentence Length4)Juvenile Felony5)Age6) Felony PropertyModelsConstruction Sample ROC (SE)Validation Sample ROC (SE)Logistic Regression.750 (.001).749* (.002)Neural Network.755* (.001).750* (.002)Random Forest.750 (.001).734 (.002)Model Comparisons

ROC curve represents the Sensitivity (y axis) plotted against the specificity (x axis). Closer the ROC is to the upper left corner the higher the overall accuracy 17

Model ComparisonsSignificant differences foundNeural network significantly greater predictive validity than random forestNeural network significantly greater predictive validity than logistic regression but only construction sampleIncremental Utility of Methodological AdvancementsNeural networks performed best, followed by logistic regression and random forest

ROC differences of methods found to be significant but not universally

Preliminary nature of findings are stressedLimitations Lack of specificity of outcome measure and sample heterogeneityAny felony within 3 yearsSpecialization and taxonomic structures not considered

Unit of analysis is incarceration cycleViolation of independence assumption for repeat incarcerations

Exclusion of dynamic predictorsFuture Findings and Policy ImplicationsAdd dynamic predictors to modelsAvailable since 2008Prior/preliminary findings indicate only modest improvement

Examine impact of latent variable methods4th potential model

Disentangle heterogeneity Subgroup analyses based on offense specialtiesi.e. drug, violent, sex offenderPredicting Parkinsons disease status with vocal dysphonia measurementsRoxana HickeyBioinformatics & Computational BiologyStatistics 519 Multivariate Statistics Term ProjectProfessor Stephen LeeApril 27, 2011OutlineBackgroundParkinsons diseaseVocal dysphoniaStudy datasetStatistical analysesConclusionsParkinsons diseaseNeurological disorder that leads to shaking and difficulty with walking, movement and coordination1Affects >1 million people in North America2rapidly increased prevalence after age 603No cure, but medication available to alleviate symptoms, especially in early stages4early detection key to effective treatment strategieshttp://www.healthtree.com/articles/parkinsons-disease/causes/

1=PubMed Health24Parkinsons disease & vocal impairment~90% of individuals with Parkinsons disease have some form of vocal impairment5, 6characteristics7dysphonia (impaired production of vocal sounds)dysarthria (problems with normal articulation in speech)may be one of earliest indicators of onset of illness8 Tests for vocal impairment9,10sustained phonations11, 12 (focus of this study)produce single vowel and hold pitch constantrunning speech12speak standard sentences that contains representative sample of linguistic unitsMeasures of assessing vocal dysphoniaTraditional methods11, 12pitch (F0, fundamental frequency of vocal oscillation)absolute sound pressure level (loudness)jitter (variation in F0 from vocal cycle to vocal cycle)shimmer (variation in amplitude)noise-to-harmonics ratioNovel methods13, 14nonlinear dynamical systems theory and nonlinear time series analysisrecurrence period density entropydetrended fluctuation analysis

Measures of assessing vocal dysphoniaMeasurements differ in robustness14uncontrolled variation in acoustic environmentphysical condition and characteristics of subjectTherefore, chosen measurement methods should be as robust as possible to this variationGoal of the study: identify an optimal feature set that is both robust to uncontrolled variation and able to classify patients with Parkinsons disease based on vocal dysphonic symptomsAdditional advantage: possibility of monitoring patients remotely

http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data

Subjects & methodsSubjects31 individuals8 healthy23 with Parkinsons disease (PD)average of six sustained vowel phonations recorded from each subjectTotal n=195Calculation of features via software programstraditional measuresnon-standard measures, including new measure proposed by authors: pitch period entropyVariablesAttributeDescriptionAttributeDescriptionMDVP:Jitter (%)MDVP jitter as percentageNHRNoise-to-Harmonics RatioMDVP:Jitter(Abs)MDVP absolute jitter in microsecondsHNRHarmonics-to-Noise RatioMDVP:RAPMDVP Relative Amplitude PerturbationRPDERecurrence Period Density EntropyMDVP:PPQMDVP five-point Period Perturbation QuotientD2Correlation dimensionJitter:DDPAverage absolute difference of differences between cycles, divided by the average periodDFADetrended Fluctuation Analysis

MDVP:ShimmerMDVP local shimmerspread1Nonlinear measure of fundamental frequency variationMDVP:Shimmer(dB)MCVP local shimmer in decibelsspread2Nonlinear measure of fundamental frequency variationShimmer:APQ33-pt Amplitude Perturbation QuotientPPEPitch period entropyShimmer:APQ55-pt Amplitude Perturbation QuotientMDVP:Fo(Hz)Average vocal fundamental frequencyMDVP:APQMDVP 11-point Amplitude Perturbation QuotientMDVP:Fhi(Hz)Maximum vocal fundamental frequencyShimmer:DDAAvg abs. diff. between consecutive differences between the amplitudes of consecutive periodsMDVP:Flo(Hz)Minimum vocal fundamental frequencyMDVP = (Kay Pentax) Multi-Dimensional Voice ProgramMeasures of variation in amplitudeMeasures of variation in fundamental frequencyMeasures of ratio of noise to tonal components in voiceNonlinear dynamical complexity measuresSingle fractal scaling exponentNonlinear measures of fundamental frequency variationGrouping variable:status =0 (healthy)=1 (PD)Statistical analysesEDAPCAMANOVAHotellings T2QDAClassification tree (with random forest)EDA

0=healthy1=PDEDA

0=healthy1=PDEDA

0=healthy1=PDEDA

0=healthy1=PDEDA

0=healthy1=PDparallel coordinate plot36PCA

MANOVAtemplate

H0: healthy = ParkinsonsHotellings T2 test

H0: healthy = Parkinsons(p=22)

T-square test statistic = 187.48

df = 48 + 147 2 = 193

critical 20.05, 22, 193 47 (extrapolated)

Conclusion: reject H0 (=0.05)coefficients of a indicate relative importance of variables in multivariate T-square test39010351319138ClassifiedActualpark.qda.cv 500,000 NOK $85,000 USD (2010)Percent elderly (Old age dependency) (2010)Percent employed in primary industries ie. mining, fishing, farming (2010)Percent Labor Force participation (2010)Percent unemployed (2010)Percent paid for Social Assistance (2009)-Percent over age 25 with only completed primary education (2010)-Percent over age 25 with secondary education attainment (2010)-Percent over age 25 with attainment beyond secondary w/o completion of tertiary (2010)-Percent over age 25 with attainment of tertiary education (2010)Percent Voter turnout (2008)Percent Municipal Net Loan to Gross Revenue (2010)Percent Municipal Net loan debt per capita (2010)Percent Municipal Long term debt to Revenue of (2010)

63Non-BarentsBarentsBarents vs. Non-BarentsMunicipalitiesN=430(N=342)(N=88)

6465

Df Hotelling-Lawley approx F num Df den Df Pr(>F) barentsF 1 1.5733 43.424 15 414 < 2.2e-16 ***Residuals 428

Component Eigenvalues % of total Variables and (component loadings) Variance _____________________________________________________________________________________________________________________

1. Age, Income, School, 2.324 33.75%Percent Elderly (0.730) Migration and LaborIncome < 150,000 (0.762) ForcePercent Primary Sector (0.651)Percent Tertiary School1 (-0.738)Percent Tertiary School2 (-0.641)Net Migration (-0.695)Labor Force Part. (-0.612)Income > 500,000 (-0.814)

2. Social Welfare 1.692 17.90%Percent Unemploy. (0.599)Upper Secondary Ed. (-0.576)Social Assistance (0.543)Labor Force Part. (-0.527)

3 . Debt 1.305 10.65%Net Loan to Gross Rev. (0.617)Long Term debt (0.637)Net Loan Debt/capita (0.709)

4. Education 1.115 7.77%Tertiary Ed (-0.543)1

6667

68

69Plots of municipalities on First 3 Principal Components BarentsNon-Barents70

Standard Deviations on First Principal ComponentQDA Analysis71Quadratic Discriminant Analysis on the same 16 variables. Results illustrate a discernible difference between North and South.

***correct classification rate of 94.65%***cross-validated 1 = Non-Barents2 = Barents DiscussionDistinction between North and South urbanizationMigration and social vulnerability Life Biography (20 somethings)

Caveats Missing variables (ethnic minority)Indigenous group Sami

Further researchCommunity level analysis

72

73Photo by Hildegun Johnsen

Questions, Feedback?

Thank you

Determining the Geographic Origin of Potatoes with Trace Metal Analysis Using Statistical and Neural Network ClassifiersThe objective of this research was to develop a method to confirm the geographical authenticity of Idaho-labeled potatoes as Idaho-grown potatoes. Elemental analysis (K, Mg, Ca, Sr, Ba, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Mo, S, Cd, Pb, and P) PCA, CDA, discriminant function analysis, k-nearest neighbors, and neural network

76