methods and applications in research, public health, and...

93
Mining Health-Related Data Methods and applications in Research, Public Health, and Patient Care John H. Holmes, Ph.D. Center for Clinical Epidemiology and Biostatistics University of Pennsylvania School of Medicine CCEB

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Mining Health-Related DataMethods and applications in Research, Public

Health, and Patient Care

John H. Holmes, Ph.D.Center for Clinical Epidemiology and Biostatistics

University of Pennsylvania School of Medicine

CCEB

Page 2: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Why are these guys so happy?

Page 3: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Where we’re going today…

l Introduction to databases and warehouses

l Data mining: What is it?l Output of data miningl The data mining life cyclel Data mining applicationsl Conclusion

Page 4: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

What are we looking for?The Information Spectrum

Data Information Knowledge Wisdom

Page 5: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

160

Data!

Page 6: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

160/94Information

Knowledge

Page 7: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Databases

l Logically coherent collection of data with some inherent meaning

l Databases are designed, built, and populated with data for a specific purpose, for an intended group of users

l Represent some aspect of the real world

Page 8: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

“Large” data

l How to define large data» Number of fields» Number of records» Complexity of data model» Breadth of distribution

l Always, the issue is high dimensionalityl Ultimately, large data end up in a

centralized resource

Page 9: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Examples of large data

l CMS Minimum Data Set

l State Medicaid claims databases

l Federally mandated surveillance systems

l Proprietary insurance claims data

Page 10: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Another (generic) example:Data Warehouses

l A centralized resource for long-term data storage

l Support the activities of entire organizations (enterprises)

l Input from distributed databases on scheduled batch basis

l Platform for decision supportl Provide large-scale, temporal data

Page 11: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

How does a warehouse work?

Database

Database

Database

Database

Database

Extract data

Transform data

Clean data

Data Warehouse

Data Mart

Data Mart

Data Mart

Page 12: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Large data gets us into a hole…

l Large number of raw and derived variables renders traditional “manual” methods for discovering patters in data unwieldy

l Hypothesis-driven (biased) analyses may lead to missed associations

l Constantly changing patterns in prospective data require constantly changing analytic approaches that can be informed by data mining

Page 13: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

l Introduction to databases and warehouses

l Data mining: What is it?l Output of data miningl The data mining life cyclel Data mining applicationsl Conclusion

Page 14: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Knowledge Discovery in Databases - KDD

l Data-driven identification of valid, novel, potentially useful, and ultimately meaningful patterns in databases

l Traditionally applied to large-scale enterprise databases (data warehouses)

l Focused on hypothesis generation, not hypothesis testing

Page 15: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Now what are we looking for?The Information Spectrum revisited

Data Information Knowledge Wisdom

Page 16: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

The KDD Process

D a t a b a s e

Data C lean ing

O u t p u t Genera t ion

Data Analys is

M o d e l D e v e l o p m e n t

Query Too lsSta t is t ics and

A I Too lsVisual izat ion

T o o l sPresenta t ion

T o o l s

D a t a T rans format ion

T o o l s

Report

M o n i t o r

Ac t ion

M o d e l

D o m a i n M o d e l

From Brachman and Anand in: Fayyad et al:Advances in Knowledge Discovery and Data MIning

Page 17: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Data mining is the application of specialized software tools to the process of knowledge discovery

Page 18: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

l Introduction to databases and warehouses

l Data mining: What is it?l Output of data miningl The data mining life cyclel Data mining applicationsl Data mining resources

Page 19: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

The Ore… What comes out of the mine?

l Decision tables and treesl Association rulesl Classification rulesl Prediction rulesl Clustersl Visualizations

Page 20: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Decision Trees

l Simple, graphical method of representing data attributes and the relationships between them

l Robust data visualization tools

l Nodes (or cells) implicitly test an attribute with a constant or another attribute

Page 21: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

A simple decision tree

N oN o Y e sY e s

N oN o

N oY e s

Y e sY e s

N oY e s

B i t t e n

R a b i e s p r e s e n t

D o n ' t t r e a t

A n i m a l C a p t u r e d

A n i m a l C a p t u r e d

A n i m a l V a c c i n a t e d

A n i m a l V a c c i n a t e dT r e a t

D o n ' t t r e a t D o n ' t t r e a t

D o n ' t t r e a t

T r e a t T e s t b r a i n

Page 22: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Rules

l IF {condition} THEN {result}where:» condition=antecedent (LHS)» result=consequent (RHS)

lConditions can be joined by Boolean connectors» AND, NOT, OR

Page 23: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Association Rules

l Focus on relationships between anyattributes

l Most databases have large numbers of association rules that are often trivial (and misleading!)

l Example:» IF car-make = Ford

THEN seat-belts-worn=Yes

Page 24: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Classification rule miningl Looking for rules that classify cases into

one of the known classes

If {VARIABLE}=valuethen FATALITY=Yes

orIf {VARIABLE}=valuethen FATALITY=No

Page 25: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Prediction Rules

l Classification rules that are used to predict class membership for objects of unknown class

l May provide simple class membership

l May indicate probability of class membership

Page 26: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

How a prediction rule works

Unknown case

Class=?

Prediction rule

Classified case

Outcome=Positive

Page 27: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Clusteringl Grouping of objects into sets defined by

some type of similarityl Clusters are constructed by maximizing

intraclass similarity and minimizing interclass similarity» Objects in a cluster are similar to others in

the same cluster» Objects in one cluster are dissimilar to

those in another cluster

Page 28: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

An Example: Raw Data

Page 29: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Applying a clusterer: Identifying similarities and dissimilarities

Page 30: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Applying a clusterer: Identifying similarities and dissimilarities

Page 31: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Online Analytical Processing (OLAP)

l Analysis techniques applied to data warehouses» Summarization» Consolidation» Aggregation» “Rotational” analysis

l Support multidimensional databasesl Predecessor of modern data mining

Page 32: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Data Visualization

l Used when data are not in an organized form

l Often a good first step to data reduction and transformation

l Focuses on graphical techniques

Page 33: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Visualization Tools

l Bar chartsl Pie chartsl Line graphsl X-Y plotsl Mapsl Density plots

Page 34: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Visualization tools of historical interest:Snow’s Cholera Map

Page 35: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Visualization tools of historical interest:Nightingale’s Coxcomb Plot

Page 36: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

l Introduction to databases and warehouses

l Data mining: What is it?l Output of data miningl The data mining life cyclel Data mining applicationsl Data mining resources

Page 37: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

The Data Mining Life Cycle

l Data preparation

l Data reduction

l Data modeling and prediction

l Evaluation

Page 38: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

1. Data Preparationl Standardization of coding from attribute

to attribute of like conceptl Applying attribute transformationsl Applying attribute normalizationsl Database denormalizationl Discretizationl Missing data

Page 39: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Coding standardization

l Make sure all variables of like meaning are coded the same way

l Example» Height should be in same units from record

to record in the database

Page 40: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Attribute transformations

l Some techniques require data to be normally distributed or otherwise “smoothed”

l Example» WBC (heavily skewed to the right)» Solution: log transform

Page 41: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Attribute normalization

l Involves normalizing range to a specific normalization scheme, usually between 0 and 1, or -1 and +1

l Required by clustering methods for polytomous categorical data

Page 42: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Database denormalization

l Required if the software can’t handle normalized databases

l Example» Large claims databases, where patients

have many claims records» Solution: do the needed joins to create a

flat file and mine that

Page 43: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Discretizationl Some data mining software can’t handle

continuously valued attributes» Example: Older evolutionary computation

methods» Solution: discretize

l Approaches:» Histogram analysis» Statistical binning» Machine learning methods

– Entropy-based discretization

Page 44: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Handling missing data

l Special coding regimes» Use negative or very large numbers to

code numerical missing data

l Imputation» Estimate what the missing value should be,

using statistical methods

l Ignore it» Some software can do this!

Page 45: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

2. Data Reductionl Why?

» Data mining methods are not open-ended as to capacity

» Trivia and minutiae are noise» Noise may overwhelm software» Output may overwhelm users

Page 46: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Methods for Data Reduction

l Segmentation

l Deletion

l Sampling

l Feature selection

Page 47: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Segmentation

l Divide the database up into manageable chunks, while analyzing all of the data

l Methods» Kohonen maps» Clustering

Page 48: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Deletionl Rows

» Restrict the exploration to selected records in the database

» Problem: you could end up missing a rare cancer!

l Columns» Restrict the exploration to selected fields in the

database» Problem: you could end up missing important risk

factors!

Page 49: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Feature selectionl Heuristic approaches

» Cognitive domain model» Expert panel

l Statistical methods» Univariate and bivariate analysis» Regression» Nearest neighbors» Clustering

Page 50: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

3. Data Modeling and Prediction

l Use of mined information to create or augment knowledge

l Application of mined information for classification and prediction

l Employs the Output of data mining (we’ll come back to that!)

Page 51: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Two families of data mining tools

l Statistical/Probabilistic

l Machine learning/Artificial intelligence

Page 52: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Statistical and probabilisticdata mining tools

l Univariate

l Multivariate

l Bayesian classifiers

l Statistical classifiers

Page 53: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Machine learning tools

l Neural networks

l Decision tree induction

l Evolutionary computation

Page 54: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Decision Tree Inductionl Decision trees

» A node represents a test on an attribute» A branch represents the test’s outcome» Leaf nodes represent decisions

l Created (induced) from data by means of an entropy based metric, information gain» Used to select recursively the attribute that

best separates the data into separate classes at a given node

Page 55: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Evolutionary Computationl Framework and algorithms based on genetics

metaphorl Each unique combination of responses to a

set of variables mapped to a unique rule or “chromosome”

l Each rule or “chromosome” is mapped to a possible outcome or “phenotype”

l Genetic operators mimic Darwinism (survival of the fittest rule)

Page 56: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Mining Complex Data

l Spatiall Time-seriesl Textl Web

Page 57: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Spatial Data

l Contain topological informationl Organized according to a complex

multidimensional indexing structurel Require spatial reasoning and

representationl Examples

» Maps» Imaging data

Page 58: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

An example:MRI of an abdominal aortic aneurysm

Page 59: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Mining time-series datal Trend analysis

» Trend movements– General direction a time-series graph move over

time– Cyclic variations

l Long term oscillations of a trend line over time

– Seasonal variationl Cyclic variations tied to recurring points in time

– Random variationl Variations in movement of trend-line that are not cyclic

Page 60: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Text mining

l Text databases» Collections of text-based documents» Data semi- or unstructured

l Methods» Keyword and similarity retrieval» Latent semantic indexing

Page 61: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Keyword and similarity retrieval

l Keyword retrieval» Document represented by a string

containing one or more keywords» Query formulated using keyword vectors

with Boolean operators

l Similarity-based retrieval» Similar to keyword retrieval, but retrieval is

based on degree of similarity between keywords in document and in query vector

Page 62: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Web mining

l Issues» Size» Complexity» Dynamic nature» Broad coverage» Lots of chaff

Page 63: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Identifying Web usage patterns

l Uses Web logs to discover patterns of access to pages» URL, IP of accessing user, and timestamp» Lots of simple data, but often confusing

patterns!

l Applications» Marketing» Web site design

Page 64: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

4. Evaluation

l Heuristics» Does this make good clinical sense?

l Multi-method comparisons

l Statistical methods» Tests of association» Sensitivity, specificity, predictive values,

ROC curves

Page 65: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

l Introduction to databases and warehouses

l Data mining: What is it?l Output of data miningl The data mining life cyclel Data mining applicationsl Data mining resources

Page 66: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Three illustrative applications of data mining

l Epidemiologic surveillance» Motor vehicle-associated fatalities

l Patient safety» Features associated with medication errors

l Research » Intelligent data analysis

Page 67: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Fatality Analysis Reporting System (FARS)

l Prospective surveillance database of all fatal vehicle accidents occurring in the US

l Available at http://www-fars.nhtsa.dot.gov/

l Person-, vehicle-, and crash-level data

Page 68: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

The FARS data model

Crash Vehicle Person

Page 69: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Some possible questions about FARS

l What variables are associated with fatality?

l What variables predict fatality?

l Why not just use logistic regression?

l How to go about mining this database?

Page 70: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Some characteristics of FARSl Denormalized

» The Person File contains pertinent data from the Vehicle and Crash Files

l Large» 100,968 person records» 72 candidate variables

l Unbalanced» 42,116 deaths (41.7%)

l Many missing values» Bicycles don’t have airbags!

l Some variables are continuous» Require discretization for some DM tools

l Interactions» Passenger airbag deployment vs. year of vehicle

l Prospective» Even within a given year, new patterns emerge over time

Page 71: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

What we’re left with

l Agel Sex l Roadway functionl Manner of collisionl Model yearl Body typel Rolloverl Emergency use vehiclel Impact typel Fire and/or explosion

l Person typel Seating positionl Location in vehiclel Ejection from vehiclel Alcohol usel Drug usel Work-related injuryl Restraint usel Weather conditionsl Surface conditions

Outcome: Fatality (Yes/No)

Candidate predictors

Page 72: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Sample decision tree output:...ejection = Yes:

:...hospital = No : Fatality (2968/36): hospital = Yes :: :...driver = Yes : Fatality (1337/388): driver = No :: :...rollover = Yes : No Fatality (1120/379): rollover = No :: :...collision = No : Fatality (12/1): collision = Yes :: :...urban = No : No Fatality (275/99): urban = Yes :: :...fire_exp = Yes : No Fatality (10/2): fire_exp = No :: :...air_bag = Yes : Fatality (18/6): air_bag = No :: :...rest_use = Yes : No Fatality (10/2): rest_use = No :: :...sex = No : Fatality (70/31): sex = Male:: :...drinking = No : No Fatality (107/42): drinking = Yes : Fatality (5/1)

Page 73: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Sample rules from a decision rule inducer

If DRIVER=NoAnd EJECTION=NoAnd DRINKING=NoAnd DRUGS=NoThen FATAL=No

If DRIVER=YesAnd ROAD=RuralAnd RESTRAINT=NoThen FATAL=Yes

If WORK_INJ=YesThen FATAL=Yes

If HOSPITAL=YesThen FATAL=No

If PEDESTRIAN=NoThen FATAL=No

If EJECTION=YesAnd HOSPITAL=NoThen FATAL=Yes

Page 74: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Three illustrative applications of data mining

l Epidemiologic surveillance» Motor vehicle-associated fatalities

l Patient safety» Features associated with medication errors

l Research » Intelligent data analysis

Page 75: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Mining a large dataset for diagnoses associated with medication errors

l Nationwide Inpatient Sample» Healthcare Cost and Utilization Project of

U.S. Agency for Health Research and Quality

l Details» Seven million hospital discharges in 1997» Data from 22 states

Page 76: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Predictor variables

l Age in years l Died in hospitall Discharge dispositionl Length of stay

l Cancerl Circulation l Congenital l Dematologicl Endocrine

l Obstetricall Psychiatricl Pulmonaryl Others

l Primary insurance l Race l Sexl Income

l Gastrointestinal l Genitourinary l Hematologicl Musculoskeletall Neurologic

Discharge Diagnoses

Demographics

Page 77: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Class distribution

Medication Error 52,491 (0.73%)

No Medication Error 7,095,929 (99.27%)

Total 7,148,420 (100.0%)

Page 78: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Classification variable

l Presence of a diagnosis suggesting medication error, defined as:» Incorrect dosage» Incorrect route of administration» Incorrect drug» Incorrect time or frequency» Problem associated with medication

administration that could lead to an ADE

Page 79: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Data mining approach

l Decision tree induction (See5)

l Evolutionary computation (EpiCS)

l Logistic regression

Page 80: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Results: Sample negative rulesIf AGE<=22

LOS<=2INCOME>$50K

Then No error present

If LOS>2 and LOS<=6INCOME >$25KDIAGNOSIS=Dermatologic

Then No error present

If AGE>37 and AGE<=45LOS=5INSURER=PrivateDIAGNOSIS<>Gastrointestinal

Then No error present

Page 81: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Results: Sample positive rulesIf LOS=2

DIAGNOSIS=PsychiatricThen Error present

If LOS>13 and LOS<=16INSURER=MedicaidINCOME <$25K DIAGNOSIS<>Obstretric

Then Error present

If DISPOSITION=Transferred to another hospitalLOS=1SEX=MaleDIAGNOSIS=Psychiatric

Then Error present

Page 82: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Three illustrative applications of data mining

l Epidemiologic surveillance» Motor vehicle-associated fatalities

l Patient safety» Features associated with medication errors

l Research » Intelligent data analysis

Page 83: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Using data mining to inform statistical analysis

l Data» FARS 2001 person file

l Data mining methods» EpiXCS» See5 (decision tree inducer)

l Primary statistical analysis method» Logistic regression

Page 84: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

The problem…

l The dataset is too large to analyze via traditional statistical methods» >100K cases» >100 variables, plus interactions

l Variable selection for logistic regression via bivariate methods too cumbersome

l How can we build a robust logistic model using these data?

Page 85: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Alternatives

l Bootstrapping

l Mine the data to identify candidate predictors and interactions

l Or, both!

Page 86: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Here’s how the methods compared on variable selection

Significant Predictors

Inappropriate restraint

Rear-end impact

Age >55

Pickup trucks

Vehicle rollover

Not hospitalized

Driver’s side impact

Fire/Explosion

Ejection

Motircyclists

Cyclists

--X

--X

--X

X-X

X-X

X-X

XXX

XXX

XXX

XXX

XXX

See5LREpiXCS

Page 87: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

And on classification…

Classification Performance on Testing Set

PPV

AUC

0.840.770.84

0.800.860.85

See5LREpiXCS

Page 88: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

l Introduction to databases and warehousesl Data mining: What is it?l Output of data miningl The data mining life cyclel Data mining applicationsl Conclusion

Page 89: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

What kinds of questions can (and can’t) be answered by data mining?

l Can…» Are there attributes that seem to be associated

with others?» Are there any attributes that may be associated

with an outcome that I might not be considering?» What variables should be included in a

regression model?

l Can’t…» Is an observed association statistically

significant?» Can I always rely on what the miner is telling me?

Page 90: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Where is data mining appropriate?l Large data

» Data warehouses» Data marts» Temporal databases

l Small data» Specialized registries» Ad hoc clinical research databases

l When the data are complex enough that you’re not sure that just looking at them will give you the answers you need

Page 91: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

What is left after you strip away the hype?

l Data mining is a computer science discipline in development» New techniques and software appear

frequently, many of them untested» Old techniques have been prematurely

rejected» Data mining is not a panacea…

BE CIRCUMSPECT!

Page 92: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Ethical concernsl Data mining can’t be used to make policy, at

least not by itself!

l Data mining results should not be reported in the literature, unless it’s a data mining article

l There is no substitute for the intellectual enterprise, of which data mining is a small part

Page 93: Methods and applications in Research, Public Health, and ...infranet.uwaterloo.ca/infranet/inftalks/2003-2004/... · Knowledge Discovery in Databases - KDD lData-driven identification

Some data mining resources

l One-stop shopping web site» KD nuggets www.kdnuggets.com

l Software» Weka www.cs.waikato.ac.nz/ml/weka/» MLC++ www.sgi.com/tech/mlc/» IBM Intelligent Miner www-3.ibm.com/software/data/iminer/

» SPSS Clementine www.spss.com