doing data science – chapter 12: epidemiology vast amounts of individual patient medical data is...
TRANSCRIPT
Doing Data Science – Chapter 12: Epidemiology
• Vast amounts of individual patient medical data is available– Detailed – visits, prescriptions, outcomes, etc.– Records cover lifetimes– Largest databases have records on 80 million people
• However many medical studies are observational– Not founded on data– Results effect actions of doctors and insurance regulators
Confounder Problem and Stratification
• Confounding problem: an extraneous variable which correlates to both the dependent and independent statistical variable, giving an incorrect perception of cause and effect
• Stratification: partitioning a case into subcases and evaluating just the subcases to reach conclusions about the top level case– Weighted average is one way of evaluating subcases
• Example [p.294-295]:– In study where equal number of women (50) and men (50) had treatment but
different numbers (80 women, 20 men) were in the control group– Original causal effect is 10%– Stratified causal effect is 5% for men and 11.25% for women– This does NOT prove that the treatment side effects are twice as strong for
women• Problem – errors in causality if the numbers in the groups after stratification are
too different to give meaningful statistics
Data Driven Studies
• Analysis of 50 studies of drug/outcome pairs– 5000 analyses for each pair on nine databases– Example:
• ACE inhibitors (treatment for hypertension)/swelling of the heart• Results varied between databases from 3X risk to 6X risk
– For 20 of 50 pairs, risk or no risk was database dependent– By adjusting factors of databases, confounders, and time windows, all studies
can show risk or no risk
Data Driven Studies
• Observational Medical Outcomes Partnership (OMOP)– See how well current methods predict things we already know– 10 large medical databases containing records for 200 million people– $25M– Determined an ROC curve. Area Under the Curve (AUC) was 0.65, not much
better than a random 0.5– Databases are self-consistent – using one database gave better accuracy (0.92
in one case)– Graphs below show ~80% sensitivity with ~10% false-positive rate [p.302]
“The epidemiologists in general don’t believe the results of this study.”
In other words, they prefer to rely on observational rather than data driven
conclusions
References
• http://en.wikipedia.org/wiki/Confounding• https://en.wikipedia.org/wiki/ACE_inhibitor