mining unstructured healthcare data deep dhillon

36
mining unstructured healthcare data deep dhillon chief data scientist | [email protected] | twitter.com/zang0

Upload: bat

Post on 24-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

mining unstructured healthcare data deep dhillon chief data scientist | [email protected] | twitter.com/zang0. alliance health networks HCP/advanced patients medify.com – 10,000% user growth past year patients diabeticconnect.com - # 1 online diabetes site @ ~1.4M uniques - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: mining unstructured healthcare data deep dhillon

mining unstructured healthcare data

deep dhillon

chief data scientist | [email protected] | twitter.com/zang0

Page 2: mining unstructured healthcare data deep dhillon

alliance health networks

HCP/advanced patientsmedify.com – 10,000% user growth past year

patientsdiabeticconnect.com - # 1 online diabetes site @ ~1.4M uniques

*connect.com – content sites + guided social networks

health care industrypatient surveying, matchmaking and analysis

Page 3: mining unstructured healthcare data deep dhillon

q/a topic pages news + media experts discussions

• original content• aligns well with google searches• provides answers

• original content• typically aligned well with google searches,

i.e. treatments for X, symptoms for Y• good coverage in the head

• original content• fresh• simple to understand• good coverage in the head

• original content• moderately authoritative• simple to understand• good coverage in the head• provides answers

• original content• addresses emotional needs• simple to understand• provides answers

original content=importantproviding answers=importanthead=developedfresh=importantauthority=important

anatomy of what patients currently use: webmd (e.g. drugs.com, yahoo health, etc.)

why mine healthcare text?

Page 4: mining unstructured healthcare data deep dhillon

Patients and HCPs need long tail, statistically meaningful, consumer friendly, authoritative and fresh health content.

q/a topic pages news + media experts discussions

• manually written, not authoritative• not consistently credible, i.e. minimal accredation• not statistically meaningful

• evergreen, rarely change• manually written, not authoritative• not consistently credible, i.e. minimal accredation• not thorough, i.e. not long tail

• manually written, not authoritative• not consistently credible, i.e. minimal accredation• not thorough, i.e. not long tail

• manually written, moderately authoritative• not thorough, i.e. not long tail

• manually written, not authoritative• not thorough, i.e. not long tail• not consistently credible, i.e. minimal • accredation• not statistically meaningful• sparse

manual=expensivemanual=head focusedmanual=not authoritativemanual=old and dated

why mine healthcare text?

Page 5: mining unstructured healthcare data deep dhillon

automated content generation

• cost effective• structured content• statistically based• scales to millions of patients• scales to long tail treatments + conditions• authoritative / citation driven• fresh

Page 6: mining unstructured healthcare data deep dhillon

types of text mining

Page 7: mining unstructured healthcare data deep dhillon

medify demo

Page 8: mining unstructured healthcare data deep dhillon

how do we mine text?

examples

parser

curators

knowledge

assignments

external repos like UMLS

Index

SearchPage Module

Page 9: mining unstructured healthcare data deep dhillon

• http://www.diabeticconnect.com/discussions/4790• https://www.medify.com/internal/annotate/abstract?abstractId=8181575

annotate in the product:• Easier for people to give explicit GT feedback• Channels high visibility error angst productively• Channels more GT toward areas most seen by users• Override high visibility system mistakes with human data

gt demo:

Page 10: mining unstructured healthcare data deep dhillon
Page 11: mining unstructured healthcare data deep dhillon
Page 12: mining unstructured healthcare data deep dhillon
Page 13: mining unstructured healthcare data deep dhillon

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

5,000

Detailed Signals Preliminary Mining From Discussion Threads

Page 14: mining unstructured healthcare data deep dhillon

Medical History - Newly Diagnosed0%

Medical History - Symptom3%

Lifestyle Adjustment6%

Medical History - Condition10%

Personal Account12%

Treatment Experience15%

Medical History21%

Helper/Influencer33%

Distribution of Signals Mined From Diabetic Connect Discussion Threads

Page 15: mining unstructured healthcare data deep dhillon

knowledge base

• identity (i.e. UMLS id)• taxonomical relations

• hierarchical is_a• condition/arthritis/Rhematoid Arthritis

• synonym relations• RA=Rhematoid Arthritis• metformim

• polarity• effective=positive• ineffective=negative

• parsing clues (ambiguity)• anaphor clues

UMLS

treatment/symptom/condition/demographic

Research (Outcome) Discussion (Gromin)

API

curator

Page 16: mining unstructured healthcare data deep dhillon

document sentence selection

shallow tagging

sentence modification

dependency treeparse

triple extraction

concluder

3 tier classification

annotations

knowledge

abstract parser work flow 1

2

3

4

5

6

7

8

10

9

1. sentences w/ high conclusion presence selected

2. shallow entity tagging applied based on umls

3. sentence text is modified to retain entities, optimize for performance, and eliminate unnecessary filler.

4. deep typed dependency parsing of modified text

5. 3 tier text classification applied on BOW + entities

6. SVO triples extracted from dependencies

7. rule based conclusions generated from triples

8. confidence models applied to rule and classifier based conclusions

9. knowledge base used to represent domain and discourse style.

10. errors measured against curated data applied for improvements

confidence assignment

Page 17: mining unstructured healthcare data deep dhillon

message sentence split

shallow tagging

sentence modification

dependency treeparse

triple extraction

concluder

classification

conclusions

knowledge

discussion parser work flow 1

2

3

4

5

6

7

8

10

9

1. sentences split from discussion thread messages

2. shallow entity tagging applied based on umls

3. sentence text is modified to retain entities, optimize for performance, and eliminate unnecessary filler.

4. deep typed dependency parsing of modified textfor select conclusion types

5. select conclusion type based text classification applied on BOW + entities after dimensionality reduction

6. SVO triples extracted from dependencies

7. rule based conclusions generated from triples

8. rule based KB driven anaphora resolution applied.Classification based conclusions added.

9. knowledge base used to represent domain and discourse style.

10. errors measured against curated data applied for improvements

anaphora resolution

Page 18: mining unstructured healthcare data deep dhillon

feature engineering

• entities, i.e. normalize synonyms, id new entity types, like social relations

• entity types, i.e. metformin > treatment_medication

• phrase driven cues, i.e. [have] [you] [considered] > suggestion_indicator

Page 19: mining unstructured healthcare data deep dhillon

anaphora resolution

• relation structure, i.e. [person]>takes>itit refers to treatment (i.e. not condition/symptom), and specifically: medication but not device

• statistically driven, manually curated cuesi.e. drug > treatment/medication

• filter– non matching antecedent candidates– singular/plural agreement

• score candidates:– antecedent occurrence frequency– distance (#sents) from antecedent to anaphor– co-occurrence of anaphor w/ antecedent

Page 20: mining unstructured healthcare data deep dhillon

technology

• Lang: Java + Python + Ruby• DB: Solr 4, Mongo DB, S3• Work: Map Reduce• Dependencies: Malt Parser, Stanford Parser• Misc: Tomcat, Spring, Mallet, Reverb, Minor Third• Tagging: Peregrine + home grown

Page 21: mining unstructured healthcare data deep dhillon

Data Pipeline

21

Page 22: mining unstructured healthcare data deep dhillon

browser

Load Balancer – www.medify.com

Portal 2 Portal NCache

Portal 1

Load Balancer – API

Solr 1 Solr 2 Solr N

request transaction flow

Page 23: mining unstructured healthcare data deep dhillon

questions?

Page 24: mining unstructured healthcare data deep dhillon
Page 25: mining unstructured healthcare data deep dhillon
Page 26: mining unstructured healthcare data deep dhillon
Page 27: mining unstructured healthcare data deep dhillon
Page 28: mining unstructured healthcare data deep dhillon
Page 29: mining unstructured healthcare data deep dhillon
Page 30: mining unstructured healthcare data deep dhillon
Page 31: mining unstructured healthcare data deep dhillon
Page 32: mining unstructured healthcare data deep dhillon
Page 33: mining unstructured healthcare data deep dhillon
Page 34: mining unstructured healthcare data deep dhillon
Page 35: mining unstructured healthcare data deep dhillon
Page 36: mining unstructured healthcare data deep dhillon

Advair: Experts vs. Patients• “medicalese” vs. patients words• more granularity• a story like perspective w/ words

of inspiration

My Pulmonologist today said that he had just come from the hospital bedside of a patient with my exact symptoms who is on oxygen and not doing well. If I hadn't been so diligent in following my asthma plan that could easily have been me. His telling me that really hit home.

By the way, my Pulmonologist surprised me with his view on many of his patients. He actually got excited and thanked me for being knowledgeable about asthma, understanding my own body and health needs and for following my treatment plan. He said he doesn't get many patients who can actually talk about their disease, symptoms, time frames, treatments/medications, and non-medical measures they are taking. He also doesn't get many people who ask questions. My response to this: What are people thinking? They need to take control of their own health or noone else will.