mining unstructured healthcare data
Post on 12-Nov-2014
Embed Size (px)
DESCRIPTIONAlliance Health Chief Data Scientist Deep Dhillon presentation to UW CS students on mining unstructured healthcare data. This talk describes technical information on a system designed to empower patients and health care professionals by automatically generating health care content that is fresh, authoritative, statistically relevant and not available via other means.
- 1. mining unstructured healthcare datadeep dhillonchief data scientist | email@example.com | twitter.com/zang0
2. alliance health networks HCP/advanced patients medify.com 10,000% user growth past year patientsdiabeticconnect.com - # 1 online diabetes site @ ~1.4M uniques*connect.com content sites + guided social networks health care industry patient surveying, matchmaking and analysis 3. why mine healthcare text?anatomy of what patients currently use: webmd (e.g. drugs.com, yahoo health, etc.)q/a topic pagesnews + mediaexperts discussions original content addresses emotional needs simple to understand provides answers original content moderately authoritative simple to understand original content good coverage in the head fresh provides answers simple to understand good coverage in the head original content typically aligned well with google searches,i.e. treatments for X, symptoms for Yoriginal content=important good coverage in the headproviding answers=important original content head=developed aligns well with google searches fresh=important provides answers authority=important 4. why mine healthcare text?Patients and HCPs need long tail, statistically meaningful, consumer friendly,authoritative and fresh health content.q/a topic pages news + mediaexperts discussions manually written, not authoritative not thorough, i.e. not long tail not consistently credible, i.e. minimal accredation not statistically meaningful sparse manually written, moderately authoritative not thorough, i.e. not long tail manually written, not authoritative not consistently credible, i.e. minimal accredation not thorough, i.e. not long tail evergreen, rarely change manually written, not authoritative not consistently credible, i.e. minimal accredation not thorough, i.e. not long tail manual=expensive manual=head focused manually written, not authoritative not consistently credible, i.e. minimal accredationmanual=not authoritative not statistically meaningful manual=old and dated 5. automated content generation cost effective structured content statistically based scales to millions of patients scales to long tail treatments + conditions authoritative / citation driven fresh 6. types of text mining 7. medify demo 8. how do we mine text? assignmentscuratorsexamplesknowledgeparser Indexexternal reposPagelike UMLS Search Module 9. annotate in the product: Easier for people to give explicit GT feedback Channels high visibility error angst productively Channels more GT toward areas most seen by users Override high visibility system mistakes with human datagt demo: http://www.diabeticconnect.com/discussions/4790 https://www.medify.com/internal/annotate/abstract?abstractId=8181575 10. Detailed Signals Preliminary Mining From Discussion Threads5,0004,5004,0003,5003,0002,5002,0001,5001,000 500 0 11. Distribution of Signals Mined From Diabetic Connect DiscussionThreads Medical History - NewlyLifestyle Adjustment Diagnosed Medical History - Symptom6% 0%3%Medical History - Condition 10%Helper/Influencer33% Personal Account 12% Treatment Experience 15% Medical History21% 12. API Research (Outcome)Discussion (Gromin)curator treatment/symptom/condition/demographic identity (i.e. UMLS id)UMLS taxonomical relations hierarchical is_a condition/arthritis/Rhematoid Arthritis synonym relations RA=Rhematoid Arthritis metformim polarityknowledge base effective=positive ineffective=negative parsing clues (ambiguity) anaphor clues 13. abstract parser work flowdocumentsentence selection 11.sentences w/ high conclusion presence selected2.shallow entity tagging applied based on umls 23.sentence text is modified to retain entities, optimize shallow taggingfor performance, and eliminate unnecessary filler.4.deep typed dependency parsing of modified text5.3 tier text classification applied on BOW + entities 3sentence6.SVO triples extracted from dependencies modification 7.rule based conclusions generated from triples8.confidence models applied to rule and classifierbased conclusions9 9.knowledge base used to represent domain and dependency treediscourse style.knowledge parse 4 10. errors measured against curated data applied forimprovements5 triple extraction6 3 tier classification10concluder 78confidence assignmentannotations 14. discussion parser work flowmessage sentence split1 1. sentences split from discussion thread messages2. shallow entity tagging applied based on umls3. sentence text is modified to retain entities, optimize2for performance, and eliminate unnecessary filler. shallow tagging4. deep typed dependency parsing of modified text for select conclusion types5. select conclusion type based text classification3applied on BOW + entities after dimensionalitysentence reduction modification 6. SVO triples extracted from dependencies7. rule based conclusions generated from triples9 8. rule based KB driven anaphora resolution applied. Classification based conclusions added. dependency tree9. knowledge base used to represent domain andknowledge parse 4discourse style.10.errors measured against curated data applied for improvements 5 triple extraction6classification10concluder 7 8anaphora resolutionconclusions 15. feature engineering entities, i.e. normalize synonyms, id newentity types, like social relations entity types, i.e. metformin >treatment_medication phrase driven cues, i.e. [have] [you][considered] > suggestion_indicator 16. anaphora resolution relation structure, i.e. [person]>takes>itit refers to treatment (i.e. not condition/symptom), and specifically: medication but not device statistically driven, manually curated cuesi.e. drug > treatment/medication filter non matching antecedent candidates singular/plural agreement score candidates: antecedent occurrence frequency distance (#sents) from antecedent to anaphor co-occurrence of anaphor w/ antecedent 17. technology Lang: Java + Python + Ruby DB: Solr 4, Mongo DB, S3 Work: Map Reduce Dependencies: Malt Parser, Stanford Parser Misc: Tomcat, Spring, Mallet, Reverb, Minor Third Tagging: Peregrine + home grown 18. Data Pipeline21 19. Solr 1Solr 2 Solr N Load Balancer APICachePortal 1Portal 2 Portal NLoad Balancer www.medify.comrequest transaction flow browser 20. questions? 21. Advair: Experts vs. Patients medicalese vs. patients words more granularity a story like perspective w/ wordsof inspirationMy Pulmonologist today said that he had just come fromthe hospital bedside of a patient with my exact symptomswho is on oxygen and not doing well. If I hadnt been sodiligent in following my asthma plan that could easily havebeen me. His telling me that really hit home.By the way, my Pulmonologist surprised me with hisview on many of his patients. He actually got excitedand thanked me for being knowledgeable aboutasthma, understanding my own body and health needsand for following my treatment plan. He said he doesntget many patients who can actually talk about theirdisease, symptoms, timeframes, treatments/medications, and non-medicalmeasures they are taking. He also doesnt get manypeople who ask questions. My response to this: Whatare people thinking? They need to take control of theirown health or noone else will.