neurolinguistic approach to vector representation of medical concepts włodzisław duch department...

26
Neurolinguistic Approach to Neurolinguistic Approach to Vector Representation of Vector Representation of Medical Concepts Medical Concepts Włodzisław Duch Włodzisław Duch Department of Informatics, Department of Informatics, Nicolaus Copernicus University, Nicolaus Copernicus University, Toruń, Toruń, Poland Poland Paweł Matykiewicz & John Pestian Department of Biomedical Informatics, Children's Hospital Research Foundation, Cincinnati, OH

Upload: bryce-dean

Post on 13-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Neurolinguistic Approach to Vector Neurolinguistic Approach to Vector Representation of Medical ConceptsRepresentation of Medical ConceptsNeurolinguistic Approach to Vector Neurolinguistic Approach to Vector

Representation of Medical ConceptsRepresentation of Medical Concepts

Włodzisław DuchWłodzisław Duch

Department of Informatics, Department of Informatics, Nicolaus Copernicus University, Nicolaus Copernicus University, Toruń, Toruń, PolandPoland

Paweł Matykiewicz & John Pestian

Department of Biomedical Informatics, Children's Hospital Research Foundation, Cincinnati, OH

Page 2: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

PlanPlanGoal: Reaching human-level competence in all aspects of NLP.Well ... at least in annotations of medical texts.

• Neurocognitive inspirations: how are words represented in brains? What practical ideas can we derive from these inspirations?

• From semantic networks to vector representations.

• Hospital discharge summaries: what can be done with them?

• Enhancing document representations using medical ontologies.

• Clusterization, categorization, topics from discharge summaries.

• Few final thoughts while we are still on the run.

Page 3: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Ambitious approaches…Ambitious approaches…CYC, Douglas Lenat, started in 1984. Developed by CyCorp, with 2.5 millions of assertions linking over 150.000 concepts and using thousands of micro-theories (2004).Cyc-NL is still a “potential application”, knowledge representation in frames is quite complicated and thus difficult to use.

Open Mind Common Sense Project (MIT): a WWW collaboration with over 14,000 authors, who contributed 710,000 sentences; used to generate ConceptNet, very large semantic network.Other such projects: HowNet (Chinese Academy of Science), FrameNet (Berkley), various large-scale ontologies.

The focus of these projects is to understand all relations in text/dialogue. NLP is hard and messy! Many people lost their hope that without deep embodiment we shall create good NLP systems.

Go the brain way!

Page 4: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Brain areas involvedBrain areas involvedBrain areas involvedBrain areas involvedOrganization of the word recognition circuits in the left temporal lobehas been elucidated using fMRI experiments (Cohen et al. 2004).How do words that we hear, see and thinking of activate the brain? Seeing words: orthography, phonology, articulation, semantics.

Visual word form area (VWFA) in the left occipitotemporal sulcus is strictly unimodal visual area.

Adjacent lateral inferotemporal multimodal area (LIMA) reacts to both auditory & visual stimulation, has cross-modal phonemic and lexical links.

Likely: homolog of the VWFA in the auditory stream, the auditory word form area, located in the left anterior superior temporal sulcus; this area shows reduced activity in developmental dyslexics.

Large variability in location of these regions in individual brains.

Page 5: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Words in the brainWords in the brainWords in the brainWords in the brainPsycholinguistic experiments show that most likely categorical, phonological representations are used, not the acoustic input.Acoustic signal => phoneme => words => semantic concepts.Phonological processing precedes semantic by 90 ms (from N200 ERPs).F. Pulvermuller (2003) The Neuroscience of Language. On Brain Circuits of Words and Serial Order. Cambridge University Press.

Phonological neighborhood density = the number of words that are similar in sound to a target word. Similar = similar pattern of brain activations.

Semantic neighborhood density = the number of words that are similar in meaning to a target word.

Action-perception networks inferred from ERP and fMRI

Page 6: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Insights and brainsInsights and brainsInsights and brainsInsights and brainsActivity of the brain while solving problems that required insight and that could be solved in schematic, sequential way has been investigated. E.M. Bowden, M. Jung-Beeman, J. Fleck, J. Kounios, „New approaches to demystifying insight”. Trends in Cognitive Science 2005.

After solving a problem presented in a verbal way subjects indicated themselves whether they had an insight or not.

An increased activity of the right hemisphere anterior superior temporal gyrus (RH-aSTG) was observed during initial solving efforts and insights. About 300 ms before insight a burst of gamma activity was observed, interpreted by the authors as „making connections across distantly related information during comprehension ... that allow them to see connections that previously eluded them”.

Page 7: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Insight interpretedInsight interpretedInsight interpretedInsight interpreted

What really happens? My interpretation:

• LH-STG represents concepts, S=Start, F=final• understanding, solving = transition, step by step, from S to F• if no connection (transition) is found this leads to an impasse; • RH-STG ‘sees’ LH activity on meta-level, clustering concepts into

abstract categories (cosets, or constrained sets);• connection between S to F is found in RH, leading to a feeling of

vague understanding; • gamma burst increases the activity of LH representations for S, F

and intermediate configurations; • stepwise transition between S and F is found;• finding solution is rewarded by emotions during Aha! experience;

they are necessary to increase plasticity and create permanent links.

Page 8: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Memory & creativityMemory & creativityMemory & creativityMemory & creativityCreative brains accept more incoming stimuli from the surrounding environment (Carson 2003), with low levels of latent inhibition responsible for filtering stimuli that were irrelevant in the past. “Zen mind, beginners mind” (S. Suzuki) – learn to avoid habituation! Creative mind maintains complex representation of objects and situations.

Pair-wise word association technique may be used to probe if a connection between different configurations representing concepts in the brain exists.

A. Gruszka, E. Nęcka, Creativity Research Journal, 2002.

Words may be close (easy) or distant (difficult) to connect; priming words may be helpful or neutral; helpful words are related semantically or phonologically (hogse for horse); neutral words may be nonsensical or just not related to the presented pair.

Results for groups of people of low/high creativity are surprising …

Word 1 Priming 0,2 s Word 2

Page 9: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Connectionist spreading activation model => semantic network with mostly lateral connections (Collins and Loftus, 1975).

Hierarchical model of semantic memory followed by most ontologies (Collins and Quillian, 1969).

Semantic memorySemantic memory

Our implementation of semantic memory is based on connectionist model, uses relational database and object access layer API.

The database stores three types of data: • concepts, or objects being described; • keywords (features of concepts extracted from data sources);• relations between them.

Attempts to create “common sense” semantic memory from machine-readable sources and using active dialogues, see Szymanski & Duch, Semantic Memory Knowledge Acquisition Through Active Dialogues, poster #1156), here only medical applications.

Page 10: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Semantic => vector repsSemantic => vector repsSemantic => vector repsSemantic => vector repsWord w in the context: (w,Cont), distribution of brain activations.

States (w,Cont) lexicographical meanings: clusterize (w,Cont) for all contexts, define prototypes (wk,Cont) for different meanings wk.

Simplification: use spreading activation in semantic networks to define . How does the activation flow? Try this algorithm on collection of texts:

• Perform text pre-processing steps: stemming, stop-list, spell-Perform text pre-processing steps: stemming, stop-list, spell-checking ...checking ...

• Use MetaMap with a very restrictive settings to discover Use MetaMap with a very restrictive settings to discover concepts, avoiding highly ambiguous results when mapping concepts, avoiding highly ambiguous results when mapping text to UMLS ontology. text to UMLS ontology.

• Use UMLS relations to create first-order cosets (terms + all new Use UMLS relations to create first-order cosets (terms + all new terms from included relations); add only those types of terms from included relations); add only those types of relations that lead to improvement of classification results.relations that lead to improvement of classification results.

• Reduce dimensionality of the first-order coset space, leave all Reduce dimensionality of the first-order coset space, leave all original features; use feature ranking method for this reduction. original features; use feature ranking method for this reduction.

• Repeat last two steps iteratively to create second- and higher-Repeat last two steps iteratively to create second- and higher-order enhanced spaces, first expanding, then shrinking the order enhanced spaces, first expanding, then shrinking the space. space.

Create Create XX vectors representing concepts.vectors representing concepts.

Page 11: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Medical applications Medical applications • Can we capture expert’s intuition evaluating document’s

similarity, finding its category? • How to include a priori knowledge in document categorization –

important especially for rare disease. • Provide unambiguous annotation of all concepts.• Acronyms/abbreviations expansion and disambiguation.• How to make inferences from the information in the text, assign

values to concepts (true, possible, unlikely, false).• How to deal with the negative knowledge (not consistent with ...).• Automatic creation of medical billing codes from text.• Semantic search support, better specification of queries, Q/A system.• Integration of text analysis with molecular medicine.• Provide support for billing, knowledge discovery, dialog systems.

• Here: categorization of summary discharges.

Page 12: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Unified Medical Language System (UMLS)Unified Medical Language System (UMLS)

semantic types

“Virus” causes “Disease or Syndrome”

semantic relation

Other relations: “interacts with”, “contains”, “consists of” , “result of”, “related to”, …

Other types: “Body location or region”, “Injury or Poisoning”, “Diagnostic procedure”, …

Page 13: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

UMLS – Example (keyword: “virus”)UMLS – Example (keyword: “virus”)

Metathesaurus:

Concept: Virus, CUI: C0042776, Semantic Type: Virus

Definition (1 of 3): Group of minute infectious agents characterized by a

lack of independent metabolism and by the ability to replicate only within living host cells; have capsid, may have DNA or RNA (not both). (CRISP Thesaurus)

Synonyms: Virus, Vira Viridae

Semantic Network: "Virus" causes "Disease or Syndrome"

Page 14: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

More semantic relationsMore semantic relationsNeurocognitive approach to language understanding: use recognition, semantic and episodic memory models, create graphs of consistent concepts for interpretation, use spreading activation and inhibition to simulate effect of semantic priming, annotate and disambiguate text.

For medical texts ULMS has >2M concepts, 15M relations … we are developing a system for unambiguous concept mapping in Medical Domain, and ontology for common reason (with J. Szymanski)

Page 15: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Data statisticsData statistics

General info:• 4534 documents, hospital summary discharges• 10 classes (main disease treated)• 807 initial features (concepts) for 26 semantic types

Baseline:• Majority: 19.1% (asthma class)• Content based: 34.6% (frequency of class name in text)

Remarks:• Feature values represent term frequency (tf) i.e. the number of

occurrences of a particular concept in text• Very short documents + specialized vocabulary => very sparse

vectors, hard to categorize

Page 16: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Example of clinical summary dischargesExample of clinical summary discharges

Jane is a 13yo WF who presented with CF bronchopneumonia.

She has noticed increasing cough, greenish sputum production, and fatique since prior to 12/8/03. She had 2 febrile epsiodes, but denied any nausea, vomiting, diarrhea, or change in appetite. Upon admission she had no history of diabetic or liver complications.

Her FEV1 was 73% 12/8 and she was treated with 2 z-paks, and on 12/29 FEV1 was 72% at which time she was started on Cipro.

She noted no clinical improvement and was admitted for a 2 week IV treatment of Tobramycin and Meropenem.

Page 17: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Summary discharge test dataSummary discharge test data

Average size [bytes]

No. of records

367201282865Asthma

9906

32416

35348

7958

27024

13430

14282

19418

23583

Reference Data

size [bytes]

1375

1420

1597

1790

1816

1587

2849

1598

1451

586

493

177

283

41

298

544

638

609

Clinical Data

UTI

Gastroenteritis

Otitis media

Cerebral palsy

Cystic fibrosis

JRA

Anemia

Epilepsy

Pneumonia

Disease name

JRA - Juvenile Rheumatoid Arthritis UTI - Urinary tract infection JRA - Juvenile Rheumatoid Arthritis UTI - Urinary tract infection

Page 18: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Data processing/preparationData processing/preparation

Reference Texts MMTx

ULMS concepts /feature prototypes/

Filtering - focus on 26 semantic types. Features - UMLS concept

IDs

Clinical Documents

MMTx

Filtering using existing space

Final data

UMLS concepts

MMTx – discovers UMLS concepts in text

Page 19: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Semantic Semantic types types usedused

Values indicate the actual numbers of concepts found in:

I – clinical textsII – reference texts

26 most useful types found using feature selection for all features of the specific type.

Page 20: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Classification resultsClassification results

:

: 1 log( )

tf

tfM2

M3

M0: raw tf frequencies; M1: binarized tf vectors;

: 1 log log /ij ij is tf N df M4

M0 M1 M2 M3 M4

kNN 48.9 50.2 51.0 51.4 49.5

SSV DT 39.5 40.6 31.0 39.5 39.5

SVM 59.3 60.4 60.9 60.5 59.8

10 ref cos 60.1 58.9 56.7 56.8 56.5

10-fold crossvalidation % balanced accur, different feature weightings:

Itert L, Duch W, Pestian J, Influence of a priori Knowledge on Medical Document Categorization, IEEE Symposium on Computational Intelligence in Data Mining, IEEE Press, April 2007, pp. 163-170.

Page 21: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Enhancing representationsEnhancing representationsExperts reading the text activate their semantic memory and add a lot of knowledge that is not explicitly present in the text.

Co-occurrence statistics does not capture structural relations of real objects and features, systematic knowledge is needed.An approximation (not as good as SM): use ontologies adding related concepts (use parent & other relations) to those discovered in the text.

Ex: IBD => [C0021390] Inflammatory Bowel Diseases =>-> [C0341268] Disorder of small intestine-> [C0012242] Digestive System Disorders-> [C1290888] Inflammatory disorder of digestive tract-> [C1334233] Intestinal Precancerous Condition-> [C0851956] Gastrointestinal inflammatory disorders NEC -> [C1285331] Inflammation of specific body organs-> [C0021831] Intestinal Diseases

[C0025677] Methotrexate (Pharmacologic Substance) =>-> [C0003191] Antirheumatic Agents-> [C1534649] Analgesic/antipyretic/antirheumatic

Page 22: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

MDS mapping of 4534 documents divided in 10 classes, using cosine distances.

1. Initial representation, 807 features.2. Enhanced by 26 selected semantic types, two steps, 2237 concepts with CC

>0.02 for at least one class.Two steps create feedback loops A B between concepts.

Structure appears ... is it interesting to experts? Are these specific subtypes (clinotypes)?

Clusterization on enhanced dataClusterization on enhanced data

Page 23: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Discover topics, subclusters, more focused than general categories.

Map text on the 2007 MeSH (Medical Subject Headings) ontology, more precise than ULMS. Filter rare concepts (appearing in <1% docs) and very common concepts (>99% docs); remove documents with too few concepts

(<1% of all) => smaller but better defined clusters.Leave only 26 semantic types.

Ward’s clustering used, with silhouette measure of clustering quality.

Only 3 classes: two classes that mix most strongly (Pneumonia and Otitis media), add the smallest class JRA.

Initial filtering: 570 concepts with 1%<tf<99%,1002 documents.Semantic (26 types): 224 concepts, 908 docs with >1% concepts.

These 224 concepts have about 70.000 ULMS relations, only 500 belong to the 26 semantic types. Enhancement: very restrictive, only ~25 most correlated added.

Searching for topicsSearching for topics

Page 24: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

ResultsResultsResultsResultsStart, iterations 2, 3 and 4 shown, 5 clinotypes may be

distinguished.

Page 25: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Few conclusionsFew conclusionsFew conclusionsFew conclusionsNeurocognitive NLP leads to interesting inspirations.Sydney Lamb, Rice Uni, wrote general book (1999) on the neural basis of language. How to create practical large-scale algorithms?

Various approximations to knowledge representation in brain networks are studied: the use of a priori knowledge based on reference vectors, formation of graphs of consistent concepts in spreading activation networks, ontology & semantic-based enhancements + specific relations.

Clusterization/categorization quality has been used to discover which semantic types are useful (selecting categories of features), expand and reduce the concept space, discovering useful “pathways of the brain”.

Can one identify specific clinotypes in summary discharges? Can they be used to improve training of young MDs?

Sessions on Medical Text Analysis and billing annotation challenge, April 1-5, 2007, IEEE CIDM, Honolulu, showed that human level competence in some text analysis tasks can be reached!

Page 26: Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Thank Thank youyoufor for

lending lending your your ear ear ......

Google: Duch => Papers, talks