semantic annotation, clustering and visualization
DESCRIPTION
"Practise" presentation of my MSc thesis I did for the Leiden University Bio-imaging Group. More information @ http://graus.nu/category/thesis/TRANSCRIPT
Semantic annotation, clustering and visualization
David Graus
Media Technology Msc
ProgrammeGraduation Project
Supervisor: Joris Slob
Introduction
David Graus Media Technology Msc Programme07/02/2012
Cyttron DB entry"The volume of the brain evaluated in this
study. The color scale represents the
number of 4-mm voxels with data in at least
7 subjects along a 3-cm deep line into the
brain. A three-dimensional rendering of a
brain is shown in regions where insufficient
data were obtained. The most superior
regions of the frontal and parietal lobes and
the most inferior regions of the temporal
lobes were not evaluated. Imaging artifacts
may also compromise the significance of
results in the most inferior portions of the
frontal lobe."
David Graus Media Technology Msc Programme07/02/2012
Tasks1. Semantic annotation
Identify and tag most important concepts from text
[NLP]
2. Topic extraction
Relate concepts and find clusters [Linked Data]
3. Visualization
Draw resulting graphs and clusters [Datavisualization]
David Graus Media Technology Msc Programme07/02/2012
1. Semantic Annotation
Method I: Find words
Method II: Compare texts
David Graus Media Technology Msc Programme07/02/2012
Semantic Annotation: Method I
"The volume of the brain evaluated in this
study. The color scale represents the
number of 4-mm voxels with data in at least
7 subjects along a 3-cm deep line into the
brain. A three-dimensional rendering of a
brain is shown in regions where insufficient
data were obtained. The most superior
regions of the frontal and parietal lobes and
the most inferior regions of the temporal
lobes were not evaluated. Imaging artifacts
may also compromise the significance of
results in the most inferior portions of the
frontal lobe."
David Graus Media Technology Msc Programme07/02/2012
Formal knowledge: Biomedical Ontology
David Graus Media Technology Msc Programme07/02/2012
NCI Thesaurus89.129 unique concepts50.804 definitions258.051 synonymsRelations!
Concept Agrobacterium tumefaciens
Definition A species of Gram negative, rod shaped bacteria assigned to the phylum Proteobacteria. This bacteria is motile by flagella and mediates the horizontal gene transfer of its Ti plasmid to infect plants. A. tumefaciens is commonly found in soil and around the root surfaces of plants and is the causative agent of crown gall disease.
Synonyms RHIZOBIUM RADIOBACTERCDC GROUP VD-3
David Graus Media Technology Msc Programme07/02/2012
Semantic Annotation: Method I
"The volume of the brain evaluated in this
study. The color scale represents the
number of 4-mm voxels with data in at least
7 subjects along a 3-cm deep line into the
brain. A three-dimensional rendering of a
brain is shown in regions where insufficient
data were obtained. The most superior
regions of the frontal and parietal lobes and
the most inferior regions of the temporal
lobes were not evaluated. Imaging artifacts
may also compromise the significance of
results in the most inferior portions of the
frontal lobe."
David Graus Media Technology Msc Programme07/02/2012
Semantic Annotation: Method I
"The volume of the brain evaluated in this
study. The color scale represents the
number of 4-mm voxels with data in at least
7 subjects along a 3-cm deep line into the
brain. A three-dimensional rendering of a
brain is shown in regions where insufficient
data were obtained. The most superior
regions of the frontal and parietal lobes and
the most inferior regions of the temporal
lobes were not evaluated. Imaging artifacts
may also compromise the significance of
results in the most inferior portions of the
frontal lobe."
David Graus Media Technology Msc Programme07/02/2012
Semantic Annotation: Method I
"The volume of the brain evaluated in this
study. The color scale represents the
number of 4-mm voxels with data in at least
7 subjects along a 3-cm deep line into the
brain. A three-dimensional rendering of a
brain is shown in regions where insufficient
data were obtained. The most superior
regions of the frontal and parietal lobes and
the most inferior regions of the temporal
lobes were not evaluated. Imaging artifacts
may also compromise the significance of
results in the most inferior portions of the
frontal lobe."
Is a: Body Part, Organ, or Organ ComponentsubClass Of: Organ, Central Nervous System PartDefinition: An organ composed of grey and white matter containing billions of neurons that is the center for intelligence and reasoning. It is protected by the bony cranium.
David Graus Media Technology Msc Programme07/02/2012
Example"The volume of the brain evaluated in this study. The color scale represents
the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep
line into the brain. A three-dimensional rendering of a brain is shown in regions
where insufficient data were obtained. The most superior regions of the frontal
and parietal lobes and the most inferior regions of the temporal lobes were
not evaluated. Imaging artifacts may also compromise the significance of results
in the most inferior portions of the frontal lobe."
Most, Brain, A, Inferior, Data, And, With, Volume,
Volume, Three, Temporal, Superior, Study, Scale,
Parietal, Number, Lobe, Line, Into, Frontal Lobe, Deep,
Color, At
David Graus Media Technology Msc Programme07/02/2012
Example
David Graus Media Technology Msc Programme07/02/2012
"The volume of the brain evaluated in this study. The color scale represents
the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep
line into the brain. A three-dimensional rendering of a brain is shown in regions
where insufficient data were obtained. The most superior regions of the frontal
and parietal lobes and the most inferior regions of the temporal lobes were
not evaluated. Imaging artifacts may also compromise the significance of results
in the most inferior portions of the frontal lobe."
Semantic Annotation: Method I
2 ‘Modifiers’ of representations:1. (Porter) Stemming (text & ontologyconcepts)
Lobes – lobeBrains – brainEtc…
2. Generate synonyms (using WordNet)
David Graus Media Technology Msc Programme07/02/2012
Combo 'brain, regions, data, evaluated, frontal, inferior, lobes, along, also, artifacts, color, compromise, deep, dimensional, imaging, insufficient, least, line, lobe. brain, color, deep, imaging, insufficient, […]
Most frequent
words
'brain, regions, data, evaluated, frontal, inferior, lobes, along,
also, artifacts, color, compromise, deep, dimensional, imaging,
insufficient, least, line, lobe‘Most
frequent nouns
'brain, color, deep, imaging, insufficient, line, lobe, number,
rendering, scale, significance, study, volume‘
Bigrams'also compromise, artifacts may, cm deep, color scale,
compromise significance, deep line, dimensional rendering,
imaging artifacts, may also, mm voxels, represents number, scale
represents, significance results, subjects along, data least, data
obtained, evaluated study, frontal lobe, frontal parietal, inferior
portions‘Trigrams
'also compromise significance, artifacts may also, cm deep line,
color scale represents, compromise significance results, imaging
artifacts may, may also compromise, scale represents number,
insufficient data obtained, mm voxels data, portions frontal lobe,
[…]
Different text representationsDavid Graus Media Technology Msc Programme
07/02/2012
Semantic Annotation: Method I
6 Representations (literal + 5 keyword
variations)
4 Treatments (literal + stem + synonyms +
both)
24 results
David Graus Media Technology Msc Programme07/02/2012
Method II: Text Comparison
Find concepts that might not occur in text
"The volume of the brain evaluated in this study. The color scale represents the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-dimensional rendering of a brain is shown in regions where insufficient data were obtained. The most superior regions of the frontal and parietal lobes and the most inferior regions of the temporal lobes were not evaluated. Imaging artifacts may also compromise the significance of results in the most inferior portions of the frontal lobe."
David Graus Media Technology Msc Programme07/02/2012
Compare text to definitions
Find relevant concepts based on their (textual)
definitions
David Graus Media Technology Msc Programme07/02/2012
Cyttron entry
NCI Thesaurus definitions
compare
Parietal Lobe: One of the lobes of the cerebral hemisphere located superiorly to the occipital lobe and posteriorly to the frontal lobe. Cognition and visuospatial processing are its main functions.
"The volume of the brain evaluated in this study. The color scale represents the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-dim
Method II: Text Comparison
Find concepts that might not occur in text
Frontotemporal Dementia: A syndrome caused by progressive degeneration of the frontal or temporal lobes
of the brain. It is manifested with personality changes and deterioration of the language skills.
Parietal Lobe: One of the
lobes of the cerebral
hemisphere located
superiorly to the occipital
lobe and posteriorly to the
frontal lobe. Cognition and
visuospatial processing are
its main functions.
Area of Broca: A region of
the brain located in the frontal lobe of the dominant
hemisphere; it plays an essential role in the motor control of speech.
David Graus Media Technology Msc Programme07/02/2012
"The volume of the brain evaluated in this study. The color scale represents the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-dimensional rendering of a brain is shown in regions where insufficient data were obtained. The most superior regions of the frontal and parietal lobes and the most inferior regions of the temporal lobes were not evaluated. Imaging artifacts may also compromise the significance of results in the most inferior portions of the frontal lobe."
Compare how?
Bag of Words + TF-IDF
Dictionary: BioMedCentral Corpus
> 100.000 articles
> 8GB raw data
Process CorpusClean (strip tags, store only article body)
Tokenize (create list of words)
Remove common words (stopwords)
Stem remaining words
David Graus Media Technology Msc Programme07/02/2012
Method II: Text ComparisonConvert both texts to vector space using dictionary,
compute similarity.
Return most similar concepts.
David Graus Media Technology Msc Programme07/02/2012
"The volume of the brain evaluated in this study. The color scale represents the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-dimensional rendering of a brain is shown in regions where insufficient data were obtained. The most superior regions of the frontal and parietal lobes and the most inferior regions of the temporal lobes were not evaluated. Imaging artifacts may also compromise the significance of results in the most inferior portions of the frontal lobe."
1. Frontotemporal Dementia
2. Parietal Lobe3. Area of Broca4. Anterior Cranial
Fossa5. Brain Lobectomy6. Anterior Parietal
Artery7. Mammary Gland8. Frontal Lobe9. Interlobar10.Lobar
Method II: Text ComparisonDifferent cut-off rules:1. Anything over x% similar
2. 5 most similar
3. 10 most similar
4. 20% most similar
5. 10% most similar
David Graus Media Technology Msc Programme07/02/2012
ResultLong list of (linked) conceptsRelevancy?
David Graus Media Technology Msc Programme07/02/2012
Find clustersMeasure semantic similarity between
concepts
- Shortest paths- Shared parents- Node’s ‘depth’
David Graus Media Technology Msc Programme07/02/2012
David Graus Media Technology Msc Programme07/02/2012
Get data!Analyse algorithms
David Graus Media Technology Msc Programme07/02/2012
To do