functional gene clustering via gene annotation sentences, mesh and go keywords from biomedical...
TRANSCRIPT
Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO
Keywords from Biomedical Literature
Dr. N. JEYAKUMAR, M.Sc., Ph.D.,Bioinformatics Centre
School of BiotechnologyMadurai Kamaraj University
Madurai – 625021, INDIA
2
Purpose & Goals Extracting gene specific functional ‘keywords’ from
biological literature From full-abstracts Gene specific sentences
Augment extracted keywords with MeSH and GO keywords related to gene
Compare the accuracy of results with a test data set in various keyword extraction methods
Full-abstracts Gene specific sentences Gene specific sentences + MeSH keywords Gene specific sentences+ MeSH and GO keywords
Use the keyword extraction method to cluster the differentially expressed gene clusters in a microarray experiments
3
Outline
Part I: Text mining and keyword extraction from literature Our text mining methodology
Part II: Applications to microarrays Functional keyword clustering of
microarray data
Two Parts: I, and II
?
Part I: Text Mining
5
Text Mining: Introduction and overview
Text mining aims to identify non-trivial, implicit, previously unknown, and potentially useful patterns in text (e.g. classification system, association rules, hyphothesis etc.)
includes more established research areas such as information retrieval (IR), natural language processing (NLP), information extraction (IE), and traditional data mining (DM)
relevant to bioinformatics because of explosive growth of biomedical literature (e.g.
MEDLINE – 15 million records) availability of some information in textual form only,
e.g. clinical records
6
Experimental design of gene clustering with sentences-level, MeSH and GO keywords
S e t o f A b s tra c t
G e n e L ist
G e n e /P ro te inD ic tio n a ry
Yo u r s tu ff h e re .
Yo u r s tu ff h e re .
Yo u r s tu ff h e re . Yo u r s tu ff h e re .
C lu s te rin g
K e y w o rd E x tra c tio n
M e S H /G OK e y w o rd
E x tra c tio n
S e n te n c e E x c tra c tio n
F e a tu re Ve c to rG e n e ra tio n
F ilte rin gM e d L in eA b s tra c ts
M ic ro a r ra yE x p e rim e n t
M e S H /G e n e O n to lo g y
P a tte rn sV isu a liz a tio nA n n o ta t io n
Text Mining: System Architecture
7
Text Mining: Keyword Extraction from Biomedical Literature
Steps to extract sentence-level keywords
Gene - Synonym dictionary – A special gene name synonym name dictionary was created for human genes using Entrez-Gene
Gene-name normalization - This process replaces all the gene names in the abstract with its unique canonical identifier (Entrez gene ID) using the gene-synonym dictionary specially constructed for this study.
Sentence filtering – using corpus specific the regular expression as the following example
($gene @{0,6} $action (of|with) @{0,2} $gene)
extracts sentences that match the structure shown below the expression. The notational construct ‘A B ...’ is interpreted as ‘A followed by B followed by ...’.
gene name 0-6 words action verb ‘of’ or ‘with’ 0-2 words gene name
Keyword extraction. – Next slide
8
Text Mining: Keyword Extraction from biomedical literature
Table 1. An example set of regular expressions as nouns describing agents and agents, and passive and active verbs
Name of Expression Expression Pattern Sentence Output Nouns describing agents ($gene (is)? (the|an|a) @{0,2}$action of @{0,2}
$gene)
IL6, a known mediator of STAT3 response
Nouns describing actions ($gene @{0,6} $action (of|with) @{0,1} $gene)
abi5 domains required for interaction with abi3
Passive verbs ($gene @{0.6} (is|was|be|are|were) @{0,1} $action $(by|via|through) @{0,3} $gene)
Protein kinase c (PKC) has been shown to be activated by parathyroid hormone
Active verbs ($gene $sub-action @{0,1} $action @{0,2} $gene) Insulin mediated inhibition of hormone sensitivity lipase activity
9
Text Mining: Keyword Extraction from Biomedical Literature
Keyword extraction Example
Sentence: BRCA1 physically associates with p53 and stimulates its
transcriptional activity.
Brill-POS-tagged sentence: BRCA1/NNP physically/RB associates/VBZ with/IN p53/NN and/CC
stimulates/VBZ its/PRP$ transcriptional/JJ activity/NN ./.
Sentence keywords: associates, stimulates, transcription activity
Sentence keywords after manual curation: transcription activity
10
Text Mining: MeSH Keyword Extraction
MeSH keywords MeSH keywords are subject index terms assigned to each scientific
literature by the Natural Library of Medicine (NLM) for purpose of subject indexing and searching the journal articles via PubMed.
MeSH keyword extraction Extracted directly from gene specific abstracts via Perl scripts
MeSH keyword curation Using a MeSH keywords stop words dictionary (e.g., human, DNA,
animal, Support U.S Govt etc.).
For example the MeSH keywords associated with a gene ‘FOS’ in our gene list are ‘oncogene, felypressin, transcription-factor, thermo-receptors, DNA-binding, antibiosis, inflammatory-response, zinc-fingers, gene-regulation, and neuronal-plasticity’.
11
Text Mining: GO Keyword Extraction
GO keywords Gene Ontology (GO) is a hierarchical organization of gene and gene
product terms from various databases in which concepts at higher levels in the hierarchy are more general than those further down
GO keyword extraction Out of the three GO annotation categories we included only
molecular function and biological process and left out cellular component as it is less important for characterizing genes functions
Further, due to hierarchical nature of GO and multiple inheritance in the GO structure, we consider with every ancestor up to the level 2 in the GO tree
For example the GO keywords associated with the gene ‘FOS’ in our gene list are ‘protein-dimerization, DNA binding, RNA polymerase, transcription factor, DNA methylation, and inflammatory-response’.
12
Text Mining: Keyword Representation and Calculation of Numeric Vectors
This process is concerned with computing the numeric weight, wij, for each gene-keyword pair (gi, tj) (i = 1, 2, … n and j = 1, 2, … k) to represent the gene’s characteristics in terms of the associated keywords.
Common techniques for such numeric encoding include
Binary. The presence or absence of a keyword relative to a gene.
Term frequency. The frequency of occurrence of a keyword with a gene.
Term frequency / inverse document frequency (TF*IDF). The relative frequency of occurrence of a keyword with a gene compared to other genes
13
Text Mining: TF*IDF Weighting
Most weighting scheme in information retrieval and text classification method is the TFIDF (term frequency / inverse document frequency) weighting scheme.
TF(w,d) (Term Frequency) is the number of times word w occurs in a document d.
DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.
The inverse document frequency is calculated as
Where | D | is total number of documents in the corpus
)log()( )(||wDF
DwIDF
14
Text Mining: Keyword Representation and Calculation of Numeric vectors
In our study, as the keywords are extracted from gene specific sentences but not from full abstracts, the number of keywords associated with each gene is small.
Further, the frequency of occurance of most keywords tended be one.
Therefore, the binary encoding scheme was adopted as illustrated in Table 2 .
Genes / Terms t1 t2 ... tk
g1 w11 = 0 w21 = 1 ... wk1 = 1 g2 w12 = 1 w22 = 1 ... wk2 = 0
... ... ... ... ... gn w1n = 0 w2n = 0 ... wkn = 1
Table 2. Binary representation of gene * keywords
15
Text Mining: Gene Clustering
After, our binary coding scheme adopted in this study consists of numeric row vectors representing genes (via the associated biological functional keywords), and numeric column vectors representing annotation terms (via the associated genes)
Clustering can produce useful and specific information about the biological characteristics of sets of genes
Clustering: Partition unlabeled examples into disjoint subsets of clusters, such that:
Examples within a cluster are very similar Examples in different clusters are very different
Discover new categories in an unsupervised manner.
16
Text Mining: Test Set and Evaluation
The test set contains 20 genes and 10 abstracts for each gene, resulting in a total of 200 abstracts in two cancer categories (Table 3) was used evaluate usefulness of our keyword extraction method
Genes Category
ADAM23, DKK1, IGF2, LRRC4, L3MBTL, MMP9, MSH2, PTPNS1, SFMBT1, ZIC1
Brain Tumor
AMPH, ATM, BRCA1, BRCA2, CHEK2, CDH1, PHB, TFF1, TSG101, XRCC3
Breast Cancer
Table 3. Test set of 20 human genes manually grouped in to two cancer categories
17
Text Mining: Evaluation
Full abstract keywords (baseline). Extracts gene annotation terms based on term frequencies * inverse document frequencies (TF*IDF) within the entire abstract without regard to sentence structure.
Sentence keywords. Extracts gene specific keywords based sentence-level processing.
Sentence + MeSH keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction).
Sentence + MeSH + GO keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction) and GO terms (see Section GO keyword extraction
18
Text Mining: Evaluation
Results of various keyword extraction methods
Keywords Extraction Method
Precision
Recall F-measure (%)
Abstract keywords (baseline)
0.31 0.24 27.05
Sentence keywords only 0.57 0.38 45.60
Sentence + MeSH keywords
0.64 0.47 54.19
Sentence + MeSH + GO keywords
0.78 0.72 74.88
Part II: Applications to Microarrays
Functional keyword Clustering of genes resulting from microarray experiment
20
Applications to Microarrays Data and Analysis
As an illustrative example, our keyword extraction methods was applied to functional interpretation of cluster of genes that were found differentially expressed in a microarray experiment investigating the impact of two mitogenic protein Epidermal growth factor (EGF) and Sphingosine 1-phosphate (S1P) on glioblastoma cell lines
when compared to the resting state, 19 genes were significantly differentially expressed as a response to EGF, 35 genes as a response to S1P and 30 genes as a response to COM, i.e., combined stimuli of S1P and EGF. The three gene lists are referred to as G(EGF), G(S1P) and G(COM), respectively (Table 4).
21
Applications to Microarrays Data and Analysis
Table 4. List of Differentially Expressed Genes
Gene List Name of Genes
G(EGF) (19 genes)
HRY, KLF2, ID1, JUN, DUSP6, IMPDH2, GP1BB, PNUTL1, CGI-96, CALD1, TRIM15, FOS, SPRY4, CLU, SLC5A3, MRPS6, ABCA1, OLFM1, PHLDA1
G(S1P) (35 genes)
F3, NR4A1, KLF5, GADD45B, IL8, CITED2, CALD1, IL6, BCL6, LBH, HRB2, KIAA0992, NFKBIA, TNFAIP3, CCL2, DSCR1, TXNIP, NAB1, EHD1, GBP1, GLIPR1, MAP2K3, FZD7, RGS3, SOCS5, FOSL2, JAG1, DOC1, NRG1, BTG1, PDE4C, KIAA1718, KIAA0346, SFRS3, PLAU
G(COM)(30 genes)
MAFF, DUSP5, EGR3, SERPINE1, ZFP36, DUSP1, LIF, DTR, MYC, GADD45B, RTP801, ATF3, JUNB, SNARK, WEE1, EGR2, TIEG, SPRY2, CEBPD, SGK, GEM, NEDD9, LDLR, EGR1, C8FW, UGCG, MCL1, ZYX, FOSL1, DIPA
22
Applications to Microarrays Data and Analysis
Using these the three gene lists obtained from the microarray experiment (Table 6) as query in MEDLINE returned the three corresponding sets of abstracts A(EGF), A(S1P) and A(COM), respectively (Table 5).
The abstracts were processed with the keyword extraction method involving sentence-level augmented with MeSH and GO keywords
The resulting keywords were encoded in binary weighting scheme
The resulting representations were clustered using average linkage hierarchical clustering algorithm.
23
Applications to Microarrays Data and Analysis
Table 5. Three sets of abstracts, A(EGF), A(S1P), and A(COM), retrieved via MEDLINE for this study
Gene List # of Genes in List
Retrieved Abstract Set
# of Abstracts in Set
G(EGF) 19 A(EGF) 28 913
G(S1P) 35 A(S1P) 19 705
G(COM) 30 A(COM) 39 890
24
Applications to Microarrays Average Linkage Hierarchical Clustering Algorithm
Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters.
Compromise between single and complete link.
Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters.
)( :)(
),()1(
1),(
ji jiccx xyccyjiji
ji yxsimcccc
ccsim
25
Applications to Microarrays Results
HRY
KLF2ID1
JUN
DUSP6
IMPDH2
GP1BB
PNUTL1
CALD1
TRIM 15
FO S
SPRY4
CLU
SLC5A3
MRPS6
ABCA1
OLFM1PHLDA1
neu
ral t
ube
def
ects
tran
scri
ptio
n fa
ctor
cell
dea
them
bryo
gene
sis
ion
bin
ding
angi
oge
nesi
s
inhi
bitio
nem
bryo
nic
dev
elo
pme
nt
tran
s-ac
tivat
ors
zinc
fing
ers
mito
gen
esi
s
asse
mbl
ese
cret
ion
bio
synt
hesi
sre
gula
tion
gly
copr
ote
inan
drog
ens
odo
nto
gen
esis
calm
odu
lin-b
indi
ng
desa
tura
ses
shap
e-re
gula
tion
rela
xatio
ntu
mo
rigen
esis
intr
ace
llula
ra
ther
ogen
eses
glu
tam
ine-
tran
spo
rtD
NA
-met
hyla
tion
fely
pres
sin
tran
sitio
ncl
uste
ring
reco
mbi
natio
nth
erm
o-re
cept
ors
v-fo
sfu
sio
nse
nsat
ion
imm
uno-
rea
ctiv
ity
ant
ibio
sis
oste
obla
sts
Summary of analysis of EGF cluster
26
Applications to Microarrays Results
Summary of analysis of S1P cluster
F3
N R 4A1
KLF5
G A DD 45B
IL8
C ITE D 2
C ALD 1
IL6
BC L6
H R B2
N FK BIA
TNFAIP 3
C C L2
D SC R 1
TXN IP
N AB1
EH D 1G B P1
G LIP R 1
M AP2K3
FZD 7R G S3
SO CS 5
FO SL2
JA G 1
D O C 1
N R G 1
BTG 1
PD E4C
SFR S3
PLA U
athe
roge
nesi
sm
itoge
nesi
s
asse
mbl
ein
flam
mat
ion
angi
ogen
esis
endo
cyto
sis
lym
phoc
yte
spa
thog
enes
is
DN
A-d
epen
dent
foca
l-con
tact
DN
A-d
amag
esp
licin
gG
1 ph
ase
extr
acel
lula
rm
otili
typr
otei
n-bi
ndin
g
cos-
cells
myo
sin
RN
A lo
caliz
atio
ndo
se-r
espo
nse
antic
odo
ncy
toto
xici
typa
rasi
toph
orou
sG
pro
tein
dem
yelin
atio
ncy
toly
sis
Ca
rele
ase
loco
mot
ion
hom
eost
asis
circ
ulat
ion
phos
phor
ylat
ion
synt
hesi
sre
pair
prot
ein
kina
se
endo
thel
ializ
atio
nor
gan
ogen
esis
cell-
adhe
sion
mut
agen
esis
imm
une-
resp
onse
27
Applications to Microarrays Results
Summary of analysis of COM cluster
M A FF
D U S P 5
E G R 3
S E R P IN E 1
Z FP 36
D U S P 1
LIF
D T R
M Y C
G A D D 45B
R T P 801A TF 3
JU N B
S N A R KW E E 1
E G R 2T IE G
S P R Y 2
C E B P D
S G K
G E M
N E D D 9
LD LR
E G R 1
C 8F W
U G C G
M C L1
Z Y X
F O S L1
D IP A
DN
A-b
ind
ing
zinc
fin
ge
rsre
pre
ssor
pro
tein
sD
NA
-de
pen
de
nt
nu
cle
ustr
ans
act
iva
tion
leu
cine
zip
pe
rstr
an
scri
ptio
ng
en
e e
xpre
ssio
n r
egu
latio
n
oxi
da
tive
str
ess
pro
to-o
nco
ge
ne
cell
surv
iva
lsi
gna
l tra
nsd
uctio
n
ma
tura
tion
en
docy
tosi
s
diff
ere
ntia
tion
mito
ge
nesi
s
mito
sis
G2
pha
sech
em
ose
nsi
tivity
mut
ag
ene
sis
lym
pha
ngi
oge
ne
sis
ion
bin
ding
RN
A p
roce
ssin
g
G2
-m t
ran
sitio
nm
RN
A s
plic
ing
imm
ort
ality
DN
A r
eco
mb
ina
tion
mic
rotu
bu
leg
en
e si
len
cin
gh
elix
-loo
p-h
elix
mo
tifs
tra
nsc
rip
tion
fa
cto
r
seiz
ure
sg
eno
me
inst
abi
lity
DN
A m
od
ifica
tion
DN
A m
eth
ylat
ion
jun
ge
ne
s
28
Conclusions An important topic in microarray data mining is to bind
transcriptionally modulated genes to functional pathways or how transcriptional modulation can be associated with specific biological events such as genetic disease phenotype, cell differentiation etc.
However, the amount of functional annotation available with each transcriptionaly modulated genes is still a limiting factor because not all genes are well annotated
Further, Jenssen et al. (2001) earlier compiled a network of human gene relationships from MEDLINE abstracts. These compiled relationships were then compared to the gene expression cluster results. This approach gives a very interesting result: functionally related genes can show totally different patterns, and hence belong to different clusters (Jenssen, et al.: A literature network of human genes for high-throughput analysis of gene expression, Nat.Genet., 28, 21-28, 2001)
29
Conclusions Our gene functional keyword clustering/ grouping will
enable to select functionally informative genes from differentially expressed genes for further investigations.
Our evaluation suggests that this approach will provide more specific and useful information than typical approaches using abstract-level information. This is particularly the case when the sentence-level terms are augmented by MeSH and GO keywords
As the current text mining scenario is on full-text mining As full-text contains large number of irreverent sentences compare to abstracts this approach is more appropriate for full-text study as it filters irrelevant sentences before clustering.
30
Acknowledgments Eric G. Bremer, Brain Tumor Research Program,
Children’s Memorial Research Center, Chicago, IL, USA, and James R. van Brocklyn, Division of Neuropathology, Department of Pathology, The Ohio State University, Columbus, Ohio, USA for the microarray data set
Dr. Daniel Berrar, Bioinformatics Research Group, University of Ulster, UK
Members of Bioinformatics Centre, Madurai Kamaraj University, India
Dept of Biotechnology, Govt. of India for Bioinformatics facilities
31
THANK YOU