functional gene clustering via gene annotation sentences, mesh and go keywords from biomedical...

Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO

Keywords from Biomedical Literature

Dr. N. JEYAKUMAR, M.Sc., Ph.D.,Bioinformatics Centre

School of BiotechnologyMadurai Kamaraj University

Madurai – 625021, INDIA

2

Purpose & Goals Extracting gene specific functional ‘keywords’ from

biological literature From full-abstracts Gene specific sentences

Augment extracted keywords with MeSH and GO keywords related to gene

Compare the accuracy of results with a test data set in various keyword extraction methods

Full-abstracts Gene specific sentences Gene specific sentences + MeSH keywords Gene specific sentences+ MeSH and GO keywords

Use the keyword extraction method to cluster the differentially expressed gene clusters in a microarray experiments

3

Outline

Part I: Text mining and keyword extraction from literature Our text mining methodology

Part II: Applications to microarrays Functional keyword clustering of

microarray data

Two Parts: I, and II

?

Part I: Text Mining

5

Text Mining: Introduction and overview

Text mining aims to identify non-trivial, implicit, previously unknown, and potentially useful patterns in text (e.g. classification system, association rules, hyphothesis etc.)

includes more established research areas such as information retrieval (IR), natural language processing (NLP), information extraction (IE), and traditional data mining (DM)

relevant to bioinformatics because of explosive growth of biomedical literature (e.g.

MEDLINE – 15 million records) availability of some information in textual form only,

e.g. clinical records

6

Experimental design of gene clustering with sentences-level, MeSH and GO keywords

S e t o f A b s tra c t

G e n e L ist

G e n e /P ro te inD ic tio n a ry

Yo u r s tu ff h e re .

Yo u r s tu ff h e re .

Yo u r s tu ff h e re . Yo u r s tu ff h e re .

C lu s te rin g

K e y w o rd E x tra c tio n

M e S H /G OK e y w o rd

E x tra c tio n

S e n te n c e E x c tra c tio n

F e a tu re Ve c to rG e n e ra tio n

F ilte rin gM e d L in eA b s tra c ts

M ic ro a r ra yE x p e rim e n t

M e S H /G e n e O n to lo g y

P a tte rn sV isu a liz a tio nA n n o ta t io n

Text Mining: System Architecture

7

Text Mining: Keyword Extraction from Biomedical Literature

Steps to extract sentence-level keywords

Gene - Synonym dictionary – A special gene name synonym name dictionary was created for human genes using Entrez-Gene

Gene-name normalization - This process replaces all the gene names in the abstract with its unique canonical identifier (Entrez gene ID) using the gene-synonym dictionary specially constructed for this study.

Sentence filtering – using corpus specific the regular expression as the following example

($gene @{0,6} $action (of|with) @{0,2} $gene)

extracts sentences that match the structure shown below the expression. The notational construct ‘A B ...’ is interpreted as ‘A followed by B followed by ...’.

gene name 0-6 words action verb ‘of’ or ‘with’ 0-2 words gene name

Keyword extraction. – Next slide

8

Text Mining: Keyword Extraction from biomedical literature

Table 1. An example set of regular expressions as nouns describing agents and agents, and passive and active verbs

Name of Expression Expression Pattern Sentence Output Nouns describing agents ($gene (is)? (the|an|a) @{0,2}$action of @{0,2}

$gene)

IL6, a known mediator of STAT3 response

Nouns describing actions ($gene @{0,6} $action (of|with) @{0,1} $gene)

abi5 domains required for interaction with abi3

Passive verbs ($gene @{0.6} (is|was|be|are|were) @{0,1} $action $(by|via|through) @{0,3} $gene)

Protein kinase c (PKC) has been shown to be activated by parathyroid hormone

Active verbs ($gene $sub-action @{0,1} $action @{0,2} $gene) Insulin mediated inhibition of hormone sensitivity lipase activity

9

Text Mining: Keyword Extraction from Biomedical Literature

Keyword extraction Example

Sentence: BRCA1 physically associates with p53 and stimulates its

transcriptional activity.

Brill-POS-tagged sentence: BRCA1/NNP physically/RB associates/VBZ with/IN p53/NN and/CC

stimulates/VBZ its/PRP$ transcriptional/JJ activity/NN ./.

Sentence keywords: associates, stimulates, transcription activity

Sentence keywords after manual curation: transcription activity

10

Text Mining: MeSH Keyword Extraction

MeSH keywords MeSH keywords are subject index terms assigned to each scientific

literature by the Natural Library of Medicine (NLM) for purpose of subject indexing and searching the journal articles via PubMed.

MeSH keyword extraction Extracted directly from gene specific abstracts via Perl scripts

MeSH keyword curation Using a MeSH keywords stop words dictionary (e.g., human, DNA,

animal, Support U.S Govt etc.).

For example the MeSH keywords associated with a gene ‘FOS’ in our gene list are ‘oncogene, felypressin, transcription-factor, thermo-receptors, DNA-binding, antibiosis, inflammatory-response, zinc-fingers, gene-regulation, and neuronal-plasticity’.

11

Text Mining: GO Keyword Extraction

GO keywords Gene Ontology (GO) is a hierarchical organization of gene and gene

product terms from various databases in which concepts at higher levels in the hierarchy are more general than those further down

GO keyword extraction Out of the three GO annotation categories we included only

molecular function and biological process and left out cellular component as it is less important for characterizing genes functions

Further, due to hierarchical nature of GO and multiple inheritance in the GO structure, we consider with every ancestor up to the level 2 in the GO tree

For example the GO keywords associated with the gene ‘FOS’ in our gene list are ‘protein-dimerization, DNA binding, RNA polymerase, transcription factor, DNA methylation, and inflammatory-response’.

12

Text Mining: Keyword Representation and Calculation of Numeric Vectors

This process is concerned with computing the numeric weight, wij, for each gene-keyword pair (gi, tj) (i = 1, 2, … n and j = 1, 2, … k) to represent the gene’s characteristics in terms of the associated keywords.

Common techniques for such numeric encoding include

Binary. The presence or absence of a keyword relative to a gene.

Term frequency. The frequency of occurrence of a keyword with a gene.

Term frequency / inverse document frequency (TF*IDF). The relative frequency of occurrence of a keyword with a gene compared to other genes

13

Text Mining: TF*IDF Weighting

Most weighting scheme in information retrieval and text classification method is the TFIDF (term frequency / inverse document frequency) weighting scheme.

TF(w,d) (Term Frequency) is the number of times word w occurs in a document d.

DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.

The inverse document frequency is calculated as

Where | D | is total number of documents in the corpus

)log()( )(||wDF

DwIDF

14

Text Mining: Keyword Representation and Calculation of Numeric vectors

In our study, as the keywords are extracted from gene specific sentences but not from full abstracts, the number of keywords associated with each gene is small.

Further, the frequency of occurance of most keywords tended be one.

Therefore, the binary encoding scheme was adopted as illustrated in Table 2 .

Genes / Terms t1 t2 ... tk

g1 w11 = 0 w21 = 1 ... wk1 = 1 g2 w12 = 1 w22 = 1 ... wk2 = 0

... ... ... ... ... gn w1n = 0 w2n = 0 ... wkn = 1

Table 2. Binary representation of gene * keywords

15

Text Mining: Gene Clustering

After, our binary coding scheme adopted in this study consists of numeric row vectors representing genes (via the associated biological functional keywords), and numeric column vectors representing annotation terms (via the associated genes)

Clustering can produce useful and specific information about the biological characteristics of sets of genes

Clustering: Partition unlabeled examples into disjoint subsets of clusters, such that:

Examples within a cluster are very similar Examples in different clusters are very different

Discover new categories in an unsupervised manner.

16

Text Mining: Test Set and Evaluation

The test set contains 20 genes and 10 abstracts for each gene, resulting in a total of 200 abstracts in two cancer categories (Table 3) was used evaluate usefulness of our keyword extraction method

Genes Category

ADAM23, DKK1, IGF2, LRRC4, L3MBTL, MMP9, MSH2, PTPNS1, SFMBT1, ZIC1

Brain Tumor

AMPH, ATM, BRCA1, BRCA2, CHEK2, CDH1, PHB, TFF1, TSG101, XRCC3

Breast Cancer

Table 3. Test set of 20 human genes manually grouped in to two cancer categories

17

Text Mining: Evaluation

Full abstract keywords (baseline). Extracts gene annotation terms based on term frequencies * inverse document frequencies (TF*IDF) within the entire abstract without regard to sentence structure.

Sentence keywords. Extracts gene specific keywords based sentence-level processing.

Sentence + MeSH keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction).

Sentence + MeSH + GO keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction) and GO terms (see Section GO keyword extraction

18

Text Mining: Evaluation

Results of various keyword extraction methods

Keywords Extraction Method

Precision

Recall F-measure (%)

Abstract keywords (baseline)

0.31 0.24 27.05

Sentence keywords only 0.57 0.38 45.60

Sentence + MeSH keywords

0.64 0.47 54.19

Sentence + MeSH + GO keywords

0.78 0.72 74.88

Part II: Applications to Microarrays

Functional keyword Clustering of genes resulting from microarray experiment

20

Applications to Microarrays Data and Analysis

As an illustrative example, our keyword extraction methods was applied to functional interpretation of cluster of genes that were found differentially expressed in a microarray experiment investigating the impact of two mitogenic protein Epidermal growth factor (EGF) and Sphingosine 1-phosphate (S1P) on glioblastoma cell lines

when compared to the resting state, 19 genes were significantly differentially expressed as a response to EGF, 35 genes as a response to S1P and 30 genes as a response to COM, i.e., combined stimuli of S1P and EGF. The three gene lists are referred to as G(EGF), G(S1P) and G(COM), respectively (Table 4).

21


Table 4. List of Differentially Expressed Genes

Gene List Name of Genes

G(EGF) (19 genes)

HRY, KLF2, ID1, JUN, DUSP6, IMPDH2, GP1BB, PNUTL1, CGI-96, CALD1, TRIM15, FOS, SPRY4, CLU, SLC5A3, MRPS6, ABCA1, OLFM1, PHLDA1

G(S1P) (35 genes)

F3, NR4A1, KLF5, GADD45B, IL8, CITED2, CALD1, IL6, BCL6, LBH, HRB2, KIAA0992, NFKBIA, TNFAIP3, CCL2, DSCR1, TXNIP, NAB1, EHD1, GBP1, GLIPR1, MAP2K3, FZD7, RGS3, SOCS5, FOSL2, JAG1, DOC1, NRG1, BTG1, PDE4C, KIAA1718, KIAA0346, SFRS3, PLAU

G(COM)(30 genes)

MAFF, DUSP5, EGR3, SERPINE1, ZFP36, DUSP1, LIF, DTR, MYC, GADD45B, RTP801, ATF3, JUNB, SNARK, WEE1, EGR2, TIEG, SPRY2, CEBPD, SGK, GEM, NEDD9, LDLR, EGR1, C8FW, UGCG, MCL1, ZYX, FOSL1, DIPA

22


Using these the three gene lists obtained from the microarray experiment (Table 6) as query in MEDLINE returned the three corresponding sets of abstracts A(EGF), A(S1P) and A(COM), respectively (Table 5).

The abstracts were processed with the keyword extraction method involving sentence-level augmented with MeSH and GO keywords

The resulting keywords were encoded in binary weighting scheme

The resulting representations were clustered using average linkage hierarchical clustering algorithm.

23


Table 5. Three sets of abstracts, A(EGF), A(S1P), and A(COM), retrieved via MEDLINE for this study

Gene List # of Genes in List

Retrieved Abstract Set

# of Abstracts in Set

G(EGF) 19 A(EGF) 28 913

G(S1P) 35 A(S1P) 19 705

G(COM) 30 A(COM) 39 890

24

Applications to Microarrays Average Linkage Hierarchical Clustering Algorithm

Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters.

Compromise between single and complete link.

Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters.

)( :)(

),()1(

1),(

ji jiccx xyccyjiji

ji yxsimcccc

ccsim

25

Applications to Microarrays Results

HRY

KLF2ID1

JUN

DUSP6

IMPDH2

GP1BB

PNUTL1

CALD1

TRIM 15

FO S

SPRY4

CLU

SLC5A3

MRPS6

ABCA1

OLFM1PHLDA1

neu

ral t

ube

def

ects

tran

scri

ptio

n fa

ctor

cell

dea

them

bryo

gene

sis

ion

bin

ding

angi

oge

nesi

s

inhi

bitio

nem

bryo

nic

dev

elo

pme

nt

tran

s-ac

tivat

ors

zinc

fing

ers

mito

gen

esi

s

asse

mbl

ese

cret

ion

bio

synt

hesi

sre

gula

tion

gly

copr

ote

inan

drog

ens

odo

nto

gen

esis

calm

odu

lin-b

indi

ng

desa

tura

ses

shap

e-re

gula

tion

rela

xatio

ntu

mo

rigen

esis

intr

ace

llula

ra

ther

ogen

eses

glu

tam

ine-

tran

spo

rtD

NA

-met

hyla

tion

fely

pres

sin

tran

sitio

ncl

uste

ring

reco

mbi

natio

nth

erm

o-re

cept

ors

v-fo

sfu

sio

nse

nsat

ion

imm

uno-

rea

ctiv

ity

ant

ibio

sis

oste

obla

sts

Summary of analysis of EGF cluster

26


Summary of analysis of S1P cluster

F3

N R 4A1

KLF5

G A DD 45B

IL8

C ITE D 2

C ALD 1

IL6

BC L6

H R B2

N FK BIA

TNFAIP 3

C C L2

D SC R 1

TXN IP

N AB1

EH D 1G B P1

G LIP R 1

M AP2K3

FZD 7R G S3

SO CS 5

FO SL2

JA G 1

D O C 1

N R G 1

BTG 1

PD E4C

SFR S3

PLA U

athe

roge

nesi

sm

itoge

nesi

s

asse

mbl

ein

flam

mat

ion

angi

ogen

esis

endo

cyto

sis

lym

phoc

yte

spa

thog

enes

is

DN

A-d

epen

dent

foca

l-con

tact

DN

A-d

amag

esp

licin

gG

1 ph

ase

extr

acel

lula

rm

otili

typr

otei

n-bi

ndin

g

cos-

cells

myo

sin

RN

A lo

caliz

atio

ndo

se-r

espo

nse

antic

odo

ncy

toto

xici

typa

rasi

toph

orou

sG

pro

tein

dem

yelin

atio

ncy

toly

sis

Ca

rele

ase

loco

mot

ion

hom

eost

asis

circ

ulat

ion

phos

phor

ylat

ion

synt

hesi

sre

pair

prot

ein

kina

se

endo

thel

ializ

atio

nor

gan

ogen

esis

cell-

adhe

sion

mut

agen

esis

imm

une-

resp

onse

27


Summary of analysis of COM cluster

M A FF

D U S P 5

E G R 3

S E R P IN E 1

Z FP 36

D U S P 1

LIF

D T R

M Y C

G A D D 45B

R T P 801A TF 3

JU N B

S N A R KW E E 1

E G R 2T IE G

S P R Y 2

C E B P D

S G K

G E M

N E D D 9

LD LR

E G R 1

C 8F W

U G C G

M C L1

Z Y X

F O S L1

D IP A

DN

A-b

ind

ing

zinc

fin

ge

rsre

pre

ssor

pro

tein

sD

NA

-de

pen

de

nt

nu

cle

ustr

ans

act

iva

tion

leu

cine

zip

pe

rstr

an

scri

ptio

ng

en

e e

xpre

ssio

n r

egu

latio

n

oxi

da

tive

str

ess

pro

to-o

nco

ge

ne

cell

surv

iva

lsi

gna

l tra

nsd

uctio

n

ma

tura

tion

en

docy

tosi

s

diff

ere

ntia

tion

mito

ge

nesi

s

mito

sis

G2

pha

sech

em

ose

nsi

tivity

mut

ag

ene

sis

lym

pha

ngi

oge

ne

sis

ion

bin

ding

RN

A p

roce

ssin

g

G2

-m t

ran

sitio

nm

RN

A s

plic

ing

imm

ort

ality

DN

A r

eco

mb

ina

tion

mic

rotu

bu

leg

en

e si

len

cin

gh

elix

-loo

p-h

elix

mo

tifs

tra

nsc

rip

tion

fa

cto

r

seiz

ure

sg

eno

me

inst

abi

lity

DN

A m

od

ifica

tion

DN

A m

eth

ylat

ion

jun

ge

ne

s

28

Conclusions An important topic in microarray data mining is to bind

transcriptionally modulated genes to functional pathways or how transcriptional modulation can be associated with specific biological events such as genetic disease phenotype, cell differentiation etc.

However, the amount of functional annotation available with each transcriptionaly modulated genes is still a limiting factor because not all genes are well annotated

Further, Jenssen et al. (2001) earlier compiled a network of human gene relationships from MEDLINE abstracts. These compiled relationships were then compared to the gene expression cluster results. This approach gives a very interesting result: functionally related genes can show totally different patterns, and hence belong to different clusters (Jenssen, et al.: A literature network of human genes for high-throughput analysis of gene expression, Nat.Genet., 28, 21-28, 2001)

29

Conclusions Our gene functional keyword clustering/ grouping will

enable to select functionally informative genes from differentially expressed genes for further investigations.

Our evaluation suggests that this approach will provide more specific and useful information than typical approaches using abstract-level information. This is particularly the case when the sentence-level terms are augmented by MeSH and GO keywords

As the current text mining scenario is on full-text mining As full-text contains large number of irreverent sentences compare to abstracts this approach is more appropriate for full-text study as it filters irrelevant sentences before clustering.

30

Acknowledgments Eric G. Bremer, Brain Tumor Research Program,

Children’s Memorial Research Center, Chicago, IL, USA, and James R. van Brocklyn, Division of Neuropathology, Department of Pathology, The Ohio State University, Columbus, Ohio, USA for the microarray data set

Dr. Daniel Berrar, Bioinformatics Research Group, University of Ulster, UK

Members of Bioinformatics Centre, Madurai Kamaraj University, India

Dept of Biotechnology, Govt. of India for Bioinformatics facilities

31

THANK YOU

functional gene clustering via gene annotation sentences, mesh and go keywords from biomedical...

Documents

entrezgene gene

special gene

gene names

gene clusters

slide slide

text mining slide

keywords text mining

gene annotation sentences