dr. n. jeyakumar, m.sc., ph.d., bioinformatics centre school of biotechnology

31
Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology Madurai Kamaraj University Madurai – 625021, INDIA

Upload: tiger

Post on 03-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature. Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology Madurai Kamaraj University Madurai – 625021, INDIA. Purpose & Goals. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature

Dr. N. JEYAKUMAR, M.Sc., Ph.D.,Bioinformatics Centre

School of BiotechnologyMadurai Kamaraj University

Madurai – 625021, INDIA

Page 2: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

2

Purpose & Goals Extracting gene specific functional ‘keywords’ from

biological literature From full-abstracts Gene specific sentences

Augment extracted keywords with MeSH and GO keywords related to gene

Compare the accuracy of results with a test data set in various keyword extraction methods

Full-abstracts Gene specific sentences Gene specific sentences + MeSH keywords Gene specific sentences+ MeSH and GO keywords

Use the keyword extraction method to cluster the differentially expressed gene clusters in a microarray experiments

Page 3: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

3

Outline

Part I: Text mining and keyword extraction from literature Our text mining methodology

Part II: Applications to microarrays Functional keyword clustering of

microarray data

Two Parts: I, and II

?

Page 4: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

Part I: Text Mining

Page 5: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

5

Text Mining: Introduction and overview Text mining aims to identify non-trivial, implicit,

previously unknown, and potentially useful patterns in text (e.g. classification system, association rules, hyphothesis etc.)

includes more established research areas such as information retrieval (IR), natural language processing (NLP), information extraction (IE), and traditional data mining (DM)

relevant to bioinformatics because of explosive growth of biomedical literature (e.g.

MEDLINE – 15 million records) availability of some information in textual form only,

e.g. clinical records

Page 6: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

6

Experimental design of gene clustering with sentences-level, MeSH and GO keywords

S e t o f A b s tra c t

G e n e L ist

G e n e /P ro te inD ic tio n a ry

Yo u r s tu ff h e re .

Yo u r s tu ff h e re .

Yo u r s tu ff h e re . Yo u r s tu ff h e re .

C lu s te rin g

K e y w o rd E x tra c tio n

M e S H /G OK e y w o r d

E x tra c tio n

S e n te n c e E x c tra c tio n

F e a tu re Ve c to rG e n e ra tio n

F ilte rin gM e d L in eA b s tra c ts

M ic ro a r ra yE x p e rim e n t

M e S H /G e n e O n to lo g y

P a tte rn sV isu a liz a tio nA n n o ta t io n

Text Mining: System Architecture

Page 7: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

7

Text Mining: Keyword Extraction from Biomedical Literature

Steps to extract sentence-level keywords

Gene - Synonym dictionary – A special gene name synonym name dictionary was created for human genes using Entrez-Gene

Gene-name normalization - This process replaces all the gene names in the abstract with its unique canonical identifier (Entrez gene ID) using the gene-synonym dictionary specially constructed for this study.

Sentence filtering – using corpus specific the regular expression as the following example

($gene @{0,6} $action (of|with) @{0,2} $gene)

extracts sentences that match the structure shown below the expression. The notational construct ‘A B ...’ is interpreted as ‘A followed by B followed by ...’.

gene name 0-6 words action verb ‘of’ or ‘with’ 0-2 words gene name

Keyword extraction. – Next slide

Page 8: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

8

Text Mining: Keyword Extraction from biomedical literature

Table 1. An example set of regular expressions as nouns describing agents and agents, and passive and active verbs

Name of Expression Expression Pattern Sentence Output Nouns describing agents ($gene (is)? (the|an|a) @{0,2}$action of @{0,2}

$gene)

IL6, a known mediator of STAT3 response

Nouns describing actions ($gene @{0,6} $action (of|with) @{0,1} $gene)

abi5 domains required for interaction with abi3

Passive verbs ($gene @{0.6} (is|was|be|are|were) @{0,1} $action $(by|via|through) @{0,3} $gene)

Protein kinase c (PKC) has been shown to be activated by parathyroid hormone

Active verbs ($gene $sub-action @{0,1} $action @{0,2} $gene) Insulin mediated inhibition of hormone sensitivity lipase activity

Page 9: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

9

Text Mining: Keyword Extraction from Biomedical LiteratureKeyword extraction Example

Sentence: BRCA1 physically associates with p53 and stimulates its

transcriptional activity.

Brill-POS-tagged sentence: BRCA1/NNP physically/RB associates/VBZ with/IN p53/NN and/CC

stimulates/VBZ its/PRP$ transcriptional/JJ activity/NN ./.

Sentence keywords: associates, stimulates, transcription activity

Sentence keywords after manual curation: transcription activity

Page 10: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

10

Text Mining: MeSH Keyword Extraction

MeSH keywords MeSH keywords are subject index terms assigned to each scientific

literature by the Natural Library of Medicine (NLM) for purpose of subject indexing and searching the journal articles via PubMed.

MeSH keyword extraction Extracted directly from gene specific abstracts via Perl scripts

MeSH keyword curation Using a MeSH keywords stop words dictionary (e.g., human, DNA,

animal, Support U.S Govt etc.).

For example the MeSH keywords associated with a gene ‘FOS’ in our gene list are ‘oncogene, felypressin, transcription-factor, thermo-receptors, DNA-binding, antibiosis, inflammatory-response, zinc-fingers, gene-regulation, and neuronal-plasticity’.

Page 11: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

11

Text Mining: GO Keyword Extraction

GO keywords Gene Ontology (GO) is a hierarchical organization of gene and gene

product terms from various databases in which concepts at higher levels in the hierarchy are more general than those further down

GO keyword extraction Out of the three GO annotation categories we included only

molecular function and biological process and left out cellular component as it is less important for characterizing genes functions

Further, due to hierarchical nature of GO and multiple inheritance in the GO structure, we consider with every ancestor up to the level 2 in the GO tree

For example the GO keywords associated with the gene ‘FOS’ in our gene list are ‘protein-dimerization, DNA binding, RNA polymerase, transcription factor, DNA methylation, and inflammatory-response’.

Page 12: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

12

Text Mining: Keyword Representation and Calculation of Numeric Vectors

This process is concerned with computing the numeric weight, wij, for each gene-keyword pair (gi, tj) (i = 1, 2, … n and j = 1, 2, … k) to represent the gene’s characteristics in terms of the associated keywords.

Common techniques for such numeric encoding include

Binary. The presence or absence of a keyword relative to a gene.

Term frequency. The frequency of occurrence of a keyword with a gene.

Term frequency / inverse document frequency (TF*IDF). The relative frequency of occurrence of a keyword with a gene compared to other genes

Page 13: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

13

Text Mining: TF*IDF Weighting Most weighting scheme in information retrieval and

text classification method is the TFIDF (term frequency / inverse document frequency) weighting scheme.

TF(w,d) (Term Frequency) is the number of times word w occurs in a document d.

DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.

The inverse document frequency is calculated as

Where | D | is total number of documents in the corpus

)log()( )(||wDF

DwIDF

Page 14: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

14

Text Mining: Keyword Representation and Calculation of Numeric vectors

In our study, as the keywords are extracted from gene specific sentences but not from full abstracts, the number of keywords associated with each gene is small.

Further, the frequency of occurance of most keywords tended be one.

Therefore, the binary encoding scheme was adopted as illustrated in Table 2 .

Genes / Terms t1 t2 ... tk g1 w11 = 0 w21 = 1 ... wk1 = 1 g2 w12 = 1 w22 = 1 ... wk2 = 0 ... ... ... ... ... gn w1n = 0 w2n = 0 ... wkn = 1

Table 2. Binary representation of gene * keywords

Page 15: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

15

Text Mining: Gene Clustering After, our binary coding scheme adopted in this

study consists of numeric row vectors representing genes (via the associated biological functional keywords), and numeric column vectors representing annotation terms (via the associated genes)

Clustering can produce useful and specific information about the biological characteristics of sets of genes

Clustering: Partition unlabeled examples into disjoint subsets of clusters, such that:

Examples within a cluster are very similar Examples in different clusters are very different

Discover new categories in an unsupervised manner.

Page 16: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

16

Text Mining: Test Set and Evaluation The test set contains 20 genes and 10 abstracts for each gene,

resulting in a total of 200 abstracts in two cancer categories (Table 3) was used evaluate usefulness of our keyword extraction method

Genes Category

ADAM23, DKK1, IGF2, LRRC4, L3MBTL, MMP9, MSH2, PTPNS1, SFMBT1, ZIC1

Brain Tumor

AMPH, ATM, BRCA1, BRCA2, CHEK2, CDH1, PHB, TFF1, TSG101, XRCC3

Breast Cancer

Table 3. Test set of 20 human genes manually grouped in to two cancer categories

Page 17: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

17

Text Mining: Evaluation

Full abstract keywords (baseline). Extracts gene annotation terms based on term frequencies * inverse document frequencies (TF*IDF) within the entire abstract without regard to sentence structure.

Sentence keywords. Extracts gene specific keywords based sentence-level processing.

Sentence + MeSH keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction).

Sentence + MeSH + GO keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction) and GO terms (see Section GO keyword extraction

Page 18: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

18

Text Mining: EvaluationResults of various keyword extraction methods

Keywords Extraction Method

Precision

Recall F-measure (%)

Abstract keywords (baseline)

0.31 0.24 27.05

Sentence keywords only 0.57 0.38 45.60

Sentence + MeSH keywords

0.64 0.47 54.19

Sentence + MeSH + GO keywords

0.78 0.72 74.88

Page 19: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

Part II: Applications to Microarrays

Functional keyword Clustering of genes resulting from microarray experiment

Page 20: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

20

Applications to Microarrays Data and Analysis As an illustrative example, our keyword extraction

methods was applied to functional interpretation of cluster of genes that were found differentially expressed in a microarray experiment investigating the impact of two mitogenic protein Epidermal growth factor (EGF) and Sphingosine 1-phosphate (S1P) on glioblastoma cell lines

when compared to the resting state, 19 genes were significantly differentially expressed as a response to EGF, 35 genes as a response to S1P and 30 genes as a response to COM, i.e., combined stimuli of S1P and EGF. The three gene lists are referred to as G(EGF), G(S1P) and G(COM), respectively (Table 4).

Page 21: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

21

Applications to Microarrays Data and Analysis

Table 4. List of Differentially Expressed GenesGene List Name of Genes

G(EGF) (19 genes)

HRY, KLF2, ID1, JUN, DUSP6, IMPDH2, GP1BB, PNUTL1, CGI-96, CALD1, TRIM15, FOS, SPRY4, CLU, SLC5A3, MRPS6, ABCA1, OLFM1, PHLDA1

G(S1P) (35 genes)

F3, NR4A1, KLF5, GADD45B, IL8, CITED2, CALD1, IL6, BCL6, LBH, HRB2, KIAA0992, NFKBIA, TNFAIP3, CCL2, DSCR1, TXNIP, NAB1, EHD1, GBP1, GLIPR1, MAP2K3, FZD7, RGS3, SOCS5, FOSL2, JAG1, DOC1, NRG1, BTG1, PDE4C, KIAA1718, KIAA0346, SFRS3, PLAU

G(COM)(30 genes)

MAFF, DUSP5, EGR3, SERPINE1, ZFP36, DUSP1, LIF, DTR, MYC, GADD45B, RTP801, ATF3, JUNB, SNARK, WEE1, EGR2, TIEG, SPRY2, CEBPD, SGK, GEM, NEDD9, LDLR, EGR1, C8FW, UGCG, MCL1, ZYX, FOSL1, DIPA

Page 22: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

22

Applications to Microarrays Data and Analysis Using these the three gene lists obtained from the

microarray experiment (Table 6) as query in MEDLINE returned the three corresponding sets of abstracts A(EGF), A(S1P) and A(COM), respectively (Table 5).

The abstracts were processed with the keyword extraction method involving sentence-level augmented with MeSH and GO keywords

The resulting keywords were encoded in binary weighting scheme

The resulting representations were clustered using average linkage hierarchical clustering algorithm.

Page 23: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

23

Applications to Microarrays Data and Analysis

Table 5. Three sets of abstracts, A(EGF), A(S1P), and A(COM), retrieved via MEDLINE for this study

Gene List # of Genes in List

Retrieved Abstract Set

# of Abstracts in Set

G(EGF) 19 A(EGF) 28 913

G(S1P) 35 A(S1P) 19 705

G(COM) 30 A(COM) 39 890

Page 24: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

24

Applications to Microarrays Average Linkage Hierarchical Clustering Algorithm Use average similarity across all pairs within

the merged cluster to measure the similarity of two clusters.

Compromise between single and complete link.

Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters.

)( :)(

),()1(

1),(ji jiccx xyccyjiji

ji yxsimcccc

ccsim

Page 25: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

25

Applications to Microarrays Results

HRY

KLF2ID1

JUN

DUSP6

IMPDH2

GP1BB

PNUTL1

CALD1

TRIM 15

FO S

SPRY4

CLU

SLC5A3

MRPS6

ABCA1

OLFM1PHLDA1

neur

al tu

be d

efec

tstra

nscr

iptio

n fa

ctor

cell

deat

hem

bryo

gene

sis

ion

bind

ing

angi

ogen

esis

inhi

bitio

nem

bryo

nic

deve

lopm

ent

trans

-act

ivat

ors

zinc

fing

ers

mito

gene

sis

asse

mbl

ese

cret

ion

bios

ynth

esis

regu

latio

ngl

ycop

rote

inan

drog

ens

odon

toge

nesi

sca

lmod

ulin

-bin

ding

desa

tura

ses

shap

e-re

gula

tion

rela

xatio

ntu

mor

igen

esis

intra

cellu

lar

athe

roge

nese

sgl

utam

ine-

trans

port

DN

A-m

ethy

latio

nfe

lypr

essi

ntra

nsiti

oncl

uste

ring

reco

mbi

natio

nth

erm

o-re

cept

ors

v-fo

sfu

sion

sens

atio

nim

mun

o-re

activ

ityan

tibio

sis

oste

obla

sts

Summary of analysis of EGF cluster

Page 26: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

26

Applications to Microarrays ResultsSummary of analysis of S1P cluster

F3

N R 4A1

KLF5

G A DD 45B

IL8

C ITE D 2

C ALD 1

IL6

BC L6

H R B2

N FKBIA

TN FAIP3

C C L2

D SC R 1

TXN IP

N AB1

EH D 1G B P1

G LIP R 1

M AP2 K3

FZD 7R G S3

SO CS 5

FO SL2

JA G 1

D O C 1

N R G 1

BTG 1

PD E4C

SFR S3

PLA U

athe

roge

nesi

sm

itoge

nesi

sas

sem

ble

infla

mm

atio

nan

giog

enes

isen

docy

tosi

sly

mph

ocyt

espa

thog

enes

is

DN

A-d

epen

dent

foca

l-con

tact

DN

A-d

amag

esp

licin

gG

1 ph

ase

extra

cellu

lar

mot

ility

prot

ein-

bind

ing

cos-

cells

myo

sin

RN

A lo

caliz

atio

ndo

se-r

espo

nse

antic

odon

cyto

toxi

city

para

sito

phor

ous

G p

rote

inde

mye

linat

ion

cyto

lysi

sC

a re

leas

elo

com

otio

nho

meo

stas

isci

rcul

atio

nph

osph

oryl

atio

nsy

nthe

sis

repa

irpr

otei

n ki

nase

endo

thel

ializ

atio

nor

gano

gene

sis

cell-

adhe

sion

mut

agen

esis

imm

une-

resp

onse

Page 27: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

27

Applications to Microarrays ResultsSummary of analysis of COM cluster

M A FF

D U SP 5

EG R 3

SE R P IN E 1

ZFP 36

D U SP 1

LIF

D TR

M Y C

G A D D 45B

R TP 801ATF 3

JU N B

SN A R KW E E 1

EG R 2TIE G

SP R Y 2

C E B P D

SG K

G E M

N E D D 9

LD LR

EG R 1

C 8FW

U G C G

M C L1

ZY X

FO S L1

D IP A

DN

A-b

indi

ngzi

nc fi

nger

sre

pres

sor p

rote

ins

DN

A-d

epen

dent

nucl

eus

trans

activ

atio

nle

ucin

e zi

pper

stra

nscr

iptio

nge

ne e

xpre

ssio

n re

gula

tion

oxid

ativ

e st

ress

prot

o-on

coge

nece

ll su

rviv

alsi

gnal

tran

sduc

tion

mat

urat

ion

endo

cyto

sis

diffe

rent

iatio

nm

itoge

nesi

sm

itosi

sG

2 ph

ase

chem

osen

sitiv

itym

utag

enes

isly

mph

angi

ogen

esis

ion

bind

ing

RN

A pr

oces

sing

G2-

m tr

ansi

tion

mR

NA

splic

ing

imm

orta

lity

DN

A re

com

bina

tion

mic

rotu

bule

gene

sile

ncin

ghe

lix-lo

op-h

elix

mot

ifstra

nscr

iptio

n fa

ctor

seiz

ures

geno

me

inst

abili

ty

DN

A m

odifi

catio

nD

NA

met

hyla

tion

jun

gene

s

Page 28: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

28

Conclusions An important topic in microarray data mining is to bind

transcriptionally modulated genes to functional pathways or how transcriptional modulation can be associated with specific biological events such as genetic disease phenotype, cell differentiation etc.

However, the amount of functional annotation available with each transcriptionaly modulated genes is still a limiting factor because not all genes are well annotated

Further, Jenssen et al. (2001) earlier compiled a network of human gene relationships from MEDLINE abstracts. These compiled relationships were then compared to the gene expression cluster results. This approach gives a very interesting result: functionally related genes can show totally different patterns, and hence belong to different clusters (Jenssen, et al.: A literature network of human genes for high-throughput analysis of gene expression, Nat.Genet., 28, 21-28, 2001)

Page 29: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

29

Conclusions Our gene functional keyword clustering/ grouping will

enable to select functionally informative genes from differentially expressed genes for further investigations.

Our evaluation suggests that this approach will provide more specific and useful information than typical approaches using abstract-level information. This is particularly the case when the sentence-level terms are augmented by MeSH and GO keywords

As the current text mining scenario is on full-text mining As full-text contains large number of irreverent sentences compare to abstracts this approach is more appropriate for full-text study as it filters irrelevant sentences before clustering.

Page 30: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

30

Acknowledgments Eric G. Bremer, Brain Tumor Research Program,

Children’s Memorial Research Center, Chicago, IL, USA, and James R. van Brocklyn, Division of Neuropathology, Department of Pathology, The Ohio State University, Columbus, Ohio, USA for the microarray data set

Dr. Daniel Berrar, Bioinformatics Research Group, University of Ulster, UK

Members of Bioinformatics Centre, Madurai Kamaraj University, India

Dept of Biotechnology, Govt. of India for Bioinformatics facilities

Page 31: Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

31

THANK YOU