clunch applying nlp models to the biological domain eugen buehler lyle ungar

33
CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

Upload: barry-bradford

Post on 16-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Applying NLP models to the Biological Domain

Eugen BuehlerLyle Ungar

Page 2: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Overview

• “Languages” of Computers and Biology

• Probability Models for NL and Biology

• Maximum Entropy• Basic ME amino acid model• The “Whole Protein Model”• Results in a gene prediction model

Page 3: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Bits and Bytes: The Alphabet of Computers

• Computer electronics are complicated: RAM, processor, etc.

• It all comes down to bits (1s and 0s).• Bits can be organized into bytes (8).• Bytes can represent, among other

things, letters (ASCII), which can form sentences.

Page 4: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Page 5: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

DNA: Biology’s Alphabet• Biology is complicated.• It comes down to nucleotides

(A,C,G,T).• Nucleotides can be grouped into

codons.• Codons represent amino acids,

amino acids make proteins/genes.

LCSAM

CTGTGCAGCGCUATG

Page 6: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Find the words!

0101000110010100100100011010100011100101101101101001011101010101000001110101010100010001001110011101001100111001001110010100110010100100010010010001000100100010001001001100010001001100110010011101010100110011001001100101010001000110100100010000100100100010100100100010001101010100010101011100101011100011110001111000110011101001111101000011010000011110100111110010011000111100101111000111010101011001

Page 7: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Find the genes!AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC

Page 8: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

NL and Biological Modeling

“Mary went to the ____ .”

MSGTIPSCPTAL ___

h

w

h

a

hphp sorestore

n

ii happ

1

protein

Page 9: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Markov Models

12 nnnn wwwphwp

12 nnnn aaaphap

Page 10: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

ME, In a Nutshell

• Constrain the model.• Maximize entropy.

Page 11: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Constraining features

• “is the” occurs with frequency 1/10000.• Define a feature:

• Require that:

otherwise0

the"" and "is"in ends if1),(} theis{

whhwf

10000

1,

, theis

hw

hwfhwp

Page 12: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Exponential Solution

• A unique solution exists with maximum entropy:

n

iiiME hwf

hZhwp

1

),(exp1

)(

Page 13: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Triggers

• Triggers – Words that increase the likelihood of other words.

Crop → HarvestCuban → HavanaIran →

HashemiHate → Hate

Page 14: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Unigram and Bigram Caches

• Caches – frequency tables built from the history.

• Is “supercalifragilisticexpialidocious” a common word?

• Allow for model adaptation.

Page 15: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Applying ME Models in Computational Biology

• Significant improvement for NLP.• Same for biological models? • AA sequences: a simple test case.

Page 16: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Feature Sets

• Unigrams and Bigrams• Self-triggers - frequency of a

specific amino acid.• Class based self-triggers -

frequency of a specific amino acid class.

• Unigram Cache - Amino acid frequency for this protein.

Page 17: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Training and testing data

• Burset et al. set of 571 proteins.• Homologous proteins eliminated.• Resulting set of 204 proteins split

into 2 groups of 102 each.

Page 18: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Perplexity of Amino Acid Models

16.616.8

1717.217.417.617.8

1818.218.4

Pe

rple

xit

y

Test Data

Training Data

Page 19: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Results

• “Long distance” features help.• Best model gives a 30% reduction in

perplexity over unigram reduction.• Our model may improve predictions

made by Genscan, a eukaryotic gene finding algorithm.

Page 20: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Limitations of this model

• Artificial model.• Cannot represent all global

features.

Page 21: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

The “Whole Sentence” Model

jjjWS sf

ZsPsP exp

1)()( 0

inii

s

n

s

nn hwf

hZhwpsP ,exp

1)(

110

Page 22: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Secondary Structure

MAGTVTEAWDVAVFAARRRNDEDDTTRDSLFTYTNSNNTRGPFEGPNYHIAPRWVYNITSVWMIFVVIASIFTNGLVLVATAKFKKLRHPLNWILVNLAIADLGETVIASTISVINQISGYFILGHPMCVLEGYTVSTCGISALWSLAVISWERWVVVCKPFGNVKFDAKLAVAGIVFSWVWSAVWTAPPVFGWSRYWPHGLKTSCGPDVFSGSDDPGVLSYMIVLMITCCFIPLAVILLCYLQVWLAIRAVAAQQKESESTQKAEKEVSRMVVVMIIAYCFCWGPYTVFACFAAANPGYAFHPLAAALPAYFAKSATIYNPIIYVFMNRQFRNCIMQLFGKKVDDGSELSSTSRTEVSSVSNSSVSPA

-----HHHHHHHHHHH--------------EEE--------------------EEEE---EEEEEEEEEEEEH--HEEHHHHHHHH------HHHHHHHHH---HHEEEEEEEEEEE---EEE-----EEE----EEEE-EHEHHHHEHHHH-HEEEE---------HHHHHEHEEEEEEEHH-------------H------------E----------EEEEEEEEE------EHHHHHHHHHHHHHHHHHH-H----HHHHHHHHHHHHHEEEEEEEE-------EEEHHH----------HHHHHHHHHH--------EEEEEH-----HHHHHHH---------------EEEEE---------

Page 23: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

“Whole Protein” Results

• 19 features evaluated• Two were selected:

– Mean length of alpha helix region– Maximum length of any structural

region• 59% increase in protein likelihood

Page 24: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Improved Glimmer Models

• Glimmer used IMMs to predict genes in bacteria.

• Will adding amino acid triggers improve these models? How much?

Page 25: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

H. Pylori Genome

• 1562 Coding Sequences• Split into:

– Training (>500bp) – 1154 genes, 1,354,167 bp

– Testing (<500bp) – 408 genes, 129,045 bp

Page 26: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Glimmer Depth

-0.10.10.30.50.70.91.11.31.5

Change in Model Depth/Features

Ch

ang

e in

PP

C

Page 27: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Lateral Gene Transfer

• Many genes in bacteria come not from their ancestors but from other bacterial species.

• Different bacteria “prefer” to use different codons.

• Analogous to detection of plagiarism detection?

Page 28: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Model Adaptation

• Gene models are trained for every organism.

• Lots of unused information• Analogous to cross-domain

application of NLP models.

Page 29: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Thanks

• Lyle Ungar• Roni Rosenfeld• NIH Grant

Page 30: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Page 31: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

N-Gram Features• Unigram (frequency of individual

words)

• Bigram (frequency of pairs of words)

otherwise0

if1),( 1

1

wwwhfw

otherwise0

and in ends if1),( 21

},{ 21

wwwhwhf ww

Page 32: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

Model with 44 featuresFeature: A Parameter: 1.1283516023198754Feature: V Parameter: 1.0452969493978286Feature: L Parameter: 1.6105829266787726Feature: I Parameter: 0.6972079195742815Feature: P Parameter: 0.8475961041799492Feature: W Parameter: 0.25978594248392567Feature: F Parameter: 0.7344119941888887Feature: M Parameter: 0.4493594016115246Feature: G Parameter: 1.0298040447402417Feature: S Parameter: 1.4262351925813321Feature: T Parameter: 0.9964109799704354Feature: Y Parameter: 0.5683263852913849Feature: C Parameter: 0.4212841347252633Feature: N Parameter: 0.714248861570727Feature: Q Parameter: 1.0039856305693131Feature: K Parameter: 1.0584367118507696Feature: R Parameter: 1.0280086647521616Feature: H Parameter: 0.45401904480206107Feature: D Parameter: 0.7901584163893435Feature: Self Trigger (FREQUENCY) : 1 Parameter: 1.4162727357372282Feature: Self Trigger (FREQUENCY) : 2 Parameter: 1.0953472527852288Feature: Self Trigger (FREQUENCY) : 3 Parameter: 0.9310253183080955Feature: Self Trigger (FREQUENCY) : 4 Parameter: 5.441497902570791Feature: Self Trigger (FREQUENCY) : 5 Parameter: 1.9764385350098654Feature: Self Trigger (FREQUENCY) : A Parameter: 2.9856234039913274Feature: Self Trigger (FREQUENCY) : V Parameter: 1.864040230431935Feature: Self Trigger (FREQUENCY) : L Parameter: 2.9274415604129906Feature: Self Trigger (FREQUENCY) : I Parameter: 2.4563140770039715Feature: Self Trigger (FREQUENCY) : P Parameter: 3.7708734693326558Feature: Self Trigger (FREQUENCY) : W Parameter: 2.672577458866343Feature: Self Trigger (FREQUENCY) : F Parameter: 1.479212103590432Feature: Self Trigger (FREQUENCY) : M Parameter: 0.5107656797047934Feature: Self Trigger (FREQUENCY) : G Parameter: 4.495511648228042Feature: Self Trigger (FREQUENCY) : S Parameter: 5.91039344990589Feature: Self Trigger (FREQUENCY) : T Parameter: 2.449321508559543Feature: Self Trigger (FREQUENCY) : Y Parameter: 2.3542114958521925

Feature: Self Trigger (FREQUENCY) : C Parameter: 82.68056436437357Feature: Self Trigger (FREQUENCY) : N Parameter: 2.4258773271617287Feature: Self Trigger (FREQUENCY) : Q Parameter: 14.611485492431102Feature: Self Trigger (FREQUENCY) : K Parameter: 3.1913667655121665Feature: Self Trigger (FREQUENCY) : R Parameter: 17.76347525956296Feature: Self Trigger (FREQUENCY) : H Parameter: 2.6972280092545198Feature: Self Trigger (FREQUENCY) : D Parameter: 1.5621090399310904Feature: Self Trigger (FREQUENCY) : E Parameter: 34.508027837307324Correction Parameter: 0.9499284545145702Iteration 26Perplexity Training Set = 17.48022432377071Perplexity of Test Set = 17.9412251895500226 17.48022432377071 17.94122518955002

Page 33: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH

otherwise0

teller"" and bank"" if1),(tellerbank

whwhf

Trigger feature function