clunch applying nlp models to the biological domain eugen buehler lyle ungar
TRANSCRIPT
CLUNCH
Applying NLP models to the Biological Domain
Eugen BuehlerLyle Ungar
CLUNCH
Overview
• “Languages” of Computers and Biology
• Probability Models for NL and Biology
• Maximum Entropy• Basic ME amino acid model• The “Whole Protein Model”• Results in a gene prediction model
CLUNCH
Bits and Bytes: The Alphabet of Computers
• Computer electronics are complicated: RAM, processor, etc.
• It all comes down to bits (1s and 0s).• Bits can be organized into bytes (8).• Bytes can represent, among other
things, letters (ASCII), which can form sentences.
CLUNCH
CLUNCH
DNA: Biology’s Alphabet• Biology is complicated.• It comes down to nucleotides
(A,C,G,T).• Nucleotides can be grouped into
codons.• Codons represent amino acids,
amino acids make proteins/genes.
LCSAM
CTGTGCAGCGCUATG
CLUNCH
Find the words!
0101000110010100100100011010100011100101101101101001011101010101000001110101010100010001001110011101001100111001001110010100110010100100010010010001000100100010001001001100010001001100110010011101010100110011001001100101010001000110100100010000100100100010100100100010001101010100010101011100101011100011110001111000110011101001111101000011010000011110100111110010011000111100101111000111010101011001
CLUNCH
Find the genes!AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC
CLUNCH
NL and Biological Modeling
“Mary went to the ____ .”
MSGTIPSCPTAL ___
h
w
h
a
hphp sorestore
n
ii happ
1
protein
CLUNCH
Markov Models
12 nnnn wwwphwp
12 nnnn aaaphap
CLUNCH
ME, In a Nutshell
• Constrain the model.• Maximize entropy.
CLUNCH
Constraining features
• “is the” occurs with frequency 1/10000.• Define a feature:
• Require that:
otherwise0
the"" and "is"in ends if1),(} theis{
whhwf
10000
1,
, theis
hw
hwfhwp
CLUNCH
Exponential Solution
• A unique solution exists with maximum entropy:
n
iiiME hwf
hZhwp
1
),(exp1
)(
CLUNCH
Triggers
• Triggers – Words that increase the likelihood of other words.
Crop → HarvestCuban → HavanaIran →
HashemiHate → Hate
CLUNCH
Unigram and Bigram Caches
• Caches – frequency tables built from the history.
• Is “supercalifragilisticexpialidocious” a common word?
• Allow for model adaptation.
CLUNCH
Applying ME Models in Computational Biology
• Significant improvement for NLP.• Same for biological models? • AA sequences: a simple test case.
CLUNCH
Feature Sets
• Unigrams and Bigrams• Self-triggers - frequency of a
specific amino acid.• Class based self-triggers -
frequency of a specific amino acid class.
• Unigram Cache - Amino acid frequency for this protein.
CLUNCH
Training and testing data
• Burset et al. set of 571 proteins.• Homologous proteins eliminated.• Resulting set of 204 proteins split
into 2 groups of 102 each.
CLUNCH
Perplexity of Amino Acid Models
16.616.8
1717.217.417.617.8
1818.218.4
Pe
rple
xit
y
Test Data
Training Data
CLUNCH
Results
• “Long distance” features help.• Best model gives a 30% reduction in
perplexity over unigram reduction.• Our model may improve predictions
made by Genscan, a eukaryotic gene finding algorithm.
CLUNCH
Limitations of this model
• Artificial model.• Cannot represent all global
features.
CLUNCH
The “Whole Sentence” Model
jjjWS sf
ZsPsP exp
1)()( 0
inii
s
n
s
nn hwf
hZhwpsP ,exp
1)(
110
CLUNCH
Secondary Structure
MAGTVTEAWDVAVFAARRRNDEDDTTRDSLFTYTNSNNTRGPFEGPNYHIAPRWVYNITSVWMIFVVIASIFTNGLVLVATAKFKKLRHPLNWILVNLAIADLGETVIASTISVINQISGYFILGHPMCVLEGYTVSTCGISALWSLAVISWERWVVVCKPFGNVKFDAKLAVAGIVFSWVWSAVWTAPPVFGWSRYWPHGLKTSCGPDVFSGSDDPGVLSYMIVLMITCCFIPLAVILLCYLQVWLAIRAVAAQQKESESTQKAEKEVSRMVVVMIIAYCFCWGPYTVFACFAAANPGYAFHPLAAALPAYFAKSATIYNPIIYVFMNRQFRNCIMQLFGKKVDDGSELSSTSRTEVSSVSNSSVSPA
-----HHHHHHHHHHH--------------EEE--------------------EEEE---EEEEEEEEEEEEH--HEEHHHHHHHH------HHHHHHHHH---HHEEEEEEEEEEE---EEE-----EEE----EEEE-EHEHHHHEHHHH-HEEEE---------HHHHHEHEEEEEEEHH-------------H------------E----------EEEEEEEEE------EHHHHHHHHHHHHHHHHHH-H----HHHHHHHHHHHHHEEEEEEEE-------EEEHHH----------HHHHHHHHHH--------EEEEEH-----HHHHHHH---------------EEEEE---------
CLUNCH
“Whole Protein” Results
• 19 features evaluated• Two were selected:
– Mean length of alpha helix region– Maximum length of any structural
region• 59% increase in protein likelihood
CLUNCH
Improved Glimmer Models
• Glimmer used IMMs to predict genes in bacteria.
• Will adding amino acid triggers improve these models? How much?
CLUNCH
H. Pylori Genome
• 1562 Coding Sequences• Split into:
– Training (>500bp) – 1154 genes, 1,354,167 bp
– Testing (<500bp) – 408 genes, 129,045 bp
CLUNCH
Glimmer Depth
-0.10.10.30.50.70.91.11.31.5
Change in Model Depth/Features
Ch
ang
e in
PP
C
CLUNCH
Lateral Gene Transfer
• Many genes in bacteria come not from their ancestors but from other bacterial species.
• Different bacteria “prefer” to use different codons.
• Analogous to detection of plagiarism detection?
CLUNCH
Model Adaptation
• Gene models are trained for every organism.
• Lots of unused information• Analogous to cross-domain
application of NLP models.
CLUNCH
Thanks
• Lyle Ungar• Roni Rosenfeld• NIH Grant
CLUNCH
CLUNCH
N-Gram Features• Unigram (frequency of individual
words)
• Bigram (frequency of pairs of words)
otherwise0
if1),( 1
1
wwwhfw
otherwise0
and in ends if1),( 21
},{ 21
wwwhwhf ww
CLUNCH
Model with 44 featuresFeature: A Parameter: 1.1283516023198754Feature: V Parameter: 1.0452969493978286Feature: L Parameter: 1.6105829266787726Feature: I Parameter: 0.6972079195742815Feature: P Parameter: 0.8475961041799492Feature: W Parameter: 0.25978594248392567Feature: F Parameter: 0.7344119941888887Feature: M Parameter: 0.4493594016115246Feature: G Parameter: 1.0298040447402417Feature: S Parameter: 1.4262351925813321Feature: T Parameter: 0.9964109799704354Feature: Y Parameter: 0.5683263852913849Feature: C Parameter: 0.4212841347252633Feature: N Parameter: 0.714248861570727Feature: Q Parameter: 1.0039856305693131Feature: K Parameter: 1.0584367118507696Feature: R Parameter: 1.0280086647521616Feature: H Parameter: 0.45401904480206107Feature: D Parameter: 0.7901584163893435Feature: Self Trigger (FREQUENCY) : 1 Parameter: 1.4162727357372282Feature: Self Trigger (FREQUENCY) : 2 Parameter: 1.0953472527852288Feature: Self Trigger (FREQUENCY) : 3 Parameter: 0.9310253183080955Feature: Self Trigger (FREQUENCY) : 4 Parameter: 5.441497902570791Feature: Self Trigger (FREQUENCY) : 5 Parameter: 1.9764385350098654Feature: Self Trigger (FREQUENCY) : A Parameter: 2.9856234039913274Feature: Self Trigger (FREQUENCY) : V Parameter: 1.864040230431935Feature: Self Trigger (FREQUENCY) : L Parameter: 2.9274415604129906Feature: Self Trigger (FREQUENCY) : I Parameter: 2.4563140770039715Feature: Self Trigger (FREQUENCY) : P Parameter: 3.7708734693326558Feature: Self Trigger (FREQUENCY) : W Parameter: 2.672577458866343Feature: Self Trigger (FREQUENCY) : F Parameter: 1.479212103590432Feature: Self Trigger (FREQUENCY) : M Parameter: 0.5107656797047934Feature: Self Trigger (FREQUENCY) : G Parameter: 4.495511648228042Feature: Self Trigger (FREQUENCY) : S Parameter: 5.91039344990589Feature: Self Trigger (FREQUENCY) : T Parameter: 2.449321508559543Feature: Self Trigger (FREQUENCY) : Y Parameter: 2.3542114958521925
Feature: Self Trigger (FREQUENCY) : C Parameter: 82.68056436437357Feature: Self Trigger (FREQUENCY) : N Parameter: 2.4258773271617287Feature: Self Trigger (FREQUENCY) : Q Parameter: 14.611485492431102Feature: Self Trigger (FREQUENCY) : K Parameter: 3.1913667655121665Feature: Self Trigger (FREQUENCY) : R Parameter: 17.76347525956296Feature: Self Trigger (FREQUENCY) : H Parameter: 2.6972280092545198Feature: Self Trigger (FREQUENCY) : D Parameter: 1.5621090399310904Feature: Self Trigger (FREQUENCY) : E Parameter: 34.508027837307324Correction Parameter: 0.9499284545145702Iteration 26Perplexity Training Set = 17.48022432377071Perplexity of Test Set = 17.9412251895500226 17.48022432377071 17.94122518955002
CLUNCH
otherwise0
teller"" and bank"" if1),(tellerbank
whwhf
Trigger feature function