bioinformatics the prediction of life

40
Bioinformatics Bioinformatics The Prediction of Life The Prediction of Life Tony C Smith Tony C Smith Department of Computer Science Department of Computer Science University of Waikato University of Waikato [email protected] [email protected]

Upload: orea

Post on 07-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Bioinformatics The Prediction of Life. Tony C Smith Department of Computer Science University of Waikato [email protected]. Bioinformatics. Computation with biological data Data: genes, proteins, microarrays, mass spectra, written documents, populations of organisms … - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatics The Prediction of Life

BioinformaticsBioinformatics

The Prediction of LifeThe Prediction of Life

Tony C SmithTony C SmithDepartment of Computer ScienceDepartment of Computer Science

University of WaikatoUniversity of [email protected]@cs.waikato.ac.nz

Page 2: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

BioinformaticsBioinformatics

Computation with biological dataComputation with biological data

Data:Data: genes, proteins, microarrays, mass genes, proteins, microarrays, mass spectra, written documents, populations of spectra, written documents, populations of

organisms …organisms …

Goal:Goal: knowledge discovery knowledge discovery

Page 3: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

The The essenceessence is prediction … is prediction … My dog is very littlMy dog is very littl__ ?

We know that letters do not occur in English at random; not all letters are equally common (e.g. ‘e’ is more common than ‘x’)

We know that context changes the probability of a letter (e.g. what’s the most likely letter after the sequence “I eat Weet-Bi_”)

Prediction is important in many applications (e.g. encryption, compression, communication, graphics, simulation … and bioinformatics!)

Page 4: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Prediction in bioinformaticsPrediction in bioinformatics

Predicting the location of genes in DNAPredicting the location of genes in DNAPredicting the function of proteinsPredicting the function of proteinsPredicting diseases from molecular samplesPredicting diseases from molecular samplesPredicting population dynamicsPredicting population dynamicsAnything that involves “making a judgment”; Anything that involves “making a judgment”; typically expressible as a yes/no decision about typically expressible as a yes/no decision about some sample datumsome sample datum

Page 5: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

RepresentationRepresentation

W e e t – B i xW e e t – B i x

0101011101100101011001010111010000101101 …

… to the computer, everything is binary!

Page 6: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

0101011101100101011001010111010000101101

0101101100100111111011010011010000101101 A A C G T C A T T C G A T G A T T C G A

Just as we can teach a computer to predict things about a sequence of letters in English prose, we can also teach it to predict things about a other sequences—like a genetic sequence

Page 7: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaagcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcataacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagccacccacaccagttatatagagacgaactcgcatcagc

Page 8: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcccaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcatttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctatcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcaccgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacaggctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgccgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcttacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgccatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatataatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcgagacgaactc

Page 9: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcacaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcgacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacggctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttaatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatctacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactaagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagcgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacgggaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgacgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactccatcagtgttgcgcacccacaccagttatatagagacgaactc

Page 10: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

A gene encodes a protein

It is a blueprint that provides biochemical instructions on how to construct a sequence of amino acids so as to make a working protein that will perform some function in the organism

Page 11: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

encoding region untranslated region

transcription

factor RNARNARNARNARNA

Page 12: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

untranslated region

Page 13: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

untranslated regionttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

Page 14: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

What transcription factors bind to this gene?

Where is the transcription factor binding site?

Page 15: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues: A binding site is often a short general pattern

E.g. CCGATNATCGG

Page 16: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues: The patterns are often reverse complements

E.g. CCGATNATCGGGGCTANTAGCC

Page 17: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggcttgcaatcggcgctacgcttcaaaatttattatattcccggc

Clues: Where there is one binding site, often there is another nearby.

Page 18: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

A genetic prediction problemA genetic prediction problem

All of these properties are the kinds of things for which computer science has developed algorithms and data structures to identify quickly and efficiently, and therefore it is exactly the kind of problem computer scientists should be able to solve.

Page 19: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

proteomicsproteomics

Three consecutive nucleotides in the coding regionform a ‘codon’ … i.e. encode an amino acid.

A string of amino acids makes a protein.

3 nucleotides, 4 possibilities for each, so

43 = 64 possible codons

But there are only 20 amino acids!

Page 20: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

proteomicsproteomics

Glycine: GGA, GGC, GGG, GGTTyrosine: TAT, TACMethionine: ATG

There is quite a bit of redundancy in codons.

Page 21: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Amidegroup

Carboxylgroup

R group

Amino AcidAmino Acid

Page 22: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Amino AcidAmino Acid

glycine

tyrosine

Page 23: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Primary structure: MSALVSTTPSLLAGVRNVDB …..

Page 24: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Tertiary Structure

Page 25: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Secondary Structure

Page 26: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Signal peptideSignal peptide

A relatively short sequence of amino A relatively short sequence of amino residues at the N-terminus of the nascent residues at the N-terminus of the nascent proteinprotein

typically 15-50 residuestypically 15-50 residues

MAGPRPSPWARLLLAALISVSLSGTLAMAGPRPSPWARLLLAALISVSLSGTLARCKKAPVSKKCETCVGQAALTGL …RCKKAPVSKKCETCVGQAALTGL …

Cleaved off as protein passes through Cleaved off as protein passes through membrane membrane (operates like a pass key)(operates like a pass key)

Knowing signal peptide helps determine Knowing signal peptide helps determine protein function in the organismprotein function in the organism

Page 27: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

How do we do it?How do we do it? see any patterns?see any patterns?

ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatttcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccacgcccagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatttcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcaatttattatagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacggctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtaacgcatcagactctcgtcgcgttcgcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtaacgcatcagactctcgtcgcgttcgcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgctacgcttcaaaatttattatattcccggcggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgctacgcttcaaaatttattatattcccggcggcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcaaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcaacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttatttattatattcccggcgcgcccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttatttattatattcccggcgcggctacgttcatcccagcattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgctacgttcatcccagcattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacggtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcaggacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcaggacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagatgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagatgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctactcatatcgcagctacagcgcacatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctactcatatcgcagctacagcgcatcagacgcatacgacgacgaagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgctcagacgcatacgacgacgaagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaagcagcgattttaaaattaacgcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacgaactcgcatcagtgcaatcggccggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacgaactcgcatcagtgcaatcggccggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgccatcttttactacgacggcgcctacgcatcgcagcatacgattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatgcatcagtgttgcgcacccacaccagttatatagagacgaactcttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgccgc

Page 28: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Local biases in residues around the cleavage site

Sequence regularities can be

exploited by statistical and pattern-based

models

Page 29: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Proteomic predictionProteomic prediction

Language: • letters combine to form words• words combine to form phrases• phrases combine to form sentences• sentences combine to form sentences (and ultimately Harry Potter books)

Proteins: • amino acids combine to form peptides• peptides combine to form secondary motifs (e.g. α-helixes and β-sheets)• motifs combine to make proteins• proteins combine to make toenails (and ultimately people)

Page 30: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

ApproachApproach

Problem is stated as two-class:Problem is stated as two-class:

an amino acid is either the first residue of an amino acid is either the first residue of the mature protein or it is notthe mature protein or it is not

Each residue is described by a single Each residue is described by a single document, which includes as many document, which includes as many electrochemical, structural or contextual electrochemical, structural or contextual facts as are available (desirable)facts as are available (desirable)

Page 31: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Properties of amino acidsProperties of amino acids

Page 32: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Residue as a documentResidue as a document

E.g.E.g. CysteineCysteine CysCys CC

aliphatic [aliphatic [yesyes], aromatic [], aromatic [nono], hydrophobic [], hydrophobic [yesyes], ], charge [charge [--], polarized [], polarized [yesyes]],, small [ small [nono], number of ], number of nitrogen atoms [nitrogen atoms [11], contains sulphur [], contains sulphur [yesyes], has a ], has a carbon ring [carbon ring [nono], ionized [], ionized [yesyes], valence [], valence [22], cbeta ], cbeta [[nono], covalent [], covalent [yesyes], h-bond [], h-bond [yesyes], ],

etc. (whatever else experimenter wants to include)etc. (whatever else experimenter wants to include)

Page 33: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Sample documentSample document PRNUM:1. AANUM:21.PRNUM:1. AANUM:21.

AMINO[-8]:L. ALIPH[-8]:-. AROMA[-8]:-. CBETA[-8]:-. CHARG[-8]:-. COVAL[-8]:-. HBOND[-8]:-. HPHOB[-8]:+. IONIZ[-8]:-. NITRO[-8]:1. AMINO[-8]:L. ALIPH[-8]:-. AROMA[-8]:-. CBETA[-8]:-. CHARG[-8]:-. COVAL[-8]:-. HBOND[-8]:-. HPHOB[-8]:+. IONIZ[-8]:-. NITRO[-8]:1. POLAR[-8]:-. POSNG[-8]:0. SMALL[-8]:-. SULPH[-8]:-. TEENY[-8]:-. CRING[-8]:-. VALEN[-8]:2. AMINO[-7]:L. ALIPH[-7]:-. AROMA[-7]:-. POLAR[-8]:-. POSNG[-8]:0. SMALL[-8]:-. SULPH[-8]:-. TEENY[-8]:-. CRING[-8]:-. VALEN[-8]:2. AMINO[-7]:L. ALIPH[-7]:-. AROMA[-7]:-. CBETA[-7]:-. CHARG[-7]:-. COVAL[-7]:-. HBOND[-7]:-. HPHOB[-7]:+. IONIZ[-7]:-. NITRO[-7]:1. POLAR[-7]:-. POSNG[-7]:0. SMALL[-7]:-. CBETA[-7]:-. CHARG[-7]:-. COVAL[-7]:-. HBOND[-7]:-. HPHOB[-7]:+. IONIZ[-7]:-. NITRO[-7]:1. POLAR[-7]:-. POSNG[-7]:0. SMALL[-7]:-. SULPH[-7]:-. TEENY[-7]:-. CRING[-7]:-. VALEN[-7]:2. AMINO[-6]:F. ALIPH[-6]:+. AROMA[-6]:+. CBETA[-6]:-. CHARG[-6]:-. COVAL[-6]:-. SULPH[-7]:-. TEENY[-7]:-. CRING[-7]:-. VALEN[-7]:2. AMINO[-6]:F. ALIPH[-6]:+. AROMA[-6]:+. CBETA[-6]:-. CHARG[-6]:-. COVAL[-6]:-. HBOND[-6]:-. HPHOB[-6]:+. IONIZ[-6]:-. NITRO[-6]:1. POLAR[-6]:-. POSNG[-6]:0. SMALL[-6]:-. SULPH[-6]:-. TEENY[-6]:-. CRING[-6]:+. HBOND[-6]:-. HPHOB[-6]:+. IONIZ[-6]:-. NITRO[-6]:1. POLAR[-6]:-. POSNG[-6]:0. SMALL[-6]:-. SULPH[-6]:-. TEENY[-6]:-. CRING[-6]:+. VALEN[-6]:2. AMINO[-5]:A. ALIPH[-5]:-. AROMA[-5]:-. CBETA[-5]:-. CHARG[-5]:-. COVAL[-5]:-. HBOND[-5]:-. HPHOB[-5]:-. IONIZ[-5]:-. VALEN[-6]:2. AMINO[-5]:A. ALIPH[-5]:-. AROMA[-5]:-. CBETA[-5]:-. CHARG[-5]:-. COVAL[-5]:-. HBOND[-5]:-. HPHOB[-5]:-. IONIZ[-5]:-. NITRO[-5]:1. POLAR[-5]:-. POSNG[-5]:0. SMALL[-5]:+. SULPH[-5]:-. TEENY[-5]:+. CRING[-5]:-. VALEN[-5]:2. AMINO[-4]:T. ALIPH[-4]:+. NITRO[-5]:1. POLAR[-5]:-. POSNG[-5]:0. SMALL[-5]:+. SULPH[-5]:-. TEENY[-5]:+. CRING[-5]:-. VALEN[-5]:2. AMINO[-4]:T. ALIPH[-4]:+. AROMA[-4]:-. CBETA[-4]:+. CHARG[-4]:-. COVAL[-4]:-. HBOND[-4]:+. HPHOB[-4]:-. IONIZ[-4]:-. NITRO[-4]:1. POLAR[-4]:+. POSNG[-AROMA[-4]:-. CBETA[-4]:+. CHARG[-4]:-. COVAL[-4]:-. HBOND[-4]:+. HPHOB[-4]:-. IONIZ[-4]:-. NITRO[-4]:1. POLAR[-4]:+. POSNG[-4]:0. SMALL[-4]:+. SULPH[-4]:-. TEENY[-4]:-. CRING[-4]:-. VALEN[-4]:2. AMINO[-3]:C. ALIPH[-3]:+. AROMA[-3]:-. CBETA[-3]:-. CHARG[-4]:0. SMALL[-4]:+. SULPH[-4]:-. TEENY[-4]:-. CRING[-4]:-. VALEN[-4]:2. AMINO[-3]:C. ALIPH[-3]:+. AROMA[-3]:-. CBETA[-3]:-. CHARG[-3]:-. COVAL[-3]:+. HBOND[-3]:+. HPHOB[-3]:+. IONIZ[-3]:+. NITRO[-3]:1. POLAR[-3]:+. POSNG[-3]:-. SMALL[-3]:-. SULPH[-3]:+. 3]:-. COVAL[-3]:+. HBOND[-3]:+. HPHOB[-3]:+. IONIZ[-3]:+. NITRO[-3]:1. POLAR[-3]:+. POSNG[-3]:-. SMALL[-3]:-. SULPH[-3]:+. TEENY[-3]:-. CRING[-3]:-. VALEN[-3]:2. AMINO[-2]:I. ALIPH[-2]:-. AROMA[-2]:-. CBETA[-2]:+. CHARG[-2]:-. COVAL[-2]:-. HBOND[-2]:-. TEENY[-3]:-. CRING[-3]:-. VALEN[-3]:2. AMINO[-2]:I. ALIPH[-2]:-. AROMA[-2]:-. CBETA[-2]:+. CHARG[-2]:-. COVAL[-2]:-. HBOND[-2]:-. HPHOB[-2]:+. IONIZ[-2]:-. NITRO[-2]:1. POLAR[-2]:-. POSNG[-2]:0. SMALL[-2]:-. SULPH[-2]:-. TEENY[-2]:-. CRING[-2]:-. VALEN[-2]:2. HPHOB[-2]:+. IONIZ[-2]:-. NITRO[-2]:1. POLAR[-2]:-. POSNG[-2]:0. SMALL[-2]:-. SULPH[-2]:-. TEENY[-2]:-. CRING[-2]:-. VALEN[-2]:2. AMINO[-1]:A. ALIPH[-1]:-. AROMA[-1]:-. CBETA[-1]:-. CHARG[-1]:-. COVAL[-1]:-. HBOND[-1]:-. HPHOB[-1]:-. IONIZ[-1]:-. NITRO[-1]:1. AMINO[-1]:A. ALIPH[-1]:-. AROMA[-1]:-. CBETA[-1]:-. CHARG[-1]:-. COVAL[-1]:-. HBOND[-1]:-. HPHOB[-1]:-. IONIZ[-1]:-. NITRO[-1]:1. POLAR[-1]:-. POSNG[-1]:0. SMALL[-1]:+. SULPH[-1]:-. TEENY[-1]:+. CRING[-1]:-. VALEN[-1]:2. POLAR[-1]:-. POSNG[-1]:0. SMALL[-1]:+. SULPH[-1]:-. TEENY[-1]:+. CRING[-1]:-. VALEN[-1]:2. AMINO[0]:R. ALIPH[0]:+. AROMA[0]:-. AMINO[0]:R. ALIPH[0]:+. AROMA[0]:-. CBETA[0]:-. CHARG[0]:+. COVAL[0]:-. HBOND[0]:+. HPHOB[0]:-. IONIZ[0]:+. NITRO[0]:4. POLAR[0]:+. POSNG[0]:+. SMALL[0]:-. CBETA[0]:-. CHARG[0]:+. COVAL[0]:-. HBOND[0]:+. HPHOB[0]:-. IONIZ[0]:+. NITRO[0]:4. POLAR[0]:+. POSNG[0]:+. SMALL[0]:-. SULPH[0]:-. TEENY[0]:-. CRING[0]:-. VALEN[0]:3.SULPH[0]:-. TEENY[0]:-. CRING[0]:-. VALEN[0]:3. AMINO[1]:H. ALIPH[1]:+. AROMA[1]:+. CBETA[1]:-. CHARG[1]:+. COVAL[1]:-. AMINO[1]:H. ALIPH[1]:+. AROMA[1]:+. CBETA[1]:-. CHARG[1]:+. COVAL[1]:-. HBOND[1]:+. HPHOB[1]:-. IONIZ[1]:+. NITRO[1]:3. POLAR[1]:+. POSNG[1]:+. SMALL[1]:-. SULPH[1]:-. TEENY[1]:-. CRING[1]:+. HBOND[1]:+. HPHOB[1]:-. IONIZ[1]:+. NITRO[1]:3. POLAR[1]:+. POSNG[1]:+. SMALL[1]:-. SULPH[1]:-. TEENY[1]:-. CRING[1]:+. VALEN[1]:3. AMINO[2]:Q. ALIPH[2]:+. AROMA[2]:-. CBETA[2]:-. CHARG[2]:-. COVAL[2]:-. HBOND[2]:+. HPHOB[2]:-. IONIZ[2]:-. VALEN[1]:3. AMINO[2]:Q. ALIPH[2]:+. AROMA[2]:-. CBETA[2]:-. CHARG[2]:-. COVAL[2]:-. HBOND[2]:+. HPHOB[2]:-. IONIZ[2]:-. NITRO[2]:2. POLAR[2]:+. POSNG[2]:0. SMALL[2]:-. SULPH[2]:-. TEENY[2]:-. CRING[2]:-. VALEN[2]:2. AMINO[3]:Q. ALIPH[3]:+. NITRO[2]:2. POLAR[2]:+. POSNG[2]:0. SMALL[2]:-. SULPH[2]:-. TEENY[2]:-. CRING[2]:-. VALEN[2]:2. AMINO[3]:Q. ALIPH[3]:+. AROMA[3]:-. CBETA[3]:-. CHARG[3]:-. COVAL[3]:-. HBOND[3]:+. HPHOB[3]:-. IONIZ[3]:-. NITRO[3]:2. POLAR[3]:+. POSNG[3]:0. AROMA[3]:-. CBETA[3]:-. CHARG[3]:-. COVAL[3]:-. HBOND[3]:+. HPHOB[3]:-. IONIZ[3]:-. NITRO[3]:2. POLAR[3]:+. POSNG[3]:0. SMALL[3]:-. SULPH[3]:-. TEENY[3]:-. CRING[3]:-. VALEN[3]:2. AMINO[4]:R. ALIPH[4]:+. AROMA[4]:-. CBETA[4]:-. CHARG[4]:+. SMALL[3]:-. SULPH[3]:-. TEENY[3]:-. CRING[3]:-. VALEN[3]:2. AMINO[4]:R. ALIPH[4]:+. AROMA[4]:-. CBETA[4]:-. CHARG[4]:+. COVAL[4]:-. HBOND[4]:+. HPHOB[4]:-. IONIZ[4]:+. NITRO[4]:4. POLAR[4]:+. POSNG[4]:+. SMALL[4]:-. SULPH[4]:-. TEENY[4]:-. COVAL[4]:-. HBOND[4]:+. HPHOB[4]:-. IONIZ[4]:+. NITRO[4]:4. POLAR[4]:+. POSNG[4]:+. SMALL[4]:-. SULPH[4]:-. TEENY[4]:-. CRING[4]:-. VALEN[4]:3. AMINO[5]:Q. ALIPH[5]:+. AROMA[5]:-. CBETA[5]:-. CHARG[5]:-. COVAL[5]:-. HBOND[5]:+. HPHOB[5]:-. CRING[4]:-. VALEN[4]:3. AMINO[5]:Q. ALIPH[5]:+. AROMA[5]:-. CBETA[5]:-. CHARG[5]:-. COVAL[5]:-. HBOND[5]:+. HPHOB[5]:-. IONIZ[5]:-. NITRO[5]:2. POLAR[5]:+. POSNG[5]:0. SMALL[5]:-. SULPH[5]:-. TEENY[5]:-. CRING[5]:-. VALEN[5]:2. AMINO[6]:Q. ALIPH[6]:IONIZ[5]:-. NITRO[5]:2. POLAR[5]:+. POSNG[5]:0. SMALL[5]:-. SULPH[5]:-. TEENY[5]:-. CRING[5]:-. VALEN[5]:2. AMINO[6]:Q. ALIPH[6]:+. AROMA[6]:-. CBETA[6]:-. CHARG[6]:-. COVAL[6]:-. HBOND[6]:+. HPHOB[6]:-. IONIZ[6]:-. NITRO[6]:2. POLAR[6]:+. POSNG[6]:0. +. AROMA[6]:-. CBETA[6]:-. CHARG[6]:-. COVAL[6]:-. HBOND[6]:+. HPHOB[6]:-. IONIZ[6]:-. NITRO[6]:2. POLAR[6]:+. POSNG[6]:0. SMALL[6]:-. SULPH[6]:-. TEENY[6]:-. CRING[6]:-. VALEN[6]:2. AMINO[7]:Q. ALIPH[7]:+. AROMA[7]:-. CBETA[7]:-. CHARG[7]:-. SMALL[6]:-. SULPH[6]:-. TEENY[6]:-. CRING[6]:-. VALEN[6]:2. AMINO[7]:Q. ALIPH[7]:+. AROMA[7]:-. CBETA[7]:-. CHARG[7]:-. COVAL[7]:-. HBOND[7]:+. HPHOB[7]:-. IONIZ[7]:-. NITRO[7]:2. POLAR[7]:+. POSNG[7]:0. SMALL[7]:-. SULPH[7]:-. TEENY[7]:-. COVAL[7]:-. HBOND[7]:+. HPHOB[7]:-. IONIZ[7]:-. NITRO[7]:2. POLAR[7]:+. POSNG[7]:0. SMALL[7]:-. SULPH[7]:-. TEENY[7]:-. CRING[7]:-. VALEN[7]:2. AMINO[8]:Q. ALIPH[8]:+. AROMA[8]:-. CBETA[8]:-. CHARG[8]:-. COVAL[8]:-. HBOND[8]:+. HPHOB[8]:-. CRING[7]:-. VALEN[7]:2. AMINO[8]:Q. ALIPH[8]:+. AROMA[8]:-. CBETA[8]:-. CHARG[8]:-. COVAL[8]:-. HBOND[8]:+. HPHOB[8]:-. IONIZ[8]:-. NITRO[8]:2. POLAR[8]:+. POSNG[8]:0. SMALL[8]:-. SULPH[8]:-. TEENY[8]:-. CRING[8]:-. VALEN[8]:2. MULT3:7. MULT5:4. IONIZ[8]:-. NITRO[8]:2. POLAR[8]:+. POSNG[8]:0. SMALL[8]:-. SULPH[8]:-. TEENY[8]:-. CRING[8]:-. VALEN[8]:2. MULT3:7. MULT5:4. MULT7:3. MULT9:2. 2GRAM:IA. GRAM2:HQ. 3GRAM:CIA. GRAM3:HQQ. MULT7:3. MULT9:2. 2GRAM:IA. GRAM2:HQ. 3GRAM:CIA. GRAM3:HQQ.

Page 34: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Artificial IntelligenceArtificial Intelligence

Computers do things Computers do things only human brains only human brains can otherwise docan otherwise do

expert expert

Page 35: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Artificial IntelligenceArtificial Intelligence

Computers do things Computers do things only human brains only human brains can otherwise docan otherwise do

expertsystem

expert

Page 36: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Artificial IntelligenceArtificial Intelligence

Computers do things Computers do things only human brains only human brains can otherwise docan otherwise do

learningsystem

expertsystem

Page 37: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

Machine learningMachine learning

creating computer programs that get better with experiencecreating computer programs that get better with experiencelearn how to make expert judgmentslearn how to make expert judgmentsdiscover previously hidden, potentially useful information (data discover previously hidden, potentially useful information (data mining)mining)

What is machine learning?

How does it work?user provides learning system with examples of concept to be learneduser provides learning system with examples of concept to be learned

induction algorithm infers a characteristic model of the examplesinduction algorithm infers a characteristic model of the examples

model is used to predict whether or not future novel instances are also model is used to predict whether or not future novel instances are also examples – and it does this very consistently, and very, very quickly!examples – and it does this very consistently, and very, very quickly!

Page 38: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

BioinformaticsBioinformatics

Biologists know proteins, computer scientists Biologists know proteins, computer scientists know machine learningknow machine learning

Together, they can find hidden and potentially Together, they can find hidden and potentially useful information about genes and proteinsuseful information about genes and proteins

Biotechnology is a multi-billion dollar industryBiotechnology is a multi-billion dollar industry

Biotechnology is one of the best funded areas of Biotechnology is one of the best funded areas of scientific researchscientific research

Shortage of people educated in bioinformaticsShortage of people educated in bioinformatics

Page 39: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

The University of WaikatoThe University of Waikato

Waikato University is ranked first in the country Waikato University is ranked first in the country in computer science and in molecular, cellular, in computer science and in molecular, cellular, and whole-organism biologyand whole-organism biology

centre of the universe for machine learningcentre of the universe for machine learning

Page 40: Bioinformatics The Prediction of Life

Bioinformatics Tony C Smith

The University of WaikatoThe University of Waikato

If you’re interested in getting involved in If you’re interested in getting involved in bioinformatics, or indeed any other area bioinformatics, or indeed any other area

along the leading edge of computer along the leading edge of computer science and/or biology, then …science and/or biology, then …

Waikato wants You!Waikato wants You!