bioinformatics for high school
TRANSCRIPT
-
8/8/2019 Bioinformatics for High School
1/28
An Introduction toAn Introduction to
BioinformaticsBioinformatics(high-school version)(high-school version)
Ying XuYing Xu
Institute of Bioinformatics, and Biochemistry andInstitute of Bioinformatics, and Biochemistry and
Molecular Biology DepartmentMolecular Biology DepartmentUniversity of GeorgiaUniversity of Georgia
[email protected]@bmb.uga.edu
-
8/8/2019 Bioinformatics for High School
2/28
The BasicsThe Basics
genes
cell chromosome
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgat
cgtgtgggtagtagctgatatgatgcgaggtaggggataggatag
caacagatgagcggatgctgagtgcagtggcatgcgatgtcgatg
atagcggtaggtagacttcgcgcataaagctgcgcgagatgattg
caaagragttagatgagctgatgctagaggtcagtgactgatgatc
gatgcatgcatggatgatgcagctgatcgatgtagatgcaataagt
cgatgatcgatgatgatgctagatgatagctagatgtgatcgatggt
aggtaggatggtaggtaaattgatagatgctagatcgtaggta
genome andsequencing
protein
metabolicpathway/network
-
8/8/2019 Bioinformatics for High School
3/28
BioinformaticsBioinformatics(or computational biology)(or computational biology)
This interdisciplinary science is aboutThis interdisciplinary science is aboutproviding computational support toproviding computational support to
studies onstudies on linking the behavior of cells,linking the behavior of cells,organisms and populations toorganisms and populations to thetheinformation encoded in the genomesinformation encoded in the genomes
Temple Smith
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgat
cgtgtgggtagtagctgatatgatgcgaggtaggggataggatag
caacagatgagcggatgctgagtgcagtggcatgcgatgtcgatg
atagcggtaggtagacttcgcgcataaagctgcgcgagatgattg
caaagragttagatgagctgatgctagaggtcagtgactgatgatc
gatgcatgcatggatgatgcagctgatcgatgtagatgcaataagt
cgatgatcgatgatgatgctagatgatagctagatgtgatcgatggt
aggtaggatggtaggtaaattgatagatgctagatcgtaggta
-
8/8/2019 Bioinformatics for High School
4/28
Information Encoded inInformation Encoded in
GenomesGenomes
What information? And how to find and interpretWhat information? And how to find and interpretit?it?
Working molecules (proteins, RNAs) in our cellsWorking molecules (proteins, RNAs) in our cells
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcga
ggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggta
ggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtga
ctgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta
bacterial
cell
-
8/8/2019 Bioinformatics for High School
5/28
Information Encoded inInformation Encoded in
GenomesGenomes
How to find where protein-encoding genes are in a genome?How to find where protein-encoding genes are in a genome?
A genome is like a book written in words consisting of 4A genome is like a book written in words consisting of 4letters (A, C, G, T), and each protein-encoding gene is likeletters (A, C, G, T), and each protein-encoding gene is like
an instruction about how the protein is madean instruction about how the protein is made
People have found that the six-letter words (e.g., AAGTGC)People have found that the six-letter words (e.g., AAGTGC)have different frequencies in genes from non-gene regionshave different frequencies in genes from non-gene regions
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgc
gatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgat
cgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta
-
8/8/2019 Bioinformatics for High School
6/28
Information Encoded inInformation Encoded in
GenomesGenomes
Frequency in genes (AAA ATT) = 1.4%; Frequency in non-genes (AAA ATT) = 5.2%Frequency in genes (AAA GAC) = 1.9%; Frequency in non-genes (AAA GAC) = 4.8%
Frequency in genes (AAA TAG) = 0.0%; Frequency in non-genes (AAA TAG) = 6.3%
.
AAAATTAAAATTAAAGACAAAATTAAAGACAAACACAAAATTAAATAGAAATAGAAAATT ..
Is this a gene or non-gene region if you have to makea bet?
-
8/8/2019 Bioinformatics for High School
7/28
Information Encoded inInformation Encoded in
GenomesGenomes
Preference model: for each 6-letter word X (e.g., AAA AAA), calculate its frequencies in
gene and non-gene regions, FC(X), FN(X) calculate Xspreference value P(X) = log (FC(X)/FN(X))
Properties: P(X) is 0 if X has the same frequencies in gene and non-gene regions P(X) has positive score if X has higher frequency in gene than in non-
gene region; the larger the difference, the more positive the score is P(X) has negative score if X has higher frequency in non-gene than in
gene region; the larger the difference, the more negative the score is
Gene prediction: given a DNA region, calculate the sum of P(X)values for all 6-letter words X in the region; if the sum is larger than zero, predict gene otherwise predict non-gene
-
8/8/2019 Bioinformatics for High School
8/28
Information Encoded inInformation Encoded in
GenomesGenomes
You just learned your first bioinformatics methodYou just learned your first bioinformatics methodfor gene prediction for gene prediction congratulationscongratulations!!
-
8/8/2019 Bioinformatics for High School
9/28
Information Encoded inInformation Encoded in
GenomesGenomes
Ok, we now have learned how to find genes encodedOk, we now have learned how to find genes encodedin a genomein a genome
How do we find out what they do (their biologicalHow do we find out what they do (their biologicalfunctions, e.g. sensors, transportors, regulators,functions, e.g. sensors, transportors, regulators,
enzymes)?enzymes)?
-
8/8/2019 Bioinformatics for High School
10/28
Information Encoded inInformation Encoded in
GenomesGenomes
People have observed that similar protein sequences tend toPeople have observed that similar protein sequences tend tohave similar functionshave similar functions
Over the years, many genes have been thoroughly studied indifferent organisms,e.g.,human, mouse, fly, ., rice,
their biological functions have been identified and documented
For a new protein, scientists can possibly predict its function by
identifying well-studied proteins in other organisms, that havehigh sequence similarities to it
This works for ~60% of genes in a newly sequenced genome
-
8/8/2019 Bioinformatics for High School
11/28
Information Encoded inInformation Encoded in
GenomesGenomes
Scientists have developed computationalScientists have developed computationaltechniques fortechniques for identifying regulatory signals that controls geneidentifying regulatory signals that controls gene
transcriptiontranscription
predicting protein-protein interactionspredicting protein-protein interactions
elucidating biological networks for a particular functionelucidating biological networks for a particular function ... and elucidating many other information... and elucidating many other information
-
8/8/2019 Bioinformatics for High School
12/28
Information Encoded inInformation Encoded in
GenomesGenomes
E. Coli O157 and O111 are human pathogenic while E. ColiK12 is not;
Can we tell why? Which genes or pathways in E. coli O157
and O111 are responsible for the pathogenicity?
-
8/8/2019 Bioinformatics for High School
13/28
Information Encoded inInformation Encoded in
GenomesGenomes
E.coliK-12
E.coliO157
B
.pseudomallei
P.furiosus
Randomseq
humanchromosome#1
-
8/8/2019 Bioinformatics for High School
14/28
Information Encoded inInformation Encoded in
GenomesGenomes
Red: prokaryotes
Blue: eukaryotes
Green: plastids
Orange: plasmids
Black: mitochondria
x-axis: average of variations of the K-mer
frequencies,
y-axis: average barcode similarity among
fragments of a genome
-
8/8/2019 Bioinformatics for High School
15/28
Information Encoded inInformation Encoded in
GenomesGenomes
Yes, biologists can derive a lot of information fromYes, biologists can derive a lot of information from
genomes nowgenomes now
but we are far from fully understanding any genomebut we are far from fully understanding any genomeyet, even for the simplest living organisms, bacteriayet, even for the simplest living organisms, bacteria
We can clearly use new ideas from bright young mindsWe can clearly use new ideas from bright young minds interested in doing bioinformatics? interested in doing bioinformatics?
-
8/8/2019 Bioinformatics for High School
16/28
Linking Genome Information toLinking Genome Information to
Biological Systems BehaviorsBiological Systems Behaviors
To fully understand cellular behaviors, we need toTo fully understand cellular behaviors, we need to elucidate information encoded in the genome, andelucidate information encoded in the genome, and
understand working molecules, encoded by the genome,understand working molecules, encoded by the genome,
behaves according to the physical laws on earth!behaves according to the physical laws on earth!
ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagca
acagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag
gene
protein
-
8/8/2019 Bioinformatics for High School
17/28
Key Drivers ofKey Drivers of
BioinformaticsBioinformatics Human genome project has fundamentallyHuman genome project has fundamentallychanged biological sciencechanged biological science
A key consequence of the genome project isA key consequence of the genome project isscientists learned that they can producescientists learned that they can produce
biological data massivelybiological data massively genome sequencesgenome sequences
microarray data for gene expression levelsmicroarray data for gene expression levels yeast two hybrid systems for protein-protein interactionsyeast two hybrid systems for protein-protein interactions
and other high-throughput biological dataand other high-throughput biological dataThese data reflect the cellular states,molecular structures and functions, incomplex ways
-
8/8/2019 Bioinformatics for High School
18/28
Key Drivers ofKey Drivers of
BioinformaticsBioinformatics
and let bioinformaticians to (help to) decipherand let bioinformaticians to (help to) decipher
the meaning of these data, like in genomethe meaning of these data, like in genomesequencessequences
Together, high-throughput probing technologiesTogether, high-throughput probing technologiesand bioinformatics are transforming biologicaland bioinformatics are transforming biological
science into a new science more like physicsscience into a new science more like physics
-
8/8/2019 Bioinformatics for High School
19/28
Key Drivers ofKey Drivers of
BioinformaticsBioinformatics
Like physics, whereLike physics, where general rules and lawsgeneral rules and laws areare
taught at the start,taught at the start, biology will surely bebiology will surely bepresented to future generations of students as apresented to future generations of students as a
set of basic systemsset of basic systems ....... duplicated and....... duplicated and
adapted to a very wide range of cellular andadapted to a very wide range of cellular and
organismic functions,organismic functions, following basic evolutionaryfollowing basic evolutionary
principles constrained by Earths geologicalprinciples constrained by Earths geological
history.history. Temple SmithTemple Smith,, Current Topics in Computational Molecular BiologyCurrent Topics in Computational Molecular Biology
-
8/8/2019 Bioinformatics for High School
20/28
Biomarker IdentificationBiomarker Identification
Our goal is to identify markers in blood that canOur goal is to identify markers in blood that cantell if a person has a particular form of cancertell if a person has a particular form of cancer
in a similar fashion to doingpregnancy test using a test kit,
possibly at home
-
8/8/2019 Bioinformatics for High School
21/28
Biomarker IdentificationBiomarker Identification
Microarray gene expression data allow comparativeMicroarray gene expression data allow comparativeanalyses of gene expression patterns in canceranalyses of gene expression patterns in cancer versusversusnormal tissuesnormal tissues
on cancertissues
on normaltissues
Finding genes showing
maximum difference in theirexpression levels betweencancer and normal tissues
-
8/8/2019 Bioinformatics for High School
22/28
Biomarker IdentificationBiomarker Identification
proteins A, , Zhighly expressed incancer
-
8/8/2019 Bioinformatics for High School
23/28
Biomarker IdentificationBiomarker Identification
QuestionQuestion:: Can we predict which of these tissue markerCan we predict which of these tissue markerproteins can get secreted into blood circulation so we canproteins can get secreted into blood circulation so we canget markers in blood?get markers in blood?
Through literature search, we found over proteins beingThrough literature search, we found over proteins beingsecreted into blood circulation due to various physiologicalsecreted into blood circulation due to various physiologicalconditionsconditions
We then trained a classifier to identify features thatWe then trained a classifier to identify features that
distinguish between proteins that can be secreted into blooddistinguish between proteins that can be secreted into bloodand proteins that cannotand proteins that cannot
-
8/8/2019 Bioinformatics for High School
24/28
Biomarker IdentificationBiomarker Identification
We have developed a classifier to distinguish blood-We have developed a classifier to distinguish blood-secretory proteins and other proteinssecretory proteins and other proteins
On a test set with 52 positive data and 3,629 negative data,On a test set with 52 positive data and 3,629 negative data,our classifier achievesour classifier achieves
89.6% sensitivity, 98.5% specificity and 94% AUC89.6% sensitivity, 98.5% specificity and 94% AUC
-
8/8/2019 Bioinformatics for High School
25/28
Biomarker IdentificationBiomarker Identification
The predicted marker proteins can be validatedThe predicted marker proteins can be validatedusing mass spectrometry experimentusing mass spectrometry experiment
-
8/8/2019 Bioinformatics for High School
26/28
Biomarker IdentificationBiomarker Identification
If successful, it will be possible to test for cancerIf successful, it will be possible to test for cancerusing a test-kit like pregnancy test-kitsusing a test-kit like pregnancy test-kits
-
8/8/2019 Bioinformatics for High School
27/28
Take-Home MessageTake-Home Message
Biological science is under rapid transformation because ofBiological science is under rapid transformation because ofhigh-throughput measurement technologies andhigh-throughput measurement technologies and
bioinformaticsbioinformatics
As an emerging field, bioinformatics is about usingAs an emerging field, bioinformatics is about usingcomputational techniques to solve biological problems, andcomputational techniques to solve biological problems, and
represents the future of biologyrepresents the future of biology
-
8/8/2019 Bioinformatics for High School
28/28
THANK YOU!