bda2015 tutorial-part1-intro
TRANSCRIPT
16th December 2015
Genomics 3.0: Big Data in
Precision Medicine
Asoke K Talukder, Ph.D
InterpretOmics, Bangalore, India
17th December 2009
Big Data Analytics 2015Hyderabad 16-18 December, 2015
16th December 2015
Acknowledgement
• BDA2015 Technical committee
• Authors & Publishers making their articles Open
Access in the Web
• Open Source Software/Foundation
• Authors of Open Source & Open Domain software
• NCBI & other open domain databases
• Wikipedia & other sites that believe in Bhikshu
Economy
2
16th December 2015 3
Disclaimer
• During my research for this tutorial, I have referred many text and many presentations available in the Web and obtained from various colleagues and professionals. I tried to give credit to creators of artifacts used in this presentation; however, if I have missed credit citation to the original author, that is undeliberate and unintentional. Such omissions are regretted.
16th December 2015
About the Speaker
• Dr. Asoke K. Talukder is a computer scientist – worked for companies like Fujitsu-ICIM, Microsoft, Oracle, Informix, Digital, Hewlett Packard, ICL, Sequoia, Northern Telecom, NEC, KredietBank, iGate, Cellnext, etc. Dr. Asoke authored/edited six books out of which two are translated in Chinese and published many peer-reviewed research papers. He is recipient of many international awards including All India Radio/Doordarshan award, ICIM Professional Excellence Award, ICL Excellence Award, IBM Solutions Excellence Award, Simagine GSMWorld Award etc. He has been listed in “Who’s Who in the World”, “Who’s Who in Science and Engineering”, and “Outstanding Scientists of 21st Century”. He did M.Sc (Physics with Biophysics Major) and Ph.D in Computer Science. He was the DaimlerChrysler Chair Professor at IIIT, Adjunct Professor, Department of CSE, NIT Warangal and Adjunct Faculty CE, NITK, Surathkal. He is Co-founder and Chief Scientific Officer of InterpretOmics the Data Sciences and Systems Biology company.
4
16th December 2015
Part I - Introduction
16th December 2015
Everyday Newspaper Headlines
6
16th December 2015
Structure of the Tutorial
• Introduction to Omic Sciences
• Omic Sciences Challenges
• Computational Biology
• Algorithms, & Data Mining in Biology
• Blood Biopsy – a case study
7
16th December 2015
Goal of this Tutorial
• This tutorial will define the role of Big Data and Data Sciences in biology and lifesciences. With the help of chemistry and physics, we have some understanding of biology. With advancement of technology, our next leap in biology is becoming possible. We need Mathematics and Computers to solve grand challenges in Biology for better understanding of life and understanding of genomics – the building block of life. This will help solve problems in life like diseases management or management of food and environment
8
16th December 2015
Leading causes of death (U.S., 1999)
number of % total
Rank Cause deaths deaths
1 heart disease 725,192 30.3
2 malignant neoplasm 549,192 23.0
3 cerebrovascular disease 167,366 7.0
4 chronic lower respiratory 124,181 5.2
5 accidents 97,860 4.1
6 diabetes mellitus 68,399 2.9
7 influenza, pneumonia 63,730 2.7
8 Alzheimer’s disease 44,536 1.9
9 nephritis & related 35,525 1.5
10 septicemia 30,680 1.3
11 … all other 2,391,39920.2
Source: National Vital Statistics Reports 49(11):1-87, 2001.
Classification of Disease
9
16th December 2015
Genomics and World Health
• “It is now believed that the information generated by genomics will, in the long-term, have major benefits for the prevention, diagnosis and management of many diseases which hitherto have been difficult or impossible to control. These include communicable and genetic diseases, together with other common killers or causes of chronic illhealth, including cardiovascular disease, cancer, diabetes, the major psychoses, dementia, rheumatic disease, asthma, and many others.”
– Genomics and World Health, Report of the Advisory Committee on Health Research, presented to Director general of WHO on 20 December 2001; Ref - Jeffrey D. Sachs, WHO, Geneva, 2002
10
16th December 2015
Genomics and Food Chain
• To develop high nutrient food and high yield
crop, we need to understand the genetic
structure of plants and the disease vectors.
• We also need GMO (Genetically Modified
Organisms) crops that can grow and
produce in hostile environments like drought
affected or high salineted areas
11
16th December 2015
Genomics and Energy
• All our energy come from fossil fuels like coal and petroleum, which has been converted from some living biological organism to fuel for millions of years
• Can we culture organisms that will reduce this cycle to few years instead of millions of years
• Can we generate bio-fuels that will be economic and commercially viable?
12
16th December 2015
Genomics and Environment
• Can we culture organisms that will help the
carbon cycle and reduce the CO2?
• Can we culture organisms or plants that will
desalinate the sea water and produce sweet
drinking water?
• Can we culture organisms or plans that will
clean the environment and accelerate the
bio-degradability of waste?
13
16th December 2015
Genetic Components of Disease
Alzheimer’s Disease
14
16th December 2015
Landmark Discoveries
• 1941 Genes code for single proteins
• 1944 Proof that DNA carries genetic information
• 1949 The concept of sickle cell anaemia as a “molecular disease”
• 1953 Structure of insulin determined
• 1953 Multistage mutational theory of cancer by Nordling
• 1953 Field Cancerization theory of cancer
• 1953 Structure of Neuclic Acid and DNA determined
• 1956 Monogenic disease due to a single amino acid substitution of the β-chain of haemoglobin
• 1960 The X-ray crystallographic structure of haemoglobin
• 1961 The genetic code, messenger RNA, gene regulation
• 1972 Recombinant DNA, cloning and gene isolation
• 1974 Direct demonstration of a human gene deletion
• 1975 Southern blotting*
• 1976 Proto-oncogenes
• 1977 DNA sequencing
• 1978 Human gene library
• 1979 Restriction fragment length polymorphism used for prenatal diagnosis Stop codon mutation demonstrated in human globin messenger RNA Cellular oncogenes
• 1979–81 Human genes cloned and sequenced
• 1985 “Disease genes” isolated by positional cloning Polymerase chain reaction (PCR)
• 2000 The Human Genome Project — completion of 90% draft
15
16th December 2015
Questions Biologists Often Ask
Biologists need answers to a number of questions
How can we get all the knowledge that are contained in a given sequence or structural data
analysis
prediction of certain properties
How can software tools help in designing drugs and cure diseases based on available data
Tools for early drug discovery process
Tools to predict and treat before they manifest
16
16th December 2015
Omic Sciences• Genomics – is the "basic recipe" book defining an individual’s
characteristics, or that of a population or of a living species
• Transcriptomics – is the science that studies how the "basic recipes" are translated into a final product: the proteins
• Proteomics – is the study of all proteins produced by the genome expression
• Metabolomics – is the the study of interactions between proteins and all "metabolites" (sugar, fat, biomolecules, etc.) – of a cell or a biological entity
• Physiomics – is the study of interaction with physiology
• Fluxomics – is the study of dynamic changes of molecules within a cell over time.
• Sociomics – is the study of all social and cultural ecosystems that interact with the genomes
• Epigenomics – is the influence of the environmental imprint on the "coat" that covers the genetic material in the genome
• Phenomics – is the study of phenotype
• Bibliomics – is the study of literature
17
16th December 2015
Genomics
• Genomics is the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis, epistasis, pleiotropy and other interactions between loci and alleles within the genome. In contrast, the investigation of the roles and functions of single genes is a primary focus of molecular biology or genetics and is a common topic of modern medical and biological research. Research of single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genome's networks.
18
16th December 2015
Gene
• With the exception of viruses, which are intracellular parasites, living organisms are divided into two general classes. First, there are eukaryotes whose cells have a complex compartmentalized internal structure; they comprise algae, fungi, plants and animals. Second, there are prokaryotes, single-celled microorganisms with a simple internal organization, which comprise bacteria and related organisms. Genetic information is transferred from one generation to the next by subcellular structures called chromosomes. Prokaryotes usually have a single circular chromosome, while most eukaryotes have more than two and in some cases up to several hundred. For example, in humans there are 23 pairs; one of the pair is inherited from each parent. Twenty-two pairs are called autosomes and one pair are called sex chromosomes. The latter are designated X and Y; females have two X chromosomes (XX) while males have an X and Y (XY).
19
16th December 2015
Genetics Vs Genomics
• Genetics is Biology• Genomics is Statistical Data Mining• Genetics is Confirmatory• Genomics is Expolratory• Genetics is hypothesis driven• Genomics is hypothesis creating
20
16th December 2015
Genomics 3.0
• Genomics 1.0: started with the Human genome project, used by
academics and researchers to understand the disease dynamics and
the genotype phenotypic association of a living system at a time when
clinicians treat the symptom of a disease (phenotype)
• Genomics 2.0: entered the clinic and pharmaceutical companies
through translational genomics. It is used today as a tool for diagnosis
of non-communicable and genetic diseases. Clinicians use Genomics
2.0 to not just treat symptoms; but, to treat the disease
• Genomics 3.0: will deal with holistic precision medicine and will be
driven by big-data genomic analytics of the 21st Century. Genomics 3.0
will be used for asymptomatic disease onset. It will not just treat a
disease, but treat a patient and cure a disease
16th December 2015
Reduction Vs Integration
22
16th December 2015
What is a System?
• A system is a whoesome entity made out of set of interacting or
interdependent components forming an integrated whole object
• It can be collection of a set of elements (often called
'components') and relationships which are different from
relationships of the set or its elements to other elements or sets
• Interdependent components may have some property or even
cannot exibit any property outside the wholesome object
• These components when combined, it becomes a wholesome
system with a static and dynamic property completely different
from the properties of individual components
23
16th December 2015
Systems Biology
• Systems Biology Is about integration of modeling, simulation, experimentation, databases, and bioinformatic approaches
• Predictive understanding of microbial and plant systems for advancing for clinical medicine, high yield crops, hight nutriant produce, biofuel, biological sontrol on carbon-cycling, cleaning up contaminated environment etc.
• integration of modeling, simulation, experimentation, and bioinformatic approaches
24
16th December 2015
The Synergy
Genomics
Transcriptomics
Proteomics
Metabolomics
Fluxomics
Sociomics
Epigenomics
Systems Biology
........
Bibliomics
25
16th December 2015
Model
• Scientific modelling is an activity to make a particular function or entity of the real world easier to define, quantify, visualize, understand, or simulate by referencing it to existing and usually commonly accepted knowledge
• A simulator should be able to model the actual system in Reduced or Enlarged Space & Time
• Key issues in simulation include representation of the true characteristics, function, and behaviours of the original system in a space that can be manipulated or changed as desired
• However, in many cases the similarity is only approximate or even intentionally distorted.
26
16th December 2015
Biological System
27
16th December 2015
Ways To Study A System*
28
16th December 2015
Deductive and Inductive Science
Ref: Sylvia Wassertheil-Smoller, Biostatistics and Epidemiology, Springer, 2003
Physical Science
Law of Gravitation,
Newton's Law of Motion
E = mC2
Chemical/Molecular Properties
Statistics
Biological Phenomenon
Simulation (Model fitting)
Wireless Mobile Communication
Clinical Trial
29
16th December 2015
Technical Attractions of
Simulation• Ability to compress time, expand time
• Ability to control sources of variation
• Avoids errors in measurement
• Ability to stop and review
• Ability to restore system state
• Facilitates replication
• Modeler can control level of detail
Discrete-Event Simulation: Modeling, Programming, and Analysis by G. Fishman, 2001
30
16th December 2015
Simulation System
31
16th December 2015
Part II – Some Biology
16th December 2015
Will impact the health care system significantly:• Pharmaceuticals
• Biotechnology
• Healthcare industry
• Health insurance
• Medicine--diagnostics, therapy, prevention, wellness
• Nutrition
• Assessments of environmental toxicities
• Academia and medical schools
Precision Medicine Will Transform
the Health Care Industry
Healthcare
System
New ideas need new
organizational structures
33
16th December 2015
Instruments to Decipher Various
Types of Biological Information
34
16th December 2015
Protein interactions: Yeast two-hybrid method
35
16th December 2015
• Based on X-Ray data from Rosliand Franklin, recognized that the 3.4
Angstrom period suggested a double helix.
• Based on Chargaff’s rule ([A]=[T] and [C]=[G]), recognized that the
two strands must be held together by H-bonds between purine and
pyrimidine pairs.
• Accepted the assumption that nucleotides were held together by
phosphodiester bonds with phosphate as the chain backbone.
Watson-Crick Model of DNA
36
16th December 2015
• James D. Watson and Francis
Crick who, using x-ray data
collected by Rosalind Franklin,
proposed the double helix
structure of the DNA molecule in
1953. Their article, Molecular
Structure of Nucleic Acids: A
Structure for Deoxyribose
Nucleic Acid, is celebrated for its
treatment of the B form of DNA
(B-DNA), and as the source of
Watson-Crick base pairing of
nucleotides. They with Maurice
Wilkins, were awarded the Nobel
Prize in Physiology or Medicine
in 1962.
Watson & Crick
37
16th December 2015
The Journal Article that Won the Nobel Prize
38
16th December 2015
Interactions within a Cell
Animal Plant
Nucleus
Ribosome
Endoplasmic Reticulum
Golgi Body
Ribosome: site where proteins are made
39
16th December 2015
Nucleus
Chromosome
DNA
Nucleic Acid
Nucleotide
Inside the Nucleus
40
16th December 2015
Nucleic Acids
• Deoxyribonucleic acid (DNA)
– DNA is found in the nucleus with small amounts
in mitochondria and chloroplasts
• Ribonucleic acid (RNA)
– RNA is found throughout the cell
© 2007 Paul Billiet ODWS41
16th December 2015
Watson-Crick Model of DNA
• Chains were in an antiparallel orientation
• Bases stacked perpendicular to helix axis and associate through hydrogen bonds
• Each turn is 34 Angstroms = 10 bases/turn
• Major and minor grooves within the helix
• Double helix has a 20 Angstrom diameter
42
16th December 2015
ADDING IN THE
BASES
• The bases are
attached to the 1st
Carbon
• Their order is
important
It determines the
genetic information
of the molecule
P
P
P
P
P
P
G
C
C
A
T
T© 2007 Paul Billiet ODWS
43
16th December 2015
Nucleotide Base Pairing
Nucleotides pair by forming H-bonds between bases. The
pairing is the basis for the antiparallel strands associating with
each other.
44
16th December 2015
3’
3’ 5’
5’
Single Stranded DNADouble Stranded DNA
45
16th December 2015
Proteins play key roles in a living
system
• Three examples of protein functions
– Catalysis:Almost all chemical reactions in a living cell are catalyzed by protein enzymes.
– Transport:Some proteins transports various substances, such as oxygen, ions, and so on.
– Information transfer:For example, hormones.
Alcohol dehydrogenase oxidizes alcohols to aldehydes or ketones
Haemoglobin carries oxygen
Insulin controls the amount of sugar in the blood
46
16th December 2015
Amino acid: Basic unit of protein
COO-NH3+ C
R
HAn amino acid
Different side chains, R, determin the properties of 20 amino acids.
Amino group Carboxylic acid group
47
16th December 2015
Proteins are linear polymers of
amino acids
R1
NH3+ C CO
H
R2
NH C CO
H
R3
NH C CO
H
R2
NH3+ C COO
ー
H
+
R1
NH3+ C COO
ー
H
+
H2OH2O
Peptide bond
Peptide bond
The amino acid sequence is called as
primary structureA A
FNG
GS
T
S
DK
A carboxylic acid condenses with an amino group with the release of a water
48
16th December 2015
Gene is protein’s blueprint,
genome is life’s blueprint
Gene
GenomeDNA
Protein
Gene Gene
Gene
Gene
GeneGene
GeneGene
GeneGeneGeneGene
GeneGene
Protein Protein
ProteinProtein
Protein
ProteinProtein
Protein
Protein
Protein
Protein
Protein
Protein
Protein
49
16th December 2015
Gene is protein’s blueprint,
Genome is life’s blueprint
Genome
Gene Gene
Gene
Gene
GeneGene
GeneGene
GeneGeneGeneGene
GeneGene
Protein Protein
ProteinProtein
Protein
ProteinProtein
Protein
Protein
Protein
Protein
Protein
Protein
Protein
Glycolysis network
50
16th December 2015
Amino acid sequence is
encoded by DNA base sequence
in a gene
Th
ird le
tter
G
A
C
T
G
A
C
T
G
A
C
T
G
A
C
T
Gly
Arg
Ser
Arg
Trp
Stop
Cys
GACT
GGGGAGGCGGTG
GGAGlu
GAAGCAGTA
GGCGACGCCGTC
GGTAsp
GAT
Ala
GCT
Val
GTT
G
AGGAAGACGMetATG
AGALys
AAAACAATA
AGCAACACCATC
AGTAsn
AAT
Thr
ACT
Ile
ATT
A
CGGCAGCCGCTG
CGAGln
CAACCACTA
CGCCACCCCCTC
CGTHis
CAT
Pro
CCT
Leu
CTT
C
TGGTAGTCGTTG
TGAStop
TAATCALeu
TTA
TGCTACTCCTTC
TGTTyr
TAT
Ser
TCTPhe
TTT
T
Firs
t lette
r
Second letter
51
16th December 2015
Our life is maintained by
molecular network systems
Molecular network system in a cell
(From ExPASy Biochemical Pathways; http://www.expasy.org/cgi-bin/show_thumbnails.pl?2)
52
16th December 2015
So how can we meaningfully
integrate the data?
53
16th December 2015
protein-gene
interactions
protein-protein
interactions
PROTEOME
GENOME
METABOLISM
Bio-chemical
reactions
Citrate Cycle
Cellular networks:
GENES
54
16th December 2015
A Real-life System - Reactome
55
16th December 2015
End of Part I & II
InterpretOmicsOffice: Shezan Lavelle, 5th Floor,
#15 Walton Road, Bengaluru 560001
Lab: #329, 7th Main, HAL 2nd Stage,
Indiranagar, Bengaluru 560008
Phone: +91(80)46623800