bb30055: genes and genomes genomes - dr. mv hejmadi (bssmvh) 3 broad areas (a) genomes...
TRANSCRIPT
BB30055: Genes and genomes
Genomes - Dr. MV Hejmadi (bssmvh)
3 broad areas
(A) Genomes
(B)Applications genome
projects
(C) Genome evolution
Why sequence the genome?3 main reasons
• description of sequence of every gene valuable. Includes regulatory regions which help in understanding not only the molecular activities of the cell but also ways in which they are controlled.
• identify & characterise important inheritable disease genes or bacterial genes (for industrial use)
• Role of intergenic sequences e.g. satellites,
intronic regions etc
History of Human Genome Project (HGP)
1953 – DNA structure (Watson & Crick)1972 – Recombinant DNA (Paul Berg)1977 – DNA sequencing (Maxam, Gilbert and Sanger)1985 – PCR technology (Kary Mullis)1986 – automated sequencing (Leroy Hood & Lloyd
Smith1988 – IHGSC established (NIH, DOE) Watson leads1990 – IHGSC scaled up, BLAST published
(Lipman+Myers)1992 – Watson quits, Venter sets up TIGR1993 – F Collins heads IHGSC, Sanger Centre (Sulston)1995 – cDNA microarray1998 – Celera genomics (J Craig Venter)2001 – Working draft of human genome sequence
published2003 – Finished sequence announced
Human Genome Project (HGP)
Goal: Obtain the entire DNA sequence of human genome
Players:(A) International Human Genome Sequence
Consortium (IHGSC)- public funding, free access to all, started
earlier- used mapping overlapping clones method
(B) Celera Genomics – private funding, pay to view- started in 1998- used whole genome shotgun strategy
Whose genome is it anyway?
(A) International Human Genome Sequence Consortium (IHGSC)- composite from several different people generated from 10-20 primary samples taken from numerous anonymous donors across racial and ethnic groups
(B) Celera Genomics – 5 different donors (one of whom was J
Craig Venter himself !!!)
Strategies for sequencing the human genome
sequencing larger genomes
Mapping phase
Sequencing phase
Result….
~30 - 40,000 protein-coding genes estimated based on known genes and predictions
IHGSC Celeradefinite genes 24,500 26,383 possible genes 5000 12,000
Organisation of human genome
Nuclear genome (3.2 Gbp) 24 types of chromosomes Y- 51Mb and chr1 -279Mbp
Mitochondrial genome
General organisation of human genome
Polypeptide-coding regions
Gene organisation
Rare bicistronic transcription units E.g. UBA52 transcription generates ubiquitin
and a ribosomal protein S27a
General organisation of human genome
Non polypeptide–coding: RNA encoding
Pseudogenes ()
non functional copies of
exonic sequences of an
active gene.
Thought to arise by genomic
insertion of a cDNA as a
result of retroposition
Contributes to overall
repetitive elements (<1%)
processed pseudogenes -
Pseudogenes in globin gene cluster
Gene fragments or truncated genes
Gene fragments: small
segments of a gene
(e.g. single exon from
a multiexon gene)
Truncated genes: Short components of functional genes (e.g. 5’ or 3’ end)
Thought to arise due to unequal crossover or exchange
General organisation of human genome
Repetitive elements
Main classes based on origin
Tandem repeats
Interspersed repeats
Segmental duplications
1) Tandem repeats
Blocks of tandem repeats at subtelomeres pericentromeres Short arms of acrocentric
chromosomes Ribosomal gene clusters
Tandem / clustered
repeats
class Size of repeat
Repeat block
Major chromosomal
location
Satellite 5-171 bp > 100kb centromeric
heterochromatin
minisatellite 9-64 bp 0.1–20kb Telomeres
microsatellites 1-13 bp < 150 bp Dispersed
HMG3 by Strachan and Read pp 265-268
Broadly divided into 4 types based on size
SatellitesLarge arrays of
repeats
Some examplesSatellite 1,2 & 3Alphoid DNA) - found in all
chromosomes satellite
HMG3 by Strachan and Read pp 265-268
MinisatellitesModerate sized arrays of repeats
Some examplesHypervariable minisatellite DNA
- core of GGGCAGGAXG- found in telomeric regions- used in original DNA fingerprinting technique by Alec Jeffreys
HMG3 by Strachan and Read pp 265-268
MicrosatellitesVNTRs - Variable Number of Tandem Repeats, SSR - Simple Sequence Repeats 1-13 bp repeats e.g. (A)n ; (AC)n
HMG3 by Strachan and Read pp 265-268
2% of genome (dinucleotides - 0.5%)Used as genetic markers (especially for disease mapping)
Individual genotype
Microsatellite genotyping
. design PCR primers unique to one locus in the genomea single pair of PCR primers will produce different sized products for each of the different length microsatellites
2) Interspersed repeats
A.k.a. Transposon-derived repeats
45% of genome
Arise mainly as a result of transposition either through a DNA or a RNA intermediate
Interspersed repeats (transposon-derived)
class family size Copy numbe
r
% genome
*LINE L1 (Kpn family)
L2
~6.4kb 0.5x106
0.3 x 106
16.9
3.2
SINE Alu ~0.3kb 1.1x106 10.6
LTR e.g.HERV ~1.3kb 0.3x106 8.3
DNA
transposon
mariner ~0.25kb 1-2x104 2.8
major types
* Updated from HGP publications HMG3 by Strachan & Read pp268-272
Most ancient of eukaryotic genomes Autonomous transposition (reverse trancriptase) ~6-8kb long Internal polymerase II promoter and 2 ORFs 3 related LINE families in humans
– LINE-1, LINE-2, LINE-3. Believed to be responsible for retrotransposition
of SINEs and creation of processed pseudogenes
LINEs (long interspersed elements)
LINEs (long interspersed elements)
Nature (2001) pp879-880 HMG3 by Strachan & Read pp268-272
Non-autonomous (successful freeloaders! ‘borrow’ RT from other sources such as LINEs)
~100-300bp long Internal polymerase III promoter No proteins Share 3’ ends with LINEs 3 related SINE families in humans
– active Alu, inactive MIR and Ther2/MIR3.
SINEs (short interspersed elements)
LINES and SINEs have preferred insertion sites
• In this example, yellow represents the distribution of mys (a type of LINE) over a mouse genome where chromosomes are orange. There are more mys inserted in the sex (X) chromosomes.
Try the link below to do an online experiment which shows how an Alu insertion polymorphism has been used as a tool to reconstruct the human lineage
http://www.geneticorigins.org/geneticorigins/pv92/intro.html
Repeats on the same orientation on both sides of element e.g. ATATATNNNNNNNATATAT• contain sequences that serve as transcription promoters• as well as terminators. • These sequences allow the element to code for an mRNA
molecule that is processed and polyadenylated. • At least two genes coded within the element to supply
essential• activities for the retrotransposition mechanism. • The RNA contains a specific primer binding site (PBS) for
initiating reverse transcription. • A hallmark of almost all mobile elements is that they form
small direct repeats formed at the site of integration.
Long Terminal Repeats (LTR)
Autonomous or non-autonomous Autonomous retroposons encode gag,
pol genes which encode the protease, reverse transcriptase, RNAseH and integrase
Long Terminal Repeats (LTR)
Nature (2001) pp879-880 HMG3 by Strachan & Read pp268-272
DNA transposons Inverted repeats on both sides of elemente.g. ATGCNNNNNNNNNNNCGTA
DNA transposons (lateral transfer?)
Nature (2001) pp879-880 From GenesVII by Levin
3) Segmental duplications
Closely related sequence blocks at different genomic loci
Transfer of 1-200kb blocks of genomic sequence
Segmental duplications can occur on homologous chromosomes (intrachromosomal) or non homologous chromosomes (interchromosomal)
Not always tandemly arranged Relatively recent
Segmental duplicationsInterchromosomal segments duplicated
among non-homologous
chromosomes
Intrachromosomal duplications occur within a
chromosome / arm
Nature Reviews Genetics 2, 791-800 (2001);
Segmental duplicationsSegmental duplications in chromosome 22
Segmental duplications - chromosome 7.
Nature Reviews Genetics 2, 791-800 (2001)
Major insights from the HGP
Nature (2001) 15th Feb Vol 409 special issue; pgs 814 & 875-914.
1)Gene size, content and distribution
2)Proteome content
3)SNP identification
4)Distribution of GC content
5)CpG islands
6)Recombination rates
7)Repeat content
1) Gene size
More genes: Twice as many as drosophila / C.elegans
Uneven gene distribution: Gene-rich and gene-poor
regions
More paralogs: some gene families have extended
the number of paralogs e.g. olfactory gene family
has 1000 genes
More alternative transcripts: Increased RNA splice
variants produced thereby expanding the primary
proteins by 5 fold (e.g. neurexin genes)
Gene content….
Gene distribution
Genes- within genes E.g. NF1 gene
Overlapping genes (transcribed from 2 DNA strands) - Rare
Genes generally dispersed (~1 gene per 100kb)
Class III complex at HLA 6p21.3
HMG3 Fig 9.8
Gene-rich E.g. MHC on chromosome 6 has 60
genes with a GC content of 54%
Gene-poor regions 82 gene deserts identified? Large or unidentified genes
What is the functional significance of these variations?
Uneven gene distribution
2) Proteome content proteome more complex than invertebrates
Protein Domains (sections with identifiable shape/function)
Domain arrangements in humanslargest total number of domains is 130largest number of domain types per protein is 9Mostly identical arrangement of domains
A A B B CB C C CC Protein X
Proteome more complex than invertebrates……
no huge difference in domain number in humansBUT, frequency of domain sharing very high in human proteins (structural proteins and proteins involved in signal transduction and immune function)
However, only 3 cases where a combination of 3 domain types shared by human & yeast proteins.
e.g carbomyl-phosphate synthase (involved in the first 3 steps of de novo pyrimidine biosynthesis) has 7 domain types, which occurs once in human and yeast but twice in drosophila
3) SNPs (single nucleotide polymorphisms)
More than 1.4million SNPs identified One every 1.9kb length on averageDensities vary over regions and chromosomese.g. HLA region has a high SNP density, reflecting
maintenance of diverse haplotypes over many MYears
Nature (2001) 15th Feb Vol 409 special issue; pgs 821-823 & 928
Sites that result from point mutations in individual base pairs
biallelic ~60,000 SNPs lie within exons and
untranslated regions (85% of exons lie within 5kb of a SNP)
May or may not affect the ORF Most SNPs may be regulatory
How does one distinguish sequence errors from polymorphisms?sequence errorsEach piece of genome sequenced at least 10
times to reduce error rate (0.01%)
PolymorphismsSequence variation between individuals is 0.1%
To be defined as a polymorphism, the altered sequence must be present in a significant population
Rate of polymorphisms in diploid human genome is about 1 in 500 bp
Nature (2001) 15th Feb Vol 409 special issue; pgs 821-823 & 928
SNPs and disease
3) SNPs……and risk of disease
N(291)S
3) SNPs……and risk of disease
3 major alleles (APO E2, E3, and E4)
APO E2: Cys112 / Cys158 APO E3: Cys112 / Arg158 APO E4: Arg112 / Arg158
late-onset Alzheimer's disease (LOAD)Apolipoprotein 4 haplotype is a genetic risk factor
3) SNPs……and pharmacogenomics
4) Distribution of GC content
Genome wide average of 41%Huge regional variations exist
E.g.distal 48Mb of chromosome 1p-47% but chromosome 13 has only 36%
Confirms cytogenetic staining with G-bands (Giemsa)dark G-bands – low GC content (37%)light G-bands – high GC content (45%)
Nature (2001) 15th Feb Vol 409 special issue; pg 876-877
5) CpG islands
Significance of CpG islands1) Non-methylated CpG islands
associated with the 5’ ends of genes2) Aberrant methylation of CpG islands
is one mechanism of inactivating tumor suppressor genes (TSGs) in neoplasia
http://www.sanger.ac.uk/HGP/cgi.shtml
CpG Methyl CpG TpG
methylated at C Deamination
CpG islands show no methylation
CpG islands
Greatly under-represented in human genome
• ~28,890 in number• Variable density
e.g. Y – 2.9/Mb but 16,17 & 22 have 19-22/MbAverage is 10.5/Mb
Nature (2001) 15th Feb Vol 409 special issue; pg 877-888
6) Recombination rates
2 main observations• Recombination rate increases with
decreasing arm length• Recombination rate suppressed
near the centromeres and increases towards the distal 20-35Mb
7) Repeat content
a) Age distribution
b) Comparison with other genomes
c) Variation in distribution of repeats
d) Distribution by GC content
e) Y chromosome
Nature (2001) 409: pp 881-891
Repeat content…….
Most interspersed repeats predate eutherian
radiation (confirms the slow rate of clearance of
nonfunctional sequence from vertebrate genomes)
LINEs and SINEs have extremely long lives
2 major peaks of transposon activity
No DNA transposition in the past 50MYr
LTR retroposons teetering on the brink of extinction
a) Age distribution
overall decline in interspersed repeat activity in hominid lineage in the past 35-40MYr
compared to mouse genome, which shows a younger and more dynamic genome
a) Age distribution
b) Comparison with other genomes
Higher density of transposable elements in euchromatic portion of genome
Higher abundance of ancient transposons
60% of IR made up of LINE1 and Alu repeats
whereas DNA transposons represent only 6%
(a few human genes appear likely to have resulted from horizontal transfer from bacteria!!)
c) Variation in distribution of repeatsSome regions show eitherHigh repeat density
e.g. chromosome Xp11 – a 525kb region shows 89% repeat density
Low repeat density e.g. HOX homeobox gene cluster (<2% repeats)
(indicative of regulatory elements which have low tolerance for insertions)
High GC – gene rich ; High AT – gene poor
LINEs abundant in AT-rich regionsSINEs lower in AT-rich regions
Alu repeats in particular retained in actively transcribed GC rich regions E.g. chromosme 19 has 5% Alus compared to Y chromosome
d) Distribution by GC content
Unusually young genome (high tolerance to gaining insertions)
Mutation rate is 2.1X higher in male germline
Possibly due to cell division rates or different repair mechanisms
e) The Y chromosome !
• Working draft published – Feb 2001• Finished sequence – April 2003
• Annotation of genes going on(refer: International Human Genome
Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 21 October 2004 (doi: 10.1038/nature03001)
Other genomes sequenced
2002Mus musculus36,000 genes
Sept 2003Canis 18,473human orthologs
19974,200 genes
199819,099 genes
200238,000 genes
Science (26 Sep 2003)Vol301(5641)pp1854-1855
31Aug 2005Pan troglodytes28% identical Human orthologs
References
1) Chapter 9 pp 265-268 HMG 3 by Strachan and
Read
2) Chapter 10: pp 339-348Genetics from genes to genomes by Hartwell et al (2/e)
3) Nature (2001) 409: pp 879-891