bb30055: genes and genomes genomes - dr. mv hejmadi (bssmvh) 3 broad areas (a) genomes...

BB30055: Genes and genomes

Genomes - Dr. MV Hejmadi (bssmvh)

3 broad areas

(A) Genomes

(B)Applications genome

projects

(C) Genome evolution

Why sequence the genome?3 main reasons

• description of sequence of every gene valuable. Includes regulatory regions which help in understanding not only the molecular activities of the cell but also ways in which they are controlled.

• identify & characterise important inheritable disease genes or bacterial genes (for industrial use)

• Role of intergenic sequences e.g. satellites,

intronic regions etc

History of Human Genome Project (HGP)

1953 – DNA structure (Watson & Crick)1972 – Recombinant DNA (Paul Berg)1977 – DNA sequencing (Maxam, Gilbert and Sanger)1985 – PCR technology (Kary Mullis)1986 – automated sequencing (Leroy Hood & Lloyd

Smith1988 – IHGSC established (NIH, DOE) Watson leads1990 – IHGSC scaled up, BLAST published

(Lipman+Myers)1992 – Watson quits, Venter sets up TIGR1993 – F Collins heads IHGSC, Sanger Centre (Sulston)1995 – cDNA microarray1998 – Celera genomics (J Craig Venter)2001 – Working draft of human genome sequence

published2003 – Finished sequence announced

Human Genome Project (HGP)

Goal: Obtain the entire DNA sequence of human genome

Players:(A) International Human Genome Sequence

Consortium (IHGSC)- public funding, free access to all, started

earlier- used mapping overlapping clones method

(B) Celera Genomics – private funding, pay to view- started in 1998- used whole genome shotgun strategy

Whose genome is it anyway?

(A) International Human Genome Sequence Consortium (IHGSC)- composite from several different people generated from 10-20 primary samples taken from numerous anonymous donors across racial and ethnic groups

(B) Celera Genomics – 5 different donors (one of whom was J

Craig Venter himself !!!)

Strategies for sequencing the human genome

sequencing larger genomes

Mapping phase

Sequencing phase

Result….

~30 - 40,000 protein-coding genes estimated based on known genes and predictions

IHGSC Celeradefinite genes 24,500 26,383 possible genes 5000 12,000

Organisation of human genome

Nuclear genome (3.2 Gbp) 24 types of chromosomes Y- 51Mb and chr1 -279Mbp

Mitochondrial genome

General organisation of human genome

Polypeptide-coding regions

Gene organisation

Rare bicistronic transcription units E.g. UBA52 transcription generates ubiquitin

and a ribosomal protein S27a

Non polypeptide–coding: RNA encoding

Pseudogenes ()

non functional copies of

exonic sequences of an

active gene.

Thought to arise by genomic

insertion of a cDNA as a

result of retroposition

Contributes to overall

repetitive elements (<1%)

processed pseudogenes -

Pseudogenes in globin gene cluster

Gene fragments or truncated genes

Gene fragments: small

segments of a gene

(e.g. single exon from

a multiexon gene)

Truncated genes: Short components of functional genes (e.g. 5’ or 3’ end)

Thought to arise due to unequal crossover or exchange

Repetitive elements

Main classes based on origin

Tandem repeats

Interspersed repeats

Segmental duplications

1) Tandem repeats

Blocks of tandem repeats at subtelomeres pericentromeres Short arms of acrocentric

chromosomes Ribosomal gene clusters

Tandem / clustered

repeats

class Size of repeat

Repeat block

Major chromosomal

location

Satellite 5-171 bp > 100kb centromeric

heterochromatin

minisatellite 9-64 bp 0.1–20kb Telomeres

microsatellites 1-13 bp < 150 bp Dispersed

HMG3 by Strachan and Read pp 265-268

Broadly divided into 4 types based on size

SatellitesLarge arrays of

repeats

Some examplesSatellite 1,2 & 3Alphoid DNA) - found in all

chromosomes satellite


MinisatellitesModerate sized arrays of repeats

Some examplesHypervariable minisatellite DNA

- core of GGGCAGGAXG- found in telomeric regions- used in original DNA fingerprinting technique by Alec Jeffreys


MicrosatellitesVNTRs - Variable Number of Tandem Repeats, SSR - Simple Sequence Repeats 1-13 bp repeats e.g. (A)n ; (AC)n


2% of genome (dinucleotides - 0.5%)Used as genetic markers (especially for disease mapping)

Individual genotype

Microsatellite genotyping

. design PCR primers unique to one locus in the genomea single pair of PCR primers will produce different sized products for each of the different length microsatellites

2) Interspersed repeats

A.k.a. Transposon-derived repeats

45% of genome

Arise mainly as a result of transposition either through a DNA or a RNA intermediate

Interspersed repeats (transposon-derived)

class family size Copy numbe

r

% genome

*LINE L1 (Kpn family)

L2

~6.4kb 0.5x106

0.3 x 106

16.9

3.2

SINE Alu ~0.3kb 1.1x106 10.6

LTR e.g.HERV ~1.3kb 0.3x106 8.3

DNA

transposon

mariner ~0.25kb 1-2x104 2.8

major types

* Updated from HGP publications HMG3 by Strachan & Read pp268-272

Most ancient of eukaryotic genomes Autonomous transposition (reverse trancriptase) ~6-8kb long Internal polymerase II promoter and 2 ORFs 3 related LINE families in humans

– LINE-1, LINE-2, LINE-3. Believed to be responsible for retrotransposition

of SINEs and creation of processed pseudogenes

LINEs (long interspersed elements)

LINEs (long interspersed elements)

Nature (2001) pp879-880 HMG3 by Strachan & Read pp268-272

Non-autonomous (successful freeloaders! ‘borrow’ RT from other sources such as LINEs)

~100-300bp long Internal polymerase III promoter No proteins Share 3’ ends with LINEs 3 related SINE families in humans

– active Alu, inactive MIR and Ther2/MIR3.

SINEs (short interspersed elements)

LINES and SINEs have preferred insertion sites

• In this example, yellow represents the distribution of mys (a type of LINE) over a mouse genome where chromosomes are orange. There are more mys inserted in the sex (X) chromosomes.

Try the link below to do an online experiment which shows how an Alu insertion polymorphism has been used as a tool to reconstruct the human lineage

http://www.geneticorigins.org/geneticorigins/pv92/intro.html

Repeats on the same orientation on both sides of element e.g. ATATATNNNNNNNATATAT• contain sequences that serve as transcription promoters• as well as terminators. • These sequences allow the element to code for an mRNA

molecule that is processed and polyadenylated. • At least two genes coded within the element to supply

essential• activities for the retrotransposition mechanism. • The RNA contains a specific primer binding site (PBS) for

initiating reverse transcription. • A hallmark of almost all mobile elements is that they form

small direct repeats formed at the site of integration.

Long Terminal Repeats (LTR)

Autonomous or non-autonomous Autonomous retroposons encode gag,

pol genes which encode the protease, reverse transcriptase, RNAseH and integrase

Long Terminal Repeats (LTR)

Nature (2001) pp879-880 HMG3 by Strachan & Read pp268-272

DNA transposons Inverted repeats on both sides of elemente.g. ATGCNNNNNNNNNNNCGTA

DNA transposons (lateral transfer?)

Nature (2001) pp879-880 From GenesVII by Levin

3) Segmental duplications

Closely related sequence blocks at different genomic loci

Transfer of 1-200kb blocks of genomic sequence

Segmental duplications can occur on homologous chromosomes (intrachromosomal) or non homologous chromosomes (interchromosomal)

Not always tandemly arranged Relatively recent

Segmental duplicationsInterchromosomal segments duplicated

among non-homologous

chromosomes

Intrachromosomal duplications occur within a

chromosome / arm

Nature Reviews Genetics 2, 791-800 (2001);

Segmental duplicationsSegmental duplications in chromosome 22

Segmental duplications - chromosome 7.

Nature Reviews Genetics 2, 791-800 (2001)

Major insights from the HGP

Nature (2001) 15th Feb Vol 409 special issue; pgs 814 & 875-914.

1)Gene size, content and distribution

2)Proteome content

3)SNP identification

4)Distribution of GC content

5)CpG islands

6)Recombination rates

7)Repeat content

1) Gene size

More genes: Twice as many as drosophila / C.elegans

Uneven gene distribution: Gene-rich and gene-poor

regions

More paralogs: some gene families have extended

the number of paralogs e.g. olfactory gene family

has 1000 genes

More alternative transcripts: Increased RNA splice

variants produced thereby expanding the primary

proteins by 5 fold (e.g. neurexin genes)

Gene content….

Gene distribution

Genes- within genes E.g. NF1 gene

Overlapping genes (transcribed from 2 DNA strands) - Rare

Genes generally dispersed (~1 gene per 100kb)

Class III complex at HLA 6p21.3

HMG3 Fig 9.8

Gene-rich E.g. MHC on chromosome 6 has 60

genes with a GC content of 54%

Gene-poor regions 82 gene deserts identified? Large or unidentified genes

What is the functional significance of these variations?

Uneven gene distribution

2) Proteome content proteome more complex than invertebrates

Protein Domains (sections with identifiable shape/function)

Domain arrangements in humanslargest total number of domains is 130largest number of domain types per protein is 9Mostly identical arrangement of domains

A A B B CB C C CC Protein X

Proteome more complex than invertebrates……

no huge difference in domain number in humansBUT, frequency of domain sharing very high in human proteins (structural proteins and proteins involved in signal transduction and immune function)

However, only 3 cases where a combination of 3 domain types shared by human & yeast proteins.

e.g carbomyl-phosphate synthase (involved in the first 3 steps of de novo pyrimidine biosynthesis) has 7 domain types, which occurs once in human and yeast but twice in drosophila

3) SNPs (single nucleotide polymorphisms)

More than 1.4million SNPs identified One every 1.9kb length on averageDensities vary over regions and chromosomese.g. HLA region has a high SNP density, reflecting

maintenance of diverse haplotypes over many MYears

Nature (2001) 15th Feb Vol 409 special issue; pgs 821-823 & 928

Sites that result from point mutations in individual base pairs

biallelic ~60,000 SNPs lie within exons and

untranslated regions (85% of exons lie within 5kb of a SNP)

May or may not affect the ORF Most SNPs may be regulatory

http://images.google.com/imgres?imgurl=www.dnachip.com.hk/image/SNPs.jpg&imgrefurl=http://www.dnachip.com.hk/html/research/snps.htm&h=294&w=250&prev=/images%3Fq%3DSNPs%26svnum%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DG

How does one distinguish sequence errors from polymorphisms?sequence errorsEach piece of genome sequenced at least 10

times to reduce error rate (0.01%)

PolymorphismsSequence variation between individuals is 0.1%

To be defined as a polymorphism, the altered sequence must be present in a significant population

Rate of polymorphisms in diploid human genome is about 1 in 500 bp

Nature (2001) 15th Feb Vol 409 special issue; pgs 821-823 & 928

SNPs and disease

3) SNPs……and risk of disease

N(291)S

3) SNPs……and risk of disease

3 major alleles (APO E2, E3, and E4)

APO E2: Cys112 / Cys158 APO E3: Cys112 / Arg158 APO E4: Arg112 / Arg158

late-onset Alzheimer's disease (LOAD)Apolipoprotein 4 haplotype is a genetic risk factor

3) SNPs……and pharmacogenomics

4) Distribution of GC content

Genome wide average of 41%Huge regional variations exist

E.g.distal 48Mb of chromosome 1p-47% but chromosome 13 has only 36%

Confirms cytogenetic staining with G-bands (Giemsa)dark G-bands – low GC content (37%)light G-bands – high GC content (45%)

Nature (2001) 15th Feb Vol 409 special issue; pg 876-877

5) CpG islands

Significance of CpG islands1) Non-methylated CpG islands

associated with the 5’ ends of genes2) Aberrant methylation of CpG islands

is one mechanism of inactivating tumor suppressor genes (TSGs) in neoplasia

http://www.sanger.ac.uk/HGP/cgi.shtml

CpG Methyl CpG TpG

methylated at C Deamination

CpG islands show no methylation

CpG islands

Greatly under-represented in human genome

• ~28,890 in number• Variable density

e.g. Y – 2.9/Mb but 16,17 & 22 have 19-22/MbAverage is 10.5/Mb

Nature (2001) 15th Feb Vol 409 special issue; pg 877-888

6) Recombination rates

2 main observations• Recombination rate increases with

decreasing arm length• Recombination rate suppressed

near the centromeres and increases towards the distal 20-35Mb

7) Repeat content

a) Age distribution

b) Comparison with other genomes

c) Variation in distribution of repeats

d) Distribution by GC content

e) Y chromosome

Nature (2001) 409: pp 881-891

Repeat content…….

Most interspersed repeats predate eutherian

radiation (confirms the slow rate of clearance of

nonfunctional sequence from vertebrate genomes)

LINEs and SINEs have extremely long lives

2 major peaks of transposon activity

No DNA transposition in the past 50MYr

LTR retroposons teetering on the brink of extinction

a) Age distribution

overall decline in interspersed repeat activity in hominid lineage in the past 35-40MYr

compared to mouse genome, which shows a younger and more dynamic genome

a) Age distribution

http://images.google.com/imgres?imgurl=www.bbc.co.uk/cult/simpsons/images/quotes/burns.gif&imgrefurl=http://www.bbc.co.uk/cult/simpsons/groening/page14.shtml&h=271&w=163&prev=/images%3Fq%3Dburns%26svnum%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DG

http://images.google.com/imgres?imgurl=www.zipcaplan.com/media/mighty%2520mouse.jpg&imgrefurl=http://www.zipcaplan.com/credits/&h=344&w=205&prev=/images%3Fq%3Dmighty%2Bmouse%26svnum%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DG

b) Comparison with other genomes

Higher density of transposable elements in euchromatic portion of genome

Higher abundance of ancient transposons

60% of IR made up of LINE1 and Alu repeats

whereas DNA transposons represent only 6%

(a few human genes appear likely to have resulted from horizontal transfer from bacteria!!)

http://images.google.com/imgres?imgurl=www.fmnh.org/research_collections/zoology/zoo_sites/bugcamp_web/images/diptera/drosophila.jpg&imgrefurl=http://www.fmnh.org/research_collections/zoology/zoo_sites/bugcamp_web/insdip2.htm&h=225&w=293&prev=/images%3Fq%3Ddrosophila%26svnum%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DG

http://images.google.com/imgres?imgurl=www.spiegel.de/img/0,1020,102549,00.jpg&imgrefurl=http://www.spiegel.de/wissenschaft/mensch/0,1518,grossbild-102549-217162,00.html&h=460&w=420&prev=/images%3Fq%3Dc.elegans%26svnum%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DG

c) Variation in distribution of repeatsSome regions show eitherHigh repeat density

e.g. chromosome Xp11 – a 525kb region shows 89% repeat density

Low repeat density e.g. HOX homeobox gene cluster (<2% repeats)

(indicative of regulatory elements which have low tolerance for insertions)

High GC – gene rich ; High AT – gene poor

LINEs abundant in AT-rich regionsSINEs lower in AT-rich regions

Alu repeats in particular retained in actively transcribed GC rich regions E.g. chromosme 19 has 5% Alus compared to Y chromosome

d) Distribution by GC content

Unusually young genome (high tolerance to gaining insertions)

Mutation rate is 2.1X higher in male germline

Possibly due to cell division rates or different repair mechanisms

e) The Y chromosome !

• Working draft published – Feb 2001• Finished sequence – April 2003

• Annotation of genes going on(refer: International Human Genome

Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 21 October 2004 (doi: 10.1038/nature03001)

Other genomes sequenced

2002Mus musculus36,000 genes

Sept 2003Canis 18,473human orthologs

19974,200 genes

199819,099 genes

200238,000 genes

Science (26 Sep 2003)Vol301(5641)pp1854-1855

31Aug 2005Pan troglodytes28% identical Human orthologs

References

1) Chapter 9 pp 265-268 HMG 3 by Strachan and

Read

2) Chapter 10: pp 339-348Genetics from genes to genomes by Hartwell et al (2/e)

3) Nature (2001) 409: pp 879-891

bb30055: genes and genomes genomes - dr. mv hejmadi (bssmvh) 3 broad areas (a) genomes...

Documents

mbp mitochondrial genome

genome shotgun strategy

size slide

human genome project

ribosomal protein s27a

genomes genomes

bacterial genes

known genes