molecular biology and biochemistry - unisi.itmolecular biology and biochemistry 1 “courtesy of...

Molecular Biology and Biochemistry

1

“Courtesy of Prof. Monica Bianchini”

Table of contents

IntroductionThe genetic materialGene structure and information content

2

Introduction − 1

The main feature of living organisms is their ability of saving, using, and transmitting information Bioinformatics aims at:

determining which information is biologically relevantdeciphering how it is used in order to control the internal chemistry of living organisms

3

Introduction − 2

Bioinformatics is a discipline that lies at the crossroad of Biology, Computer Science and new technologies It is characterized by the application of mathematical, statistical, and computational tools for biological, biochemical and biophysical data processing The main topic in Bioinformatics is the analysis of DNA but, with the wide diffusion of more economic and effective techniques to acquire protein data, also proteins have become largely available for bioinformatics analyses

4

Introduction − 3

In the Bioinformatics field, there are three main subtopics:

Development and implementation of tools to store, manage, and analyze biological data

Creation of biobanks, database management, executable models (languages)

Information extraction via data analysis and interpretation

Data miningDevelopment of new algorithms to enucleate/verify relationships among a huge amount of information

Statistical methods, artificial intelligence

5

Introduction − 4

Therefore, Bioinformatics can be defined as... …the set of computer science methods and tools which can be applied for processing data in the fields of Biology, Chemistry and Medicine

Bioinformatics mainly covers:Database building techniques, useful for bio-medical research, and ad hoc query toolsArtificial intelligence, statistical methods, modeling techniques, which allow the analysis of such data and the identification/automatic extraction of significant information

6

Introduction − 5

With the rapid development of sequencing technologies, a higher throughput of better quality data can now be achieved at a lower costThese data are being extensively used to decipher the mechanisms of biological systems at the most basic levelHowever, this huge and growing amount of high−quality data, and the information delivered, continue to create challenges for researchers in Bioinformatics

7

The genetic material − 1

The genetic material is the DNA (deoxyribonucleic acid)It is actually the information contained in the DNA that allows the organization of molecules in living cells and organisms, capable of regulating their internal chemical composition and their growth and reproductionIt is also the DNA that gives us the inheritance of our ancestors’ physical traits, through the transmission of genes

8

9


Genes contain the information, in the form of specific nucleotide sequences, which constitute the DNA moleculesDNA molecules use only four nucleobases, guanine, adenine, thymine and cytosine (G, A, T, C), which are attached to a phosphate group (PO4) and to a deoxyribose sugar (C5H10O4), to form a nucleotide

CytosineGuanine Adenine Thymine

PyrimidinesPurines


All the information collected in the genes depends on the order in which the four nucleotides are organized along the DNA moleculeComplicated genes may be composed by hundreds of nucleotidesThe genetic code “which describes” an organism, known as its genome, is conserved in millions/billions of nucleotides

Chemical structure of the Guanine 5’ monophosphate: each nucleotide is composed of three parts, a phosphate group, a central deoxyribose sugar, and one out of the four nucleobases

10


Nucleotide strings can be joined together to form long polynucleotide chains and, on a larger scale, a chromosomeThe union of two nucleotides occurs through the formation of a phosphodiester bond, which connects the phosphate group of a nucleotide and the deoxyribose sugar of another nucleotideEster bonds involve connections mediated by oxygen atoms

Phosphodiester bonds occur when a phosphorus atom of a phosphate group binds two molecules via two ester bonds

11

DNA nucleotides are joined together by covalent bonds between the hydroxyl groups (OH) in 5’ e 3’: the alternation of phosphate residues and pentoses constitutes the skeleton of nucleic acids 12


All living organisms form phosphodiester bonds exactly in the same wayTo all the five carbons in the deoxyribose sugar − also called a pentose − an order number (from 1’ to 5’) is assigned

13


Phosphate groups of each free nucleotide are always linked to the 5’ carbon of the sugar

Phosphate groups constitute a bridge between the 5’ carbon in the deoxyribose of a new attached nucleotide and the 3’ carbon of a pre−existing polynucleotide chainA nucleotide string always starts with a free 5’ carbon, whereas, at the other end, there is a free 3’ carbon

The orientation of the DNA molecule is crucial for deciphering the information content of the cell

14


A common theme in all biological systems, and at all levels, is the idea that the structure and the function of such systems are intimately correlatedThe discovery of J. Watson and F. Crick (1953) that the DNA, inside the cells, usually exist as a double− stranded molecule provided an invaluable clue on how DNA could act as genetic material

15

J. Watson e F. Crick (Nobel Prize for Medicine with M. Wilkinson, 1962)


Therefore, the DNA is made up of two strands, and the information contained in a single strand is redundant with respect to the information contained in the otherThe DNA can be replicated and faithfully transmitted from generations to generations by separating the two strands and using each strand as a template for the synthesis of its companion

16


During the DNA duplication, the double strand is separated by means of the helicase enzyme; each DNA strand serves as a template for the synthesis of a new strand, produced thanks to the DNA polymerase

17


(Single Strand Binding Protein)

Helicase

New strand

New strand

DNA polymerase

DNA template

More precisely: the information contained in the two DNA strands is complementary

For each G on one strand, there is a C on the complementary strand and vice versaFor each A on one strand, there is a T on the complementary strand and vice versaThe interaction between G/C and A/T is specific and stable

In the space between the two DNA strands (11Å), guanine, with its double−ring structure, is too large to couple with the double ring of an adenine or of another guanineThymine, with its single−ring structure, is too small to interact with another nucleobase with a single ring (cytosine or thymine)

18


However, the space between the two DNA strands does not constitute a barrier for the interaction between G and T, or between A and C but, instead, it is their chemical nature that makes them incompatible

Three hydrogen bonds

Two hydrogen bonds

19

The chemical interaction between the two different types of nucleobases is so stable and energetically favorable to be responsible for the coupling of the two complementary strands


Thymine

Cytosine

Adenine

Guanine

20


The two strands of the DNA molecule do not have the same direction (5’/3’)They are anti−parallel, with the 5’ termination of one strand coupled with the 3’ termination of the other, and vice versa

ExampleIf the nucleotide sequence of one strand is 5’−

GTATCC−3’, the complementary strand sequence will be 3’−CTAGG−5’ (or, by convention, 5’−GGATAC−3’)

Characteristic sequences placed in position 5’ or 3’ with respect to a given reference point (f.i., the starting point of a gene) are commonly defined upstream and downstream, respectively

21


The central dogma of molecular biology − 1

Although the specific nucleotide sequence of a DNA molecule contains fundamental information, actually the enzymes act as catalysts, continually changing and adapting the chemical environment of the cells

Catalysts are molecules that allow specific chemical reactions to proceed faster: they don’t fray, or alter, and can therefore be used repeatedly to catalyze (or, in other words, to accelerate) the same reaction

22

Instructions needed to describe the enzymatic catalysts produced by the cell are contained in genesThe process of extracting information from genes, for the construction of enzymes, is shared by all living organismsIn detail: the information preserved in the DNA is used to synthesize a transient single−stranded polynucleotide molecule, called RNA (ribonucleic acid) which, in turn, is employed for synthesizing proteins

23


The process of copying a gene into an RNA is called transcription and is realized through the enzymatic activity of an RNA polymerase

A one−to−one correspondence is established between the nucleotides G, A, U (uracil), C of the RNA and G, A, T, C of the DNA

The conversion process from the RNA nucleotide sequence to the amino acid sequence, that constitutes the protein, is called translation; it is carried out by an ensemble of proteins and RNA, called ribosomes

24


RNA vs DNA

25


Transcription from DNA to RNA via RNA polymerase

Gene structure − 1

All the cells interpret the genetic instructions in the same way, thanks to the presence of specific signals used as punctuation marks between genesThe DNA code was developed at the very beginning of the history of life on the Earth and has undergone few changes over the course of millions of yearsBoth prokaryotic organisms (bacteria) and eukaryotes (complex organisms like yeasts, plants, animals and humans) use the same alphabet of nucleotides and almost the same format and methods to store and use the genetic information

26

Structure of a prokaryotic cell (left) and of an eukaryotic animal cell (right)

27


The gene expression is the process during which the information stored in the DNA is used to construct an RNA molecule that, in turn, encodes for a protein

Significant energy waste for the cellOrganisms that express unnecessary proteins compete unfavorably for their subsistence, compared to organisms that regulate their gene expression more effectivelyRNA polymerases are responsible for the activation of the gene expression through the synthesis of RNA copies of the gene

28


The RNA polymerase must be able to:reliably distinguish the beginning of the genedetermine which genes encode for needfull proteins

The RNA polymerase cannot search for a particular nucleotide, because each nucleotide is randomly located in any part of the DNA

The prokaryotic RNA polymerase examines a DNA sequence searching a specific sequence of 13 nucleotides

1 nucleotide serves as the start site for transcription6 nucleotides are located 10 nucleotides upstream of the start site6 nucleotides are located 35 nucleotides upstream of the start site

Promoter sequence of the gene

29


Because prokaryotic genomes are long few million nucleotides, promoter sequences, that can be randomly found only once every 70 million nucleotides, allow RNA polymerase to identify the beginning of a gene in a statistically reliable way

The genome of eukaryotes is several orders of magnitude longer: the RNA polymerase must recognize longer and more complex promoter sequences in order to locate the beginning of a gene

30


In the early ‘60s, F. Jacob and J. Monod − two French biochemists − were the first to obtain experimental evidence on how cells distinguish between genes that should or should not be transcribedTheir work on the regulation of prokaryotic genes (Nobel 1965) revealed that the expression of the structural genes (coding for proteins involved in cell structure and metabolism) is controlled by specific regulatory genes

Proteins encoded by the regulatory genes are able to bind to cellular DNA only near the promoter sequence − and only if it is required by the interaction with the external environment − of structural genes whose expression they control

31


When the chemical bond established between DNA and proteins facilitates the initiation of the transcription phase, by the RNA polymerase, it occurs a positive regulation; instead, the regulation is negative when such bond prevents the transcription

The structural genes in prokaryotes are activated/ deactivated by one or two regulatory proteinsEukaryotes use an ensemble of seven (or more) proteins to realize the gene expression regulation

32


While nucleotides are the basic units used by the cell to maintain the information (DNA) and to build molecules capable of transfer it (RNA), the amino acids are the basic bricks to build up proteinsThe protein function is strongly dependent on the order in which the amino acids are translated and bonded by the ribosomes

The genetic code − 1

Chemical structure of an amino acid: the amino group, the � −carbon, and the carboxyl group are identical for all the amino acids, which differ instead with respect to the R−group (side chain)

33

However, even if in constructing DNA and RNA molecules only four nucleotides are used, for the protein synthesis, 20 amino acids are needed

34


Therefore, there cannot be a one−to−one relation between nucleotides and amino acids, for which they encode, nor can there be a correspondence between pairs of nucleotides (at most 16 different combinations) and amino acidsRibosomes must use a code based on triplets

With only three exceptions, each group of three nucleotides, a codon, in an RNA copy of the coding portion of a gene, corresponds to a specific amino acidThe three codons that do not indicate to the ribosomes to insert an amino acid are called stop codons

35


36


The translation process

37

For a limited number of bacterial genes, UGA encodes a twenty−first amino acid, selenocysteine ; a twenty−σecond amino acid, pyrrolysine, is encoded by UAG in some bacterial and eukaryotic species


Amino acids are classified into four different categories

Nonpolar R−groups, hydrophobic: glycine, alanine, valine, leucine, isoleucine, methionine, phenyla-lanine, tryptophan, prolinePolar R−groups, hydrophilic: serine, threonine, cysteine , tyrosine, asparagine, glutamineAcid (negatively charged): aspartic acid, glutamic acidBasic (positively charged): lysine, arginine, histidine

38


39


Sulfur

Positively charged

Hydrophobic

Negatively charged

Aromatic

Imino acid

Hydrophilic

Subject to O−glycosilation

Glycine

NP

C�

NP

NP

NP

C+NP NP

NP

NP

NP

P

P P

P P

P C+

C+

C�

18 out of 20 amino acids are encoded by more than one codon − six at most, for leucine, serine, and arginine: this redundancy of the genetic code is called degeneracy

A single change in a codon is usually insufficient to cause the encoding of an amino acid of a different class (especially in the third position)That is, during the replication/transcription of the DNA, mistakes may occur that do not affect the amino acid composition of the proteinThe genetic code is very robust and minimize the consequences of possible errors in the nucleotide sequence, avoiding devastating impacts on the encoded protein function 40


41


Amino acid composition and number of codifying codons (Release 2015_04 of 01-Apr-15 of UniProtKB/Swiss-Prot, containing 548,208 sequence entries, with 195,282,524 amino acids)

6

6

6

4

4 4 23 2 2 4

22 2 2

21 2

2 1

The translation is effected by the ribosomes from a start site, located on the RNA copy of a gene, and proceeds until the first stop codon is encounteredThe start codon is the triplet AUG (which also encodes for methionine), both in eukaryotes and in prokaryotesThe translation is accurate only when the ribosomes examine the codons contained inside an open reading frame (delimited by a start and a stop codon, respectively)

The alteration of the reading frame (except for a multiple of 3) changes each amino acid situated downstream with respect to the alteration itself, and usually causes a truncated, unfunctional, version of the protein


42

Most of the genes encodes for proteins composed by hundreds of amino acidsIn a randomly generated sequence, stop codons will appear approximately every 20 triplets (3 codons out of 64), whereas open reading frames, representing genes, usually are very long sequences not containing stop codons

Open Reading Frames (ORFs) are a distinctive feature of many genes in both prokaryotes and eukaryotes

43


44


An open reading frame is a portion of a DNA molecule that, when translated into amino acids, contains no stop codons; a long ORF is likely to represent a gene

Introns and exons − 1

The messenger RNA (mRNA) copies of prokaryotic genes correspond perfectly to the DNA sequences of the genome (with the exception of uracil, which is used in place of thymine), and the steps of transcription and translation are partially overlappedIn eukaryotes, the two phases of gene expression are physically separated by the nuclear membrane: the transcription occurs in the nucleus, whereas the translation starts only after that the mRNA has been transported into the cytoplasm

RNA molecules transcribed by the eukaryotic polymerase may be significantly modified before that ribosomes initiate their translation

45

Most eukaryotic genes contain a very high number of intronsExample: the gene associated with the human cystic fibrosis has 24 introns and is long more than a million base pairs, while the mRNA trans-lated by the ribosomes is made up of about a thousand nucleobases

The most striking change that affects the primary transcript of eukaryotic genes is splicing, which involves the introns cutting and the reunification of the flanking exons

46


47


The majority of eukaryotic introns follows the GT−AG rule, according to which introns normally begin with the dinucleotide GT and end with the dinucleotide AGA failure in the correct splicing of introns from the primary transcript of eukaryotic RNA can:

introduce a shift of the reading framegenerate the transcription of a premature stop codonInoperability of the translated protein

48


Anyway, pairs of nucleotides are statistically too frequent to be considered as a sufficient signal for the recognition of introns by the splicesome (the ensemble of proteins devoted to splicing)

Additional nucleotides, at the endpoints 5’ and 3’ and inside the intron, are examined, which can differ for different cells

49


Alternative splicing: increasing of the protein biodiversity caused by the modification of the splicesome and by the presence of accessory proteins responsible for the recognition of the intron/exon neighborhood

A single gene encodes for multiple proteins (for instance in different tissues)For instance, it is estimated that almost 95% of the approximately 20,000 multiexonic protein−encoding genes in the human genome undergo alternative splicing and that, on average, a given gene gives rise to four alternatively spliced variants − encoding a total of 90−100,000 proteins which differ in their sequence and therefore, in their activities

50


51


Protein isoforms

The protein function − 1

The functions of proteins are incredibly diversifiedStructural proteins, such as collagen, provide support to the bones and to the connective tissuesEnzymes act as biological catalysts, like pepsin, which regulates the metabolismProteins are also responsible for the transport of atoms and molecules through the body (hemo-globin), for signals and intercellular communications (insulin), for the absorption of photons by the visual apparatus (rhodopsin), for the activation of muscles (actin and myosin), etc.

52

The protein function − 2

53

The primary structure − 1

Following the instructions contained in the mRNA, ribosomes translate a linear polymer (chain) of amino acidsThe 20 amino acids share a similar chemical structure and differ only according to the R−group

The structural region common to all amino acids is called the main chain or the backbone, while the different R−groups constitute side chains

The linear, ordered, chain in which the amino acids are assembled in a protein constitutes its primary structure

54

The protein chain has a directionAt one end of the chain there is a free amino group (NH), while, at the other termination, there is a carboxyl group (COOH)The chain goes from the amino−terminal (which is the first to be synthesized) to the carboxy−terminal

55


Primary structure

Peptide bond Peptide bonds formation

After the translation phase, the protein collapses, i.e. it bends and shapes itself in a complex globular structure, assuming the native structure, which, however, depends on the arrangement of the amino acids in the primary chain

The native structure defines the protein functionality

56


unfolded peptide

native protein

The chemical structure of the protein backbone forces its tridimensional arrangement which is, locally, basically planar

These two links allow a circular rotation.The protein folding exclusively derives from these rotations

The only movable segments in the backbone are the bonds of the � −carbon with the nitrogen and with the carbon of the carbonyl group, respectively

57


The C value paradox − 1

In 1948, it was discovered that the DNA amount within each cell of the same organism is the sameThis quantity is called the C value, a term coined in 1950 by Henson Swift“I am afraid the letter C stood for nothing more glamorous

than ‘constant’, i.e., the amount of DNA that was characteristic of a particular genotype” (H. Swift to Michael Bennett)

It is worth noting that, while the size of the genome of a species is constant, there are strong variations among different species, but without a correlation with the complexity of the organisms

58

The lack of correlation between the complexity and the size of the genome gives rise to the C−value paradoxThe DNA amount often differs by a factor greater than one hundred among very similar speciesThe clear implication − even if difficult to prove − is that a great quantity of DNA, present in some organisms, is in excess and, apparently, does not provide a significant contribution to the complexity of the organism itself

59


60


Concluding… − 1

DNA molecules contain the information stored within the cell needed for its life and reproductionThe ordered chain of the four different nucleotides is transcribed by RNA polymerase into mRNA, which is then translated into proteins by ribosomesTwenty different amino acids are used to build proteins, and the order and the specific composition of these “building blocks” play an important role in the construction and in the structure and functional maintenance of proteins

61

Molecular biologists have a rather limited number of tools to study DNA

The restriction enzymes cut the DNA molecule when they recognize a specific string of nucleotidesGel electrophoresis allows the separation of DNA fragments according to their length and chargeBlotting and hybridization ensure the retrieval of specific DNA fragments in a mixtureThe cloning techniques allows the propagation of specific sequences, for repeated use and analysisPCR provides a rapid amplification and characterization of specific DNA fragmentsFinally, DNA sequencing determines the order of nucleotides, ensuring the characterization of the DNA molecule

62

Concluding… − 2

The reassociation kinetics revealed that the DNA content of a cell, its C value, does not always directly correspond to the information content (the complexity) of an organism

In complex organisms, there are large areas of junk (useless) DNA

63

Concluding… − 3

molecular biology and biochemistry - unisi.itmolecular biology and biochemistry 1 “courtesy of...

Documents