10/19/2015bchb524 - 2015 protein structure informatics using bio.pdb bchb524 2015 lecture 12 by...

30
10/19/2015 BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

Upload: cecil-garrett

Post on 19-Jan-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Protein StructureInformatics

using Bio.PDB

BCHB5242015

Lecture 12

By Edwards & Li

Page 2: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Outline

Review Python modules Biopython Sequence modules

Biopython’s Bio.PDB Protein structure primer / PyMOL PDB file parsing PDB data navigation: SMCRA Examples

Page 3: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Python Modules Review

Access the program environment sys, os, os.path

Specialized functions math, random

Access file-like resources as files: zipfile, gzip, urllib

Make specialized formats into “lists” and “dictionaries” csv (, XML, …)

Page 4: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

BioPython Sequence Modules

Provide “sequence” abstraction More powerful than a python string

Knows its alphabet! Basic tasks already available

Easy parsing of (many) downloadable sequence database formats FASTA, Genbank, SwissProt/UniProt, etc… Simplify access to large collections of sequence Access by iteration, get sequence and accession Other content available as lists and dictionaries. Little semantic extraction or interpretation

Page 5: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

Biopython Bio.SeqIO

Access to additional information annotations dictionary features list

Information, keys, and keywords vary with database! Semantic content extraction (still) up to you!

10/19/2015 BCHB524 - 2015

import Bio.SeqIOimport sysseqfile = open(sys.argv[1])for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):    print "\n------NEW SEQRECORD------\n"    print "seq_record.annotations\n\t",seq_record.annotations    print "seq_record.features\n\t",seq_record.features    print "seq_record.dbxrefs\n\t",seq_record.dbxrefs    print "seq_record.format('fasta')\n",seq_record.format('fasta')seqfile.close()

Page 6: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Proteins are…

…a linear sequence of amino-acids, after transcription from DNA, and translation from mRNA.

Page 7: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Proteins are…

…3-D molecules that interact with other (biological) molecules to carry out biological functions…

DNA Polymerase Hemoglobin

Page 8: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

Proteins are…

www.uniprot.org

Primary Secondary Tertiary Quaternary

Introduction to Protein Structure, by Carl Branden and John Tooze

Page 9: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

Proteins are Composed of 20 Amino Acids

Page 10: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

Representations of Protein 3D Structures

Ball-Stick Ribbon Surface

Ball-Stick: Atom-Bond

Helix, Sheet and Random Coil

Solvent Accessible Surface

Page 11: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Protein Data Bank (PDB)

Repository of the 3-D conformation(s) / structure of proteins.

The result of laborious and expensive experiments using X-ray crystallography and/or nuclear magnetic resonance (NMR). (x,y,z) position of every atom of every amino-acid

Some entries contain multi-protein complexes, small-molecule ligands, docked epitopes and antibody-antigen complexes…

Page 12: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Visualization (PyMOL)

Page 13: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

Molecular Visualization Tools PyMol: http://pymol.sourceforge.net/ Chimera: http://www.cgl.ucsf.edu/chimera/ PMV: http://www.scripps.edu/~sanner/python/ Coot: http://www.ysbl.york.ac.uk/~emsley/coot/ CCP4mg: http://www.ysbl.york.ac.uk/~lizp/molgraphics.html mmLib: http://pymmlib.sourceforge.net/ VMD: http://www.ks.uiuc.edu/Research/vmd/ MMTK: http://starship.python.net/crew/hinsen/MMTK/

http://biopython.org/

Page 14: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

Overall Layout of a Structure Object

A structure consists of models

A model consists of chains

A chain consists of residues

A residue consists of atoms

http://biopython.org/

Page 15: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Biopython Bio.PDB

Parser for PDB format files Navigate structure and answer atom-atom

distance/angle questions.

Structure (PDB File) >> Model >> Chain >> Residue >> Atom >> (x,y,z) coordinates SMCRA representation mirrors PDB format

Page 16: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

SMCRA Data-Model

Each PDB file represents one “structure” Each structure may contain many models

In most cases there is only one model, index 0. Each polypeptide (amino-acid sequence) is a

“chain”. A single-protein structure has one chain, “A” 1HPV is a dimer and has chains “A” and “B”.

Page 17: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

SMCRA Data-Model

#eg1.pyimport Bio.PDB.PDBParserimport sys

# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV", "1HPV.pdb")model = structure[0]

# This structure is a dimer with two chainsachain = model['A']bchain = model['B']

Page 18: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

SMCRA

Chains are composed of amino-acid residues Access by iteration, or by index Residue “index” may not be sequence position

Residues are composed of atoms: Access by iteration or by atom name …except for H!

Water molecules are also represented as atoms – HOH residue name, het=“W”

Page 19: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

SMCRA Data-Model

#eg2.pyimport Bio.PDB.PDBParserimport sys

# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV", "1HPV.pdb")model = structure[0]

for chain in model:  for residue in chain:    for atom in residue:      print chain, residue, atom, atom.get_coord()

Page 20: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Polypeptide molecules

S-G-Y-A-L

Page 21: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

SMCRA Atom names

Page 22: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Check polypeptide backbone#eg3.pyimport Bio.PDB.PDBParserimport sys# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV", "1HPV.pdb")model = structure[0]achain = model['A']

for residue in achain:    index = residue.get_id()[1]    calpha = residue['CA']    carbon = residue['C']    nitrogen = residue['N']    oxygen = residue['O']

    print "Residue:",residue.get_resname(),index    print "N  - Ca",(nitrogen - calpha)    print "Ca - C ",(calpha - carbon)    print "C  - O ",(carbon - oxygen)    print break

Page 23: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Check polypeptide backbone#eg4.py# As before...

for residue in achain:    index = residue.get_id()[1]    calpha = residue['CA']    carbon = residue['C']    nitrogen = residue['N']    oxygen = residue['O']

    print "Residue:",residue.get_resname(),index    print "N  - Ca",(nitrogen - calpha)    print "Ca - C ",(calpha - carbon)    print "C  - O ",(carbon - oxygen)

    if achain.has_id(index+1):        nextresidue = achain[index+1]        nextnitrogen = nextresidue['N']        print "C  - N ",(carbon - nextnitrogen)

    print

Page 24: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Find potential disulfide bonds

The sulfur atoms of Cys amino-acids often form “di-sulfide” bonds if they are close enough – less than 8 Å.

Compare with PDB file contents: SSBOND

Bio.PDB does not provide an easy way to access the SSBOND annotations

Page 25: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Find potential disulfide bonds#eg5.pyimport Bio.PDB.PDBParserimport sys

# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1KCW", "1KCW.pdb")model = structure[0]achain = model['A']

cysresidues = []for residue in achain:    if residue.get_resname() == 'CYS':        cysresidues.append(residue)

for c1 in cysresidues:    c1index = c1.get_id()[1]    for c2 in cysresidues:        c2index = c2.get_id()[1]        if (c1['SG'] - c2['SG']) < 8.0:            print "possible di-sulfide bond:",            print "Cys",c1index,"-",            print "Cys",c2index,            print round(c1['SG'] - c2['SG'],2)

Page 26: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Find contact residues in a dimer#eg6.pyimport Bio.PDB.PDBParserimport sys

# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV","1HPV.pdb")

achain = structure[0]['A']bchain = structure[0]['B']

for res1 in achain: if res1.get_id()[0][0] is 'H' or res1.get_id()[0][0] is 'W': continue    r1ca = res1['CA']    r1ind = res1.get_id()[1]    r1sym = res1.get_resname()    for res2 in bchain: if res2.get_id()[0][0] is 'H' or res2.get_id()[0][0] is 'W': continue        r2ca = res2['CA']        r2ind = res2.get_id()[1]        r2sym = res2.get_resname()        if (r1ca - r2ca) < 6.0:            print "Residues",r1sym,r1ind,"in chain A",            print "and",r2sym,r2ind,"in chain B",            print "are close to each other:",round(r1ca-r2ca,2)

Page 27: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Find contact residues in a dimer – better version#eg7.pyimport Bio.PDB.PDBParserimport sys

# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure = parser.get_structure("1HPV","1HPV.pdb")

achain = structure[0]['A']bchain = structure[0]['B']

bchainca = [ r['CA'] for r in bchain ]

neighbors = Bio.PDB.NeighborSearch(bchainca)

for res1 in achain:    r1ca = res1['CA']    r1ind = res1.get_id()[1]    r1sym = res1.get_resname()    for r2ca in neighbors.search(r1ca.get_coord(), 6.0):        res2 = r2ca.get_parent()        r2ind = res2.get_id()[1]        r2sym = res2.get_resname()        print "Residues",r1sym,r1ind,"in chain A",        print "and",r2sym,r2ind,"in chain B",        print "are close to each other:",round(r1ca-r2ca,2)

Page 28: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Superimpose two structuresimport Bio.PDBimport Bio.PDB.PDBParserimport sys

# Use QUIET=True to avoid lots of warnings...parser = Bio.PDB.PDBParser(QUIET=True)structure1 = parser.get_structure("2WFJ","2WFJ.pdb")structure2 = parser.get_structure("2GW2","2GW2a.pdb")

ppb=Bio.PDB.PPBuilder()

# Manually figure out how the query and subject peptides correspond...# query has an extra residue at the front# subject has two extra residues at the backquery = ppb.build_peptides(structure1)[0][1:]target = ppb.build_peptides(structure2)[0][:-2]

query_atoms = [ r['CA'] for r in query ]target_atoms = [ r['CA'] for r in target ]

superimposer = Bio.PDB.Superimposer()superimposer.set_atoms(query_atoms, target_atoms)

print "Query and subject superimposed, RMS:", superimposer.rmssuperimposer.apply(structure2.get_atoms())

# Write modified structures to one fileoutfile=open("2GW2-modified.pdb", "w") io=Bio.PDB.PDBIO() io.set_structure(structure2) io.save(outfile) outfile.close() 

Page 29: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Superimpose two chainsimport Bio.PDB

parser = Bio.PDB.PDBParser(QUIET=1)structure = parser.get_structure("1HPV","1HPV.pdb")

model = structure[0]

ppb=Bio.PDB.PPBuilder()

# Get the polypeptide chainsachain,bchain = ppb.build_peptides(model)

aatoms = [ r['CA'] for r in achain ]batoms = [ r['CA'] for r in bchain ]

superimposer = Bio.PDB.Superimposer()superimposer.set_atoms(aatoms, batoms)

print "Query and subject superimposed, RMS:", superimposer.rmssuperimposer.apply(model['B'].get_atoms())

# Write structure to fileoutfile=open("1HPV-modified.pdb", "w") io=Bio.PDB.PDBIO() io.set_structure(structure) io.save(outfile)  outfile.close() 

Page 30: 10/19/2015BCHB524 - 2015 Protein Structure Informatics using Bio.PDB BCHB524 2015 Lecture 12 By Edwards & Li

10/19/2015 BCHB524 - 2015

Exercises

Read through and try the examples from Chapter 10 of the Biopython Tutorial and the Bio.PDB FAQ.

Write a program that analyzes a PDB file (filename provided on the command-line!) to find pairs of lysine residues that might be linked if the BS3 cross-linker is used. The rigid BS3 cross-linker is approximately 11 Å long. Write two versions, one that computes the distance between

all pairs of lysine residues, and one that uses the NeighborSearch technique.