nick till nrps code project summary - reed · pdf filenick till nrps code project summary a....
TRANSCRIPT
Nick Till
NRPS Code Project Summary
A. Data formatting/trimming The data used in this project was obtained from a paper which detailed a
machine-learning approach to the prediction of amino-acids encoded by a given
A-domain in non-ribosomal peptide synthesis. The training data used for this
approach was taken from this paper, in .xls format. Then, all unnatural amino
acid residues were stripped, leaving 19 amino-acid encoding sequence types
(each type contained 1-~30 sequences). No methionine encoding A-domains
were in the training data, leaving only 19 amino-acids rather than 20. The
resulting trimmed down .xls file was converted to .csv format, and read into the
python script as a dictionary of lists (see .py file). All further data manipulations
and analysis was performed in Python (see .py file).
B. High Level Overview The first aspect of the project was essentially visualizing the IC of the A-
domains to more easily make a hypothesis about whether a method simpler than
the machine learning approach (reported by Rottig et al.) could be used to
accomplish AA prediction based on a signature sequence of an A-domain. This
visualization was done with IC Bar Charts. Through the following steps:
1. Read in CSV, to get sequence data into usable dictionary format
(readseq).
2. Construct a matrix of amino acid frequency of occurrence for each position
in each set of A-domain sequences corresponding to one amino-acid
recognition (AAfrequency).
3. Use the frequencies of each amino acid at each position in each set of A-
domain sequences to construct a consensus sequence for each A-domain
type (consensus)
4. Calculate entropy at each position of each consensus (Entropy)
5. Calculate information content at each position of each consensus (IC)
6. Plot bar charts of sequence position vs information content (plotBarChart)
7. Plot bar charts for all A-domain sequences regardless of amino-acid
specificity to see overall conservation
8. Make a weighted hamming distance function, where hamming distance
scales with information content at a given position (WeightedHamming).
9. Take an input sequence and predict its amino-acid specificity by
minimizing its weighted hamming distance against all possible (18 others)
amino-acid A-domain consensus sequences, and return the minimized
amino acid. Furthermore, check to make sure the returned amino acid
actually contains a sequence identical to the one input – output match or
mismatch to indicate success of predictor.
C. Discussion In both using 34-mer inputs and 10-mer inputs, the predictor is correct a
bit over half of the time (Tables 1 and 2). Given the low information content for
the amino-acid sequences which are predicted poorly, this makes some sense
(Figures 1 and 2). Typically, A-domain consensuses with high information
content are predicted well (Arginine, Alanine, Glutamine), however there certainly
are outliers to this trend. While Serine and Threonine are highly conserved
amongst many A-domains, they are each only predicted in one of the two
prediction models (one in the 10-mer model, and one in the 34-mer model). A
very odd successful prediction is that of valine. While the only well conserved
residues in valine (of which we have 28 examples, figure 2) are well conserved
across all A-domains, valine is still predicted by both the 10-mer and 34-mer
predictors. More complex weighting in the hamming distance function might
improve prediction power to include serine and threonine, and further not predict
valine. While it is nice that valine is predicted well, it is likely only an anomaly,
and would not hold if the predictor was applied to larger data sets.
The most obvious problem with the current predictor is the necessity for
extensive preparation of the data beforehand. The 34 residues taken as a
signature sequence in half of the predictions are selected from an 8 Å radius ball
centered in the amino acid binding site. Thus, the ordering of the amino acids
does not reflect the ordering present in the primary structure. Furthermore, the
assumption that these residues will still lie within the 8 Å radius ball in all A-
domains relies on the assumption that the conformation of all A-domains will be
very similar to the only one we have a crystal structure for (Marahiel et al.). This
is an even more obvious problem when we take the 10-mer sequences as the
signatures, since one of these 10 residues lying outside the recognition site will
unduly affect the prediction.
The weighted Hamming function could be improved with a more detailed
look at the logos (figures 3 and 4) of each A-domain sequence. With an
understanding of which residues in the A-domain are crucial to recognition of a
given AA, the Hamming function could be weighted accordingly.
REFERENCE
Figure 1. Information content by position for all 19 characterized amino acid A-domains, and a reference information content for all sequences (with length 10), regardless of A-domain specificity.
REFERENCE
Figure 2. Information content by position for all 19 characterized amino acid A-domains, and a reference information content for all sequences (length 32), regardless of A-domain specificity.
Table 1. Prediction of AA for a 34-mer sequence randomly selected from each known AA sequence set. Match indicates prediction success.
34-mer sequence Known
AA Predicted
AA Full
Match? 34-mer Match?
YWNPFDLSVMDPVSLFCGEYNTYGPTEATVAVTG ala ala Y Y TYATFDVSVWESTCIVGGEYNAYGPTEVAVETTI arg arg Y Y YWASFDLTVTAAKIVAGGEVNEYGPTETTVGCCA asn ala N N YWFSFDLGYTSPKLVLGGEINHYGPTETTIGAIA asp asp Y Y LSLSFDHFVEQDSGDCVGEINGYGPTEVSITTHK cys ala N N LGLAFDASVQQTDGLVGGETNVYGPTETCVDASS gln gln Y Y LGLAFDASVKQADMIVGGDTNVYGPTECCVDAAS glu val N N FAMTFDISALELQALVGGETNLYGPTETTIWSTF gly gly Y Y VNTSFDGSVFDGFILFGGEIHVYGPTESTVYATY ile ala N N LWDAFDASIWEPFLLTGGDVNNYGPTENTVVATS leu leu Y Y YDHWFDAAWQPADTALGGEFNCYGPTETTVEAVV lys val N N TAQAFDAAVWESALIVAGDVNAYGLTETTVCATM phe phe Y Y LFEAFDVCYQESVSITAGEHNHYGPSETHVVSAY pro pro Y Y RWMTFDVSVWEWHFMCSGEHNLYGPTEAAVDVTA ser ser Y Y LHQHFDFSVWEGNQIFGGEINMYGITETTVHVTY thr val N N LDRVFDVSMADPVMVSGGDHNEYGVTEATVVSTV trp val N N TWRFFDGCVTSTLITFAGEANEYGPTENSVATTI tyr ala N N LNAGFDASTFEGWLIIGGDWNGYGPTENTTFSTC val val Y Y
Table 2. Prediction of AA for a 10-mer sequence randomly selected from each known AA sequence set. Match indicates prediction success.
10-mer sequence Known AA Predicted AA Full Match? 10-mer match?
DLMVLCTVA- ala ala Y Y DVWTIGAVE- arg arg Y Y DLTKVGEVG- asn ala N N DLTKVGHIG- asp ala N N DHESDVGIT- cys ala N N DAQDLGVVD- gln gln Y Y DAKDIGVVD- glu ala N N DILQLGLIW- gly gly Y Y
DGFFLGVVY- ile ala N N DAWFLGNVV- leu leu Y Y DAQDAGCVE- lys lys Y Y DAWAIAAVC- phe phe Y Y DVQVIAHVV- pro pro Y Y
DVWHMSLVD- ser ala N N DFWNIGMVH- thr thr Y Y DVAVVGEVV- trp ala N N DGTLTAEVA- tyr ala N N DAFWIGGTF- val val Y Y
All 10-mers Logo
WebLogo 3.4
0.0
1.0
2.0
3.0
4.0
bits DIGFVLAGAEMQYLTFWVWSKTHFN
5
N
M
T
Y
F
VIL
D
V
C
SAGN
H
V
E
M
G
AL
TIVH
G
C
A
Y
W
F
V
D
10
EK
All 32-mers Logo
WebLogo 3.4
0.0
1.0
2.0
3.0
4.0
bits
V
S
R
TYFLY
G
SDNAWI
H
A
L
T
M
G
P
HTAS
5
H
FDIGFVLAM
C
G
T
FASY
L
A
TIV
10
G
A
E
M
Q
YLTFW
Y
A
S
QDEM
S
W
L
T
APG
V
W
S
K
T
H
FN
I
T
A
G
Q
S
FL
15
N
M
T
Y
F
VIL
A
I
TCLFV
D
V
C
SAGG
S
DE
20
C
W
L
T
F
YHVI
Q
S
HNN
H
V
E
M
G
AL
W
GYG
25
V
AIPA
STEV
S
NAT
V
H
CAST
30
TIVH
G
C
A
Y
W
F
V
D
C
TVAS
N
V
M
I
AST
W
F
S
I
Y
A