parsing data records

68
Parsing data records

Upload: egil

Post on 24-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Parsing data records. > sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parsing data records

Parsing data records

Page 2: Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

Page 3: Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

A sequence record in FASTA format

Page 4: Parsing data records

seq = ">sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens \MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS\WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY\LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY\YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD\AGEGEN"

for i in seq: print i

Page 5: Parsing data records

seq = open("SingleSeq.fasta")

for line in seq: print line

Page 6: Parsing data records

seq = open("SingleSeq.fasta")seq_2 = open("SingleSeq-2.fasta")

for line in seq: seq_2.write(line)

seq_2.close()

Page 7: Parsing data records

Writing different things depending on a condition

Read a sequence in FASTA format and print only the header of the sequence

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

Page 8: Parsing data records

seq = open("SingleSeq.fasta")

for line in seq: if line[0] == '>': print line

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

Page 9: Parsing data records

Making choices: The if/elif/else statements

if <condition 1>: if expression in <condition1> is TRUE<statements 1> execute statements 1

[elif <condition 2>]: else if exp in <condition2> is TRUE<statements 2>] execute statements 2....

[elif <condition 3>]: etc...pass]

…[else:

<statements N>]

Page 10: Parsing data records

>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE">>> s_len = float(len(s))>>> G_num = s.count('G’) >>> A_num = s.count('A’)>>> freq_G = G_num/s_len>>> freq_A = A_num/s_len>>> print freq_G0.116666666667>>> print freq_A0.216666666667

Write different things depending on a condition

Page 11: Parsing data records

>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE">>> s_len = float(len(s))>>> G_num = s.count('G’)>>> A_num = s.count('A’)>>> freq_G = G_num/s_len>>> freq_A = A_num/s_len>>> print freq_G0.116666666667>>> print freq_A0.216666666667

>>> if freq_G > freq_A:... print "Gly is more frequent than Ala"... elif freq_G < freq_A:... print "Ala is more frequent than Gly"... else:... print "The frequency of Gly and Ala is the same"...Ala is more frequent than Glycines

Write different things depending on a condition

Page 12: Parsing data records

The if/elif/else construct produces different effects compared with the use of a series of if conditions

Page 13: Parsing data records

seq = open("SingleSeq.fasta")

for line in seq: if line[0] != '>': print line

seq = open("SingleSeq.fasta")

for line in seq: if line[0] == '>': print line

Page 14: Parsing data records

seq = open("SingleSeq.fasta")

for line in seq: if line[0] != '>': print line

== != => <= > <

Page 15: Parsing data records

Exercises 1, 2, and 3

1) Read a file in FASTA format and write to a new file only the header of the record.2) Read a file in FASTA format and write to a new file only the sequence (without the header).3) Merge 1) and 2). In other words, read a file in FASTA format and write the header to a file and the sequence to a different one.

Page 16: Parsing data records

fasta = open('SingleSeq.fasta')header = open('header.txt', 'w’)

for line in fasta: if line[0] == '>': header.write(line) header.close()

Page 17: Parsing data records

fasta = open('SingleSeq.fasta')seq = open('seq.txt','w')

for line in fasta: if line[0] != '>': seq.write(line)

seq.close()

Page 18: Parsing data records

fasta = open('SingleSeq.fasta')header = open('header.txt', 'w')seq = open('seq.txt','w')

for line in fasta: if line[0] == '>': header.write(line) else: seq.write(line)

header.close()seq.close()

Page 19: Parsing data records

Let’s increase the difficulty just a bit…

Page 20: Parsing data records

seq_fasta = open("SingleSeq.fasta")

seq = ''

for line in seq_fasta: if line[0] == '>': header = line else: seq = seq + line.strip()

num_cys = seq.count("C")

print header, seq, num_cys

Page 21: Parsing data records

Exercise 4

4) Read a file in FASTA format. Print or write the record to a file only if the sequence is from Homo sapiens.

Page 22: Parsing data records

seq_fasta = open("SingleSeq.fasta")

seq = ''header = ''

for line in seq_fasta: if line[0] == '>': if "Homo sapiens" in line: header = line else:

if header: seq = seq + line

if header: print header + seqelse: print "The record is not from H. sapiens"

Page 23: Parsing data records

In general, you will need to analyse several sequences….

Page 24: Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........

SwissProt-Human.fasta

Page 25: Parsing data records

Read the records from a file and write them to a new file

fasta = open('SwissProt-Human.fasta')fasta_2 = open('SwissProt-Human_2.fasta', 'w')

for line in fasta:fasta_2.write(line)

this must be a string

Page 26: Parsing data records

Strings can be concatenated

Strings can be indexed and sliced

String elements cannot be re-assigned

>>> print "ACTGGTA" + "ATGTAACTT"ACTGGTAATGTAACTT

>>> s = "ACTGGTA">>> s[0]'A'>>> s[1:3]'CT'

>>> s[2] = 'Z'Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: 'str' object does not support item assignment

Page 27: Parsing data records

Read the sequences from a file and write them to a new file

fasta = open('SwissProt-Human.fasta')fasta_2 = open('SwissProt-Human_2.fasta', 'w')

n = 0for line in fasta:

n = n + 1l_n = str(n)fasta_2.write(l_n + "\t" + line)

fasta_2.close()

Number the lines starting from 1

Page 28: Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........

Page 29: Parsing data records

Exercise 5

5) Download a Uniprot multiple sequence FASTA file. Write the record headers to a new file.

Page 30: Parsing data records

fasta = open('SwissProt-Human.fasta')headers = open('headers.txt', 'w')

for line in fasta:if line[0] == '>':

headers.write(line)

headers.close()

Page 31: Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........

Page 32: Parsing data records

Exercise 6

6) Read a multiple sequence FASTA file and write the sequences to a new file separated by a blank line

Page 33: Parsing data records

fasta = open('SwissProt-Human.fasta.fasta')seqs = open('seqs.txt', 'w')

for line in fasta: if line[0] == '>’: seqs.write('\n') elif line[0] != '>': seqs.write(line)seqs.close()

seqs.write(line.strip() + '\n’)

Page 34: Parsing data records

Exercise 7

7) Read a multiple sequence FASTA file and write to a new file only the records from Homo sapiens.

Page 35: Parsing data records

fasta = open('sprot_prot.fasta')output = open('homo_sapiens.fasta', 'w')

seq = ''

for line in fasta: if line[0] == '>' and seq == '': header = line elif line[0] != '>': seq = seq + line elif line[0] == '>' and seq != '': if "Homo sapiens" in header: output.write(header + seq) header = line seq = ''

if "Homo sapiens" in header: output.write(header + seq)

output.close()

Page 36: Parsing data records

Exercise 8

8) Read FASTA records from a file and count the cysteine residues in each sequence.

Page 37: Parsing data records

fasta = open('sprot_prot.fasta')

seq = ''

for line in fasta: if line[0] == '>' and seq == '': header = line[4:10] elif line[0] != '>': seq = seq + line.strip() elif line[0] == '>' and seq != '': cys_num = seq.count('C') print header, ': ', cys_num header = line[4:10] seq = ''

print header, ': ', cys_num

Read the records from a file and count the cysteine residues in each sequence

Page 38: Parsing data records

Exercises 9, 10, and 11

9) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences starting with a methionine ('M').10) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences having at least two tryptophan residues ('W'). 11) Read a multiple sequence file in FASTA format and write to a new file only the records the sequences of which start with a methionine ('M') and have at least two tryptophans ('W').

Page 39: Parsing data records

outfile = open('SwissProtHuman-Filtered.fasta','w')fasta = open('SwissProtHuman.fasta','r')

seq = ''

for line in fasta: if line[0:1] == '>' and seq == '': header = line elif line [0:1] != '>': seq = seq + line elif line[0:1] == '>' and seq != '':

TRP_num = seq.count('W') if seq[0] == 'M' and TRP_num > 1:

outfile.write(header + seq) seq = '' header = line

TRP_num = seq.count('W')if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq)outfile.close()

Page 40: Parsing data records

In many cases you will need to compare data from different files

Page 41: Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........

SwissProt-Human.fasta

cancer-expressed.txt

Page 42: Parsing data records
Page 43: Parsing data records

1) Read 10 SwissProt ACs from a file2) Store them into a data structure

cancer_file = open('cancer-expressed.txt')

cancer_list = []

for line in cancer_file:AC = line.strip()cancer_list.append(AC)

print cancer_list

Page 44: Parsing data records

List data structure

A list is a mutable ordered collection of objects

L = [1, [2,3], 4.52, ‘DNA’]

The elements of a list can be any kind of object: numbersstringstupleslistsdictionariesfunction callsetc.

L = [] The empty list

Page 45: Parsing data records
Page 46: Parsing data records

>>> L = [1,”hello”,12.1,[1,2,”three”],”seq”,(1,2)]>>> L[0] # indexing 1>>> L[3] # indexing[1, 2, ’three']>>> L[3][2] # indexing ‘three’>>> L[-1] # negative indexing(1, 2)>>> L[2:4] # slicing[12.1, [1, 2, ‘three’]]>>> L[2:] # slicing shorthand[12.1, [1, 2, ‘three’], ‘seq’, (1, 2)]>>>

Page 47: Parsing data records

The elements of a list can be changed/replaced after the list has been defined

l[i] = xl[i:j] = tdel l[i:j]del l[i:j:k]l.append(x)l.extend(x)

>>> l = [2,3,5,7,8,['a','b'],'a','b','cde']>>> l[0] = 1>>> l[1, 3, 5, 7, 8, ['a', 'b'], 'a', 'b', 'cde']>>> l[0:3] = 'DNA'>>> l['D', 'N', 'A', 7, 8, ['a', 'b'], 'a', 'b', 'cde']>>> del l[0:5]>>> l[['a', 'b'], 'a', 'b', 'cde']>>> l.append('DNA')>>> l[['a', 'b'], 'a', 'b', 'cde', 'DNA']>>> l.extend('dna')>>> l[['a', 'b'], 'a', 'b', 'cde', 'DNA', 'd', 'n', 'a']>>>

These operations CHANGE the list

Page 48: Parsing data records

l.count(x)l.index(x)l.insert(i, x)l.pop(i)l.remove(x)

>>> l = [1,3,5,7,8,['a','b'],'a','b','cde']>>> l.count(‘a’)>>> l1>>> l.index(8)4>>> l.insert(4, 80)>>> l[1, 3, 5, 7, 80, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’]>>> l.pop(4)80>>> l[1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’]>>> l.pop()‘cde’>>> l[1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’]>>> l.remove(8)[1, 3, 5, 7, [‘a’, ‘b’], ‘a’, ‘b’]

The elements of a list can be changed/replaced after the list has been defined

Page 49: Parsing data records

l.reverse()l.sort()sorted(l)

>>> l = [4, 3, 2, 1, 5, 6, 7, 8]>>> l.reverse()>>> l[8, 7, 6, 5, 1, 2, 3, 4]>>> new = sorted(l)>>> new[1, 2, 3, 4, 5, 6, 7, 8]>>> l[8, 7, 6, 5, 1, 2, 3, 4]>>> l.sort()>>> l[1, 2, 3, 4, 5, 6, 7, 8]

The elements of a list can be changed/replaced after the list has been defined

Page 50: Parsing data records

Putting together lists and loopsrange() and xrange() built-in functions

>>> range(10)[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]>>> range(1, 11)[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]>>> range(0, 30, 5)[0, 5, 10, 15, 20, 25]>>> range(0, 10, 3)[0, 3, 6, 9]>>> range(0, -10, -1)[0, -1, -2, -3, -4, -5, -6, -7, -8, -9]>>> range(0)[]>>> range(1, 0)[]# the xrange()method is more commonly used in for loops than range()>>>for i in xrange(5):… print i…0,1,2,3,4

The xrange()method generates the values upon call, i.e. it does not store them into a variable

Page 51: Parsing data records

Exercise 12

12) Create a list containing Uniprot ACs extracted from a FASTA file. Print the list.

Page 52: Parsing data records

InputFile = open("SwissProtHuman.fasta","r")AC_list = []for line in InputFile: if line[0] == '>': fields = line.split('|') AC_list.append(fields[1])print AC_list

Page 53: Parsing data records

By the way…. Exercise 13

13) Read a file in FASTA format and copy to a new file the record ACs.

Page 54: Parsing data records

human_fasta = open('SwissProt-Human.fasta')Outfile = open('SwissProt-Human-AC.txt’)

for line in human_fasta:if line[0] == '>':

AC = line.split('|')[1]Outfile.write(AC + '\n')

Outfile.close()

Selectively extract ACs froma a FASTA file

Page 55: Parsing data records

Exercise 14

14) Read the human FASTA file one record after the other. Check if the record header contains one of the 10 ACs. If YES, copy the header to a new file.

Page 56: Parsing data records

Read the human FASTA file one record after the other.Check if the record header contains one of the 10 ACs.If YES, copy the header to a new file.

cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open(‘cancer-expressed.fasta’,’w’)cancer_list = []for line in cancer_file:

AC = line.strip()cancer_list.append(AC)

for line in human_fasta:if line[0] == '>':

AC = line.split('|')[1]if AC in cancer_list:

Outfile.write(line)Outfile.close()

We are not writing the whole record but the header line only

Page 57: Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN

SwissProt-Human.fasta

Page 58: Parsing data records

Exercise 15

15) Read a multiple sequence file in FASTA format and write to a new file only the records the Uniprot ACs of which are present in the list created in 12).

Page 59: Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN

cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open('cancer_expressed.fasta','w')

cancer_list = []

for line in cancer_file:AC = line.strip()cancer_list.append(AC)

for line in human_fasta: if line[0] == ">":

field = line.split("|")AC = field[1]if AC in cancer_list:

Outfile.write(line)else:

if AC in cancer_list:Outfile.write(line)

Outfile.close()

Page 60: Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN

cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open('cancer_expressed.fasta','w')

cancer_list = []seq = ''

for line in cancer_file:AC = line.strip()cancer_list.append(AC)

for line in human_fasta:if line[0] == '>' and seq == '':

header = lineAC = line.split('|')[1]

elif line[0] != '>':seq = seq + line

elif line[0] == '>' and seq != '':if AC in cancer_list:

Outfile.write(header+seq)header = lineAC = line.split('|')[1]seq = ''

if AC in cancer_list:Outfile.write(header+seq)

The same but with more control…

Page 61: Parsing data records
Page 62: Parsing data records

Extract and write to a file the gene sequence from the Candida albicans genomic DNA, chromosome 7, complete sequence (file ap006852.gbk)

Try to write it in FASTA format:

>AP006852CcactgtccaatggctcaacacgccaatcatcatacaatacccccaacaggaatcaccaaagtactgatgcttctcactatcaatagtttgtactttcaccacacaatagcagatgatccatctaaatccaccttcctatcgatcgtgaccacccccataaaataggtcaactccataaacacctccatcaccaacgctagactcacaacccagaacatgttaatcaaccggtgggccaaGtaccgttgtagctctctcgtaaacacaagaaccaacaccaaacaacatactacaactga......

Page 63: Parsing data records

Exercise 16

16) Read a Genbank record and write to a file the nucleotide sequence in FASTA format.

Page 64: Parsing data records

InputFile = open("ap006852.gbk")OutputFile = open("ap006852.fasta","w")flag = 0

for line in InputFile:if line[0:9] == 'ACCESSION':

AC = line.split()[1].strip()OutputFile.write('>'+AC+'\n')

if line[0:6] == 'ORIGIN': flag = 1continue

if flag == 1:fields = line.split()if fields != []:

seq = ''.join(fields[1:])OutputFile.write(seq +'\n')

InputFile.close()OutputFile.close()

Page 65: Parsing data records

Parsing data records

• Start by visually inspecting the file you want to parse

• Identify the information you want to extract

• Identify separators to select your information using if conditions

• Use lists if you have to compare data from different files

Page 66: Parsing data records

cancer_file = open('cancer-expressed.txt')

cancer_list = []line = cancer_file.readline()while line:

AC = line.strip()cancer_list.append(AC)line = cancer_file.readline()

We can use while loops to read files(but usually we won’t do it)

Page 67: Parsing data records

You can repeat all exercises using ncbi_gene.fasta as input file

Page 68: Parsing data records

Summary

• Parsing sequence records in FASTA format

• Lists

• Making choices: if/elif/else

• range() and xrange()