parsing data records

Parsing data records

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN

A sequence record in FASTA format

seq = ">sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens \MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS\WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY\LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY\YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD\AGEGEN"

for i in seq: print i

seq = open("SingleSeq.fasta")

for line in seq: print line

seq = open("SingleSeq.fasta")seq_2 = open("SingleSeq-2.fasta")

for line in seq: seq_2.write(line)

seq_2.close()

Writing different things depending on a condition

Read a sequence in FASTA format and print only the header of the sequence

for line in seq: if line[0] == '>': print line

Making choices: The if/elif/else statements

if <condition 1>: if expression in <condition1> is TRUE<statements 1> execute statements 1

[elif <condition 2>]: else if exp in <condition2> is TRUE<statements 2>] execute statements 2....

[elif <condition 3>]: etc...pass]

…[else:

<statements N>]

>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE">>> s_len = float(len(s))>>> G_num = s.count('G’) >>> A_num = s.count('A’)>>> freq_G = G_num/s_len>>> freq_A = A_num/s_len>>> print freq_G0.116666666667>>> print freq_A0.216666666667

Write different things depending on a condition

>>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE">>> s_len = float(len(s))>>> G_num = s.count('G’)>>> A_num = s.count('A’)>>> freq_G = G_num/s_len>>> freq_A = A_num/s_len>>> print freq_G0.116666666667>>> print freq_A0.216666666667

>>> if freq_G > freq_A:... print "Gly is more frequent than Ala"... elif freq_G < freq_A:... print "Ala is more frequent than Gly"... else:... print "The frequency of Gly and Ala is the same"...Ala is more frequent than Glycines

Write different things depending on a condition

The if/elif/else construct produces different effects compared with the use of a series of if conditions

for line in seq: if line[0] != '>': print line

for line in seq: if line[0] == '>': print line

for line in seq: if line[0] != '>': print line

== != => <= > <

Exercises 1, 2, and 3

1) Read a file in FASTA format and write to a new file only the header of the record.2) Read a file in FASTA format and write to a new file only the sequence (without the header).3) Merge 1) and 2). In other words, read a file in FASTA format and write the header to a file and the sequence to a different one.

fasta = open('SingleSeq.fasta')header = open('header.txt', 'w’)

for line in fasta: if line[0] == '>': header.write(line) header.close()

fasta = open('SingleSeq.fasta')seq = open('seq.txt','w')

for line in fasta: if line[0] != '>': seq.write(line)

seq.close()

fasta = open('SingleSeq.fasta')header = open('header.txt', 'w')seq = open('seq.txt','w')

for line in fasta: if line[0] == '>': header.write(line) else: seq.write(line)

header.close()seq.close()

Let’s increase the difficulty just a bit…

seq_fasta = open("SingleSeq.fasta")

seq = ''

for line in seq_fasta: if line[0] == '>': header = line else: seq = seq + line.strip()

num_cys = seq.count("C")

print header, seq, num_cys

Exercise 4

4) Read a file in FASTA format. Print or write the record to a file only if the sequence is from Homo sapiens.

seq_fasta = open("SingleSeq.fasta")

seq = ''header = ''

for line in seq_fasta: if line[0] == '>': if "Homo sapiens" in line: header = line else:

if header: seq = seq + line

if header: print header + seqelse: print "The record is not from H. sapiens"

In general, you will need to analyse several sequences….

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN.........

SwissProt-Human.fasta

Read the records from a file and write them to a new file

fasta = open('SwissProt-Human.fasta')fasta_2 = open('SwissProt-Human_2.fasta', 'w')

for line in fasta:fasta_2.write(line)

this must be a string

Strings can be concatenated

Strings can be indexed and sliced

String elements cannot be re-assigned

>>> print "ACTGGTA" + "ATGTAACTT"ACTGGTAATGTAACTT

>>> s = "ACTGGTA">>> s[0]'A'>>> s[1:3]'CT'

>>> s[2] = 'Z'Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: 'str' object does not support item assignment

Read the sequences from a file and write them to a new file

fasta = open('SwissProt-Human.fasta')fasta_2 = open('SwissProt-Human_2.fasta', 'w')

n = 0for line in fasta:

n = n + 1l_n = str(n)fasta_2.write(l_n + "\t" + line)

fasta_2.close()

Number the lines starting from 1

Exercise 5

5) Download a Uniprot multiple sequence FASTA file. Write the record headers to a new file.

fasta = open('SwissProt-Human.fasta')headers = open('headers.txt', 'w')

for line in fasta:if line[0] == '>':

headers.write(line)

headers.close()

Exercise 6

6) Read a multiple sequence FASTA file and write the sequences to a new file separated by a blank line

fasta = open('SwissProt-Human.fasta.fasta')seqs = open('seqs.txt', 'w')

for line in fasta: if line[0] == '>’: seqs.write('\n') elif line[0] != '>': seqs.write(line)seqs.close()

seqs.write(line.strip() + '\n’)

Exercise 7

7) Read a multiple sequence FASTA file and write to a new file only the records from Homo sapiens.

fasta = open('sprot_prot.fasta')output = open('homo_sapiens.fasta', 'w')

seq = ''

for line in fasta: if line[0] == '>' and seq == '': header = line elif line[0] != '>': seq = seq + line elif line[0] == '>' and seq != '': if "Homo sapiens" in header: output.write(header + seq) header = line seq = ''

if "Homo sapiens" in header: output.write(header + seq)

output.close()

Exercise 8

8) Read FASTA records from a file and count the cysteine residues in each sequence.

fasta = open('sprot_prot.fasta')

seq = ''

for line in fasta: if line[0] == '>' and seq == '': header = line[4:10] elif line[0] != '>': seq = seq + line.strip() elif line[0] == '>' and seq != '': cys_num = seq.count('C') print header, ': ', cys_num header = line[4:10] seq = ''

print header, ': ', cys_num

Read the records from a file and count the cysteine residues in each sequence

Exercises 9, 10, and 11

9) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences starting with a methionine ('M').10) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences having at least two tryptophan residues ('W'). 11) Read a multiple sequence file in FASTA format and write to a new file only the records the sequences of which start with a methionine ('M') and have at least two tryptophans ('W').

outfile = open('SwissProtHuman-Filtered.fasta','w')fasta = open('SwissProtHuman.fasta','r')

seq = ''

for line in fasta: if line[0:1] == '>' and seq == '': header = line elif line [0:1] != '>': seq = seq + line elif line[0:1] == '>' and seq != '':

TRP_num = seq.count('W') if seq[0] == 'M' and TRP_num > 1:

outfile.write(header + seq) seq = '' header = line

TRP_num = seq.count('W')if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq)outfile.close()

In many cases you will need to compare data from different files

cancer-expressed.txt

1) Read 10 SwissProt ACs from a file2) Store them into a data structure

cancer_file = open('cancer-expressed.txt')

cancer_list = []

for line in cancer_file:AC = line.strip()cancer_list.append(AC)

print cancer_list

List data structure

A list is a mutable ordered collection of objects

L = [1, [2,3], 4.52, ‘DNA’]

The elements of a list can be any kind of object: numbersstringstupleslistsdictionariesfunction callsetc.

L = [] The empty list

>>> L = [1,”hello”,12.1,[1,2,”three”],”seq”,(1,2)]>>> L[0] # indexing 1>>> L[3] # indexing[1, 2, ’three']>>> L[3][2] # indexing ‘three’>>> L[-1] # negative indexing(1, 2)>>> L[2:4] # slicing[12.1, [1, 2, ‘three’]]>>> L[2:] # slicing shorthand[12.1, [1, 2, ‘three’], ‘seq’, (1, 2)]>>>

The elements of a list can be changed/replaced after the list has been defined

l[i] = xl[i:j] = tdel l[i:j]del l[i:j:k]l.append(x)l.extend(x)

>>> l = [2,3,5,7,8,['a','b'],'a','b','cde']>>> l[0] = 1>>> l[1, 3, 5, 7, 8, ['a', 'b'], 'a', 'b', 'cde']>>> l[0:3] = 'DNA'>>> l['D', 'N', 'A', 7, 8, ['a', 'b'], 'a', 'b', 'cde']>>> del l[0:5]>>> l[['a', 'b'], 'a', 'b', 'cde']>>> l.append('DNA')>>> l[['a', 'b'], 'a', 'b', 'cde', 'DNA']>>> l.extend('dna')>>> l[['a', 'b'], 'a', 'b', 'cde', 'DNA', 'd', 'n', 'a']>>>

These operations CHANGE the list

l.count(x)l.index(x)l.insert(i, x)l.pop(i)l.remove(x)

>>> l = [1,3,5,7,8,['a','b'],'a','b','cde']>>> l.count(‘a’)>>> l1>>> l.index(8)4>>> l.insert(4, 80)>>> l[1, 3, 5, 7, 80, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’]>>> l.pop(4)80>>> l[1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’]>>> l.pop()‘cde’>>> l[1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’]>>> l.remove(8)[1, 3, 5, 7, [‘a’, ‘b’], ‘a’, ‘b’]

l.reverse()l.sort()sorted(l)

>>> l = [4, 3, 2, 1, 5, 6, 7, 8]>>> l.reverse()>>> l[8, 7, 6, 5, 1, 2, 3, 4]>>> new = sorted(l)>>> new[1, 2, 3, 4, 5, 6, 7, 8]>>> l[8, 7, 6, 5, 1, 2, 3, 4]>>> l.sort()>>> l[1, 2, 3, 4, 5, 6, 7, 8]

Putting together lists and loopsrange() and xrange() built-in functions

>>> range(10)[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]>>> range(1, 11)[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]>>> range(0, 30, 5)[0, 5, 10, 15, 20, 25]>>> range(0, 10, 3)[0, 3, 6, 9]>>> range(0, -10, -1)[0, -1, -2, -3, -4, -5, -6, -7, -8, -9]>>> range(0)[]>>> range(1, 0)[]# the xrange()method is more commonly used in for loops than range()>>>for i in xrange(5):… print i…0,1,2,3,4

The xrange()method generates the values upon call, i.e. it does not store them into a variable

Exercise 12

12) Create a list containing Uniprot ACs extracted from a FASTA file. Print the list.

InputFile = open("SwissProtHuman.fasta","r")AC_list = []for line in InputFile: if line[0] == '>': fields = line.split('|') AC_list.append(fields[1])print AC_list

By the way…. Exercise 13

13) Read a file in FASTA format and copy to a new file the record ACs.

human_fasta = open('SwissProt-Human.fasta')Outfile = open('SwissProt-Human-AC.txt’)

for line in human_fasta:if line[0] == '>':

AC = line.split('|')[1]Outfile.write(AC + '\n')

Outfile.close()

Selectively extract ACs froma a FASTA file

Exercise 14

14) Read the human FASTA file one record after the other. Check if the record header contains one of the 10 ACs. If YES, copy the header to a new file.

Read the human FASTA file one record after the other.Check if the record header contains one of the 10 ACs.If YES, copy the header to a new file.

cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open(‘cancer-expressed.fasta’,’w’)cancer_list = []for line in cancer_file:

AC = line.strip()cancer_list.append(AC)

for line in human_fasta:if line[0] == '>':

AC = line.split('|')[1]if AC in cancer_list:

Outfile.write(line)Outfile.close()

We are not writing the whole record but the header line only

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAHMGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN

Exercise 15

15) Read a multiple sequence file in FASTA format and write to a new file only the records the Uniprot ACs of which are present in the list created in 12).

cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open('cancer_expressed.fasta','w')

cancer_list = []

for line in human_fasta: if line[0] == ">":

field = line.split("|")AC = field[1]if AC in cancer_list:

Outfile.write(line)else:

if AC in cancer_list:Outfile.write(line)

Outfile.close()

cancer_file = open('cancer-expressed.txt')human_fasta = open('SwissProt-Human.fasta')Outfile = open('cancer_expressed.fasta','w')

cancer_list = []seq = ''

for line in human_fasta:if line[0] == '>' and seq == '':

header = lineAC = line.split('|')[1]

elif line[0] != '>':seq = seq + line

elif line[0] == '>' and seq != '':if AC in cancer_list:

Outfile.write(header+seq)header = lineAC = line.split('|')[1]seq = ''

if AC in cancer_list:Outfile.write(header+seq)

The same but with more control…

Extract and write to a file the gene sequence from the Candida albicans genomic DNA, chromosome 7, complete sequence (file ap006852.gbk)

Try to write it in FASTA format:

>AP006852CcactgtccaatggctcaacacgccaatcatcatacaatacccccaacaggaatcaccaaagtactgatgcttctcactatcaatagtttgtactttcaccacacaatagcagatgatccatctaaatccaccttcctatcgatcgtgaccacccccataaaataggtcaactccataaacacctccatcaccaacgctagactcacaacccagaacatgttaatcaaccggtgggccaaGtaccgttgtagctctctcgtaaacacaagaaccaacaccaaacaacatactacaactga......

Exercise 16

16) Read a Genbank record and write to a file the nucleotide sequence in FASTA format.

InputFile = open("ap006852.gbk")OutputFile = open("ap006852.fasta","w")flag = 0

for line in InputFile:if line[0:9] == 'ACCESSION':

AC = line.split()[1].strip()OutputFile.write('>'+AC+'\n')

if line[0:6] == 'ORIGIN': flag = 1continue

if flag == 1:fields = line.split()if fields != []:

seq = ''.join(fields[1:])OutputFile.write(seq +'\n')

InputFile.close()OutputFile.close()

Parsing data records

• Start by visually inspecting the file you want to parse

• Identify the information you want to extract

• Identify separators to select your information using if conditions

• Use lists if you have to compare data from different files

cancer_file = open('cancer-expressed.txt')

cancer_list = []line = cancer_file.readline()while line:

AC = line.strip()cancer_list.append(AC)line = cancer_file.readline()

We can use while loops to read files(but usually we won’t do it)

You can repeat all exercises using ncbi_gene.fasta as input file

Summary

• Parsing sequence records in FASTA format

• Lists

• Making choices: if/elif/else

• range() and xrange()

parsing data records

Documents

5. data analysisimputed data. the first data set i used for...

amuse: multilingual semantic parsing for question...

weighted parsing, probabilistic parsing

interfaces parsing data - university of washington

python - file operations & data parsing

advanced android with data, web services and parsing

explore practical data mining and parsing with php

data oriented parsing literature...

section handout #4: parsing, dictionaries, and nested data

data driven parsing impact of reference data on contact data...

morpheus data parsing and plotting a new approach to...

chart parsing and probabilistic parsing

data-driven parsing using plcfrs -...

data-driven dependency parsing

data quality class 3. goals dimensions of data quality...

introduction to nlp data-driven dependency parsing

records and data management - une...

1140 modeling and parsing business data (ibm impact 2014)

records retention/data classifications/ data inventory

improving records and data management in … records and...