sequence matrix: gene concatenation made easy

20
Sequence Matrix Gaurav Vaidya 1 , David Lohman 2 , Rudolf Meier 2 Gene concatenation made easy 1: NeatCo Asia, Singapore. 2: Department of Biological Sciences, National University of Singapore, Singapore.

Upload: gaurav-vaidya

Post on 05-Jul-2015

4.945 views

Category:

Technology


0 download

DESCRIPTION

Creating large datasets by concatenating genes can be challenging. This tool hopes to make that process much, much easier. For more information, see http://code.google.com/p/sequencematrix/ or http://www3.interscience.wiley.com/journal/123577052/abstract

TRANSCRIPT

Page 1: Sequence Matrix: Gene concatenation made easy

Sequence MatrixGaurav Vaidya1, David Lohman2, Rudolf Meier2

Gene concatenation made easy

1: NeatCo Asia, Singapore.2: Department of Biological Sciences, National University of Singapore, Singapore.

Page 2: Sequence Matrix: Gene concatenation made easy

Our goals

✤ Many powerful tools exist for concatenating sequences.

✤ Adding new sequences to an existing dataset is tedious and time consuming.

✤ Our initial goal: simple, user-friendly program for concatenating sequences.

✤ We also added a few tools to help you look for lab contamination in your dataset.

Page 3: Sequence Matrix: Gene concatenation made easy

Sequence Matrix

✤ Written in Java.

✤ Graphical user interface libraries.

✤ Works on different operating systems.

✤ Easy to install: download and run the batch file.

Page 4: Sequence Matrix: Gene concatenation made easy

Importing sequences

✤ You can use the sequence names as entered in the input file.

✤ Or you can ask Sequence Matrix to try to identify the species names.

Page 5: Sequence Matrix: Gene concatenation made easy

Importing sequences

✤ Sequences mode:

✤ gi|237510679|gb|AY556753.2|Daubentonia madagascariensis voucher WE94001 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence

✤ gi|237510678|gb|AY556735.2|Macaca sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence

✤ Species name

✤ Daubentonia madagascariensis

✤ Macaca sylvanus

Page 6: Sequence Matrix: Gene concatenation made easy

Importing sequences

✤ A common source of error is forgetting to recode leading and trailing gaps as missing information.

✤ Sequence Matrix can automatically replace such gaps with question marks.

Page 7: Sequence Matrix: Gene concatenation made easy

Importing sequences: Naming

✤ Sequences from one dataset are matched up to another dataset by sequence name.

✤ Errors in sequence naming need to be fixed.

✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.

Page 8: Sequence Matrix: Gene concatenation made easy

Export: Taxonsets

✤ By default, we generate taxonsets on the basis of:

✤ Combined length.

✤ Number of character sets

✤ Information for a particular gene.

Page 9: Sequence Matrix: Gene concatenation made easy

Gene trees

✤ Two ways to do them:

✤ Use the taxonset of taxa having information for a particular gene to exclude other taxa.

✤ Export the entire dataset with one file per column.

Page 10: Sequence Matrix: Gene concatenation made easy

Export features

✤ You can also export the Sequence Matrix table as an Excel-readable text file.

✤ Supervisory mode.

✤ Keep track of a project as it grows.

Page 11: Sequence Matrix: Gene concatenation made easy

Character sets

✤ We can read character sets defined in Nexus CHARSET and TNT xgroup commands.

✤ These can be “split” into individual columns, or imported as a single column representing the entire file.

Page 12: Sequence Matrix: Gene concatenation made easy

Excision

✤ Individual sequences can be excised from the dataset.

✤ Excised sequences will not be exported.

✤ Sequence Matrix will warn you about that.

Page 13: Sequence Matrix: Gene concatenation made easy

Contamination

✤ You thought you were sequencing Gorilla gorilla

✤ but you were really sequencing Homo sapiens.

✤ We have two tools you can use:

✤ If Homo sapiens is in your dataset.

✤ If Homo sapiens is not in your dataset (experimental!).

Page 14: Sequence Matrix: Gene concatenation made easy

H. sapiens in dataset

✤ Looks for pairs of sequences whose pairwise distance is very low.

✤ Expected difference depends on gene:

✤ 28S doesn’t change very much, but

✤ COI changes very quickly.

✤ Some interpretation is required.

Page 15: Sequence Matrix: Gene concatenation made easy

H. sapiens not present

✤ Use “Pairwise Distance Mode” to look for unusual pairwise distances.

✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”.

✤ Colour sequences by their individual pairwise distances to the reference taxon.

Page 16: Sequence Matrix: Gene concatenation made easy

H. sapiens not present

✤ Colour pairwise distances on the gene in question by their pairwise distance to the reference taxon.

✤ Look for colour variation which is unusual or out of place.

✤ We would expect sequences from different species to be correlated together.

Page 17: Sequence Matrix: Gene concatenation made easy

Pairwise distance mode

✤ You need to vary:

✤ The gene you are studying.

✤ The reference taxon being compared against.

✤ Possibly helpful as an alert mechanism.

Page 18: Sequence Matrix: Gene concatenation made easy

✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets.

✤ Taxonsets allow you to analyse subsets of your data in downstream programs.

✤ Excising sequences gives you greater control over which sequences to analyse.

✤ You can look for contamination in two ways:

✤ Looking for very low pairwise distances across your entire dataset.

✤ Looking for unusual pairwise distances in Pairwise Distance Mode.

Summary

Page 19: Sequence Matrix: Gene concatenation made easy

Acknowledgements

✤ Rudolf Meier

✤ Zhang Guanyang

✤ Farhan Ali

✤ David Lohman

✤ Everybody at the NUS DBS Evolutionary Biology lab.

Page 20: Sequence Matrix: Gene concatenation made easy

Question time!