sequence matrix: gene concatenation made easy
DESCRIPTION
Creating large datasets by concatenating genes can be challenging. This tool hopes to make that process much, much easier. For more information, see http://code.google.com/p/sequencematrix/ or http://www3.interscience.wiley.com/journal/123577052/abstractTRANSCRIPT
Sequence MatrixGaurav Vaidya1, David Lohman2, Rudolf Meier2
Gene concatenation made easy
1: NeatCo Asia, Singapore.2: Department of Biological Sciences, National University of Singapore, Singapore.
Our goals
✤ Many powerful tools exist for concatenating sequences.
✤ Adding new sequences to an existing dataset is tedious and time consuming.
✤ Our initial goal: simple, user-friendly program for concatenating sequences.
✤ We also added a few tools to help you look for lab contamination in your dataset.
Sequence Matrix
✤ Written in Java.
✤ Graphical user interface libraries.
✤ Works on different operating systems.
✤ Easy to install: download and run the batch file.
Importing sequences
✤ You can use the sequence names as entered in the input file.
✤ Or you can ask Sequence Matrix to try to identify the species names.
Importing sequences
✤ Sequences mode:
✤ gi|237510679|gb|AY556753.2|Daubentonia madagascariensis voucher WE94001 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence
✤ gi|237510678|gb|AY556735.2|Macaca sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence
✤ Species name
✤ Daubentonia madagascariensis
✤ Macaca sylvanus
Importing sequences
✤ A common source of error is forgetting to recode leading and trailing gaps as missing information.
✤ Sequence Matrix can automatically replace such gaps with question marks.
Importing sequences: Naming
✤ Sequences from one dataset are matched up to another dataset by sequence name.
✤ Errors in sequence naming need to be fixed.
✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.
Export: Taxonsets
✤ By default, we generate taxonsets on the basis of:
✤ Combined length.
✤ Number of character sets
✤ Information for a particular gene.
Gene trees
✤ Two ways to do them:
✤ Use the taxonset of taxa having information for a particular gene to exclude other taxa.
✤ Export the entire dataset with one file per column.
Export features
✤ You can also export the Sequence Matrix table as an Excel-readable text file.
✤ Supervisory mode.
✤ Keep track of a project as it grows.
Character sets
✤ We can read character sets defined in Nexus CHARSET and TNT xgroup commands.
✤ These can be “split” into individual columns, or imported as a single column representing the entire file.
Excision
✤ Individual sequences can be excised from the dataset.
✤ Excised sequences will not be exported.
✤ Sequence Matrix will warn you about that.
Contamination
✤ You thought you were sequencing Gorilla gorilla
✤ but you were really sequencing Homo sapiens.
✤ We have two tools you can use:
✤ If Homo sapiens is in your dataset.
✤ If Homo sapiens is not in your dataset (experimental!).
H. sapiens in dataset
✤ Looks for pairs of sequences whose pairwise distance is very low.
✤ Expected difference depends on gene:
✤ 28S doesn’t change very much, but
✤ COI changes very quickly.
✤ Some interpretation is required.
H. sapiens not present
✤ Use “Pairwise Distance Mode” to look for unusual pairwise distances.
✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”.
✤ Colour sequences by their individual pairwise distances to the reference taxon.
H. sapiens not present
✤ Colour pairwise distances on the gene in question by their pairwise distance to the reference taxon.
✤ Look for colour variation which is unusual or out of place.
✤ We would expect sequences from different species to be correlated together.
Pairwise distance mode
✤ You need to vary:
✤ The gene you are studying.
✤ The reference taxon being compared against.
✤ Possibly helpful as an alert mechanism.
✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets.
✤ Taxonsets allow you to analyse subsets of your data in downstream programs.
✤ Excising sequences gives you greater control over which sequences to analyse.
✤ You can look for contamination in two ways:
✤ Looking for very low pairwise distances across your entire dataset.
✤ Looking for unusual pairwise distances in Pairwise Distance Mode.
Summary
Acknowledgements
✤ Rudolf Meier
✤ Zhang Guanyang
✤ Farhan Ali
✤ David Lohman
✤ Everybody at the NUS DBS Evolutionary Biology lab.
Question time!