sequence matrix: gene concatenation made easy

Sequence MatrixGaurav Vaidya1, David Lohman2, Rudolf Meier2

Gene concatenation made easy

1: NeatCo Asia, Singapore.2: Department of Biological Sciences, National University of Singapore, Singapore.

Our goals

✤ Many powerful tools exist for concatenating sequences.

✤ Adding new sequences to an existing dataset is tedious and time consuming.

✤ Our initial goal: simple, user-friendly program for concatenating sequences.

✤ We also added a few tools to help you look for lab contamination in your dataset.

Sequence Matrix

✤ Written in Java.

✤ Graphical user interface libraries.

✤ Works on different operating systems.

✤ Easy to install: download and run the batch file.

Importing sequences

✤ You can use the sequence names as entered in the input file.

✤ Or you can ask Sequence Matrix to try to identify the species names.

Importing sequences

✤ Sequences mode:

✤ gi|237510679|gb|AY556753.2|Daubentonia madagascariensis voucher WE94001 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence

✤ gi|237510678|gb|AY556735.2|Macaca sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence

✤ Species name

✤ Daubentonia madagascariensis

✤ Macaca sylvanus

Importing sequences

✤ A common source of error is forgetting to recode leading and trailing gaps as missing information.

✤ Sequence Matrix can automatically replace such gaps with question marks.

Importing sequences: Naming

✤ Sequences from one dataset are matched up to another dataset by sequence name.

✤ Errors in sequence naming need to be fixed.

✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.

Export: Taxonsets

✤ By default, we generate taxonsets on the basis of:

✤ Combined length.

✤ Number of character sets

✤ Information for a particular gene.

Gene trees

✤ Two ways to do them:

✤ Use the taxonset of taxa having information for a particular gene to exclude other taxa.

✤ Export the entire dataset with one file per column.

Export features

✤ You can also export the Sequence Matrix table as an Excel-readable text file.

✤ Supervisory mode.

✤ Keep track of a project as it grows.

Character sets

✤ We can read character sets defined in Nexus CHARSET and TNT xgroup commands.

✤ These can be “split” into individual columns, or imported as a single column representing the entire file.

Excision

✤ Individual sequences can be excised from the dataset.

✤ Excised sequences will not be exported.

✤ Sequence Matrix will warn you about that.

Contamination

✤ You thought you were sequencing Gorilla gorilla

✤ but you were really sequencing Homo sapiens.

✤ We have two tools you can use:

✤ If Homo sapiens is in your dataset.

✤ If Homo sapiens is not in your dataset (experimental!).

H. sapiens in dataset

✤ Looks for pairs of sequences whose pairwise distance is very low.

✤ Expected difference depends on gene:

✤ 28S doesn’t change very much, but

✤ COI changes very quickly.

✤ Some interpretation is required.

H. sapiens not present

✤ Use “Pairwise Distance Mode” to look for unusual pairwise distances.

✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”.

✤ Colour sequences by their individual pairwise distances to the reference taxon.

H. sapiens not present

✤ Colour pairwise distances on the gene in question by their pairwise distance to the reference taxon.

✤ Look for colour variation which is unusual or out of place.

✤ We would expect sequences from different species to be correlated together.

Pairwise distance mode

✤ You need to vary:

✤ The gene you are studying.

✤ The reference taxon being compared against.

✤ Possibly helpful as an alert mechanism.

✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets.

✤ Taxonsets allow you to analyse subsets of your data in downstream programs.

✤ Excising sequences gives you greater control over which sequences to analyse.

✤ You can look for contamination in two ways:

✤ Looking for very low pairwise distances across your entire dataset.

✤ Looking for unusual pairwise distances in Pairwise Distance Mode.

Summary

Acknowledgements

✤ Rudolf Meier

✤ Zhang Guanyang

✤ Farhan Ali

✤ David Lohman

✤ Everybody at the NUS DBS Evolutionary Biology lab.

Question time!

sequence matrix: gene concatenation made easy

Technology