biodiversity informatics of the cyperaceae: where we stand and where we’re heading
TRANSCRIPT
Biodiversity Informa1cs of the Cyperaceae: Where we stand and
where we’re heading
Andrew Hipp, Marlene Hahn, Ed Baker, Vince Smith and
The Cariceae Working Group
A set of tools for Cariceae informa1cs
Andrew Hipp, Marlene Hahn, Ed Baker, Vince Smith and
The Cariceae Working Group
Iden1fy gaps in our knowledge and
sampling
Formulate sampling plan
New collec1ons DNA sequences
DNA matrices
Mul1ple alignments
Species tree es1mates
Revised classifica1on
A central database for specimen-‐level data
What tools do we need? • An easily-‐updated hierarchical checklist to visualize sampling progress across labs, extrac1ons, sequences; • A specimen-‐level phylogene6cs pipeline that we can use to harvest exis1ng data from NCBI as well as generate ongoing phylogene1c snapshots; • A way to automate mapping from specimen data, so that we can visualize (and assess our visualiza1ons of) species distribu1ons in geographic and ecological space; and • A pla8orm for collabora6on – a virtual research environment to bring together researchers worldwide
I. A hierarchical checklist and sampling progress reports
In 2011 • A flat checklist exported
from WCM • A set of spreadsheets from
collabora1ng labs inventorying their DNA and sequence collec1ons
• A vague idea of what trips are needed
Today • A hierarchical checklist by
subgenus, sec1on • A synthesis of what
materials and sequences collaborators have on hand, and what taxa are unsampled
• A concrete sampling plan with trips and taxa iden1fied*
* Okay, we’re working on this one!
Taxonomy
Specimen(s)
DNA extrac6on(s)
Sequence(s)
Trace file(s) / con6g(s)
We are aiming toward a database in which the taxonomy, specimen data, DNA extrac1ons, raw sequencing data and DNA matrices all live together and can be curated and worked on jointly by the community.
Taxonomy
Specimen(s)
DNA extrac6on(s)
Sequence(s)
Trace file(s) / con6g(s)
Spring 2012: Hierarchical checklist
Taxonomy
Specimen(s)
DNA extrac6on(s)
Sequence(s)
Trace file(s) / con6g(s)
!
Taxonomy
Specimen(s)
DNA extrac6on(s)
Sequence(s)
Trace file(s) / con6g(s)
!
Specimen Record
Tissue
Extrac1on
DNA seq. Metadata flo
w
DNA seq.
DNA seq.
A centralized workflow • Spreadsheets imported into a single Excel file • Names cleaned (variable) • DNA data summary formula created for each spreadsheet (ca. 5 mins per user)
• Names matched to our Scratchpads checklist • All files exported to CSV • Sample sheets and SP checklist imported to R • DNA records added to checklist as nodes that are children to their taxa.
• Hierarchical checklist exported in text format, with unsampled taxa marked for searching
ß Sec1on name
ß Sampled taxon with its DNA vouchers and summaries
ß Unsampled taxon
Because Kew has coded geography using TDWG standards, we can export geographic hit-‐lists
Taxonomy
Specimen(s)
DNA extrac6on(s)
Sequence(s)
Trace file(s) / con6g(s)
!
!
!
?
II. A specimen-‐level phylogene1c pipeline
NCBI is a morass of data.
Geneious • Query nucleo1de database (NCBI) for
Organism contains: “Carex”, “Uncinia”, “Schoenoxiphium”, “Kobresia”, “Vesicarex”, or “Cymophyllus”
• Export as • FASTA • TAB-‐Delim • XML
• Only export that maintains all informa1on in NCBI.
• Necessary to obtain data that can be used to connect sequence to a specimen.
Hinchliff and Roalson. 2013. Systema(c Biology 62: 205–219.
Hinchliff and Roalson. 2013. Systema(c Biology 62: 205–219.
A workflow for specimen-‐level mul1gene datasets from NCBI
• Download from NCBI [we used Geneious, but any bulk download is fine]
• Parse out collector name, collector number, isolate number, geography • Manually clean collector names (3 days for >6500 records) • Iden1fy specimens by unique combina1ons of collector name, collector
number, isolate • Toss out “accessions” having more than one scien1fic name • Clean gene region names so that names are not duplicated (30 minutes
for >6500 records) • Export datasets to MUSCLE and align; export log file • Manually check alignments and code logfile (D, RC; variable) • Rerun MUSCLE and export RAxML batchfile • Analyze • Screen for non-‐monophyly; concatenate and con1nue!
6692 sequence records in Cariceae
Tab-‐delimited metadata from NCBI / Geneious is handy, but it lacks almost all the informa1on that could be used as voucher IDs. No way to link sequences to specimens! However, some NCBI records do contain this data. How do we access it?
NCBI Specimen Record
The FEATURES/Qualifier1 section has information that allows us to connect sequences to a specific specimen.
(for example, some records contain the qualifier specimen_voucher) To get this additional information, we need to export the data as an XML file, and parse the data out into a useable tab delimited file.
Other good information to export
We parsed the NCBI XML and embedded fields within <qualifiers1> to get voucher, DNA isolate, popula1on variants, country, geographic coordinates, collec1on date, collector name, and other fields… many informa1ve about the iden1ty of the plants sequenced. To make clean voucher IDs, we used last name, collec1on number, and DNA isolate (used by some labs). For this analysis, sequences that could not be assigned to a single-‐species voucher were discarded.
6692 sequence records à 3004 individuals, 54 genes, 5846 sequences
ITS, ETS, matK, trnL-‐trnF 3,370 DNA sequences
2,196 individuals 723 spp
397 spp > 1 individual 31.7% of those spp monophyle1c
Iden1fy gaps in our knowledge and
sampling
Formulate sampling plan
New collec1ons DNA sequences
DNA matrices
Mul1ple alignments
Species tree es1mates
Revised classifica1on
A central database for specimen-‐level data
Iden1fy gaps in our knowledge and
sampling
Formulate sampling plan
New collec1ons DNA sequences
DNA matrices
Mul1ple alignments
Species tree es1mates
Revised classifica1on
A central database for specimen-‐level data
Iden1fy gaps in our knowledge and
sampling
Formulate sampling plan
New collec1ons DNA sequences
DNA matrices
Mul1ple alignments
Species tree es1mates
Revised classifica1on
A central database for specimen-‐level data
Iden1fy gaps in our knowledge and
sampling
Formulate sampling plan
New collec1ons DNA sequences
DNA matrices
Mul1ple alignments
Species tree es1mates
Revised classifica1on
A central database for specimen-‐level data
III. Genera1ng maps from specimen data
Carex macloviana D’Urv GBIF map, 2013-‐07-‐06
Mapping GBIF Data • Generate species list to extract GBIF data. (i.e. accepted names in World Checklist) • Download GBIF data using a wrapper to dismo::gbif (R), allowing us to capture and log errors and missing data.
Clean up downloaded GBIF data • Flag duplicate specimen datasets – Flags specimens within the same species that have iden1cal coordinates.
– This should be expanded to include specimens that have iden1cal locality descrip1ons.
• Flag imprecise loca1on data – Flags specimens in which the la1tude is precise only to the degree or to a tenth of a degree.
– This threshold could be adjusted, but is tailored to the Worldclim database we are using (2.5 arc minutes).
• Create a delimited file for each species containing specimen data with flagged columns (reference file of which data are u1lized excluded in mapping step). This file becomes part of our analysis archive, so that we can always go back and edit or evaluate old data.
Example of a file generated from clean_gbif
Mapping "cleaned-‐up" dataset (Map_gbif_jpeg_imprecise)
• Maps need to be manually checked for accuracy and completeness
• We export the maps as images to a Scratchpads media gallery that can be queried or filtered by taxon
• Map reviewing is conducted in a dedicated SP2 forum
There are bugs to work out, though
Some taxa are missing data. Example: Carex humilis
• Map of 2331 specimen records from R code download
• Website individual species download – Filtered for specimens with coordinate data (= 7209 records)
– Missing records include some from France, Japan, & South Korea
Some maps will need adjustments: in next itera1ons, it should be possible to automate some of this
Carex alata specimen is missing a “-‐” in longitude column Carex lanceolata has specimens where the la1tude and longitude are switched.
In the end, integra1ng clean coordinate data with WorldClim clima1c data allows us to correlate clima1c niche evolu1on with morphological and lineage diversifica1on*. * See Thursday talk for exci1ng findings in subgenus Vignea!
h{ps://mor-‐systema1cs.googlecode.com/svn/trunk/cariceae
We’ve been wri1ng these tools in R, for the simple reason that that’s what we know. Bits could easily be ported to PHP for integra1on into Scratchpads, or Python for web implementa1on. Code is available at:
Iden1fy gaps in our knowledge and
sampling
Formulate sampling plan
New collec1ons DNA sequences
DNA matrices
Mul1ple alignments
Species tree es1mates
Revised classifica1on
A central database for specimen-‐level data
ACKNOWLEDGMENTS
The Cariceae Working Group
Bil Alverson Jane Balaban John Balaban Bethany Brown Leo Bruederle
Kyong-‐Sook Chung Theodore Cochrane
Kenneth Dritz Marcial Escudero
Kerry Ford Bruce Ford Berit Gehrke Marlene Hahn Andrew Hipp Takuji Hoshino Pedro Jimenez Timothy Jones Jongduk Jung Sangtae Kim Jennifer Kluse Kate Lueders
Modesto Luceno Anton Reznicek Eric Roalson Paul Rothrock David Simpson Julian Starr
Wayt Thomas Gayle Tonkovich Marcia Waterway Gerould Wilhelm Karen Wilson Jin Xiao-‐Feng Okihito Yano Shu-‐ren Zhang
Elizabeth Zimmerman
Colleagues at eMonocot and Scratchpads Edward Baker
Laurence Livermore Vince Smith Odile Weber
The Conveners of this Symposium Melissa Tulig Paul Wilkin
And you!
If there is 1me, I’ll take ques1ons!