biodiversity informatics of the cyperaceae: where we stand and where we’re heading

Biodiversity Informa1cs of the Cyperaceae: Where we stand and

where we’re heading

Andrew Hipp, Marlene Hahn, Ed Baker, Vince Smith and

The Cariceae Working Group

A set of tools for Cariceae informa1cs

Andrew Hipp, Marlene Hahn, Ed Baker, Vince Smith and


Iden1fy gaps in our knowledge and

sampling

Formulate sampling plan

New collec1ons DNA sequences

DNA matrices

Mul1ple alignments

Species tree es1mates

Revised classifica1on

A central database for specimen-‐level data

What tools do we need? • An easily-‐updated hierarchical checklist to visualize sampling progress across labs, extrac1ons, sequences; •  A specimen-‐level phylogene6cs pipeline that we can use to harvest exis1ng data from NCBI as well as generate ongoing phylogene1c snapshots; •  A way to automate mapping from specimen data, so that we can visualize (and assess our visualiza1ons of) species distribu1ons in geographic and ecological space; and •  A pla8orm for collabora6on – a virtual research environment to bring together researchers worldwide

I. A hierarchical checklist and sampling progress reports

In 2011 •  A flat checklist exported

from WCM •  A set of spreadsheets from

collabora1ng labs inventorying their DNA and sequence collec1ons

•  A vague idea of what trips are needed

Today •  A hierarchical checklist by

subgenus, sec1on •  A synthesis of what

materials and sequences collaborators have on hand, and what taxa are unsampled

•  A concrete sampling plan with trips and taxa iden1fied*

* Okay, we’re working on this one!

Taxonomy

Specimen(s)

DNA extrac6on(s)

Sequence(s)

Trace file(s) / con6g(s)

We are aiming toward a database in which the taxonomy, specimen data, DNA extrac1ons, raw sequencing data and DNA matrices all live together and can be curated and worked on jointly by the community.

Taxonomy

Specimen(s)

DNA extrac6on(s)

Sequence(s)


Spring 2012: Hierarchical checklist

Taxonomy

Specimen(s)

DNA extrac6on(s)

Sequence(s)


!

Taxonomy

Specimen(s)

DNA extrac6on(s)

Sequence(s)


!

Specimen Record

Tissue

Extrac1on

DNA seq. Metadata flo

w

DNA seq.

DNA seq.

A centralized workflow •  Spreadsheets imported into a single Excel file •  Names cleaned (variable) •  DNA data summary formula created for each spreadsheet (ca. 5 mins per user)

•  Names matched to our Scratchpads checklist •  All files exported to CSV •  Sample sheets and SP checklist imported to R •  DNA records added to checklist as nodes that are children to their taxa.

•  Hierarchical checklist exported in text format, with unsampled taxa marked for searching

ß Sec1on name

ß Sampled taxon with its DNA vouchers and summaries

ß Unsampled taxon

Because Kew has coded geography using TDWG standards, we can export geographic hit-‐lists

Taxonomy

Specimen(s)

DNA extrac6on(s)

Sequence(s)


!

!

!

?

II. A specimen-‐level phylogene1c pipeline

NCBI is a morass of data.

Geneious •  Query nucleo1de database (NCBI) for

Organism contains: “Carex”, “Uncinia”, “Schoenoxiphium”, “Kobresia”, “Vesicarex”, or “Cymophyllus”

•  Export as •  FASTA •  TAB-‐Delim •  XML

•  Only export that maintains all informa1on in NCBI.

•  Necessary to obtain data that can be used to connect sequence to a specimen.

Hinchliff and Roalson. 2013. Systema(c Biology 62: 205–219.

A workflow for specimen-‐level mul1gene datasets from NCBI

•  Download from NCBI [we used Geneious, but any bulk download is fine]

•  Parse out collector name, collector number, isolate number, geography •  Manually clean collector names (3 days for >6500 records) •  Iden1fy specimens by unique combina1ons of collector name, collector

number, isolate •  Toss out “accessions” having more than one scien1fic name •  Clean gene region names so that names are not duplicated (30 minutes

for >6500 records) •  Export datasets to MUSCLE and align; export log file •  Manually check alignments and code logfile (D, RC; variable) •  Rerun MUSCLE and export RAxML batchfile •  Analyze •  Screen for non-‐monophyly; concatenate and con1nue!

6692 sequence records in Cariceae

Tab-‐delimited metadata from NCBI / Geneious is handy, but it lacks almost all the informa1on that could be used as voucher IDs. No way to link sequences to specimens! However, some NCBI records do contain this data. How do we access it?

NCBI Specimen Record

The FEATURES/Qualifier1 section has information that allows us to connect sequences to a specific specimen.

(for example, some records contain the qualifier specimen_voucher) To get this additional information, we need to export the data as an XML file, and parse the data out into a useable tab delimited file.

Other good information to export

We parsed the NCBI XML and embedded fields within <qualifiers1> to get voucher, DNA isolate, popula1on variants, country, geographic coordinates, collec1on date, collector name, and other fields… many informa1ve about the iden1ty of the plants sequenced. To make clean voucher IDs, we used last name, collec1on number, and DNA isolate (used by some labs). For this analysis, sequences that could not be assigned to a single-‐species voucher were discarded.

6692 sequence records à 3004 individuals, 54 genes, 5846 sequences

ITS, ETS, matK, trnL-‐trnF 3,370 DNA sequences

2,196 individuals 723 spp

397 spp > 1 individual 31.7% of those spp monophyle1c


sampling



DNA matrices

Mul1ple alignments




III. Genera1ng maps from specimen data

Carex macloviana D’Urv GBIF map, 2013-‐07-‐06

Mapping GBIF Data • Generate species list to extract GBIF data. (i.e. accepted names in World Checklist) • Download GBIF data using a wrapper to dismo::gbif (R), allowing us to capture and log errors and missing data.

Clean up downloaded GBIF data •  Flag duplicate specimen datasets –  Flags specimens within the same species that have iden1cal coordinates.

–  This should be expanded to include specimens that have iden1cal locality descrip1ons.

•  Flag imprecise loca1on data –  Flags specimens in which the la1tude is precise only to the degree or to a tenth of a degree.

–  This threshold could be adjusted, but is tailored to the Worldclim database we are using (2.5 arc minutes).

•  Create a delimited file for each species containing specimen data with flagged columns (reference file of which data are u1lized excluded in mapping step). This file becomes part of our analysis archive, so that we can always go back and edit or evaluate old data.

Example of a file generated from clean_gbif

Mapping "cleaned-‐up" dataset (Map_gbif_jpeg_imprecise)

•  Maps need to be manually checked for accuracy and completeness

•  We export the maps as images to a Scratchpads media gallery that can be queried or filtered by taxon

•  Map reviewing is conducted in a dedicated SP2 forum

There are bugs to work out, though

Some taxa are missing data. Example: Carex humilis

•  Map of 2331 specimen records from R code download

•  Website individual species download –  Filtered for specimens with coordinate data (= 7209 records)

– Missing records include some from France, Japan, & South Korea

Some maps will need adjustments: in next itera1ons, it should be possible to automate some of this

Carex alata specimen is missing a “-‐” in longitude column Carex lanceolata has specimens where the la1tude and longitude are switched.

In the end, integra1ng clean coordinate data with WorldClim clima1c data allows us to correlate clima1c niche evolu1on with morphological and lineage diversifica1on*. * See Thursday talk for exci1ng findings in subgenus Vignea!

h{ps://mor-‐systema1cs.googlecode.com/svn/trunk/cariceae

We’ve been wri1ng these tools in R, for the simple reason that that’s what we know. Bits could easily be ported to PHP for integra1on into Scratchpads, or Python for web implementa1on. Code is available at:


sampling



DNA matrices

Mul1ple alignments




ACKNOWLEDGMENTS


Bil Alverson Jane Balaban John Balaban Bethany Brown Leo Bruederle

Kyong-‐Sook Chung Theodore Cochrane

Kenneth Dritz Marcial Escudero

Kerry Ford Bruce Ford Berit Gehrke Marlene Hahn Andrew Hipp Takuji Hoshino Pedro Jimenez Timothy Jones Jongduk Jung Sangtae Kim Jennifer Kluse Kate Lueders

Modesto Luceno Anton Reznicek Eric Roalson Paul Rothrock David Simpson Julian Starr

Wayt Thomas Gayle Tonkovich Marcia Waterway Gerould Wilhelm Karen Wilson Jin Xiao-‐Feng Okihito Yano Shu-‐ren Zhang

Elizabeth Zimmerman

Colleagues at eMonocot and Scratchpads Edward Baker

Laurence Livermore Vince Smith Odile Weber

The Conveners of this Symposium Melissa Tulig Paul Wilkin

And you!

If there is 1me, I’ll take ques1ons!

biodiversity informatics of the cyperaceae: where we stand and where we’re heading

Technology

specimen data

qualifier specimen

ncbi specimen

specific specimen

xml file

additional information

good information

metadataow dnaseq