biodiversity informatics of the cyperaceae: where we stand and where we’re heading

52
Biodiversity Informa1cs of the Cyperaceae: Where we stand and where we’re heading Andrew Hipp, Marlene Hahn, Ed Baker, Vince Smith and The Cariceae Working Group

Upload: edward-baker

Post on 11-May-2015

674 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Biodiversity  Informa1cs  of  the  Cyperaceae:  Where  we  stand  and  

where  we’re  heading  

Andrew  Hipp,  Marlene  Hahn,    Ed  Baker,  Vince  Smith  and    

The  Cariceae  Working  Group  

Page 2: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

A  set  of  tools  for  Cariceae  informa1cs  

Andrew  Hipp,  Marlene  Hahn,    Ed  Baker,  Vince  Smith  and    

The  Cariceae  Working  Group  

Page 3: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Page 4: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Page 5: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Page 6: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Page 7: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Iden1fy  gaps  in  our  knowledge  and  

sampling  

Formulate  sampling  plan  

New  collec1ons   DNA  sequences  

DNA  matrices  

Mul1ple  alignments  

Species  tree    es1mates  

Revised  classifica1on  

A  central  database  for  specimen-­‐level  data  

Page 8: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

What  tools  do  we  need?      • An  easily-­‐updated  hierarchical  checklist  to  visualize  sampling  progress  across  labs,  extrac1ons,  sequences;  •   A  specimen-­‐level  phylogene6cs  pipeline  that  we  can  use  to  harvest  exis1ng  data  from  NCBI  as  well  as  generate  ongoing  phylogene1c  snapshots;  •   A  way  to  automate  mapping  from  specimen  data,  so  that  we  can  visualize  (and  assess  our  visualiza1ons  of)  species  distribu1ons  in  geographic  and  ecological  space;  and  •   A  pla8orm  for  collabora6on  –  a  virtual  research  environment  to  bring  together  researchers  worldwide    

Page 9: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

I.  A  hierarchical  checklist  and  sampling  progress  reports  

Page 10: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

In  2011  •  A  flat  checklist  exported  

from  WCM  •  A  set  of  spreadsheets  from  

collabora1ng  labs  inventorying  their  DNA  and  sequence  collec1ons  

•  A  vague  idea  of  what  trips  are  needed  

Today  •  A  hierarchical  checklist  by  

subgenus,  sec1on  •  A  synthesis  of  what  

materials  and  sequences  collaborators  have  on  hand,  and  what  taxa  are  unsampled  

•  A  concrete  sampling  plan  with  trips  and  taxa  iden1fied*  

*  Okay,  we’re  working  on  this  one!  

Page 11: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Taxonomy  

Specimen(s)  

DNA  extrac6on(s)  

Sequence(s)  

Trace  file(s)  /  con6g(s)  

We  are  aiming  toward  a  database  in  which  the  taxonomy,  specimen  data,  DNA  extrac1ons,  raw  sequencing  data  and  DNA  matrices  all  live  together  and  can  be  curated  and  worked  on  jointly  by  the  community.  

Page 12: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Taxonomy  

Specimen(s)  

DNA  extrac6on(s)  

Sequence(s)  

Trace  file(s)  /  con6g(s)  

Page 13: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Spring  2012:  Hierarchical  checklist  

Taxonomy  

Specimen(s)  

DNA  extrac6on(s)  

Sequence(s)  

Trace  file(s)  /  con6g(s)  

!  

Page 14: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Taxonomy  

Specimen(s)  

DNA  extrac6on(s)  

Sequence(s)  

Trace  file(s)  /  con6g(s)  

!  

Page 15: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Specimen  Record  

Tissue  

Extrac1on  

DNA  seq.   Metadata  flo

w  

DNA  seq.  

DNA  seq.  

Page 16: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

A  centralized  workflow  •  Spreadsheets  imported  into  a  single  Excel  file  •  Names  cleaned  (variable)  •  DNA  data  summary  formula  created  for  each  spreadsheet  (ca.  5  mins  per  user)  

•  Names  matched  to  our  Scratchpads  checklist  •  All  files  exported  to  CSV  •  Sample  sheets  and  SP  checklist  imported  to  R  •  DNA  records  added  to  checklist  as  nodes  that  are  children  to  their  taxa.  

•  Hierarchical  checklist  exported  in  text  format,  with  unsampled  taxa  marked  for  searching  

Page 17: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

ß  Sec1on  name  

ß  Sampled  taxon  with  its  DNA  vouchers  and  summaries  

ß  Unsampled  taxon  

Page 18: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Because  Kew  has  coded  geography  using  TDWG  standards,  we  can  export  geographic  hit-­‐lists  

Page 19: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Page 20: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Taxonomy  

Specimen(s)  

DNA  extrac6on(s)  

Sequence(s)  

Trace  file(s)  /  con6g(s)  

!  

!  

!  

?  

Page 21: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

II.  A  specimen-­‐level  phylogene1c  pipeline  

Page 22: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Page 23: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

NCBI  is  a  morass  of  data.  

Geneious  •  Query  nucleo1de    database  (NCBI)  for  

Organism  contains:  “Carex”,  “Uncinia”,  “Schoenoxiphium”,  “Kobresia”,  “Vesicarex”,  or  “Cymophyllus”  

•  Export  as  •  FASTA  •  TAB-­‐Delim  •  XML    

•  Only  export  that  maintains  all  informa1on  in  NCBI.  

•  Necessary  to  obtain  data  that  can  be  used  to  connect  sequence  to  a  specimen.  

Page 24: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Hinchliff  and  Roalson.  2013.  Systema(c  Biology  62:  205–219.  

Page 25: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Hinchliff  and  Roalson.  2013.  Systema(c  Biology  62:  205–219.  

Page 26: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

A  workflow  for  specimen-­‐level  mul1gene  datasets  from  NCBI  

•  Download  from  NCBI  [we  used  Geneious,  but  any  bulk  download  is  fine]  

•  Parse  out  collector  name,  collector  number,  isolate  number,  geography  •  Manually  clean  collector  names  (3  days  for  >6500  records)  •  Iden1fy  specimens  by  unique  combina1ons  of  collector  name,  collector  

number,  isolate  •  Toss  out  “accessions”  having  more  than  one  scien1fic  name  •  Clean  gene  region  names  so  that  names  are  not  duplicated  (30  minutes  

for  >6500  records)  •  Export  datasets  to  MUSCLE  and  align;  export  log  file  •  Manually  check  alignments  and  code  logfile  (D,  RC;  variable)  •  Rerun  MUSCLE  and  export  RAxML  batchfile  •  Analyze  •  Screen  for  non-­‐monophyly;  concatenate  and  con1nue!  

Page 27: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

6692  sequence  records  in  Cariceae  

Page 28: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Tab-­‐delimited  metadata  from  NCBI  /  Geneious  is  handy,  but  it  lacks  almost  all  the  informa1on  that  could  be  used  as  voucher  IDs.  No  way  to  link  sequences  to  specimens!    However,  some  NCBI  records  do  contain  this  data.  How  do  we  access  it?  

Page 29: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

NCBI  Specimen  Record  

The FEATURES/Qualifier1 section has information that allows us to connect sequences to a specific specimen.

(for example, some records contain the qualifier specimen_voucher) To get this additional information, we need to export the data as an XML file, and parse the data out into a useable tab delimited file.

Other good information to export

Page 30: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

We  parsed  the  NCBI  XML  and  embedded  fields  within  <qualifiers1>  to  get  voucher,  DNA  isolate,  popula1on  variants,  country,  geographic  coordinates,  collec1on  date,  collector  name,  and  other  fields…  many  informa1ve  about  the  iden1ty  of  the  plants  sequenced.    To  make  clean  voucher  IDs,  we  used  last  name,  collec1on  number,  and  DNA  isolate  (used  by  some  labs).  For  this  analysis,  sequences  that  could  not  be  assigned  to  a  single-­‐species  voucher  were  discarded.  

Page 31: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

6692  sequence  records  à    3004  individuals,  54  genes,  5846  sequences  

Page 32: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

ITS,  ETS,  matK,  trnL-­‐trnF  3,370  DNA  sequences  

2,196  individuals  723  spp  

397  spp  >  1  individual  31.7%  of  those  spp  monophyle1c  

Page 33: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Page 34: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Iden1fy  gaps  in  our  knowledge  and  

sampling  

Formulate  sampling  plan  

New  collec1ons   DNA  sequences  

DNA  matrices  

Mul1ple  alignments  

Species  tree    es1mates  

Revised  classifica1on  

A  central  database  for  specimen-­‐level  data  

Page 35: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Iden1fy  gaps  in  our  knowledge  and  

sampling  

Formulate  sampling  plan  

New  collec1ons   DNA  sequences  

DNA  matrices  

Mul1ple  alignments  

Species  tree    es1mates  

Revised  classifica1on  

A  central  database  for  specimen-­‐level  data  

Page 36: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Iden1fy  gaps  in  our  knowledge  and  

sampling  

Formulate  sampling  plan  

New  collec1ons   DNA  sequences  

DNA  matrices  

Mul1ple  alignments  

Species  tree    es1mates  

Revised  classifica1on  

A  central  database  for  specimen-­‐level  data  

Page 37: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Iden1fy  gaps  in  our  knowledge  and  

sampling  

Formulate  sampling  plan  

New  collec1ons   DNA  sequences  

DNA  matrices  

Mul1ple  alignments  

Species  tree    es1mates  

Revised  classifica1on  

A  central  database  for  specimen-­‐level  data  

Page 38: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

III.  Genera1ng  maps  from  specimen  data  

Page 39: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Carex  macloviana  D’Urv  GBIF  map,  2013-­‐07-­‐06  

Page 40: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Mapping    GBIF  Data    • Generate  species  list  to  extract  GBIF  data.  (i.e.  accepted  names  in  World  Checklist)  • Download  GBIF  data  using  a  wrapper  to  dismo::gbif  (R),  allowing  us  to  capture  and  log  errors  and  missing  data.  

 

Page 41: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Clean  up  downloaded  GBIF  data  •  Flag  duplicate  specimen  datasets  –  Flags  specimens  within  the  same  species  that  have  iden1cal  coordinates.    

–  This  should  be  expanded  to  include  specimens  that  have  iden1cal  locality  descrip1ons.  

•  Flag  imprecise  loca1on  data  –  Flags  specimens  in  which  the  la1tude  is  precise  only  to  the  degree  or  to  a  tenth  of  a  degree.  

–  This  threshold  could  be  adjusted,  but  is  tailored  to  the  Worldclim  database  we  are  using  (2.5  arc  minutes).  

•  Create  a  delimited  file  for  each  species  containing  specimen  data  with  flagged  columns  (reference  file  of  which  data  are  u1lized  excluded  in  mapping  step).  This  file  becomes  part  of  our  analysis  archive,  so  that  we  can  always  go  back  and  edit  or  evaluate  old  data.  

Page 42: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Example  of  a  file  generated  from  clean_gbif  

Page 43: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Mapping  "cleaned-­‐up"  dataset  (Map_gbif_jpeg_imprecise)  

•  Maps  need  to  be  manually  checked  for  accuracy  and  completeness  

•  We  export  the  maps  as  images  to  a  Scratchpads  media  gallery  that  can  be  queried  or  filtered  by  taxon  

•  Map  reviewing  is  conducted  in  a  dedicated  SP2  forum  

Page 44: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Page 45: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

There  are  bugs  to  work  out,  though  

Some  taxa  are  missing  data.  Example:  Carex  humilis  

•  Map  of  2331  specimen  records  from  R  code  download  

•  Website    individual  species  download  –  Filtered  for  specimens  with  coordinate  data    (=  7209  records)  

– Missing  records  include  some          from  France,  Japan,  &        South  Korea  

   

Page 46: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Some  maps  will  need  adjustments:  in  next  itera1ons,  it  should  be  possible  to  automate  some  of  this  

Carex  alata  specimen  is  missing  a  “-­‐”  in  longitude  column    Carex  lanceolata  has  specimens  where  the  la1tude  and  longitude  are  switched.  

Page 47: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

In  the  end,  integra1ng  clean  coordinate  data  with  WorldClim  clima1c  data  allows  us  to  correlate  clima1c  niche  evolu1on  with  morphological  and  lineage  diversifica1on*.    *  See  Thursday  talk  for  exci1ng  findings  in  subgenus  Vignea!  

Page 48: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

h{ps://mor-­‐systema1cs.googlecode.com/svn/trunk/cariceae  

We’ve  been  wri1ng  these  tools  in  R,  for  the  simple  reason  that  that’s  what  we  know.  Bits  could  easily  be  ported  to  PHP  for  integra1on  into  Scratchpads,  or  Python  for  web  implementa1on.    Code  is  available  at:  

Page 49: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

Iden1fy  gaps  in  our  knowledge  and  

sampling  

Formulate  sampling  plan  

New  collec1ons   DNA  sequences  

DNA  matrices  

Mul1ple  alignments  

Species  tree    es1mates  

Revised  classifica1on  

A  central  database  for  specimen-­‐level  data  

Page 50: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Page 51: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

ACKNOWLEDGMENTS    

The  Cariceae  Working  Group    

Bil  Alverson    Jane  Balaban    John  Balaban    Bethany  Brown  Leo  Bruederle    

Kyong-­‐Sook  Chung  Theodore  Cochrane  

Kenneth  Dritz    Marcial  Escudero    

Kerry  Ford  Bruce  Ford  Berit  Gehrke    Marlene  Hahn  Andrew  Hipp  Takuji  Hoshino  Pedro  Jimenez  Timothy  Jones  Jongduk  Jung  Sangtae  Kim  Jennifer  Kluse    Kate  Lueders  

Modesto  Luceno    Anton  Reznicek    Eric  Roalson  Paul  Rothrock    David  Simpson  Julian  Starr  

Wayt  Thomas  Gayle  Tonkovich    Marcia  Waterway  Gerould  Wilhelm  Karen  Wilson  Jin  Xiao-­‐Feng  Okihito  Yano  Shu-­‐ren  Zhang  

Elizabeth  Zimmerman    

Colleagues  at  eMonocot  and  Scratchpads  Edward  Baker  

Laurence  Livermore  Vince  Smith  Odile  Weber  

       

The  Conveners  of  this  Symposium  Melissa  Tulig  Paul  Wilkin  

 And  you!  

   

Page 52: Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

If  there  is  1me,  I’ll  take  ques1ons!