mining whole museum collections datasets for expanding understanding of collections with the guoda...

20
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service Matthew Collins (iDigBio) Jorrit Poelen (independant) Alexander Thompson (iDigBio) Jennifer Hammock (EOL)

Upload: matthew-j-collins

Post on 18-Jan-2017

17 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Matthew Collins (iDigBio)Jorrit Poelen (independant)Alexander Thompson (iDigBio)Jennifer Hammock (EOL)

Page 2: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

What We’re Interested In

Computation with biodiversity data• Research at scale• Lowering barriers to access• Reproducability

Matthew CollinsTechnical Operations

Manager - iDigBio

Jorrit PoelenIndependant

Alexander Thompson Software Products

Lead - iDigBio

Jennifer HammockMarine Theme

Coordinator - EOL

Page 3: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Quick Review of Ways That We Work With Datasets

Focus here is on using large aggregated datasets to answer research questions

Page 4: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Working With Datasets - Web Portals

Good: searching, visualizing location, browsingLess good: data characterization, modeling, analysis, graphing

Page 5: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Working With Data - Purpose-Built Applications

Good: low barrier to entry, expert-built, documentation, peersLess good: limited scope, limited ability to change

Page 6: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Working With Data - APIs & Libraries

Good: direct access to data, some simple analysisLess good: programming barrier, performance limits

Page 7: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Working With Data - Download & Code

Good: ultimate flexibility, combine & mergeLess good: data management barrier, you’re the sysadmin

Page 8: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Working With Data - GUODA

Global Unified Open Data Access(If SPNHC can be Spinach, GUODA Gouda)

An informal collaboration between technologistsfrom organizations like EOL , ePANDDA, and iDigBio as well asindependent biodiversity informaticists. We share data usecases, best practices, infrastructure, code, and ideas aroundthe science that can be done by analyzing large open-accessbiodiversity datasets.

Page 9: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Working With Data - GUODA Continued

Goals• Have technologists discuss the technical challenges and

solution approaches in the biodiversity informatics domain• Provide on-ramp for those who might not think of

themselves as “technologists”• Fast parallel computation infrastructure and practices

(currently using Apache Spark)• Local copies of entire datasets already formatted, ready for

computation at scale on provided infrastructure• Hosting for services that rely on above

Page 10: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

What Questions Does GUODA Make Approachable?

Can we create structured data from the unstructured text in iDigBio records?

GUODA provides a platform to quickly start working on this problem.

1. No data download2. Jupyter Notebooks3. Parallel processing of entire dataset

Page 11: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Data Characterization

Looking at the Darwin Core terms fieldNotes, occurrenceRemarks, and eventRemarks to see how many characters are in which fields

Page 12: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

The Code to Produce That Figureidbdf = sqlContext.read.parquet("../data/idigbio/occurrence.txt.parquet")notes = sqlContext.sql("""

SELECT`http://portal.idigbio.org/terms/uuid` as uuid,TRIM(CONCAT(`http://rs.tdwg.org/dwc/terms/occurrenceRemarks`, ' ',

`http://rs.tdwg.org/dwc/terms/eventRemarks`, ' ', `http://rs.tdwg.org/dwc/terms/fieldNotes`)) as document

FROM idbtable WHERE `http://rs.tdwg.org/dwc/terms/fieldNotes` != '' OR

`http://rs.tdwg.org/dwc/terms/occurrenceRemarks` != '' OR `http://rs.tdwg.org/dwc/terms/eventRemarks` != ''""")

notes = notes.withColumn('document_len', sql.length(notes['document']))notes = notes.withColumn('fieldNotes_len', sql.length(notes['fieldNotes']))notes = notes.withColumn('eventRemarks_len', sql.length(notes['eventRemarks']))notes = notes.withColumn('occurrenceRemarks_len', sql.length(notes['occurrenceRemarks']))notes_pd = notes[ sub_set ].toPandas()sns.distplot(notes_pd['document_len'].dropna().apply(numpy.log10))sns.distplot(notes_pd['fieldNotes_len'].dropna()[ notes_pd['fieldNotes_len']>0 ].apply(numpy.log10))sns.distplot(notes_pd['occurrenceRemarks_len'].dropna()[ notes_pd['occurrenceRemarks_len']>0 ].apply(numpy.log10))ax = sns.distplot(notes_pd['eventRemarks_len'].dropna()[ notes_pd['eventRemarks_len']>0 ].apply(numpy.log10))

Page 13: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

The Interface to Write The Code

Notebooks

“Literate Programming”

Comments, code, and outputs all together in a readable document that describes what is being done

Page 14: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

GUODA Notebook Architecture

A look at interacting with the GUODA data service through Jupyter Notebooks

Page 15: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

GUODA Data Service At ScalePython NLTK parsing and part-of-speech tagging of notes fields with noun-phrase assembly.

Example phrases:• Intercept trap• Forest litters• Field notes• Field notebook• Fogging fungus covered log• Tropical forest • Flight intercept trap

Page 16: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

The Code - 6 minutes for 3.2M Recordsc.train(c.load_training_data("../data/chunker_training_50_fixed.json"))def pipeline(s):

return c.assemble(c.tag(p.tag(t.tokenize(s)))) pipeline_udf = sql.udf(pipeline, types.ArrayType(

types.MapType( types.StringType(), types.StringType() )))phrases = notes\ .withColumn("phrases", pipeline_udf(notes["document"]))\ .select(sql.explode(sql.col("phrases")).alias("text"))\ .filter(sql.col("text")["tag"] == "NP")\ .select(sql.lower(sql.col("text")["phrase"]).alias("phrase"))\ .groupBy(sql.col("phrase"))\ .count()phrases.write.parquet('../data/idigbio_phrases.parquet')

Page 17: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

What Else is GUODA Besides Notebooks?

Remember “collaboration” and “infrastructure” to lower barriers

• Twice monthly Google Hangouts• Hadoop HDFS data store with datasets: GBIF, iDigBio, BHL,

TraitBank so far• Apache Spark cluster for computation• Backs Effechecka http://effechecka.org/• Backs Fresh Data https://github.com/gimmefreshdata/• ePANDDA (we’re sharing ideas)• iDigBio data quality workflows

Page 18: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Why is GUODA Important?

Perform research at a faster pace by “outsourcing” some of the harder parts

Collect entire large datasets together in one place for cross-dataset exploration without data management barrier

Provides a foundation, both community and infrastructure, upon which to build purpose-built applications and APIs bigger and faster than before

Page 19: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

How You Can Fit With GUODA

• Make your data available

• Data standards to make it relatable to other datasets

• Making data available doesn’t end with handoff to the

aggregator - where is your data used?

• Support workforce development

• Support next-wave things like ePANDDA

• Collaborate with GUODA when starting your own research

Page 20: Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

www.idigbio.org

facebook.com/iDigBio

twitter.com/iDigBio

vimeo.com/idigbio

idigbio.org/rss-feed.xml

webcal://www.idigbio.org/events-calendar/export.ics

Thank you!http://guoda.bio