why data science matters and what we can do with it

42
deepcarbon.net Xiaogang (Marshall) Ma and DCO-Data Science Team Tetherless World Constellation Rensselaer Polytechnic Institute Why Data Science Matters? and what can we do with it

Upload: marshall-x-ma

Post on 19-Aug-2014

579 views

Category:

Education


23 download

DESCRIPTION

Presentation at the Deep Carbon Observatory Summer School 2014, Big Sky, MT, USA.

TRANSCRIPT

Page 1: Why data science matters and what we can do with it

deepcarbon.net

Xiaogang (Marshall) Ma and DCO-Data Science Team

Tetherless World ConstellationRensselaer Polytechnic Institute

Why Data Science Matters?and what can we do with it

Page 2: Why data science matters and what we can do with it

2

Outline

• Data Management and Publication• Interoperability of Data• Provenance of Research• Era of Science 2.0

Page 3: Why data science matters and what we can do with it

3

Data Management and Publication

Page 4: Why data science matters and what we can do with it

4

• Meet grant requirements– Many funding agencies now require researchers formally state

how they will manage and preserve datasets generated from a research project.

… …

Why Manage and Publish Data

Page 5: Why data science matters and what we can do with it

5

• Increase your research efficiency– Have you ever had a hard time understanding the data that

you or your colleagues have collected?

Data work

Page 6: Why data science matters and what we can do with it

6

Image courtesy of British Geological Survey

Nice, now I have my DATA well managed, and next…

Page 7: Why data science matters and what we can do with it

7

• Increase the visibility of your research– Making your data available to other researchers through

widely-searched repositories can increase your prominence and demonstrate continued use of the data and relevance of your research.

• Facilitate new discoveries– Enabling other researchers to use your data reinforces open

scientific inquiry and can lead to new and unanticipated discoveries. And doing so prevents duplication of effort by enabling others to use your data rather than trying to gather the data themselves.

Page 8: Why data science matters and what we can do with it

8

Data Management Plan: What and How

• What is a Data Management Plan?– A data management plan is a formal document that outlines

what you will do with your data during and after you complete your research.

• What is involved in developing one?– Developing a data management plan can be time-consuming,

tedious, and daunting, but it's a very important step in ensuring that your research data is safe and sound for the present and future.

– With the right process and framework it does not take too long time and can pay-off enormously in the long-run.

Page 9: Why data science matters and what we can do with it

• Topics in a data management plan include

– Introduction and context– Data types, formats, standards and capture methods– Short-term storage and data management– Deposit and long-term preservation – Data sharing, access and re-use– Resourcing– Adherence and review

Page 10: Why data science matters and what we can do with it

10

• Resources/Tools help create DMPs:

– DCC Data Management Plans: http://www.dcc.ac.uk/resources/data-management-plans

– MIT Data Management and Publishing: http://libraries.mit.edu/data-management/

– NSF Data Management Plan Requirements: http://www.nsf.gov/eng/general/dmp.jsp

– DMPTool: https://dmptool.org – IEDA Data Management Plan Tool: http://

www.iedadata.org/compliance/plan– DCC DMPOnline: https://dmponline.dcc.ac.uk

Page 11: Why data science matters and what we can do with it

12Image from WWW

Page 12: Why data science matters and what we can do with it

13

Data Publication & Citation

• Data as first class products of research– NSF bio-sketches can include data publications

Image from j4h.net

Page 13: Why data science matters and what we can do with it

14

• Ways of data publication– Data as supplemental material of a paper– Standalone data– Data paper: data + descriptive ‘data paper’

(Strasser, 2014)

Examples:• Standalone data journals: Nature Scientific Data, Geoscience Data

Journal, Ecological Archives• Journals that publish data papers: GigaScience, F1000 Research,

Internet Archaeology

Page 14: Why data science matters and what we can do with it

15

What does a DCO data publication look like?

Page 15: Why data science matters and what we can do with it

16

• Data Citation Index– Indexes the world's leading data repositories– Records for the datasets are connected to related peer-

reviewed literature indexed in the Web of Science™– Allow researchers to efficiently access to data across subjects

and regions

Page 16: Why data science matters and what we can do with it

17

Interoperability of Data

Page 17: Why data science matters and what we can do with it

18

A good example

• OneGeology

• Web-accessible geologic map data worldwide (scale ~1:1 million)

• Stimulate a rapid increase in interoperability (i.e. disseminate GeoSciML and vocabularies further and faster)

• 120 participating countries (July 2014)

Page 18: Why data science matters and what we can do with it

19http://portal.onegeology.org

Page 19: Why data science matters and what we can do with it

20

Wyoming

Colorado

More challenges are still to be addressedhttp://mrdata.usgs.gov/

Page 20: Why data science matters and what we can do with it

21

Earth Resource FormEnvironmental Impact ValueExploration Activity TypeExploration ResultUNFC ValueEarth Resource ExpressionEarth Resource ShapeEnduse PotentialMineral Occurrence TypeMining Activity TypeProcessing Activity Type

Mining Waste Type ValueCommodity CodeMineral Deposit GroupMineral Deposit TypeProduct Value

A list of recently finished vocabularies

CGI Geoscience Terminology Workgroup• Construct a collection of vocabularies for

populating information interchange documents and enabling interoperability

• Provide labels for concepts, scope to various communities defined by language, science domain, or application domain

Page 21: Why data science matters and what we can do with it

22

Another major effort...And there is a vocabulary created by the CGI Geoscience Terminology Workgroup!

Page 22: Why data science matters and what we can do with it

23

Golden Spike information portalhttp://geotime.tw.rpi.edu/

Golden spike - Global Boundary Stratotype Section and Point (GSSP)

Page 23: Why data science matters and what we can do with it

(Haq, 2007) 24

Still, challenges …

Page 24: Why data science matters and what we can do with it

25

(Ma et al., 2011)

Interoperable:“Data should be discoverable, accessible, decodable, understandable and usable, and data sharing should be legal and ethical for all participants.”

Page 25: Why data science matters and what we can do with it

26

• Interoperability does not mean that all data should be mediated or standardized.

• However, it is important that data archives are accompanied by detailed documentation, clarifying data provenance, data model, vocabularies used, etc.

(Ma et al., 2011)

Page 26: Why data science matters and what we can do with it

27

Provenance of Research

Page 27: Why data science matters and what we can do with it

28

Provenance capture

• Documenting provenance – Linking a range of observations and model outputs, research

activities, people and organizations involved in the production of scientific findings with the supporting data sets and methods used to generate them.

Well-curated provenance information makes scientific workflows transparent and improves the credibility and trustworthiness of their outputs. It also facilitates informed and rational policy and decision-making.

Image from nature.com

(Ma et al., 2014)

Page 28: Why data science matters and what we can do with it

“Figure 1.2: Sea Level Rise: Past, Present, and Future” in the Third National Climate Assessment report draft of USA (NCA3) 29

What is the provenance of this figure?

Page 29: Why data science matters and what we can do with it

30

• Detailed caption of that figure: – Estimated, observed and possible amounts of global sea level rise

from 1800 to 2100. Proxy estimates (Kemp et al. 2012) (for example, based on sediment records) are shown in red (pink band shows uncertainty), tide gauge data in blue (Church and White 2011a), and satellite observations are shown in green (Nerem et al. 2010). The future scenarios range from 0.66 feet to 6.6 feet in 2100 (Parris et al. 2012). Higher or lower amounts of sea level rise are considered implausible, as represented by the gray shading. The orange line at right shows the currently projected range of sea level rise of 1 to 4 feet by 2100, which falls within the larger risk-based scenario range. The large projected range reflects uncertainty about how glaciers and ice sheets will react to the warming ocean, the warming atmosphere, and changing winds and currents. As seen in the observations, there are year-to-year variations in the trend. (Figure source: Josh Willis, NASA Jet Propulsion Laboratory)

As a case study, let’s trace the provenance of this paper.

Page 30: Why data science matters and what we can do with it

Provenance tracing of NASA contributions to Figure 1.2 in draft NCA3

Here only the details of Topex-Poseidon mission are shown

Here only the details of one paper (i.e., “paper/103”) cited by that figure are shown

(a) Instances of calibration, model and software underpinning “paper/103”

(b) Instances of sensor, instrument and platform underpinning that paper

31

Page 31: Why data science matters and what we can do with it
Page 32: Why data science matters and what we can do with it
Page 33: Why data science matters and what we can do with it

34

Page 34: Why data science matters and what we can do with it

35

http://data.globalchange.gov

Page 35: Why data science matters and what we can do with it

36

Era of Science 2.0Practice

Page 36: Why data science matters and what we can do with it

37

• Science 2.0– New practices of scientists who post raw experimental results,

nascent theories, claims of discovery and draft papers on the Web for others to see and comment on.

– Proponents say these “open access” practices make scientific progress more collaborative and therefore more productive.

– Critics say scientists who put preliminary findings online risk having others copy or exploit the work to gain credit or even patents.

(Waldrop, 2008)

Page 37: Why data science matters and what we can do with it

38

Page 38: Why data science matters and what we can do with it

39

• Social scholarship: Reconsidering scholarly practices in the age of social media– Polled 1,600 US and Canadian faculty members– Found that 15% use Twitter, 28% use YouTube and 39% use

Facebook for scholarly activity

(Greenhow and Gleason, 2014)

Using social media more often would help scientists to disseminate their results, debate findings and engage a wider audience

Researchers must learn to create a robust online presence

Social-media metrics to be added to the tenure process

Page 39: Why data science matters and what we can do with it

40

• Altmetrics– A very broad group of metrics, capturing various parts of

impact a paper or work can have.

(Lin and Fenner, 2013)

The ImpactStory Altermetrics Classifications

Page 40: Why data science matters and what we can do with it

41

• altmetric.com – already a product used by NPG, Springer, etc.

This Altmetric score means that the article is:• in the 99 percentile (ranked 181st) of the

81,582 tracked articles of a similar age in all journals

• in the 93 percentile (ranked 69th) of the 992 tracked articles of a similar age in Nature

http://www.nature.com/nature/journal/v497/n7449/nature12127/metrics

Page 41: Why data science matters and what we can do with it

42

Summary

make data count

Page 42: Why data science matters and what we can do with it

43

[email protected]

Thank you!