dat 5 minute lightning talk
TRANSCRIPT
Dat: version and share your data
Karissa McKelveySoftware Developer and Project Manager and Science Evangelist and Designer (I wear a lot of hats) U.S. Open Data
@karissamck
karissa $ ~
dat is a non profit
Reproducible Research
“A rule of thumb … is that half of published research cannot be replicated”
How do we replicate research today?
How do we replicate research today?collaborate on
How do we replicate research today?collaborate on
data analysis
How do we collaborate today?
How do we collaborate today?
How do we collaborate today?
How do we collaborate today?
????????
How do we replicate research today?
me@home $ dat push me@campus $ dat pull
you@work $ dat clone
dat workflow• import
• version
• publish
• replicate
.csv.csvdata
you
.csv.csvdata
you
.csv.csvdata
you
.csv.csvdata
import
you
$ dat init
$ dat add dataset cities
$ dat add rows cities cities.csv
$ dat add files cities city_model.gz
import
$ dat listen
$ dat clone
Versioning
$ dat add files cities us_cities_viz.pngThis will override us_cities_viz.png at c2342. OK?
$ dat cities add rows updated_data.csvThis will update 3,434,245 rows. OK?
$ dat push
http://my-data.bids.edu
publish
.csv.csvdata
http://my-data.indiana.edu
.csv
.png.png.png
.csv.csv
.csv.csv.R
.csv.csv.pyINTEROPERABILITY in Python and R
.csv
.png.png.png
.csv.csv
.csv.csv.R
.csv.csv.pyECOSYSTEM
• Goal: manipulate datasets with scripting
• Supported keywords: run, pipe, map, reduce, fork, keyword
• Bash-like
• Platform-independent
• Uses node.js streams (fast!)
Datscript
Top: Datscript “pipe” command Bottom: Equivalent command in bash
Datscript: pipeline example
Datscript: example commands
background - executes command, but doesn’t wait for it to finish map- pipes first argument into rest of arguments
run- a serial command (executes and finishes command)
Karissa McKelvey - @karissamck
Melanie Cebula - @melaniecebula
http://dat-data.com
.csv
.png.png.png
.csv.csv
.csv.csv.R
.csv.csv.pyINTEROPERABILITY in Python and R
.csv
.png.png.png
.csv.csv
.csv.csv.R
.csv.csv.pyECOSYSTEM