why should you trust my data code4lib 2016
TRANSCRIPT
-
Why Should You Trust My Data?building data infrastructure that accommodates networks of trust
Matt Zumwalt
datjawn.com | databindery.com
@flyingzumwaltcode{4}lib 2016
http://datjawn.comhttp://databindery.com
-
Im interested in trust.
-
Im interested in trust.particularly trust & trustworthiness
when people exchange data
-
theres a rhythm to the computing world
centralization decentralization
client-server peer-to-peer
-
mainframes
personal computers
server farms
[internet of everything]the cloud
the PC revolution
computers
the diamond age
-
remember mainframes?
-
image credit wikipedia
https://en.wikipedia.org/wiki/UNIVAC#/media/File:UnivacII.jpg
-
the www
-
host datareference each other
-
but data
-
image credit Torkild Retvedt
https://www.flickr.com/photos/torkildr/3462606643
-
$$
$$
$$
$
-
By 2019 the data created by IoE devices alone will be 49 times higher than all the traffic that moved through
datacenters in 2014.
it wont scale.
Reference: Cisco Global Cloud Index
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html
-
Worldwide Storage Capacity in 2012: 2.5 zettabytes
Total Data Center Traffic in 2016: 10.4 zettabytes per year
Anticipated data created by Internet of Everything (IoE) devices in 2019:
507.5 zettabytes per year
References: NetApp Cisco Global Cloud Index gigaom Washington Post
http://siliconangle.com/blog/2012/05/21/when-will-the-world-reach-8-zetabytes-of-stored-data-infographic/http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttps://gigaom.com/2012/05/30/heres-what-our-web-addiction-looks-like-in-2016/https://www.washingtonpost.com/blogs/ezra-klein/post/how-big-can-the-internet-get/2012/05/30/gJQAu9OH2U_blog.html
-
distributed data web
You cant propose that something be a universal space and at the
same time keep control of it. - Tim Berners Lee
http://webfoundation.org/about/vision/history-of-the-web/
-
this relies on trust
-
elements of trustworthiness
authority & reputation integrity & provenance synergy or compatibility
consistency etc
-
weve got thisOrganisms have been solving
these problems for eons Humans for millennia
Librarians for centuries Software developers for decades
-
git for (tabular) data
transparency & reproducibility
http://datjawn.com builds from the work of http://dat-data.com
Tabular: rows & columns (ie. Spreadsheets, CSV, SQL DBs)
http://datjawn.comhttp://dat-data.com
-
history has branches
-
initial commit
a set of changes
commit those changes and describe them
Who made the changes? Why did they make them?
When did they commit them?
-
more changes
commit those changes
-
different changes committed to a different branch
-
other changes on another branch
-
merge two branches
-
get a specific version prove its identical know who made it
-
Files are data. They have histories.
Metadata are data. They have histories too. Whatever the data,
The same patterns apply.
-
How does this get replicated?
-
client-server approach
-
peer to peer approach
-
the tide has already shifted
-
Stop building server-side applications. Assume that data are anywhere and/or everywhere.
Assume that your software will be run in many places. Erase your distinctions between server and client.
Let data grow branches - build trees (ie. Merkle DAGs) Stop thinking of data as singular.
Stop thinking of datasets as monolithic. Embrace redundancy & replication.
Understand that trustworthiness and authority are dynamic. Broaden your sense of now.
Appreciate provenance.
there are no servers there is only the web
-
Meet the dat jawn team on Wednesday
Matt Zumwalt
datjawn.com | databindery.com
@flyingzumwaltcode{4}lib 2016
http://datjawn.comhttp://databindery.com