doing data science with clojure

Post on 15-Apr-2017

5.611 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Doing data science with Clojure

@sbelak simon@goopti.com

Design constraints

The analytics chasmIdeal. Almost real-time, can be done during brainstorming without disrupting flow

< 2min < 20min project

squeeze in somewhere in the day

fail

roadmapahoy!

Think in distributions, not numbers

No throwaways

Sharing results

• Have one canonical version that is always current.

• Concentrate discussion in one place and make it searchable and persistent.

• Include methodology (=code).

The environment

REPL vs. notebook

REPL vs. notebook+

(hacked) gorilla-repl.org +

auto-refresh +

hypothes.is

#alderaan #sales #growth

Code hidden, but can be expanded

Questions, comments,

& annotations

Shareable

Periodically re-run to keep it fresh

#alderaan #sales #growth

discoverability

Wishlist/TODO• Better editor (shaunlebron.github.io/parinfer/ ?)

• Embedded REPL

• Better exception reporting

• Browsable data structures

(tried and miserably failed: org-babel)

The tools

Data frame

• Data tends to be heterogeneous

• Clojure excels in structure manipulation/encoding

github.com/sbelak/huri• No data structures, just functions over collections

• Composable (even DSLs — no macros!)

• Reasonably fast (transducers <3)

• Do-what-I-mean (auto-sort, liberal with inputs, …)

• Minimal buy-in

• Support reaching into nested structures everywhere

composable data structure based DSLs

->> and partial friendly Support reaching into nested structures everywhere

vanilla vector of maps

interoperability

Provide curried versions where possible

Composability is key to quick iterating

• Provide curried versions where possible

• ->> and partial friendly

• encode computation in structure (comp, some-fn, every-pred, data structure based DSLs, …)

• consistent API

Catching errors early ⇒ more context ⇒ easier debugging ⇒ faster iterating

<3 Bret Victor

Q: What about machine learning?

A: farm it out to sklearn

huri.plot

• DSL on top of ggplot2 (via gg4clj)

• Targets Gorilla REPL

• Follows the rest of Huri’s design philosophy

• bar chart, scatter plot, line chart, box & violin plot, heatmap, histogram

Wishlist/TODO• (even) better structure manipulation (via Spectre?)

• Interactive plots

• More transducer-compatible (online) math functions

• Optimizing ->> (rewrite code on the fly to do more with transducer composition)

Projects worth keeping an eye on

github.com/thi-ng/geom

github.com/yieldbot/vizard

zeppelin-project.org

github.com/aphyr/tesser

github.com/nathanmarz/specter

Questions@sbelak

github.com/sbelak/huri

top related