demystifying data science with an introduction to machine learning

Post on 15-Jan-2015

209 Views

Category:

Internet

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Demystifying data science is the slide deck to accompany @brightsparc presentation to SEEK.

TRANSCRIPT

Demys&fying  Data  Science  

with  and  Intro  to  Machine  Learning  

Data  science  is  everywhere  

Sexiest  job  in  21st  century*  

 McKinsey  Global  Ins&tute  report  es&mates  that  by  2018,  “the  United  States  alone  could  face  a  shortage  of  140,000  to  190,000  people  with  deep  analy&cal  skills  as  well  as  1.5  million  managers  and  analysts  with  the  know-­‐how  to  use  the  analysis  of  big  data  to  make  effec&ve  decisions”  

Source:  Harvard  business  Review  Oct’  2012  

 

So  what  is  Data  Science?  

Source:  Hilary  Mason  ex-­‐Chief  data  science  bit.ly    

Who  are  these  unicorns?  

Bit  about  me  

@brightsparc  

I  thought  it  was  all  about  stats?  

It’s  a  broader  skillset  

Source:  h[p://blogs.wsj.com/cio/2014/02/14/it-­‐takes-­‐teams-­‐to-­‐solve-­‐the-­‐data-­‐scien&st-­‐shortage/  

Data  science  pipeline  

Source:  h[p://cacm.acm.org/blogs/blog-­‐cacm/169199-­‐data-­‐science-­‐workflow-­‐overview-­‐and-­‐challenges/fulltext  

Where  does  Kaggle  fit  it?  

   

Degree  breakdown  in  top  100   Areas  of  study  

What’s  the  deal  with  big  data?  

Apache  Hadoop  Ecosystem  

It’s  like  Map  Reduce  you  know  

So  what  about  machine  learning?  

Pioneer  in  machine  learning,  created  a  checkers  game  that  played  itself  

“Give  machines  the  ability  to  learn  without  explicitly  programming  them.”  Arthur  L.  Samuel  (1959)  

Types  of  algorithms  

Some  examples  

Machine  learning  process  

Build  a  model  

Underfit   Overfit  

Linear  Regression  Solve  for  values  of  θ  in  the  Hypothesis  func&on    hθ(x)  

Gradient  descent  algorithm  

Minimize  cost  func&on  which  is  ½  of  average  square  error  of  predic&on  vs.  the  training  data.  

Demo:  House  prices  

Cross  valida&on  –  split  training/test  

Supervised  learning  model  

Recommender  systems  

Collabora&ve  filtering  –  predict  ra&ngs  for  similar  items  given  other  users  behavior  

Collabora&ve  filtering  method  

Source:  h[p://cran.r-­‐project.org/web/packages/recommenderlab/vigne[es/recommenderlab.pdf  

Similar  users  based  on  distance  

Manha[an  distance   Euclidian  distance  

Demo:  Music  recommender  system  

Pearson  Correla&on  Coefficient    

Visualiza&on  frameworks  

Tableau  

D3.js   Processing  

Raphaël.js  

What  about  online  experimenta&on?  

What  will  the  future  look  like  

•  Online  collabora&on  

•  Open  Data  

Next  gen  distributed  compu&ng  

100x  faster  in  memory,  and  10x  faster  even  when  running  on  disk.  

Deep  learning,  a  new  fron&er?  

Geoffrey  Hinton  @Google  

How  can  I  get  started?  •  MOOCs  –  Coursera  Machine  Learning    (Andrew  Ng  -­‐  Stanford)  

–  Learning  from  Data  (Abu-­‐Mostafa  -­‐  Caltech)  

•  Other  references  –  Collec&ve  Intelligence  – Mining  of  massive  data  sets  –  Open-­‐Source  Data  Science  Masters  

•  Frameworks  –  Python  –  Scikit  learn  –  Java  –  WEKA  and  Cascading  

Ques&ons  

top related