Transcript
Page 1: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Integrating Non-Reactive Legacy Code - The Case of R !

!!!!!Marek Kolodziej Machine Learning Engineer !

!

!

!!!!SF Scala Meetup, Sep. 10, 2014

Page 2: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Reactive Recap

Page 3: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Event-­‐driven!-­‐ Asynchronous  -­‐ Non-­‐blocking  -­‐ Op4mized  around  Amdahl’s  Law

Scalable-­‐ Loca4on  transparency  (up  and  out)  

-­‐ Factor  in  unreliable  network  !

Resilient-­‐ Failure  isola4on  (bulkhead  paAern,  etc.)  

-­‐ Clean  service  and  failure  handling  separa4on  (supervision)

Responsive-­‐ Minimize  latency  -­‐ Deal  with  bursty  traffic  -­‐ Gracefully  handle  conges4on  (backpressure/ac4ve  pull  by  subscriber)

Reactive Recap

07

< <

Page 4: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 5: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Not  everything’s  an  actor-­‐ Legacy  Java/Scala  code  -­‐ Third-­‐Party  Libraries  

Blocking  calls!-­‐ Database  queries  -­‐ Calls  to  services  -­‐ Non-­‐threaded  run4mes  (R)  !!

Long-­‐running  jobs!-­‐ Resource  clean-­‐up  in  case  network  par44on  occurs  way  before  the  4me-­‐out  is  reached    

-­‐ Timeouts  vs.  heartbeats  !

Not  all  failures  are  within  the  JVM!-­‐ Can  we  revive  them  from  within  the  JVM?  

!!

The tough realityNot everything’s under your control

07

< <

Page 6: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Alpine’s R Operator

07

< <

Page 7: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

For

!!!!!!!!!!!!!

-­‐ 5,000+  sta4s4cal  and  machine  learning  libraries  

-­‐ “[Numeric]  gold  standard”  implementa4ons  

-­‐ Operator  would  allow  arbitrary  processing  in  a  “canned”  applica4on  

-­‐ Data  scien4sts  already  know  the  language  

-­‐ Support  for  client’s  exis4ng  code  base  (100s  of  scripts)  

-­‐ Very  rapid  prototyping  -­‐ Focus  on  science  instead  of  coding  !

Alpine’s R OperatorThe cases for and against R

07

< <

Against

!!!!!!!!!!!!!!

-­‐ Slow  run4me  (even  with  JIT)  -­‐ Memory  hogging  (by-­‐copy  seman4cs)  

-­‐ Very  slow  garbage  collec4on  -­‐ Single-­‐threaded  run4me          (even  worse  than              Python  and  Ruby)  -­‐ Na4ve  libraries  wriAen  by  people  without  much  CS/engineering  background  (segfaults,  etc.)  

-­‐ Buggy  libraries  (infinite  loops,  etc.)  

-­‐ Run4me  crashes  -­‐ Terrible  handling  of  big  datasets  

Page 8: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Licensing  Issues!-­‐ R  is  GPL  -­‐ RServe  is  (L)GPL  -­‐ Shipped  soaware  (GPL  SaaS  loophole  doesn’t  apply)

Distributed  compuHng

!!

!!!!!!!

-­‐ Need  a  cluster  of  R  workers          (mul4-­‐user,  mul4-­‐operator              concurrency  given  a  single-­‐              threaded  R  run4me)  !-­‐ REST  is  good  for  data  but  preAy  bad  for  control  (some  structure  would  be  nice)  !

-­‐ Sessions  or  backpressure  !!!

Challenges

07

< <

Fault  tolerance!-­‐ R  run4me  failures  -­‐ Network  par44ons  (R  session  clean-­‐up)  !

!

Page 9: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Licensing  Issues

!!!!

-­‐ Akka  is  Apache  2.0  -­‐ RServe  is  (L)GPL  -­‐ Can  open-­‐source  the  R-­‐Java  server  bridge  

-­‐ Communica4on  to  Alpine  backend  via  (open-­‐source)  message  case  classes

Distributed  compuHng

!!!!!!!!!!!!!

-­‐ Akka’s  loca4on  transparency  is  ideal  for  distribu4ng  work  

-­‐ Cluster  API  would  have  been  preferred  but  Alpine  uses  Akka  2.2.3  due  to  Spark  dependency  

-­‐ Structure  and  seman4cs  due  to  message  case  classes  

-­‐ Rx  streams  would  have  been  nice  for  backpressure,  but  we  have  an  old  Akka  version  (so  sessions)  

!

Solutions

07

< <

Fault  tolerance

!!!!!!!!!!!!!!!!

-­‐ Rserve  forks  R  processes.  Exc.  handling  of  the  Connec4on  object  lets  you  restart  processes.  

-­‐ Akka’s  heartbeat  allows  session  clean-­‐up  in  case  of  network  failure  before  4me-­‐out  (important  if  4me-­‐out  is  ~1  day).  

-­‐ Event  bus  lets  you  observe  failure  to  connect  to  remote  actor  system.  

-­‐ No  need  for  exactly  once  seman4cs  (the  user  can  re-­‐run  the  flow),  but  you  have  to  know  that  the  failure  occurred.  !!

!

Page 10: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Sessions

!!!!!

!!!-­‐ Arguably  the  ugliest  part  of  the  solu4on  (can  be  replaced  with  alterna4ves)  -­‐ Worker  actors  blocked  for  long  periods  (hours).  -­‐ Large  data  blocks  are  sent  to  the  Akka  R  server  (~  128  MB).    -­‐ No  backpressure  via  Rx  streams  since  it’s  Akka  2.3.2.  -­‐ Custom  router  -­‐  refuses  requests  if  all  workers  are  busy.  -­‐ Client  needs  to  respond  to  request  refusal  by  awai4ng  a  free  worker  message  (reac4ve  but  inelegant).  -­‐ BeAer  solu4on  -­‐  use  reac4ve  streams  (we  need  to  upgrade  Akka)  -­‐ Improvement:  use  Akka  for  control  but  REST  for  data  movement  !!!!!!

!!!!!!

Solutions

07

< <

Page 11: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 12: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 13: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 14: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 15: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 16: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 17: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 18: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 19: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 20: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 21: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 22: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 23: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 24: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 25: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

-­‐ Data  movement  via  REST  !

-­‐ Replacement  of  sessions  via  reac4ve  streams  (Akka  upgrade!)  !

-­‐ Kamon  test  drive  for  distributed  actors      (released  ~2  weeks  ago)  !!!!

Future Improvements

07

< <

Page 26: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

!!!!!

!!!-­‐ Akka  makes  even  non-­‐reac4ve  distributed  programming  easier  and  more  reliable  !-­‐ If  you  can,  use  the  latest  Akka  version  because  a  lot  of  the  earlier  pain  can  be  avoided:          -­‐  clustering          -­‐  persistence          -­‐  reac4ve  streams  !-­‐ Large  data  movement  via  Akka  is  probably  not  an  ideal  use  of  the  framework:          -­‐  use  REST  (including  Spray,  Play,  etc.)  and  HTTP  chunking          -­‐  move  the  data  directly  using  NeAy,  etc.  !!

!!!!!!

Conclusions

07

< <

Page 27: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Thank You !!!

07

< <

Page 28: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

!!!!!

!!!

-­‐ Alpine  is  hiring          -­‐  machine  learning  engineers  (Scala/Java)          -­‐  data  scien4sts  (R/Python)          -­‐  Front  end  developers  (Ruby  on  Rails)  !

-­‐  ScalaCourses.com  is  looking  for  reviewers:          -­‐  Scala  (beginner/intermediate)          -­‐  Akka          -­‐  Play          -­‐  Java  Interop.          -­‐  contact  Michael  Slinn:  [email protected]  !!

!!!!!

Miscellaneous

07

< <


Top Related