integrating r and the jvm platform - alpine data labs' r execute operator

28
Integrating Non-Reactive Legacy Code - The Case of R Marek Kolodziej Machine Learning Engineer SF Scala Meetup, Sep. 10, 2014

Upload: alpinedatalabs

Post on 29-Nov-2014

767 views

Category:

Engineering


3 download

DESCRIPTION

Reactive programming is a phenomenal idea, but it's not always achievable "all the way down" in practice. In the real world, one rarely writes entire platforms from scratch and even then, one often needs to integrate with third-party applications that are blocking, stateful, and seem to violate nearly every reactive principle. In my talk, I will explain how Akka is still ideally suited to handle the integration of such systems into both reactive and non-reactive JVM code. To illustrate the above claims, I will talk about Alpine Data Labs' JVM-R integration. Calls to the R language runtime to perform a data science computation are blocking given the constraints of R itself. Sessions have to be maintained since many messages have to be sent per R session (populating the R heap with DTOs, sending the script to be executed, etc.), and each actor can hold a TCP connection to a single R runtime. R is very prone to failure, be it due to poor memory management, dynamically typed, buggy user code, segmentation faults in native R packages, etc. I will show how Akka can handle all of these problems in a graceful manner to help integrate a faulty, non-engineering grade technology like R into a JVM enterprise application.

TRANSCRIPT

Page 1: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Integrating Non-Reactive Legacy Code - The Case of R !

!!!!!Marek Kolodziej Machine Learning Engineer !

!

!

!!!!SF Scala Meetup, Sep. 10, 2014

Page 2: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Reactive Recap

Page 3: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Event-­‐driven!-­‐ Asynchronous  -­‐ Non-­‐blocking  -­‐ Op4mized  around  Amdahl’s  Law

Scalable-­‐ Loca4on  transparency  (up  and  out)  

-­‐ Factor  in  unreliable  network  !

Resilient-­‐ Failure  isola4on  (bulkhead  paAern,  etc.)  

-­‐ Clean  service  and  failure  handling  separa4on  (supervision)

Responsive-­‐ Minimize  latency  -­‐ Deal  with  bursty  traffic  -­‐ Gracefully  handle  conges4on  (backpressure/ac4ve  pull  by  subscriber)

Reactive Recap

07

< <

Page 4: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 5: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Not  everything’s  an  actor-­‐ Legacy  Java/Scala  code  -­‐ Third-­‐Party  Libraries  

Blocking  calls!-­‐ Database  queries  -­‐ Calls  to  services  -­‐ Non-­‐threaded  run4mes  (R)  !!

Long-­‐running  jobs!-­‐ Resource  clean-­‐up  in  case  network  par44on  occurs  way  before  the  4me-­‐out  is  reached    

-­‐ Timeouts  vs.  heartbeats  !

Not  all  failures  are  within  the  JVM!-­‐ Can  we  revive  them  from  within  the  JVM?  

!!

The tough realityNot everything’s under your control

07

< <

Page 6: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Alpine’s R Operator

07

< <

Page 7: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

For

!!!!!!!!!!!!!

-­‐ 5,000+  sta4s4cal  and  machine  learning  libraries  

-­‐ “[Numeric]  gold  standard”  implementa4ons  

-­‐ Operator  would  allow  arbitrary  processing  in  a  “canned”  applica4on  

-­‐ Data  scien4sts  already  know  the  language  

-­‐ Support  for  client’s  exis4ng  code  base  (100s  of  scripts)  

-­‐ Very  rapid  prototyping  -­‐ Focus  on  science  instead  of  coding  !

Alpine’s R OperatorThe cases for and against R

07

< <

Against

!!!!!!!!!!!!!!

-­‐ Slow  run4me  (even  with  JIT)  -­‐ Memory  hogging  (by-­‐copy  seman4cs)  

-­‐ Very  slow  garbage  collec4on  -­‐ Single-­‐threaded  run4me          (even  worse  than              Python  and  Ruby)  -­‐ Na4ve  libraries  wriAen  by  people  without  much  CS/engineering  background  (segfaults,  etc.)  

-­‐ Buggy  libraries  (infinite  loops,  etc.)  

-­‐ Run4me  crashes  -­‐ Terrible  handling  of  big  datasets  

Page 8: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Licensing  Issues!-­‐ R  is  GPL  -­‐ RServe  is  (L)GPL  -­‐ Shipped  soaware  (GPL  SaaS  loophole  doesn’t  apply)

Distributed  compuHng

!!

!!!!!!!

-­‐ Need  a  cluster  of  R  workers          (mul4-­‐user,  mul4-­‐operator              concurrency  given  a  single-­‐              threaded  R  run4me)  !-­‐ REST  is  good  for  data  but  preAy  bad  for  control  (some  structure  would  be  nice)  !

-­‐ Sessions  or  backpressure  !!!

Challenges

07

< <

Fault  tolerance!-­‐ R  run4me  failures  -­‐ Network  par44ons  (R  session  clean-­‐up)  !

!

Page 9: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Licensing  Issues

!!!!

-­‐ Akka  is  Apache  2.0  -­‐ RServe  is  (L)GPL  -­‐ Can  open-­‐source  the  R-­‐Java  server  bridge  

-­‐ Communica4on  to  Alpine  backend  via  (open-­‐source)  message  case  classes

Distributed  compuHng

!!!!!!!!!!!!!

-­‐ Akka’s  loca4on  transparency  is  ideal  for  distribu4ng  work  

-­‐ Cluster  API  would  have  been  preferred  but  Alpine  uses  Akka  2.2.3  due  to  Spark  dependency  

-­‐ Structure  and  seman4cs  due  to  message  case  classes  

-­‐ Rx  streams  would  have  been  nice  for  backpressure,  but  we  have  an  old  Akka  version  (so  sessions)  

!

Solutions

07

< <

Fault  tolerance

!!!!!!!!!!!!!!!!

-­‐ Rserve  forks  R  processes.  Exc.  handling  of  the  Connec4on  object  lets  you  restart  processes.  

-­‐ Akka’s  heartbeat  allows  session  clean-­‐up  in  case  of  network  failure  before  4me-­‐out  (important  if  4me-­‐out  is  ~1  day).  

-­‐ Event  bus  lets  you  observe  failure  to  connect  to  remote  actor  system.  

-­‐ No  need  for  exactly  once  seman4cs  (the  user  can  re-­‐run  the  flow),  but  you  have  to  know  that  the  failure  occurred.  !!

!

Page 10: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Sessions

!!!!!

!!!-­‐ Arguably  the  ugliest  part  of  the  solu4on  (can  be  replaced  with  alterna4ves)  -­‐ Worker  actors  blocked  for  long  periods  (hours).  -­‐ Large  data  blocks  are  sent  to  the  Akka  R  server  (~  128  MB).    -­‐ No  backpressure  via  Rx  streams  since  it’s  Akka  2.3.2.  -­‐ Custom  router  -­‐  refuses  requests  if  all  workers  are  busy.  -­‐ Client  needs  to  respond  to  request  refusal  by  awai4ng  a  free  worker  message  (reac4ve  but  inelegant).  -­‐ BeAer  solu4on  -­‐  use  reac4ve  streams  (we  need  to  upgrade  Akka)  -­‐ Improvement:  use  Akka  for  control  but  REST  for  data  movement  !!!!!!

!!!!!!

Solutions

07

< <

Page 11: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 12: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 13: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 14: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 15: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 16: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 17: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 18: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 19: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 20: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 21: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 22: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 23: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 24: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator
Page 25: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

-­‐ Data  movement  via  REST  !

-­‐ Replacement  of  sessions  via  reac4ve  streams  (Akka  upgrade!)  !

-­‐ Kamon  test  drive  for  distributed  actors      (released  ~2  weeks  ago)  !!!!

Future Improvements

07

< <

Page 26: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

!!!!!

!!!-­‐ Akka  makes  even  non-­‐reac4ve  distributed  programming  easier  and  more  reliable  !-­‐ If  you  can,  use  the  latest  Akka  version  because  a  lot  of  the  earlier  pain  can  be  avoided:          -­‐  clustering          -­‐  persistence          -­‐  reac4ve  streams  !-­‐ Large  data  movement  via  Akka  is  probably  not  an  ideal  use  of  the  framework:          -­‐  use  REST  (including  Spray,  Play,  etc.)  and  HTTP  chunking          -­‐  move  the  data  directly  using  NeAy,  etc.  !!

!!!!!!

Conclusions

07

< <

Page 27: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Thank You !!!

07

< <

Page 28: Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

!!!!!

!!!

-­‐ Alpine  is  hiring          -­‐  machine  learning  engineers  (Scala/Java)          -­‐  data  scien4sts  (R/Python)          -­‐  Front  end  developers  (Ruby  on  Rails)  !

-­‐  ScalaCourses.com  is  looking  for  reviewers:          -­‐  Scala  (beginner/intermediate)          -­‐  Akka          -­‐  Play          -­‐  Java  Interop.          -­‐  contact  Michael  Slinn:  [email protected]  !!

!!!!!

Miscellaneous

07

< <