how to integrate python into a scala stack

27
Scala and Python Integrating scikit-learn into a Scala Stack to build realtime predictive models Dan Chiao VP Engineering

Upload: fliptop

Post on 27-Jan-2015

143 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: How to integrate python into a scala stack

Scala and PythonIntegrating scikit-learn into a Scala Stack to build realtime predictive models

Dan ChiaoVP Engineering

Page 2: How to integrate python into a scala stack

Why it was necessaryWe pivoted

Page 3: How to integrate python into a scala stack

The original product• Social data append

– PeopleGraph: match email addresses to public demographics and social profiles

– BrandGraph: match company URLs to public firmographics and social profiles

• Requirements– Integrate a large (and expanding)

number of web data sources (REST, SOAP, flat files)

– Realtime processing of large volumes of contacts (60 queries/s)

Page 4: How to integrate python into a scala stack

The original technology stack

• Scala– Best of both worlds

• Concise functional syntax• Java libraries and deployment architecture• Scala-specific libraries (Dispatch, Lift Web Framework)

• Twitter (soon to be Apache) Storm– Streaming intake and normalization of large amounts of data

• MongoDB– Expanding data sources = constantly updating schema– Most sophisticated query syntax of NoSQL options

• AWS and Azure– Well, duh

Page 5: How to integrate python into a scala stack

The new product• Moving up the application stack

– Focus on the most compelling single-use case for our data

– Fliptop SpendScore• Predictive analytics for sales and marketing teams• “Machine learning for Salesforce”

Page 6: How to integrate python into a scala stack

The updated technology stack

• Still need to wrangle large amounts of data, so no changes there

• New requirement: fast, scalable machine learning

Page 7: How to integrate python into a scala stack

Why not Scala (Java) native?

• The options– Apache Mahout

• Only skeleton implementations for most sophicated machine learning techniques (e.g. Random Forest, Adaboost)

• Customer-specific models – don’t need Big Data

– Weka – GPL

– Scala-native libraries – Too early to use in production

Page 8: How to integrate python into a scala stack

Why Python?

• scikit-learn– Mature – around since 2006– Actively-developed – Last stable release Aug 2013– Sophisticated – Random Forest and Adaboost classifier

show comparable performance to R

• Why not R? Not really production grade.

Page 9: How to integrate python into a scala stack

Requirements

• APIs to exploit Python’s modeling power– Train, predict, model info query, etc.

• Scalability– On demand Python serving nodes

Page 10: How to integrate python into a scala stack

Tools for Scala-Python Integration

• Reimplementation of Python– Jython (JPython)

• Communication through JNI– Jepp

• Communication through IPC– Apache Thrift

• Communication through REST API calls– Bottle

Page 11: How to integrate python into a scala stack

Jython

• Re-Implementation of Python in Java

• Can import and use any Java class.

• Includes almost all of the modules in the standard Python distribution – Except some of the modules implemented originally in C.

• Compiles to Java bytecode– either on demand or statically.

11

Page 12: How to integrate python into a scala stack

Jython

12

JVM

Scala Code

Python Code

Jython

Page 13: How to integrate python into a scala stack

Jython

• Lacks support for lots of extensions for scientific computing– Numpy, Scipy, etc.

• JyNI (Jython Native Interface) to the rescue?– Specifically designed to support CPython extensions like

Numpy, Scipy– Still in alpha

13

Page 14: How to integrate python into a scala stack

Communication through JNI

• Jepp (Java Embedded Python)– Embeds CPython in Java– Runs Python code in CPython– Leverages both JNI and Python/C for integration

Page 15: How to integrate python into a scala stack

Python Interpreter

Jepp

15

JVM

Scala Code

Python Code

JNI Jepp

Page 16: How to integrate python into a scala stack

Jepp

16

object TestJepp extends App { val jep = new Jep() jep.runScript("python_util.py") val a = (2).asInstanceOf[AnyRef] val b = (3).asInstanceOf[AnyRef] val sumByPython = jep.invoke("python_add", a, b) println(sumByPython.asInstanceOf[Int])}

def python_add(a, b): return a + b

python_util.py

TestJepp.scala

Page 17: How to integrate python into a scala stack

Communication through IPC

• Apache Thrift– Developed & open-sourced by Facebook– More community support than Protobuf, Avro

– IDL-based (Interface Definition Language)– Generates server/client code in specified languages– Take care of protocol and transport layer details– Comes with generators for Java, Python, C++, etc.

• No Scala generator• Scrooge (Twitter) to the rescue!

17

Page 18: How to integrate python into a scala stack

Thrift – IDL

18

namespace java python_service_testnamespace py python_service_test

service PythonAddService{ i32 pythonAdd (1:i32 a, 2:i32 b),}

TestThrift.thrift

$ thrift --gen java --gen py TestThrift.thrift

Page 19: How to integrate python into a scala stack

Thrift – Python Server

19

class ExampleHandler(python_service_test.PythonAddService.Iface): def pythonAdd(self, a, b): return a + b

handler = ExampleHandler()processor = Example.Processor(handler)transport = TSocket.TServerSocket(9090)tfactory = TTransport.TBufferedTransportFactory()pfactory = TBinaryProtocol.TBinaryProtocolFactory() server = TServer.TThreadedServer(processor, transport, tfactory, pfactory) server.serve()

PythonAddServer.py

class Iface: def pythonAdd(self, a, b): pass

PythonAddService.py

Page 20: How to integrate python into a scala stack

Thrift – Scala Client

20

object PythonAddClient extends App { val transport: TTransport = new TSocket("localhost", 9090) val protocol: TProtocol = new TBinaryProtocol(transport) val client = new PythonAddService.Client(protocol)

transport.open() val sumByPython = client.python_add(3, 5) println("3 + 5 = " + sumByPython) transport.close()}

PythonAddClient.scala

Page 21: How to integrate python into a scala stack

Thrift

21

JVM Scala Code

Thrift

Python Code

Python Interpreter

Thrift

Python Code

Python Interpreter

Thrift

Auto Balancing、Built-in Encryption

Page 22: How to integrate python into a scala stack

REST API Architecture

22

…Bottle

Python Code

Bottle

Python Code

Bottle

Python Code

JVM

Scala Code

Auto Balancer?Encoding?

Page 23: How to integrate python into a scala stack

Thrift v.s. REST

Thrift REST

Load Balancer ✔Encode/Decode ✔Low Learning Curve ✔No Dependency ✔

Does it matter?

No (AWS & Azure)

No(We’re already

doing it)Yes

Winner

Yes

Page 24: How to integrate python into a scala stack

Fliptop’s Architecture

24

Load Balancer

…Bottle

Python Code

Bottle

Python Code

Bottle

Python Code

JVM Scala Code

5 Python servers~5,000 requests/sec

Page 25: How to integrate python into a scala stack

Summary

• Jython• (✓) Tight integration with Scala/Java• (✗) Lack support for C extensions (JyNI might help in the

future)

• Jepp• (✓) Access high quality Python extensions with CPython speed• (✗) Two runtime environments

• Thrift, REST• (✓) Language-independent development• (✗) Bigger communication overhead

25

Page 26: How to integrate python into a scala stack

Questions?

Ask this guy

Page 27: How to integrate python into a scala stack

Thank You

27