20160524 ibm fast data meetup

PowerPoint Presentation

Scala: Lingua Franca of Fast DataJamie AllenSr. Director of Global Solutions Architects

1

Why Scala?Who is doing this?What is Fast Data?Architecting for Fast DataAgenda

Cloud portability versus native controlApplication correctness versus speed of developmentModularity versus global namespaceConcise syntax versus boilerplateMulti-threaded simplicity via abstractions versus low-level controlTradeoffs

REPLType safetyModularityConcise syntaxMulti-threaded simplicityData-centric semanticsManaged runtime for cloud portabilityEcosystemScala is the local optimum

Scala is the local optimum

The JVM is a primary reason for Scalas success

No REPL or NotebookNot a data-centric language, particularly collections semanticsWhy not Java?

Data-centric language, has all of the wonderful collections semantics we wantNo type safetyNo modularityWhy not Python?

Weak type safetyCollections are too elementalNative execution is a non-starter, so Go is the only optionGarbage collection is not generationalWhy not Go or C++?

Scala just so happened to fit well in this spacePerformanceCorrectnessConcisenessScala will evolveOther languages will come in time

Scala is NOT the end of the road

Who is doing this?

12

One Caveat: Apache Beam and TensorFlow

13

Why Scala?

At the time we started, I really wanted a PL that supports a language-integrated interface (where people write functions inline, etc) However, I also wanted to be on the JVM in order to easily interact with the Hadoop filesystem and data formats for that. Scala was the only somewhat popular JVM language that offered this kind of functional syntax and was also statically typed (letting us have some control over performance), so we chose that. Today there might be an argument to make the first version of the API in Java with Java 8, but we also benefitted from other aspects of Scala in Spark, like type inference, pattern matching, actor libriaries, etc.Matei Zaharia, creator of Spark

What is Fast Data?

A bit of history: Hadoop

Worker

Hadoop strengthsLowest capital expenditure for big dataExcellent for ingesting and integrating diverse datasetsFlexibleClassic analytics (aggregations and data warehousing)Machine learning

Hadoop weaknessesComplex administrationYARN requires dedicated clusterMapReduce foiblesPoor performanceImperative programming modelNo stream processing support

Fast Data with Spark

Spark100x faster as a replacement for Hadoop MapReduceUses much fewer machines and resourcesIncredible support from the community and enterprise

Spark use casesPrimarily anomaly detectionRisk managementFraud detectionOdds recalculationSpam filtersUpdate search engine results quickly

Spark had it with RDDsThey removed it with the DataFrames APIBrought it back with DataSets, but not as comprehensively as RDDsType safety

Why not Flink?Flink has much better stream handling for low latency systems that Spark currently lacksEvent timingWatermarksTriggersExactly-once semantics (assuming connections hold)Pipeline portability via Apache Beam integration

Why not Flink?

Architecting for Fast Data

This isnt enough

Old and busted

Traditional application architectures and platforms are obsolete.Gartner

How do we avoid messing this up?

At the APIIn our sourceFor our dataWe want isolationWikipedia, Creative Commons, created by DFoerster

We want realistic data managementUse CQRS and Event Sourcing, not CRUDTransactions, especially distributed, will not workConsistency is an anti-pattern at scaleDistributed locks and shared data will limit youData fabrics break all of these conventions

Think in terms of compensation, not prevention.Kevin Webber, Lightbend

We want to ACID v2Associativity, not AtomicityCommutativity, not ConsistencyIdempotent, not IsolationDistributed, not Durable

Wikipedia, Creative Commons, created by Weston.pace

New hotness

One version of the emerging Fast Data Architecture. For todays talk, I wont go through it in detail, but this reflects some industry trends among open-source tools (popularity of Spark Streaming, Kafka, and Cassandra), plus our view of the Typesafe Reactive Platform as the glue that integrates it all, lets you implement the rest of the micro services you need, and provides the low latency streaming through Akka Streams that Spark doesnt provide. The last slide has a link to a white paper I wrote that goes through this diagram in detail, along with several example use cases.36

Its very common to combine Spark Streaming, Kafka, and Cassandra (or another distributed, NoSQL database). In fact, this troika has become a de facto standard component of modern, stream-oriented architectures.37

While Spark, Kafka, and Cassandra are de facto standard core components, you also need a platform for resource management and other needs, plus glue to tie everything together. Hence, the SMACK stack:S - Spark (and Scala?)M - MesosA - AkkaC - CassandraK - Kafka38

Discuss all the components in the context of data flow. First, you will have streaming data from many sources, web requests and other external Internet traffic, youll have services communicating with you. If they follow the Reactive Streams standard, then they can be ingested by Akka Streams. Finally, lots of internal data like logs, files FTPed into your environment, etc. will be ingested.39

Reactive Streams follow a standard for an out-of-band, resilient backpressure mechanism that provides flow control. It only describes the behavior of a single stream with 1+ producers and 1+ consumers, but these streams are composable; a graph of reactive streams effectively provide end-to-end flow control.40

Use the Lightbend Reactive Platform as glue for the other large components, implemented as microservices. Web requests (e.g., REST) can be handled by Play, which you can also use to implement user interfaces. Reactive Stream-Compliant sources can be ingested by Akka Streams, which supports a low-latency, per-event processing model.41

Use Kafka to ingest ephemeral stream data (which cant be replied if lost downstream). Kafka has enormous scalability and durability. Its a great place to capture your streams, which can then be processed with Akka Streams or with Spark Streaming.42

One use of Kafka is to solve the problem of N*M direct links between producers and consumers. This is hard to manage and it couples services to directly, which is fragile when a given service needs to be scaled up through replication or replacement and sometimes in the protocol that both ends need to speak.43

So Kafka can function as a central hub, yet its distributed and scalable so it isnt a bottleneck or single point of failure.44

For sophisticated stream processing, where richer analytics (like online machine learning) is required and higher latencies can be tolerated, use Spark Streaming, which implements a mini batch model, where data is processed in chunks, defined by time windows as small as 1 second.45

Processing results can be written back to Kafka for downstream consumption. For permanent storage, right a database, like Cassandra, when fast record-level CRUD is required, or write to a distributed file system, when cheaper storage is desired and table-scan access patterns are most important.46

This architecture is agnostic to the platform. We like Mesos, the next-generation, general-purpose infrastructure for cluster resource and application management, but it works with YARN, too. You can deploy on premise (e.g., on bare metal hardware) or in a cloud environment.47

Learning SparkGo to http://bigdatauniversity.com, built by IBM

20160524 ibm fast data meetup

Technology