qcon rio 2015 - stock predictions with spark-geode-zeppelin.key

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

Building a Stock Prediction system with Machine Learning using Geode, Spring XD

e Spark MLLib

William Markito@william_markito

Fred Melo@fredmelo_br

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

It's all about DATA

Data SourcesLook for patterns

Prediction

medium avg (x+1)

relative strength (x)

medium avg (x)

price(x)

Machine Learning Model (e.g. Linear Regression)

© Copyright 2014 Pivotal. All rights reserved.

Transform Sink

SpringXD

Extensible Open-Source Fault-Tolerant Horizontally Scalable Cloud-Native

Machine Learning

Enrich Filter

Split

Dashboard

Indicators

1

2

Predict

3

Real data

Simulator

/Stocks

/TechIndicators

/Predictions


Apache Geode (incubating)

Introduction


IntroductionA distributed, memory-based data management platform for data oriented apps that need:

•High performance, scalability, resiliency and continuous availability

• Fast access to critical data set

• Location aware distributed data processing

•Event driven data architecture


Concepts•Cache

• In-memory storage and management for your data

•Configurable through XML, Spring, Java API or CLI

•Collection of Region

Region

Region

Region

Cache

JVM


Concepts•Region

• Distributed java.util.Map on steroids(Key/Value)

• Consistent API regardless of where or how data is stored

• Observable (reactive)

• Highly available, redundant on cache Member (s).

Region

Cache

java.util.Map

JVM

Key Value

K01 May

K02 Tim


Concepts•Region

• Local, Replicated or Partitioned

• In-memory or persistent

•Redundant

• LRU

•Overflow

Region

Cache

java.util.Map

JVM

Key Value

K01 May

K02 Tim

Region

Cache

java.util.Map

JVM

Key Value

K01 May

K02 Tim

LOCAL LOCAL_HEAP_LRU LOCAL_OVERFLOW LOCAL_PERSISTENT LOCAL_PERSISTENT_OVERFLOW PARTITION PARTITION_HEAP_LRU PARTITION_OVERFLOW PARTITION_PERSISTENT PARTITION_PERSISTENT_OVERFLOW PARTITION_PROXY PARTITION_PROXY_REDUNDANT PARTITION_REDUNDANT PARTITION_REDUNDANT_HEAP_LRU PARTITION_REDUNDANT_OVERFLOW PARTITION_REDUNDANT_PERSISTENT PARTITION_REDUNDANT_PERSISTENT_OVERFLOW REPLICATE REPLICATE_HEAP_LRU REPLICATE_OVERFLOW REPLICATE_PERSISTENT REPLICATE_PERSISTENT_OVERFLOW REPLICATE_PROXY


Concepts•Member

•A process that has a connection to the system

•A process that has created a cache

•Embeddable within your applicationClient

Locator

Server


Concepts•Client cache

•A process connected to the Geode server(s)

•Can have a local copy of the data

•Can be notified about events on the servers

Application

GemFire Server

Region

Region

Region Client Cache


Concepts• Listeners

•CacheWriter / CacheListener

•AsyncEventListener (queue / batch)

•Parallel or Serial

•Conflation

© Copyright 2014 Pivotal. All rights reserved. 19


• Currently under incubation in Apache Software Foundation

•Welcome contributions and contributors

• Code and Patches

• Bugs, feature requests

• Documentation and content

• Any form of feedback


• Code • New features

• Bug fixes (patches)

•Writing tests

• Documentation •Wiki

•Web site

• User guides

• Community • Join our mailing lists (Ask or answer)

• Become a speaker

• Find and report bugs

• Testing a release candidate or beta



• JIRA - https://issues.apache.org/jira/browse/GEODE

•GitHub - https://github.com/apache/incubator-geode

•Mailing lists: • Development - [email protected]

• Users - [email protected]

•Wiki - cwiki.apache.org/confluence/display/GEODE

• StackOverflow - http://stackoverflow.com/questions/tagged/geode+or+gemfire


https://issues.apache.org/jira/browse/GEODE

https://github.com/apache/incubator-geode

mailto:[email protected]?subject=

mailto:[email protected]?subject=

http://cwiki.apache.org/confluence/display/GEODE

http://stackoverflow.com/questions/tagged/geode+or+gemfire


SpringXDIntroduction


Concepts

Runs as a distributed application or as a single node


Concepts • A stream is composed from modules. Each module is deployed to a container and its

channels are bound to the transport.


Apache Zeppelin(incubating)

Introduction


Concepts•Web based REPL

• Iterative & Exploratory

•Support for Data Ingestion


Concepts•Multi interpreters

•Markdown

•Shell

•Spark

•Geode

•Python…


Concepts•Sharing through URLs without Reports


Apache SparkIntroduction


Concepts•RDD

•Dataframe

•Driver

•Worker

"An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes."


Concepts•RDD

•Dataframe

•Driver

•Worker

“A dataframe is a distributed collection of rows organized into named columns. An abstraction for selecting, filtering and plotting structured data (pandas), previously known as SchemaRDD."


Concepts•RDD

•Dataframe

•Driver

•Worker


Summary


Summary

• Integration • Spark, JDBC, Geode • HDFS, Twitter, File, Mail…

• Data pipeline orchestration • Intuitive DSL • Streaming & Analytics • Distributed and scalable

• Web based REPL • Multiple Interpreters

• Apache Spark • Markdown • Flink • Python • Geode…

• Iterative & Exploratory


Summary

• Fast data processing • Columnar queries • RDDs • Machine Learning • Analytics & Streaming

• Fast data store and processing • In-memory & Persistent • Highly Consistent • Transaction processing • Thousands of concurrent

clients


Source Codehttp://pivotal-open-source-hub.github.io/StockInference-Spark/

http://pivotal-open-source-hub.github.io/StockInference-Spark/

qcon rio 2015 - stock predictions with spark-geode-zeppelin.key

Documents