qcon rio 2015 - stock predictions with spark-geode-zeppelin.key
TRANSCRIPT
‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Building a Stock Prediction system with Machine Learning using Geode, Spring XD
e Spark MLLib
William Markito@william_markito
Fred Melo@fredmelo_br
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
It's all about DATA
Data SourcesLook for patterns
Prediction
medium avg (x+1)
relative strength (x)
medium avg (x)
price(x)
Machine Learning Model (e.g. Linear Regression)
© Copyright 2014 Pivotal. All rights reserved.
Transform Sink
SpringXD
Extensible Open-Source Fault-Tolerant Horizontally Scalable Cloud-Native
Machine Learning
Enrich Filter
Split
Dashboard
Indicators
1
2
Predict
3
Real data
Simulator
/Stocks
/TechIndicators
/Predictions
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
IntroductionA distributed, memory-based data management platform for data oriented apps that need:
•High performance, scalability, resiliency and continuous availability
• Fast access to critical data set
• Location aware distributed data processing
•Event driven data architecture
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts•Cache
• In-memory storage and management for your data
•Configurable through XML, Spring, Java API or CLI
•Collection of Region
Region
Region
Region
Cache
JVM
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts•Region
• Distributed java.util.Map on steroids(Key/Value)
• Consistent API regardless of where or how data is stored
• Observable (reactive)
• Highly available, redundant on cache Member (s).
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts•Region
• Local, Replicated or Partitioned
• In-memory or persistent
•Redundant
• LRU
•Overflow
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
LOCAL LOCAL_HEAP_LRU LOCAL_OVERFLOW LOCAL_PERSISTENT LOCAL_PERSISTENT_OVERFLOW PARTITION PARTITION_HEAP_LRU PARTITION_OVERFLOW PARTITION_PERSISTENT PARTITION_PERSISTENT_OVERFLOW PARTITION_PROXY PARTITION_PROXY_REDUNDANT PARTITION_REDUNDANT PARTITION_REDUNDANT_HEAP_LRU PARTITION_REDUNDANT_OVERFLOW PARTITION_REDUNDANT_PERSISTENT PARTITION_REDUNDANT_PERSISTENT_OVERFLOW REPLICATE REPLICATE_HEAP_LRU REPLICATE_OVERFLOW REPLICATE_PERSISTENT REPLICATE_PERSISTENT_OVERFLOW REPLICATE_PROXY
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts•Member
•A process that has a connection to the system
•A process that has created a cache
•Embeddable within your applicationClient
Locator
Server
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts•Client cache
•A process connected to the Geode server(s)
•Can have a local copy of the data
•Can be notified about events on the servers
Application
GemFire Server
Region
Region
Region Client Cache
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts• Listeners
•CacheWriter / CacheListener
•AsyncEventListener (queue / batch)
•Parallel or Serial
•Conflation
© Copyright 2014 Pivotal. All rights reserved. 19
Apache Geode (incubating)
• Currently under incubation in Apache Software Foundation
•Welcome contributions and contributors
• Code and Patches
• Bugs, feature requests
• Documentation and content
• Any form of feedback
© Copyright 2014 Pivotal. All rights reserved. 20
• Code • New features
• Bug fixes (patches)
•Writing tests
• Documentation •Wiki
•Web site
• User guides
• Community • Join our mailing lists (Ask or answer)
• Become a speaker
• Find and report bugs
• Testing a release candidate or beta
Apache Geode (incubating)
© Copyright 2014 Pivotal. All rights reserved. 21
• JIRA - https://issues.apache.org/jira/browse/GEODE
•GitHub - https://github.com/apache/incubator-geode
•Mailing lists: • Development - [email protected]
• Users - [email protected]
•Wiki - cwiki.apache.org/confluence/display/GEODE
• StackOverflow - http://stackoverflow.com/questions/tagged/geode+or+gemfire
Apache Geode (incubating)
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Runs as a distributed application or as a single node
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts • A stream is composed from modules. Each module is deployed to a container and its
channels are bound to the transport.
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts•Web based REPL
• Iterative & Exploratory
•Support for Data Ingestion
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts•Multi interpreters
•Markdown
•Shell
•Spark
•Geode
•Python…
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts•RDD
•Dataframe
•Driver
•Worker
"An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes."
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts•RDD
•Dataframe
•Driver
•Worker
“A dataframe is a distributed collection of rows organized into named columns. An abstraction for selecting, filtering and plotting structured data (pandas), previously known as SchemaRDD."
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Summary
• Integration • Spark, JDBC, Geode • HDFS, Twitter, File, Mail…
• Data pipeline orchestration • Intuitive DSL • Streaming & Analytics • Distributed and scalable
• Web based REPL • Multiple Interpreters
• Apache Spark • Markdown • Flink • Python • Geode…
• Iterative & Exploratory
‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Summary
• Fast data processing • Columnar queries • RDDs • Machine Learning • Analytics & Streaming
• Fast data store and processing • In-memory & Persistent • Highly Consistent • Transaction processing • Thousands of concurrent
clients
© Copyright 2014 Pivotal. All rights reserved. 36
Source Codehttp://pivotal-open-source-hub.github.io/StockInference-Spark/