big data & knime · 512.231.6000 - 512.231.6010 fax - big data & knime michael hoskins, cto...
TRANSCRIPT
512.231.6000 - 512.231.6010 fax - www.pervasive.com
Big Data & KNIME
Michael Hoskins, CTO Pervasive Software
KNIME User Conf, Zurich, 1 February 2012
Big Data and the Digital Data Revolution
• Every two days we create as much
information as we did from the
dawn of civilization until 2003 – Eric Schmidt, Google, 2010
2
How Big? Surging to Exabytes
3
Data Inflation
4
• Where is all this Big Data
coming from?
5
The Internet is a Driver
6
The Real Culprit: an Internet of Things aka: Machine Generated Data
7
• What to do with all this Big
Data?
8
9
Analyze it!
Using Machine Learning Techniques
• Association rule learning
• Classification
• Cluster analysis
• Crowdsourcing
• Data fusion and data integration
• Data mining
• Ensemble learning
• Genetic algorithms
• Natural language processing (NLP)
• Neural networks
• Network analysis
• Optimization
• Pattern recognition
10
•Predictive modeling
•Regression
•Sentiment analysis
•Signal processing
•Spatial analysis
•Statistics
•Supervised learning
•Simulation
•Time series analysis
•Unsupervised learning
•Visualization
To Predict the Future
11
• What does Big Data mean to
you and KNIME?
12
Big Data means a new Data Science
13
• What is Pervasive doing
about this?
15
Introducing Pervasive DataRush™
16
DataRush is a parallel dataflow platform that eliminates performance bottlenecks in your data-intensive applications
• Scalable: Performance dynamically scales with increased core/server
counts. No change to the code.
• High Throughput: Patented parallel dataflow technology enables fast,
deep analysis of large data sets with no limit on input data size.
• Cost Efficient: Fully exploit commodity multicore servers – save
significant capital and energy costs via efficient node utilization.
• Easy to Implement: DataRush takes care of complex parallel
processing issues at design time: hides threading complexity; no
deadlocks; runs on any platform – including Hadoop; etc..
• Extensible: DataRush is a component-based platform with an open API
so you can easily extend it for your own needs.
Pervasive DataRush Plug-in for KNIME
17
DataRush
Plug-Ins
Drag and Drop
High performance
nodes
DataRush
for
KNIME
Predictive
Analytics
Genomic Analysis: Align and Assemble
18
Scalable Predictive Analytics
19
Demo of Big Data in DataRush for KNIME
• KNIME with distributed (nextgen v6)
DataRush, reading >120m historical airline
flight records at scale, from native HDFS on
our test Hadoop cluster; performing a
Linear Regression and Visualization.
Runtime = 47 seconds!
20