clearstorydata.com using spark and shark for fast cycle analysis on diverse data 12.2.13 vaibhav...
TRANSCRIPT
clearstorydata.com
Using Spark and Shark for Fast Cycle Analysis on Diverse Data
12.2.13
Vaibhav Nivargi
clearstorydata.com
Analysis in the New Data Landscape
New use cases seen in all industries.
• Live situational analysis requiring fast-cycle analysis across internal data and sources of external data
• Multi-source analysis with data refreshing on new insights, as data from sources evolves
• Large-scale analysis of structured and unstructured data combined in integrated insights
clearstorydata.com
Example: Interactive Multi-source Analysis
More data and more people change the analysis.
FacebookShares, Likes,
Comments
News Coverage
Online, Print, Television
TwitterFollowers, Tweets, Retweets
DonationsNew Members,
Donations
Website TrafficTraffic,
Referrals, Content
Data Intelligence
Interactive analysis on diverse internal & external data
Corporate SponsorsCorporate
Engagement, New Inquiries
clearstorydata.com
Today’s Need is Speed, Scale & Ad Hoc FlexibilityWith more sources, more data and more people.
? ?
??
clearstorydata.com
Why Spark and Shark ?
• RDDs– Low latency & scale– Iterative and Interactive computation
• Lineage and fault tolerance– Able to re-derive data
• Expressive power of Scala and SQL– Operations beyond aggregations, joins, and statistical operators– Advanced: ML, data mining, segmentation, approximate
queries, graphs …
• Support for structured and semi-structured data• BDAS Stack & AMPLab
– Tachyon, MLBase, BlinkDB, GraphX …
• Community and adoption
clearstorydata.com
Data Sources ClearStory Platform ClearStory Application
The ClearStory Solution
Data Inference & Profiling
Harmonization
Visualization
Collaboration
In-MemoryData Units
clearstorydata.com
Public PremiumWebRDBMS Hadoop
ClearStory API
User Application
Data Access, Inference and Lineage
Data Source API
Files
Spark Cluster + ClearStory IP
Harmonization Engine and Blended Data Processing
Where do Spark & Shark fit ?
clearstorydata.com
How we leverage Spark & Shark
• User intent captured and translated to custom API
• Harmonization-as-a-Service• Manages Spark and Shark query execution
• Read cached data from HDFS
• RESTful
• Merges datasets (RDDs) on the fly – on user request
• Support conversion of user actions to backend queries
• Query optimizations
• Performance optimizations• Mixed-mode execution (sql2rdd & spark native)
• Caching
• Pre-computation
clearstorydata.com
How we leverage Spark & Shark
• Query results returned to the application for scalable visualization and ClearStory-specific viz techniques
• RDDs cached/un-cached and materialized at strategic points based on usage patterns and signals
• Data updates automatically processed as source data changes
• ClearStory’s own deployment, packaging, and integrated monitoring for operations at scale
clearstorydata.com
Spark Developments – What We Like
• Query cancellation, progress indication (0.8.1 and beyond)
• More performance breakthroughs
• Workload Management
• BlinkDB
• MLBase
• Tachyon
• GraphX
clearstorydata.com
We’re Hiring!
• Working with the community, giving back
• Lots of exciting new developments
• This is like the early days of Hadoop – massive momentum gathering
The First Spark Summit!More Meet-ups!