hbasecon 2012 | getting real about interactive big data management with lily & hbase - ngdata
DESCRIPTION
HBase brings interactivity to Hadoop, and allows users to collect, manage and process data in real-time. Lily wraps HBase and Solr in a comprehensive Big Data platform, with HBase-native secondary indexing complementing ad-hoc structured search. Through spare write-cycles during read operations, Lily transforms HBase in an scalable data management engine providing interactive analytics, profile harvesting and real-time recommendations. This talk highlights the architecture of Lily, how it completes HBase, and explains some of its implementation use cases.TRANSCRIPT
WWW.NGDATA.COM
Making Sense of Data
Lily goes shopping – real-time recommendations with HBase
HBaseCon, May 2012
Steven Noels – VP Product – @stevenn
WWW.NGDATA.COM
• HBase-backed data repository, with batteries included
• Data model:
• high-level data model on top of HBase’s byte[]’s
• schema
• versioning (schema and data)
• links, variants
• Java & REST API's
• Indexing:
• through configuration, not implementation
• incremental and batch index maintenance
• RowLog: distributed, durable queue for sec. actions
• Open Source: www.lilyproject.org (Apache License)
Lily Core 2’ recap
HBase
Lily
Solr et al.
RowLog
client app
WWW.NGDATA.COM
• BigTable model
• sparseness
• atomic row updates aka concistency
• auto-partitioning
• Apache license
• A great community led by a Saint J
Why HBase?
WWW.NGDATA.COM
Portfolio Overview
Schema and Data Management Total Data Aggregation Real-time Index and Retrieval Security and Enterprise Connectors
Profile Development Context and Activity Tracking
Social Stream Ingestion
Real-time AI Recommendations Industry algorithms and rules
Trend Analytics Pattern Detection
open source
commercial availability
WWW.NGDATA.COM
Some of the larger Lily deployments
• media
• aggregation, database publishing and online archives
• finance
• real-time identity fraud detection
• retail banking
• contextualized (time+loc+person) mobile coupons
• retail
• e-commerce platform: product catalog, consumer data store, real-time indexing
Lily (=HBase) In Use
WWW.NGDATA.COM
Collaborative Filtering?
Recommend items similar to a user’s highly-preferred items
WWW.NGDATA.COM
Collaborative Filtering is … Matrixes
Sean likes “Scarface” a lot Robin likes “Scarface” somewhat Grant likes “The Notebook” not at all …
(123,654,5.0)!(789,654,3.0)!(345,876,1.0)!…!
(345,654,4.5)!…!
(Magic)
Grant may like “Scarface” quite a bit …
WWW.NGDATA.COM
Personalized offers
Contextualized recommendations
Item Acitvity Profile
creditcard statements
shops & merchants product families offers/coupons
WWW.NGDATA.COM
Lily Core Repository
Fitting Recommendations into the Lily Architecture
indexes
activity storerowlog
LILY CRUD API
data, activity, profile scoring
co-occurencelookup matrix
read/write demultiplexer
LILY recommender engine
Steven [email protected]
www.ngdata.comtelephone: +32 9 33 88 220
Gent (Belgium)
Makers ofALS
k-m
ea
ns
pro
pe
nsit
y
cu
sto
m ..
.
algorithm support
data store
profile store
Lily/HBase Secondary Indexes
WWW.NGDATA.COM
• Transaction-based preferencing
• Pluggable preference strategies, using Lily-based data (HBase&Solr) for decision making • e.g. credit card statement = transactions between users and product
families
• Preference weighting
• Ingest: REST API, bulk support
• Real-time updating of the recommendation model
• Profile Store
• Profile activities can be preferenced
• Support for Profile behavior analysis
Preferencing aka Feeding the Matrix
WWW.NGDATA.COM
• Recommender
• Pluggable recommender strategies, using Lily-based data (HBase&Solr) for decision making
• Multi-model support: user-item & item-user recommendations
• Estimation of both preferenced and non-preferenced items
• Geolocation-based recommendations
• Re-scoring
• REST API
• (Planned)
• Support for Classifications (scenario - Recommend me all (possible) coffee drinkers)
• Matrix / recommendation indexing
Making recommendations
WWW.NGDATA.COM
• Secondary indexes (= Lily Core!)
• indexes are defined through configuration
• single or multi-field indexes
• range queries and prefix queries
• asc or desc sorted results
• can read huge, sorted lists
• synchronously updated: index updates are applied by rowlog secondary actions
• online building of new indexes (no table locks)
• MapReduce integration
• SolrCloud integration
• Index shards and configuration managed through ZooKeeper
Other upcoming Lily Features
WWW.NGDATA.COM
Making Sense of Data
Questions? Thank you!