engineering patterns for implementing data science models on big data platforms
TRANSCRIPT
Data Science Models on Big Data Platforms
Engineering Patterns for Implementing
Hisham ArafatDigital Transformation Lead Consultant Solutions Architect, Technology Strategist & Researcher
Riyadh, KSA – 31 January 2017
http://www.visualcapitalist.com/what-happens-internet-minute-2016/
Big Data…Practical Definition!
• Big Data is the challenge not the solution
• Big Data technologies address that challenge
• Practically:• Massive Streams
• Unstructured
• Complex Processing
Let’s Have a Use Case…Social Marketing
Social Marketing…Looks Simple!
Ingest Social Feeds
Build Corpus Metrics
Design Text Mining Model
Deploy All to a Big
Data Platform
Application for
Marketing Users
What people are saying about our new brand “LemaTea”?
Ingest Social Feeds
Build Corpus Metrics
Design Text Mining Model
Deploy All to a Big
Data Platform
Application for
Marketing Users
It’s NOT as Easy as it’s Looks Like!
Not Only Building Appropriate Model, but More Into
Designing a Solution…
Engineering Factors
• Interfacing with sources: REST APIs, source HTML,… (text is assumed)
• Parsing to extract: queries, Regular Expressions,…• Crawling frequency: every 1 minute, 1 hour, on event,…• Document structure: post, post + comments, #, Reach,
Retweets,…• Metadata: time, date, source, tags, authoritativeness,… • Transformations: canonicalization, weights, tokenization,…
- Size: average size of 2 KB / doc- Initial load: 1.5B doc- Frequency: every 5 minutes- Throughput: 2 KB * 60,000 doc = 120 MB / load - Grows per day ~ 34 GB
Engineering Factors
• Input format: text, encoded text,…• Document representation: bag of words, ontology… • Corpus structures: indexes, reverse indexes,…• Corpus metrics: doc frequency, inverse doc
frequency,…• Preprocessing: annotation, tagging,…• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day- Processing window: 60K per 3 mins- Processing rate: 20K doc per min- Final doc size = 2KB * 5 ~ 10KB- Scan rate: 20k * 10KB min ~
200MB/min - Many overheads need to be added
Engineering Factors
• Dimensionality reduction: stemming, lemmatization, noisy words…
• Type of applications: search/retrieval, sentiment analysis… • Modeling methods: classifiers, topic modeling, relevance…• Model efficiency: confusion metrics, precision, recall…• Overheads: intermediate processing, pre-aggregation,…• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day- Search for “LemaTea sweet taste”- No of tf to calculate ~ 1.5B * 3 ~
4.5B- No of idf to calculate ~ 1.5B- Total calculations for 1 search ~ 6
B- Consider daily growth
Engineering Factors
• Files structure: tables, text files, files-day,… • Files formats: HDFS, parquet, avro…• Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,…• Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML… • Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow,
Kafka/Streaming…• Ingestion pattern: real-time, micro batches,…
- Overall Storage- Processing capacity per node- No of nodes- Tables Hive, Hbase, Greenplum- Individual files Spark, Flink- Files-day Hadoop HDFS
Engineering Factors
• Workload: no of requests, request size,… • Application performance: response time, concurrent
requests…• Applications interfacing: RESET APIs, native, messaging,…• Application implementation: integration, model scoring,…• Security model: application level, platform level,…
- For 3 search terms ~ 6B calculations
- For 5 search terms ~ 9B calculations
- For 10 concurrent requests ~ 75B- Resource queuing / prioritization- Search options like date range- Access control model
Engineering Factors
Ongoing Process…Growing Requirements
What if?• New sources are included • Wider parsing Criteria • Advanced modeling: POS, Word Co-
occurrence, Co-referencing, Named Entity, Relationship Extraction,…
• Better response time is needed• More frequent ingestion
Dynamic
Platform
Ingestion
Corpus Processin
g
Model Processin
g
Requests Processin
g
• Larger number of docs• Increased processing requirements• Platform expansion • Overall architecture reconsidered
Some Building Blocks
What is a Data Science Model?• Type & format of inputs date• Data ingestion• Transformations and feature engineering• Modeling methods and algorithms• Model evaluation and scoring• Applications implantations considerations• In-Memory vs. In-Database
Key Challenges for Data Science Models
Volume
Stationary
Batches
Structured
Insights
Growth
Streams
Real-time
Unstructured
Responsive
Scale out Performance
Data Flow Engines
Event Processing
Complex Formats
Perspective / Deep Models
Traditional Data Management Systems• Shared I/O• Shared Processing• Limited Scalability• Service Bottlenecks• High Cost Factor
Shar
ed B
uffer
s
Data Files
Database Cluster
I/O
I/O
I/O
Network
Data
base
Ser
vice
Abstraction of Big Data Platforms Data
Nodes
Master NodesI/O
Network
Inte
rcon
nect
• Parallel Processing• Shared Nothing• Linear Scalability• Distributed Services• Lower Cost Factor
I/O
I/O
I/O
…
Metadata
1
2
3
n
Direct access to user
data
MetadataStand
by
User data / Replicas
User data / Replicas
User data / Replicas
User data / Replicas
In a Nutshell
Source: http://dataconomy.com/2014/06/understanding-big-data-ecosystem/
• Very huge.• Overlaps.• Overloading.• You need to
start with a use case to be able to get your solutions well engineered.
Engineered Systems• Packaged: Hortonworks – Pivotal – Cloudera• Appliances: EMC DCA – Dell DSSD – Dell VxRack• Cloud offerings: Azure – AWS – IBM – Google Cloud
Engineering Patterns in Implementation
Lambda Architecture…Social Marketing• Generic, scalable
and fault-tolerant data processing architecture.
• Keeps a master immutable dataset while serving low latency requests.
• Aims at providing linear scalability.
Source: http://lambda-architecture.net/
Social Marketing…Revisted
Ingest Social Feeds
Build Corpus Metrics
Design Text Mining Model
Deploy All to a Big
Data Platform
Application for
Marketing Users
What people are saying about our new brand “LemaTea”?
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Sequence Files
Apache Spark / MLlib• In memory distributed
Processing• Scala, Python, Java and
R• Resilient Distributed
Dataset (RDD)• Mllib – Machine
Learning Algorithms• SQL and Data Frames /
Pipelines• Streaming• Big Graph analytics
Spark Cluster Mesos HDFS/YARN
Apache Spark• Supports
different types of Cluster Managers
• HDFS / YARN, Mesos, Amazon S3, Stand Alone, Hbase, Casandra…
• Interactive vs Application Mode
• Memory OptimizationSource: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-
architecture.html
Apache Spark
Apache Spark MLlib
Apache Spark…The Big Picture
Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
Greenplum / MADLib• Massively Parallel
Processing• Shared Nothing• Table distribution
• By Key• By Round Robin
• Massively Parallel Data Loading
• Integration with Hadoop
• Native MapReduce
Apache MADLib
Image Processing…Unusual WayMassively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
Image Processing…Unusual WayMassively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
Image Processing…Unusual WayMassively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
Take Aways• A Data Science is not just the algorithms but it includes and end-
to-end solution.• The implementation should consider engineering factors and
quantify them so appropriate components can be selected.• The Big Data technology land scape is really huge and growing –
start with a solid use case to identify potential components.• Abstraction of specific technology will enable you to put your
hands on the pros and cons.• Creativity in solutions design and technology selection case by
case.• Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark
SQL Kafka, Hadoop / Yarn, Greenplum, MADLib.
Q & A
Email: [email protected]: hichawyLinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230
Thank You