big data ecosystem
DESCRIPTION
Presentation for 2014 IDC BIg Data and Business Intelligence forum in Sofia, Bulgaria, 2014-09-18TRANSCRIPT
Big Data Ecosystem
Ivo Vachkov
Xi Group Ltd.
Big Data ???
Definition
The 3Vs:
Volume
Velocity
Variety
Added later:
Veracity
Variability
Complexity
Processing Paradigms
Batch Processing
Large volumes
Lower volatility
Incremental updates
Real-time Processing
Smaller volumes
Higher volatility
Possible full regeneration
The Data Path
From Collection …
… to Processing …
… to Query:
Consumption
Visualization
[Predictive] Analysis
Monitoring / Validation
ETL, anyone?!
The Data Path
Data Path / Collection
Multiple sources (RDBMS, Logs, activity streams, message
queues, time series, etc.)
Multiple types (structured, unstructured, free text, bags of
words, raw, normalized, etc.)
Collection starts with raw data and produces digital
artifacts suitable for machine processing.
Data Path / Collection
Wide variety of components and technologies:
Flat files, binary formats (AVRO, CSV, etc.) on a typical file
system
Cluster-specific file systems
RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases,
Document Databases
Column Stores
Key-Value Stores
Time Series Stores
Streaming and transformation engines
Data Path / Processing
Different processing paradigms:
Batch Processing
Real-time Processing
Multiple expected outcomes:
Data
Action
Different destinations:
Data stores
Data-driven Control Planes
Data Path / Processing
Smaller number of technologies:
Map / Reduce (Hadoop, CouchDB, MongoDB, Riak)
Cluster Computing (PMV, MPI, LAM, OpenMP, etc.)
HPC / Supercomputing
Data parallelism is the key!
Data locality is important!
Data Path / Processing
The importance of M/R
Self-hosted solutions:
Apache Hadoop
Cloudera, HortonWorks, etc.
Cloud-based solutions:
AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo)
Joyent Manta
… many others …
Data Path / Query
Processing will create digital artifact
Extremely high variety of technologies, components,
services to deal with those artifacts:
SQL interfaces on top of NoSQL stores
NoSQL to NoSQL
NoSQL to RDBMS
Output to 3rd party API services
Output to proprietary interfaces
… a lot more …
Data Path / Query
“Query-friendly” stores:
Classical RDBMS, NewSQL
Big Table & Column Stores
Key-Value Stores
Search-oriented services
Visualization:
3rd party services
Tableau
HTML5 / JavaScript Dashboards
Programming languages / Visualization libraries
Data Path / Query
Analysis
Reports
Trends / Predictions
Real-time analytics
Data-driven Control Plane
Classical Business Intelligence
Machine Learning (Mahout)
Data Science (usually a fancy term for Statistics)
Big Data & Monitoring
Infrastructure Monitoring
Well understood
Many products
Full-Stack Application Monitoring
Technical challenges
No “one size fits all” solutions
Data Quality Monitoring
Emerging technologies
Home-grown solutions
Big Data & Monitoring
Infrastructure Monitoring
Big Data & Monitoring
Application Monitoring
Big Data & Monitoring
Data Quality Monitoring
… a bag of acronyms …
Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix, Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis, Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue, OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer, Tableau, Pentaho, SumoLogic, MongoDB, CouchDB, Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB, Memcache, Foundation DB, …
AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift, ElasticCache, SQS, SWF
Joyent: Manta
Piece of advice …
Collect relevant data!
Collecting data for data’s sake only costs money …
Use the processing technology that best matches your business case!
Hadoop is pointless if your clients only want fast geospatial searches …
Consume wisely!
Knowing that 100% of X is Y means nothing when there is only one X …
Conclusion
Q &
A