big data ecosystem

Big Data Ecosystem

Ivo Vachkov

Xi Group Ltd.

Big Data ???

Definition

The 3Vs:

Volume

Velocity

Variety

Added later:

Veracity

Variability

Complexity

Processing Paradigms

Batch Processing

Large volumes

Lower volatility

Incremental updates

Real-time Processing

Smaller volumes

Higher volatility

Possible full regeneration

The Data Path

From Collection …

… to Processing …

… to Query:

Consumption

Visualization

[Predictive] Analysis

Monitoring / Validation

ETL, anyone?!

The Data Path

Data Path / Collection

Multiple sources (RDBMS, Logs, activity streams, message

queues, time series, etc.)

Multiple types (structured, unstructured, free text, bags of

words, raw, normalized, etc.)

Collection starts with raw data and produces digital

artifacts suitable for machine processing.

Data Path / Collection

Wide variety of components and technologies:

Flat files, binary formats (AVRO, CSV, etc.) on a typical file

system

Cluster-specific file systems

RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases,

Document Databases

Column Stores

Key-Value Stores

Time Series Stores

Streaming and transformation engines

Data Path / Processing

Different processing paradigms:

Batch Processing

Real-time Processing

Multiple expected outcomes:

Data

Action

Different destinations:

Data stores

Data-driven Control Planes


Smaller number of technologies:

Map / Reduce (Hadoop, CouchDB, MongoDB, Riak)

Cluster Computing (PMV, MPI, LAM, OpenMP, etc.)

HPC / Supercomputing

Data parallelism is the key!

Data locality is important!


The importance of M/R

Self-hosted solutions:

Apache Hadoop

Cloudera, HortonWorks, etc.

Cloud-based solutions:

AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo)

Joyent Manta

… many others …

Data Path / Query

Processing will create digital artifact

Extremely high variety of technologies, components,

services to deal with those artifacts:

SQL interfaces on top of NoSQL stores

NoSQL to NoSQL

NoSQL to RDBMS

Output to 3rd party API services

Output to proprietary interfaces

… a lot more …

Data Path / Query

“Query-friendly” stores:

Classical RDBMS, NewSQL

Big Table & Column Stores

Key-Value Stores

Search-oriented services

Visualization:

3rd party services

Tableau

HTML5 / JavaScript Dashboards

Programming languages / Visualization libraries

Data Path / Query

Analysis

Reports

Trends / Predictions

Real-time analytics

Data-driven Control Plane

Classical Business Intelligence

Machine Learning (Mahout)

Data Science (usually a fancy term for Statistics)

Big Data & Monitoring

Infrastructure Monitoring

Well understood

Many products

Full-Stack Application Monitoring

Technical challenges

No “one size fits all” solutions

Data Quality Monitoring

Emerging technologies

Home-grown solutions


Infrastructure Monitoring


Application Monitoring


Data Quality Monitoring

… a bag of acronyms …

Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix, Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis, Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue, OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer, Tableau, Pentaho, SumoLogic, MongoDB, CouchDB, Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB, Memcache, Foundation DB, …

AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift, ElasticCache, SQS, SWF

Joyent: Manta

Piece of advice …

Collect relevant data!

Collecting data for data’s sake only costs money …

Use the processing technology that best matches your business case!

Hadoop is pointless if your clients only want fast geospatial searches …

Consume wisely!

Knowing that 100% of X is Y means nothing when there is only one X …

Conclusion

Q &

A

big data ecosystem

Technology