big data ecosystem

20

Click here to load reader

Upload: ivo-vachkov

Post on 01-Jul-2015

213 views

Category:

Technology


0 download

DESCRIPTION

Presentation for 2014 IDC BIg Data and Business Intelligence forum in Sofia, Bulgaria, 2014-09-18

TRANSCRIPT

Page 1: Big Data Ecosystem

Big Data Ecosystem

Ivo Vachkov

Xi Group Ltd.

Page 2: Big Data Ecosystem

Big Data ???

Definition

The 3Vs:

Volume

Velocity

Variety

Added later:

Veracity

Variability

Complexity

Page 3: Big Data Ecosystem

Processing Paradigms

Batch Processing

Large volumes

Lower volatility

Incremental updates

Real-time Processing

Smaller volumes

Higher volatility

Possible full regeneration

Page 4: Big Data Ecosystem

The Data Path

From Collection …

… to Processing …

… to Query:

Consumption

Visualization

[Predictive] Analysis

Monitoring / Validation

ETL, anyone?!

Page 5: Big Data Ecosystem

The Data Path

Page 6: Big Data Ecosystem

Data Path / Collection

Multiple sources (RDBMS, Logs, activity streams, message

queues, time series, etc.)

Multiple types (structured, unstructured, free text, bags of

words, raw, normalized, etc.)

Collection starts with raw data and produces digital

artifacts suitable for machine processing.

Page 7: Big Data Ecosystem

Data Path / Collection

Wide variety of components and technologies:

Flat files, binary formats (AVRO, CSV, etc.) on a typical file

system

Cluster-specific file systems

RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases,

Document Databases

Column Stores

Key-Value Stores

Time Series Stores

Streaming and transformation engines

Page 8: Big Data Ecosystem

Data Path / Processing

Different processing paradigms:

Batch Processing

Real-time Processing

Multiple expected outcomes:

Data

Action

Different destinations:

Data stores

Data-driven Control Planes

Page 9: Big Data Ecosystem

Data Path / Processing

Smaller number of technologies:

Map / Reduce (Hadoop, CouchDB, MongoDB, Riak)

Cluster Computing (PMV, MPI, LAM, OpenMP, etc.)

HPC / Supercomputing

Data parallelism is the key!

Data locality is important!

Page 10: Big Data Ecosystem

Data Path / Processing

The importance of M/R

Self-hosted solutions:

Apache Hadoop

Cloudera, HortonWorks, etc.

Cloud-based solutions:

AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo)

Joyent Manta

… many others …

Page 11: Big Data Ecosystem

Data Path / Query

Processing will create digital artifact

Extremely high variety of technologies, components,

services to deal with those artifacts:

SQL interfaces on top of NoSQL stores

NoSQL to NoSQL

NoSQL to RDBMS

Output to 3rd party API services

Output to proprietary interfaces

… a lot more …

Page 12: Big Data Ecosystem

Data Path / Query

“Query-friendly” stores:

Classical RDBMS, NewSQL

Big Table & Column Stores

Key-Value Stores

Search-oriented services

Visualization:

3rd party services

Tableau

HTML5 / JavaScript Dashboards

Programming languages / Visualization libraries

Page 13: Big Data Ecosystem

Data Path / Query

Analysis

Reports

Trends / Predictions

Real-time analytics

Data-driven Control Plane

Classical Business Intelligence

Machine Learning (Mahout)

Data Science (usually a fancy term for Statistics)

Page 14: Big Data Ecosystem

Big Data & Monitoring

Infrastructure Monitoring

Well understood

Many products

Full-Stack Application Monitoring

Technical challenges

No “one size fits all” solutions

Data Quality Monitoring

Emerging technologies

Home-grown solutions

Page 15: Big Data Ecosystem

Big Data & Monitoring

Infrastructure Monitoring

Page 16: Big Data Ecosystem

Big Data & Monitoring

Application Monitoring

Page 17: Big Data Ecosystem

Big Data & Monitoring

Data Quality Monitoring

Page 18: Big Data Ecosystem

… a bag of acronyms …

Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix, Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis, Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue, OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer, Tableau, Pentaho, SumoLogic, MongoDB, CouchDB, Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB, Memcache, Foundation DB, …

AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift, ElasticCache, SQS, SWF

Joyent: Manta

Page 19: Big Data Ecosystem

Piece of advice …

Collect relevant data!

Collecting data for data’s sake only costs money …

Use the processing technology that best matches your business case!

Hadoop is pointless if your clients only want fast geospatial searches …

Consume wisely!

Knowing that 100% of X is Y means nothing when there is only one X …

Page 20: Big Data Ecosystem

Conclusion

Q &

A