integrating big data technologies

28
INTEGRATING BIG DATA Dataversity Webinar Feb 7 2012 ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 1

Upload: dataversity

Post on 20-Aug-2015

2.010 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Integrating Big Data Technologies

INTEGRATING BIG DATA Dataversity Webinar Feb 7 2012

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 1

Page 2: Integrating Big Data Technologies

State of Data Today

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 2

Page 3: Integrating Big Data Technologies

A Growing Trend

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 3

Requirement Expectations Reality Speed Speed of the Internet Speed = Infra + Arch +

Design Accessibility Accessibility of a

Smartphone BI Tool licenses &

security Usability IPAD - Mobility Web Enabled BI Tool

Availability Google Search Data & Report Metadata Delivery Speed of questions Methodology & Signoff

Data Access to everything Structured Data Scalability Cloud (Amazon) Existing Infrastructure

Cost Cell phone or Free WIFI Millions

Expectations for BI are changing w/o anyone telling us

Page 4: Integrating Big Data Technologies

The  Wisdom  of  Crowds  ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 4

Page 5: Integrating Big Data Technologies

Data  Deluge  =  Business  Insights  ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 5

Page 6: Integrating Big Data Technologies

BIG  Data  ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 6  

Structured

UnStructured

ERP CRM SCM

Content Management Systems

Email Call Center

Documents Contracts

Current New

Page 7: Integrating Big Data Technologies

What’s so Big about Big Data

Velocity Volume Variety

Complexity Ambiguity

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 7

Page 8: Integrating Big Data Technologies

So you are about to start the Big Data Project

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 8

Tools

instructions

Data

Output

Page 9: Integrating Big Data Technologies

The  Normal  Way  Results  In  ……..  

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 9  

Image Source: Web

Page 10: Integrating Big Data Technologies

Why  Big  Data  can  Fail  on  the  RDBMS?  

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 10  

New Data Types

New volume

New analytics

New workload

New metadata

Current Data

Management Platform

(RDBMS + ETL+BI)

•  POOR Performance

•  Failed Programs

Scalability; Sharding; ACID;

Page 11: Integrating Big Data Technologies

BIG Data • Workload Demands

•  Process dynamic data content

•  Process unstructured data

•  Systems that can scale up and scale out with high volume data

•  Perform complex operations within reasonable response time

•  Infrastructure Requirements •  Scalable platform •  Database independence •  Fault tolerant

architectures •  Low cost of acquisition

and store •  Supported by standard

toolsets

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 11  

Page 12: Integrating Big Data Technologies

Hadoop

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 12

Design Goals ü  System Shall Manage and

Heal Itself ü  Performance Shall Scale

Linearly ü  Compute Shall Move to

Data ü  Simple Core, Modular and

Extensible

Page 13: Integrating Big Data Technologies

Hadoop Differentiators

Schema-on-Write: RDBMS

Schema-on-Read: Hadoop

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 13

•  Schema must be created before data is loaded.

•  An explicit load operation has to take place which transforms the data to the internal structure of the database.

•  New columns must be added explicitly before data for such columns can be loaded into the database.

•  Read is Fast.

•  Standards/Governance.

•  Data is simply copied to the file store, no special transformation is needed.

•  A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns.

•  New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them.

•  Load is Fast

•  Evolving Schemas/Agility

Page 14: Integrating Big Data Technologies

Hadoop Known Limitations •  Write-once model •  A namespace with an extremely large number of files exceeds

Namenode’s capacity to maintain •  Cannot be mounted by exisiting OS

•  Getting data in and out is tedious •  Virtual File System can solve problem

•  HDFS does not implement / support •  User quotas •  Access permissions •  Hard or soft links •  Data balancing schemes

•  No periodic checkpoints •  Namenode is single point of failure

•  Automatic restart and failover to another machine not yet supported

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 14

Page 15: Integrating Big Data Technologies

Hadoop Tips •  Hadoop is useful

•  When you must process lots of unstructured data

•  When running batch jobs is acceptable

•  When you have access to lots of cheap hardware

•  Hadoop is not useful •  For intense calculations with little or

no data •  When your data is not self-contained •  When you need interactive results

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 15

•  Implementation •  Think big, start small •  Build on agile cycles •  Focus on the data, as you will

always develop schema on write.

•  Available Optimizations •  Input to Maps •  Map only jobs •  Combiner •  Compression •  Speculation •  Fault Tolerance •  Buffer Size •  Parallelism (threads) •  Partitioner •  Reporter •  DistributedCache •  Task child environment settings

Page 16: Integrating Big Data Technologies

Hadoop Tips •  Performance Tuning

•  Increase the memory/buffer allocated to the tasks

•  Increase the number of tasks that can be run in parallel

•  Increase the number of threads that serve the map outputs

•  Disable unnecessary logging •  Turn on speculation •  Run reducers in one wave as they

tend to get expensive •  Tune the usage of DistributedCache,

it can increase efficiency

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 16

•  Troubleshooting •  Are your partitions uniform? •  Can you combine records at the map

side? •  Are maps reading off a DFS block

worth of data? •  Are you running a single reduce wave

(unless the data size per reducers is too big) ?

•  Have you tried compressing intermediate data & final data?

•  Are there buffer size issues •  Do you see unexplained “long tails” •  Are your CPU cores busy? •  Is at least one system resource being

loaded?

Page 17: Integrating Big Data Technologies

NoSQL • Stands for Not Only SQL • Based on CAP Theorem • Usually do not require a fixed table schema nor do they

use the concept of joins • All NoSQL offerings relax one or more of the ACID

properties • NoSQL databases come in a variety of flavors

•  XML (myXMLDB, Tamino, Sedna) •  Wide Column (Cassandra, Hbase, Big Table) •  Key/Value (Redis, Memcached with BerkleyDB) •  Graph (neo4j, InfoGrid) •  Document store (CouchDB, MongoDB)

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 17

Page 18: Integrating Big Data Technologies

NoSQL Footprint

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 18

Size

Complexity

Key Value

Big Table

Doc Database

Graph

Amazon Dynamo

Google Big Table

Cassandra

Lotus Notes HBase

Voldermort

Graph Theory

Page 19: Integrating Big Data Technologies

NoSQL

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 19

•  Best Practices •  Design for data collection •  Plan the data store •  Organize by type and semantics •  Partition for performance

•  Access and Query is run time dependent

•  Horizontal scaling •  Memory Caching

•  Access and Query •  RESTful interfaces (HTTP as an

accessAPI) •  Query languages other than SQL

•  SPARQL - Query language for the SemanticWeb

•  Gremlin - the graph traversal language

•  Sones Graph Query Language •  Data Manipulation / Query API

•  The Google BigTable DataStoreAPI

•  The Neo4jTraversalAPI •  Serialization Formats

•  JSON •  Thrift •  ProtoBuffers •  RDF

Page 20: Integrating Big Data Technologies

Forest Rim Technology – Textual ETL Engine (TETLE) – is an integration tool for turning text into a structure of data that can be analyzed by standard analytical tools

Textual ETL Engine

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 20

•  Textual ETL Engine provides a robust user interface to define rules (or patterns / keywords) to process unstructured or semi-structured data.

•  The rules engine encapsulates all the complexity and lets the user define simple phrases and keywords

•  Easy to implement and easy to realize ROI

•  Advantages •  Simple to use •  No MR or Coding required for text analysis

and mining •  Extensible by Taxonomy integration •  Works on standard and new databases •  Produces a highly columnar key-value

store, ready for metadata integration

•  Disadvantages •  Not integrated with Hadoop as a rules

interface •  Currently uses Sqoop for metadata

interchange with Hadoop or NoSQL interfaces

•  Current GA does not handle distributed processing outside Windows platform

Page 21: Integrating Big Data Technologies

Integration •  All RDBMS vendors today are supporting Hadoop or NoSQL as

an integration or extension •  Oracle Exalytics / Big Data Appliance •  Teradata Aster Appliance •  EMC Greenplum Appliance •  IBM BigInsights •  Microsoft Windows Azure Integration

•  There are multiple providers of Hadoop distribution •  CloudEra •  HortonWorks •  Zettaset

•  Adapters from vendors to interface with CloudEra or HortonWorks distributions of Hadoop are available today. There are integration efforts to release Hadoop as an integral engine across the RDBMS vendor platforms

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 21

Page 22: Integrating Big Data Technologies

Conceptual  SoluEon  Architecture  ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 22

Metadata

Data Warehouse

Taxonomy

Big Data DW Textual

ETL

ETL ELT CDC Reporting

Analytics Search OLAP

Text Mining Content Analytics

Knowledge Analytics

MDM

DataMart’s

OLTP

BIG Data Content Email Docs

MR / Ruby / Java (Hadoop)

And / Or

Page 23: Integrating Big Data Technologies

Integration Tips •  The key to the castle in integrating Big Data is metadata •  Whatever the tool, technology and technique, if you do not

know your metadata, your integration will fail •  Semantic technologies and architectures will be the way to

process and integrate the Big Data, much akin to Web 2.0 models

•  Data quality for Big Data is a very questionable goal. To get some semblance of quality, taxonomies and ontologies can be of help

•  3rd part data providers also provide keywords, trending tags and scores, these can provide a lot of integration support

•  Writing business rules for Big Data can be very cumbersome and not all programs can be written in MapReduce

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 23

Page 24: Integrating Big Data Technologies

Which Tool

Application Hadoop NoSQL Textual ETL Machine Learning x x

Sentiments x x x Text Processing x x x

Image Processing x x Video Analytics x x

Log Parsing x x x Collaborative

Filtering x x x

Context Search x Email & Content x

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 24

Page 25: Integrating Big Data Technologies

Success  Stories  

•  Machine learning & Recommendation Engines – Amazon, Orbitz

•  CRM - Consumer Analytics, Metrics, Social Network Analytics, Churn, Sentiment, Influencer, Proximity

•  Finance – Fraud, Compliance •  Telco – CDR, Fraud •  Healthcare – Provider / Patient analytics, fraud, proactive

care •  Lifesciences – clinical analytics, physician outreach •  Pharma – Pharmacovigilance, clinical trials •  Insurance – fraud, geo-spatial •  Manufacturing – warranty analytics, supplier quality

metrics

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 25

Page 26: Integrating Big Data Technologies

Data Science

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 26

Data Analytics Content Customer Product Behaviors Optimization

Big Data Processing & ETL

APPLIED SCIENCE

User Interest Prediction inventory prediction

Machine learning Pattern Mining

Advanced Regression Analysis

Business Intelligence Advanced Analytics

Art & Science

Page 27: Integrating Big Data Technologies

Challenges  

•  Resources  Availability  •  MR  is  hard  to  implement  •  Speech  to  text  

•  ConversaEon  context  is  oJen  missing  •  Quality  of  recording  •  Accent  issues  

•  Visual  data  tagging  •  Images  •  Text  embedded  within  images  

•  Metadata  is  not  available  •  Data  is  not  trusted    •  Content  management  plaMorm  capabiliEes  •  Ontologies  Ambiguity  •  Taxonomy  IntegraEon  

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 27

Page 28: Integrating Big Data Technologies

Contact •  Krish Krishnan [email protected]

Twitter: @datagenius

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 28