big data use cases

20
Big Data Use Cases InSemble Inc. http://www.insemble.com

Upload: insemble

Post on 17-Jul-2015

213 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Big Data Use Cases

InSemble Inc. http://www.insemble.com

Agenda

What is Big Data ?1

Technical Use Cases and Demo 4

Hadoop Ecosystem & Business Use cases3

Relevance to your Enterprise2

Q and A with Cloudera 5

Big Data Definitions

• Wikipedia defines it as “ Data Sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable elapsed time

• Gartner defines it as Data with the following characteristics– High Velocity– High Variety– High Volume

• Another Definition is “ Big Data is a large volume, unstructured data which cannot be handled by traditional database management systems

Why a game changer

• Schema on Read– Interpreting data at processing time– Key, Values are not intrinsic properties of data but chosen by

person analyzing the data• Move code to data

– With traditional, we bring data to code and I/O becomes a bottleneck

– With distributed systems, we have to deal with our own checkpointing/recovery

• More data beats better algorithms

Enterprise Relevance

• Missed Opportunities– Channels– Data that is analyzed

• Constraint was high cost– Storage– Processing

• Future-proof your business– Schema on Read– Access pattern not as relevant– Not just future-proofing your architecture

Hadoop Ecosystem

Source: Apache Hadoop Documentation

Hadoop 2 with YARN

Source: Hadoop In Practice by Alex Holmes

Big Data Journey

Real time Insight from all channels IT is key differentiator for your business Perfect alignment of Business and IT

Ad Hoc Data Exploration Batch, Interactive, Real time use cases Predictive Analytics, Machine Learning

Consolidated Analytics ETL Time Constraints

Security standards defined Governance Standards Defined Integrated with the Enterprise

Evaluate Business Benefits Understand Ecosystem Identify Platform

Aware of Benefits

Execute

Expand

Managed

Optimized

- Scout for Opportunities

- Pilot project

- Multiple Use cases

- Governance Model

- Core competency

Journey Over Time

Busi

ness

Val

ue

Effects

GREAT

GOOD

9

Insurance Domain – Case Study source: Cloudera( Three-Customer-Case-Studies_Industry-Brief.pdf

Solution• Cloudera Enterprise

• Apache Hive/Impala

• SQOOP

• Coexist with Enterprise Warehouses & Mainframe

REQUIREMENTS• Customized Plans based on multiple data points

• Lifestyle, health patterns, habits, preferences• Find correlations from digitizing massive amounts of data

• Traffic patterns, demographics, weather• Run analytics on multiple states simultaneously

BENEFITS

• Run descriptive models across historical data from all states

• Customized products catered to individual behaviors and risks

• Differentiated Marketing Offers

Common Use Cases

Detail Records, Time Constraints1

Sentiment Analysis, Fraud Detection4

Recommendation Engines, Insurance Underwriting3

Consolidated View, 360 degree View2

Personalized Marketing, Products5

Securing Hadoop Data

Source: http://www.voltage.com

General Thoughts

• Technology in hyper growth phase• Complex• Tools/Productivity/Monitoring products

evolving• Pilot Project• Incremental Journey

Technical Use Case: Managing Hadoop Cluster

• Ambari vs Cloudera Manager• Both provision, manage and monitor hadoop cluster• Ambari

• Open Source• Based on existing open source projects such as Puppet,

Ganglia and Nagios• Cloudera Manager

• Proprietary tool but more mature• As management tool, do we really need OSS?• Rolling upgrades and manage multiple clusters

Technical User Case: Choose SQL Engine on Hadoop

Performance Benchmark

source: http://blog.cloudera.com

Benchmark for multiple users

source: http://blog.cloudera.com

Other considerations

• Insert, update, and delete with full ACID support• Available since hive 0.14 https://issues.apache.org/

jira/browse/HIVE-5317• Support for nested data structure• Fault tolerance• Work with certain file formats (Avro, LZO

compression)• Integrate SQL on hadoop with other big data

use cases.

Demo - Hadoop cluster in AWS

• Total 6 EC2 machine, type t2.medium• RHEL 6.5, 3.75G Memory, 10G hard drive• 5-node Hadoop cluster • Public data set downloaded from

https://data.cityofchicago.org

Demo

• Chicago Crime data from 2009 to present• 2 million plus records• Dangerous communities in Chicago (Hive vs

Hive on Tez vs Impala)• Use Tableau to connect to Hadoop cluster

• Crime counts based on crime type• Homicide count by Year• dangerous community• Homicide Map

Questions?

Vijay Mandava: [email protected] Lan Jiang: [email protected] / @Lan_Jiang