data analytics with nosql

32
Data Analytics with NOSQL Mukundan Agaram Chris Weiss

Upload: mukundan-agaram

Post on 20-Mar-2017

276 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data analytics with NOSQL

Data Analytics with NOSQL

Mukundan AgaramChris Weiss

Page 2: Data analytics with NOSQL

Some initial thoughts about data...

Continual issues with large scale web apps– Data growth + query response time

● Data growth => performance degradation● Explosion of big data “analytics” use cases

– Increase in unstructured data● More interconnectivity, more formats, lack of structure...● Document oriented data (XML/JSON) are difficult to

manage and search

– Distributed server configurations ● Large systems, more distribution and HA

Cloud services has aggravated these issues

Page 3: Data analytics with NOSQL

Agenda for the night

● What is NOSQL?● Varieties of NOSQL● Key Industry Use Cases● Applications for Data Analytics● Landscape● Demos/Walkthroughs● Closing Discussions

Page 4: Data analytics with NOSQL

What is NOSQL?

● “...mechanism for storage and retrieval of datathat is modeled in means other than tabularrelations used in relational databases.”Wikipedia

● Non SQL or Non-relational● Not Only SQL● Technically since late 1960...

– E.g. IDMS, IMS, MUMPS, Cache, BerkeleyDB

Page 5: Data analytics with NOSQL

What is NOSQL?

● Drivers for modern day NOSQL– Web 2.0

– Big Data

– Facebook, Google, Amazon, Expedia etc.

– Horizontal scaling to clusters of computers● Achilles heel for RDBMS

– Cost

– Provide ● HA● Partition Tolerance (a.k.a sharding)● Speed

Page 6: Data analytics with NOSQL

NOSQL - Drawbacks and Barriers

● Compromise on consistency (CAP Theorem)● Custom query languages vs. SQL● Lack of standardized interfaces● Existing investments in RDBMS● Most lack true ACID transactions.

– Use an “eventually” consistent model

– Data is replicated with a conflict resolution algorithm

– Methods for conflict resolution and distribution varysignificantly

Page 7: Data analytics with NOSQL

CAP Theorem

● a.k.a Brewer's theorem● Impossible for a distributed computer system to

simultaneously provide – Consistency

● all nodes see same data at same time

– Availability ● Every request receives a response

– Partition Tolerance● Fault tolerance to partitioning because of network failures

Page 8: Data analytics with NOSQL

CAP alignment for NOSQL

Source: http://blog.nahurst.com/visual-guide-to-nosql-systems

Page 9: Data analytics with NOSQL

NOSQL direction

The landscape is morphing...● Current NOSQL industry focus

– Address large distributed systems reactionary to theCAP theorem

● The newer breed of NOSQL address importantaspects such as ACID

● There is a new buzz word …– NewSQL

Page 10: Data analytics with NOSQL

Database Evolution

Page 11: Data analytics with NOSQL

NOSQL Model Classification

Key Value Stores &Caches

Data is represented as a collection of (K,V) pairs. In-memory,persistent or eventually persistent.

Document Databases Data is stored in JSON document structures.

RDF, OWL & Triple Stores

Meaningful way to connect information. Can inference overtriples (S,P,O). Can be represented graphically. SPARQL

Wide Column Databases Extensible record set. Stores data tables as sections ofcolumns. Great for EDW.

Graph Databases Stores data as a graph G(V,E). Great for correlation analysis,recommendation engines and fraud detection.

Multi-model Databases Combination of one or more varieties of the above.

Page 12: Data analytics with NOSQL

NOSQL Models

● Key-Value – Cache (EHCache, BigMemory, Coherence, Memcached)

– Store (Redis, Riak, AeroSpike, Oracle NoSQL)

● Document (MongoDB, CouchDB, AmazonDynamoDB)

● Wide Column (Cassandra, HBase, Vertica)

● Graph (Neo4j, Titan, Giraph)

● Multi-model (OrientDB, ArangoDB, Sqrrl)

Page 13: Data analytics with NOSQL

Source: www.db-engines.com

Page 14: Data analytics with NOSQL

Consider NOSQL for...

● Enabling “big data” and “web” scale– Massive distribution through horizontal scaling

● Performant queries (alternatives to RDBMS)– Denormalization and large horizontal scalability

● Massive write volumes (Facebook, Twitter)● Fast and dynamic access to key data ● Flexible schemas and data types● Data/Schema Migration● Developer centric environments

Page 15: Data analytics with NOSQL

Consider NOSQL for...

● Diverse data organization options– Hierarchical correlation

– Graph correlation

– Semantic relationships

– Set based analytics

● Caching in end usage format● Data Archival● Big Data Analytics

– Cumulative metrics and insights

– Correlation

Page 16: Data analytics with NOSQL

Where RDBMS/SQL is better..

● OLTP ● Data Integrity● SQL centricity● Complex relationships

– Exception of graph NOSQL

● Maturity, stability and standardization

Page 17: Data analytics with NOSQL

Use Cases● Log management (unstructured data)● Data synchronization (online vs. offline sources)

– Shopping cart, Field sales/services, PoS, Gaming,Transportation/telemetry

● User profile management● Customer 360 degree view● Fraud detection ● Medical/Healthcare diagnosis● Data Archival● Recommendation Engines

Page 18: Data analytics with NOSQL

Applications for Data Analytics

● Complements (part of) Hadoop and Big Data● Acts as the persistence infrastructure for larger

machine learning use cases– Predictive Analytics

– Fraud/Anomaly/Outlier Detection

– Recommendation engines

● Provides a back drop for interesting datavisualization initiatives– Integrate with visualization packages such as

Tableau

Page 19: Data analytics with NOSQL

Interesting links

● Redis in Practice: Who's online?www.lukemelia.com/blog/archives/2010/01/17/redis-in-practice-whos-online/

● Inventory list of NOSQL systemswww.nosql-database.org

● Database Engine ranking and analyticswww.db-engines.com

● Visual guide to NOSQL systemswww.blog.nahurst.com/visual-guide-to-nosql-systems

Page 20: Data analytics with NOSQL

Case Studies / Demos

● Retail fraud detection – Neo4j

– Contrasting with OrientDB

– Tinkerpop/Gremlin/Blue Print

● 360 degree single view of voter information– MongoDB

● Schema on read – Hadoop

Page 21: Data analytics with NOSQL
Page 22: Data analytics with NOSQL
Page 23: Data analytics with NOSQL

Gremlin Blueprints Architecture

Neo4j OrientDB TitanGraph ArangoDB

Page 24: Data analytics with NOSQL

Qualified Voter – Use Case

● Tracks registration information for all voters inMichigan

● Uses a tabular geography model● Highly normalized schema

– Data partitioned into subsets● Enable local application instances and row level security

● Expensive queries when doing reporting● Expensive queries for performing “single view”

of voter● Several tables with tens of millions of records

Page 25: Data analytics with NOSQL

Voter Schema

Page 26: Data analytics with NOSQL

Find the first 100 voters in Ingham county withstatus and school district

SELECT V.VOTER_IDENTIFICATION_NUMBER,V.FIRST_NAME, V.LAST_NAME, G.CODE AS GENDER,

IDS.NAME AS ID_STATUS, UST.NAME AS UOCAVA_STATUS,

VA.ADDRESS_LINE_ONE, VA.CITY, VA.ZIP_CODE,

DIS.NAME AS SCHOOL_DISTRICT

FROM VOTER V, VOTER_ADDRESS VA, GENDER G,

IDENTIFICATION_STATUS IDS, UOCAVA_STATUS UST, VOTER_STATUS_TYPE VST,

STREET_RANGE SI, DISTINCT_POLITICAL_AREA DPA, DISTINCT_POLITICAL_AREA_DIS DPAD,

DISTRICT DIS, DISTRICT_TYPE DT, COUNTY CO

WHERE V.ID = VA.VOTER_ID AND V.GENDER_ID = G.ID AND V.IDENTIFICATION_STATUS_ID = IDS.ID

AND V.UOCAVA_STATUS_ID = UST.ID AND V.VOTER_STATUS_TYPE_ID = VST.ID AND VST.NAME = 'Active'

AND VA.STREET_RANGE_ID = SI.ID AND SI.DISTINCT_POLITICAL_AREA_ID = DPA.ID

AND VA.IS_ACTIVE = 'Y'

AND DPA.COUNTY_ID = CO.ID AND CO.NAME = 'Ingham'

AND DPA.ID = DPAD.DISTINCT_POLITICAL_AREA_ID AND DPAD.DISTRICT_ID = DIS.ID

AND DIS.DISTRICT_TYPE_ID = DT.ID AND DT.NAME = 'School'

AND ROWNUM <= 100;

Page 27: Data analytics with NOSQL
Page 28: Data analytics with NOSQL
Page 29: Data analytics with NOSQL

Expensive in terms of IO

● Multiple objects read● Two stage IO:● Read index● Read entire table row● Selected and WHERE clause columns

assembled and then filtered● Resources for larger volume query would be

high – memory, CPU, fast disk

Page 30: Data analytics with NOSQL

Parting conclusions

● NOSQL is a mixed bag of fruit● This space is growing● There are hundreds of products● Best value is realized from identifying the

correct use case– Functional requirements

– Non-functional requirements

Page 31: Data analytics with NOSQL

Finally you can use NOSQL for...

Page 32: Data analytics with NOSQL

Thank You!!

Questions?