big data, nosql... so what? iran hutchinson. #iranic me i work for intersystems who:i work for...

33
Big Data, NoSQL . . . So What? Iran Hutchinson

Post on 20-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Big Data, NoSQL . . . So What?

Iran Hutchinson

#iranic

Me

• I work for InterSystems who:– Drives http://globalsdb.org NoSQL project.– Has 20+ years of NoSQL production deployments– Has 20+ years of Big Data production

deployments– Built a ~250 million Euro business on the above

• Email: [email protected]• Twitter: #iranic

#iranic

Big Data is …

• Important data in varying formats and volumes that is being generated across all areas affecting your business that is generally not centrally correlated or managed.

• Examples include:– Word Files, PowerPoint, PDFs– Emails, Instant Messaging, Texts– Blogs and Social Media – Automated data from machine activities– Stream data from financial stock markets

#iranic

Some Big Data Numbers• Source: McKinsey Global Institute• 5 Billion mobile phones used in 2010• 30 Billion pieces of info shared on Facebook each month• 40% projected growth in global data generated• 235 Terabytes collected by US Library of Congress 04/11

– 15 out of 17 sectors in US have more data stored per company than this.

#iranic

Some Big Data Numbers …• Source: McKinsey Global Institute• $300 Billion in potential value in US Healthcare system• €250 Billion in Europe’s public sector administration• $600 Billion in annual consumer surplus using location data• 60% Potential increase in retail operating margins • 140,000 – 190,000 analytical talent positions in US• 1.5 Million data-savvy managers needed in US

#iranic

Case Study: Credit Suisse

• Key Challenges:– Revamp order routing architecture– Revamp order management architecture– Serve current demand and scale to new levels– Address downtime challenges

#iranic

Case Study: Credit Suisse …

• Big Data in the form of volumes of transactions• Leveraged Caché’s:

– In-memory architecture for performance– On-disk resiliency for availability– Distributed architecture for data coherency

• Can easily process 1,000,000,000 transactions– During business hours

#iranic

Case Study: European Space Agency (ESA)

• Key Challenges– Make the largest, most precise 3-D map of our Galaxy– Monitor 1,000,000,000 stars over 5 years, precisely

charting position, movement, and brightness– Along the way discover hundreds of thousands of new

celestial objects

#iranic

Case Study: ESA Continued …

1,000,000,000 objectsX 100 observations per objectX 600 bytes per observation

60,000,000,000,000 (60TB)

Solution: Caché/XEP, delivering 100,000+ sustained inserts per second per server, stored as real objects with SQL access

• Challenge Calculation:• Capture data for 1 Billion Celestial Objects• http://www.intersystems.com/cache/whitepapers/pdf/

Charting_the_Galaxy.pdf

#iranic

Enabling Technology

• Focus on Caché• A quick look at the architecture

Relational Object Key-Value Graph Array ?

Global

COS Java C++ ?

Paradigm

Language

#iranic

Enabling Technology …• Java + C database kernel run in same process

Java I/O, NIOJava Native

Interface

Collections Framework

Concurrency

Java SE

#iranic

Enabling Technology …• ECP, Distributed Computing

Data Server

App Server

1

App Server

2

App Server

3

App Server

4

App Server

?

#iranic

Enabling Technology …• Multiple, simultaneous data to disk writers

Caché Buffer

Journalers

Hard Disk

Global

Journal

Disk I/O

Global

Journal

Disk I/O

Global

Journal

Disk I/O

#iranic

Who is this Guy?

• Edgar Frank “Ted” Codd• Known for 12 Rules (0 ~ 12) for Relational Data Systems

#iranic

NoSQL … Breaking the Rules

• Rule 1: The information Rule– All information is represented in 1 and only 1 way,

namely by values in column positions within rows of tables

• Rule 12: The no subversion Rule– If the system provides a low-level (record-at-a-time)

interface, then that interface cannot be used to subvert the system i.e. relational security or integrity constraints.

#iranic

Why NoSQL?

• No to ACID transactions• No to the impedance mismatch with SQL• Dealing with Big Data and Web Scale• High prices from RDBMS vendors• Use commodity hardware• Flexible data models• It’s a cool movement ….

#iranic

Is NoSQL a new Concept?

• No• Remember MUMPS?

– SET ^Car("Door","Color")="BLUE”• Remember Multi-value/PICK

– MATWRITE array.variable ON file.variable,id. ….• Ever heard of the NoSQL RDB?

– Carlo Strozzi– http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home

%20Page

#iranic

CAP Theorem • Consistent

– A service that is consistent operates fully or not.• Availability

– The service is available to operate fully or not.• Partition Tolerance

– Managing data on multiple nodes. 1 node is 1 partition so it works or does not when it comes to processing data.

• Significant as you can get 2 of these only …

#iranic

CAP Theorem … • Arguments and links

– http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

– http://ksat.me/a-plain-english-introduction-to-cap-theorem/

– http://voltdb.com/company/blog/clarifications-cap-theorem-and-data-related-errors

#iranic

CAP Theorem …: Consistency

DB1

DB2

DB3

DB4DB5

DB6

DB7

#iranic

CAP Theorem …: Consistency

Hub

Spoke DB1

Spoke DB2

Spoke DB3

Spoke DB4

#iranic

CAP Theorem …: Consistency

DB1

DB2DB3

#iranic

Distributed computing• Fallacies (Peter Deutsch)

– The network is reliable– Latency is zero– Bandwidth is infinite– The network is secure– Topology doesn’t change– There is one administrator– Transport cost is zero– The network is homogeneous

• Remember JINI? (See Apache River project)

#iranic

NoSQL: Which Model to Use?

Data

Key-Value

Document

Column

Graph

#iranic

NoSQL: Which project?

• http://nosql-database.org/ lists 122 today.• Depends on your model selection.• Most likely choose well-known project.• Don’t forget about shared risk!

#iranic

NoSQL: Querying

• Some solutions have no querying• When available query languages differ• Lack of general AD-Hoc querying – “no” SQL• Have you heard of UnQL?

– http://www.unqlspec.org/display/UnQL/Home• NOTE: Toad for Cloud

#iranic

NoSQL: How to Succeed?

• Know your application• Don’t forget the past lessons• Consider a hybrid approach• Fight the desire to Roll-Your-Own-DB• Start small but significant

#iranic

NoSQL SQL/RDBMS

NoSQL: Hybrid Approach 1

• Two Systems• NoSQL System• SQL/RDBMS

Data Mapper / Translator

#iranic

NoSQL: Hybrid Approach 2

• One system does both NoSQL and SQL

Data

Relational

Key-Value

Document

Column

Graph

?

#iranic

GlobalsDB.org Project

• Name comes from the underlying data structure– Multi-dimensional array– Basis for commercial Caché data system

• Free for development and production deployment• NoSQL DB with Java and Node.js APIs• Code base is same as commercial product• APIs are open sourced or being open sourced• Database kernel is not open source

#iranic

A “Global” Definition

• A Global is persistent sparse multi-dimensional array, which consists of one or more storage elements or "nodes". Each node is identified by a node reference (which is, essentially, its logical address)– simple =="some data”– complex["subscript-1", "subscript-2"] =="some data”

• Example– product[item,type,os,proccessor] == quantity– product[“computer”,”laptop”,”Mac”,”i7”] == 3

#iranic

GlobalsDB Architecture

• Current Architecture

Array Document Key-Value ?

Global

Javascript Java?

Paradigm

Language

#iranic

GlobalsDB, NoSQL, Big Data

• http://nosql.mypopescu.com/• http://highscalability.com/• http://nosqltapes.com/• http://globalsdb.wordpress.com