iti016en-the evolution of databases (ii)

The evolution of database technology (II)Huibert Aalbers

Senior Certified Executive IT Architect

IT Insight podcast• This podcast belongs to the IT Insight series

• You can subscribe to the podcast through iTunes.

• Additional material such as presentations in PDF format or white papers mentioned in the podcast can be downloaded from the IT insight section of my site at http://www.huibert-aalbers.com

• You can send questions or suggestions regarding this podcast to my personal email, [email protected]

mailto:[email protected]

A brave new world• With Web 2.0, came the need for a new

set of tools that could handle an explosive growth of data

• Data willfully shared by the users

• Data collected on users and customers, sometimes unsuspectedly on their part

• Sensors, IoT, etc.

• Big Data requires a new kind of data repository

Do I need a different solution?There are basically two ways to determine that you require a new type of database solution instead of a traditional relational database

• The architect designs a new system from the ground up using a Big Data solution because he knowns that it will require it

• The team has tried every single strategy to try to scale the existing relational database and it is still not enough

• Upgrading the hardware / use of SSDs / Networking, etc.

• Query optimization

• Using a data caching scheme

• Partitioning the data

• Building new indices

• Denormalizing the data

• Using stored procedures, etc.

In order to solve the issue, we have to give up something

• What can we give up?

• ACID properties

• Data normalization

• Transaction support

No SQL Repositories From my point of view, the name “No SQL” is not right to describe non-relational databases

• The success behind No SQL databases is not related to the fact that developers don’t like SQL. It is due to the following reasons:

• They scale linearly

• They are more flexible (schema-less)

• Easier to manage for extremely high volumes of data

I think it is better to call them distributed non-relational databases

Key-Value pair databasesThese data stores are also known as distributed hash tables

• Pros

• Extremely quick, well understood CS problem

• Scale almost linearly

• Cons

• Performing complex queries against the values can be slow and complex

• Key-value pair data stores in which the product also keeps a time stamp on the data for versioning are a particular case of key-value pair databases

Document based databasesThis is a large category of data stores which allow to work with data stored in a particular document format. Among popular document formats used to store data, we could mention:

• XML • JSON • YAML

In this kind of data stores, documents are identified by a unique key, which allows for quick retrieval of the information.

Although conceptually all data stores in this category are relatively similar, there are still important differences from one product to another

• Query methods (SQL like, Map/Reduce, etc.) • Replication • Data consistency

Document based databases are schema-less

MongoDB vs CouchDB• MongoDB

• Very high volumes of data somewhat mutable data

• Dynamic flexible queries, somewhat similar to SQL

• Very quick queries

• CouchDB

• Very high volumes of mostly immutable data

• Pre-defined queries, based on MapReduce, implemented in Javascript

• Master-Master replication

• Neither MongoDB nor CouchDB natively work with XML data, both work with JSON documents

Document based databases

• Among the many “Document based databases”, MongoDB is currently the most popular, closely followed by CouchDB

• The MongoDB API is currently supported by both DB2 and Informix

• That means that it is now very easy to migrate from mongoDB to any of those databases and store in a single repository both structured data and JSON documents

Hosted Document databases• Both MongoDB and CouchDB are popular databases, which explains why

there are many options to use both hosted and managed versions of these products

• Cloudant is a fully managed version of BigCouch, which is in turn a high availability, fault tolerant version of CouchDB

• Migrating from CouchDB or BigCouch to Cloudant is totally transparent

• Both MongoDB and CouchDB scale very well by implementing sharding, which make them very well suited for born-on-cloud applications

Graph databasesSocial networks have become one of the most representative applications of what is known as Web 2.0

• Storing and processing social graphs in relational is both complex and inefficient

Unlike relational databases, this new kind of data stores focuses more on relationships than on data. For social networks kind of projects this results in:

• Increased performance

• Simpler and more natural development

HadoopHadoop is a framework designed to process tasks that can be parallelized on extremely high volumes of data distributed over a large number of server nodes belonging to a cluster. It has four main components:

• Hadoop common

• Hadoop Distributed File System (HDFS)

• Designed primarily to handle extremely high volumes of immutable data

• Loading and deleting data is efficient, updating data is not

• Hadoop YARN

• Hadoop MapReduce

Managing a complete Hadoop system is currently not for the faint of heart

MapReduceMapReduce is the data processing algorithm that sits at the very core of Hadoop

Developers need to implement for each query the following functions:

• Map: In this phase the overall problem is divided into smaller problems which can be divided into smaller tasks (which can also be further broken down) that can be distributed to run on different server nodes

• Reduce: In this second phase, the master node combines the answers received from the different nodes and processes them to produce a reply to the query

MapReduceHadoop allows to store any kind of data

• Structured

• Unstructured

When using Hadoop to store structured data, in a data warehouse like environment, it is possible to use languages that automatically generate the code for the Map/Reduce functions

• Apache Pig (pig latin)

• Apache Hive (HiveQL, similar to SQL)

• IBM Big SQL

Analyzing streams of dataSometimes the amount of stored data is so large that it simply becomes impossible to perform real time analysis

• In those cases, the best alternative is to analyze the stream of data before it is stored in the database

• The main idea is that the data is kept outside the database (generally in RAM) during a certain window of time in order to detect a combination of events in a short period of time

• Fraud detection

• Digital marketing

Polygot PersistenceWhen working with applications that require extreme scaling, there is no solution that fits all challenges. It is likely that after careful analysis of the problem more than one datastore will be required to obtain the best performance.

• This is known as “Polygot Persistence”

Contact informationOn Twitter: @huibert (English), @huibert2 (Spanish)

Web site: http://www.huibert-aalbers.com

Blog: http://www.huibert-aalbers.com/blog

http://www.huibert-aalbers.com

http://www.huibert-aalbers.com/blog

iti016en-the evolution of databases (ii)

Data & Analytics

users data

explosive growth of

kind of data stores

high volumes of data

big data solution

data caching scheme

new kind of data repositorydo

data building new indices