big data everywhere making big data results easy, economic and fast

Big Data Everywhere: Making Big Data Results Easy, Economic & Fast

Companies that can harness big data will trample big data incompetents, writes theEconomist. The McKinsey Global Institutes report on Big Data states that the value of Big Data to the US health care system could be $300 billion, that a 60% increase in retailer operating margins is possible and that location data could provide $600 billion of value to consumers globally.

Big Data is everywhere and it represents a huge opportunity to those who can use it effectively. But how do you know whether you have a Big Data problem? And if you do, how do you solve it?

The relational database has been at the heart of many IT systems for more than two decades. In recent years, this has been augmented by data warehousing technology that can support business analytics without impacting the core online transaction processing (OLTP) functions of the main database.

The world is changing. Location data from mobile devices, global web scale applications likesocial media, machine generated data, mobile devices and the data exhaust from a multitudeof new systems present new opportunities to derive value and improve efficiency but presentsignificant challenges in terms of volume, velocity and variety of data. Traditional relationaldatabases are often a poor fit for managing these new Big Data requirements because of cost,performance or both.

The analyst firm Forrester defines Big Data in terms of four key attributes - Volume, Velocity,Variety and Variability. This is a catchy definition but the reality is that any one of the four Vscan be a challenge for existing data management systems, giving you a Big Data problem.This white paper provides an introduction to some of the new technology that can help.

Scaling out versus scaling upFor more than two decades, relational databases have been the standard way to manage largedatasets. The technology is well understood, highly standardised and works well for a widerange of use cases. But as data volumes grow and workloads change, alternative approachesbecome attractive. The traditional relational database runs on a single computer. As data sizesand transaction rates increase, the means of growing the capacity of the system is to attachmore storage and use a faster computer. This works up to the point that no faster computer orhigh speed storage is available, or if available, is too expensive to be considered.

Using many smaller computers rather than one large one is an attractive idea. The sweet spotin terms of price-performance is smaller computer systems based on commodity components.When these computers can use local storage rather than a costly storage area network (SAN)or other shared storage, further costs get eliminated along the way. This has been known foryears; the problem has been how to take advantage of these lower cost computers for databaseapplications.

New generation database technologies make it possible to apply many low cost servers to thetask of running a database, distributing the load across the servers in a scale-out approach thatcan collectively provide the resources needed even for very demanding Big Data applications.The ability to run on low cost hardware and scale in this way gives another advantage: sincepublic cloud infrastructures are built on this kind of architecture, scale-out databases are alsowell suited to running in the cloud.

NOSQL databases, together with distributed processing frameworks such as Hadoop, areaimed at making the exploitation of scale-out architecture possible. By reducing the cost ofdealing with the data explosion, such technologies change the economics of Big Data, allowingyou to drive value and transform business without sacrificing the performance they demand.

?US office: 181 Fremont St, San Francisco CA. +1-866-487-2650 UK office: Wenlock Building, 50-52 Wharf Rd, London N1 7EU. +44-203-1760143

For additional information please email [email protected] or call us at the numbers below.

NoSQL & NOSQL

NoSQL (and NOSQL) are frequently used terms. Unfortunately neither are very meaningful.NoSQL is a recently invented term used to describe those databases which do not have SQLas their query language. But document databases, graph databases, object databases, tupleand triple databases, key-value stores and many others all meet this definition while being wildlydifferent in their capabilities. NoSQL is merely a category that excludes traditional relationaldatabases.

To add to the confusion, some of the new NoSQL databases have recently had SQL or SQLlikefeatures added, leading to NoSQL being adjusted by some to read NOSQL meaning NotOnly SQL. This does not help much either.

In reality, NoSQL or NOSQL databases make a number of design compromises which gobeyond SQL. Typically they allow some of the guarantees around atomicity, consistency andisolation to be relaxed, disallow some kinds of complex transactions and do not support joinoperations. This sounds drastic to those brought up to think of these features as fundamentalto database design but there is a trade-off here. NOSQL databases can provide outstandingperformance and high availability for a wide variety of emerging use cases which are outlinedbelow. More importantly, they can do so at low cost.

Databases such as Acunu Reflex are built for this new world. Low cost commodity servers orthe cloud are assumed to be the target for deployment. That means that distribution is a given,both to ensure the ability to scale out and to provide continued operation even in the presenceof the occasional inevitable hardware failure, whether of a server component such as a disk ornetwork interface, or even of an entire data center.

The death of the relational database?Any claims that the relational database is dead are greatly exaggerated. Relational databasesare not going away. Expect to continue using them for applications where they work well: thosethat require the sophisticated support that relational databases provide for complex transactionsfor example.

But there are plenty of use cases where the relational database is overkill, where scaling tomeet performance or throughput challenges is a problem and where new technology canprovide a better solution at lower cost.

What use cases make sense?Time series data, like telemetry from smart metering projects, IT infrastructure monitoring andprice data in financial markets

Time series data is common but using a relational database to store it can be a poor choice. Inmany cases, new time-series data simply gets added at the end of existing data. There are nocomplex transactions needed to combine the new data with the existing data, so the complexity of a relational database simply imposes performance overheads and cost on the resulting system.

Well-designed NOSQL systems will handle large volumes of time series data easily and without the performance degradation over time that can hamper the ability of a relational database to handle the largest time series datasets.

SQL

US office: 181 Fremont St, San Francisco CA. +1-866-487-2650 UK office: Wenlock Building, 50-52 Wharf Rd, London N1 7EU. +44-203-1760143


Write intensive workloads with large working sets and low latency requirements, such as games, online advertising and real-time operational analytics

Relational databases are optimized for workloads where reading the data occurs more often than writing or modifying it. Where you have the need to absorb huge amounts of data which will be read relatively infrequently, look for a system that supports fast writes.

Many NOSQL databases are optimised for write-intensive workloads providing the ability to deal with data rates that would require high-end hardware in a traditional solution.

High availability - particularly where you do not have control over the hardware infrastructure and do not want to have to trust to a cloud providers SLA (or lack of one)

Building high availability into a relational database is hard. Common solutions include using shared storage and multiple replica or standby servers that can take over when the active server fails and special software to manage the failover process. Not only is this expensive, it often requires a high level of control over the hardware infrastructure, a fairly sophisticated operations capability and continual testing. Worse, in cloud deployments you may have no control over the kind of hardware that is provided to you, so some of the techniques used to make a relational database highly available become impossible.

The answer is a NOSQL system that allows simple, automatic replication of data across multiple servers with no single point of failure. Systems such as Acunu Reflex allow the deployment of clusters that span multiple data centers, so that tough business continuity challenges become much easier.

Global deployments where you need to be able to put the data close to your customers

Whether it is providing low latency access to web customers to improve conversion rates or capturing activity across a global user base, NOSQL technologies provide automated data distribution, replication, etc. in a manner that is generally much less expensive than when using a complex relational architecture. NOSQL databases have been designed to facilitate a high degree of automated data protection while leveraging commodity hardware. This is much less expensive and easier to manage than the commonly manual approach of spreading data out across numerous relational database servers.

Scale from a handful to tens or even hundreds of servers at low incremental cost

It is possible to scale a relational database across many servers by sharing, that is, splitting the data so that in a customer database, for example, customers whose names begin with A sit on one server, B on the next and so on. The problem with this approach is that splitting the data and load evenly between servers is hard (you probably dont have many customers whose names begin with X) and that many operations that would be simple if the data were all in one place (like joins) become impossible.

Unstructured or semi-structured data

Changing the schema of a relational database once it contains lots of data is a costly and time consuming operation. But with agile approaches to software development, the ability to adjust the database quickly to reflect changes in business requirements is becoming increasingly important. New database technologies offer this flexibility and in some cases can also handle unstructured data, data whose format has not been anticipated in advance by the database administrator.

US office: 181 Fremont St, San Francisco CA. +1-866-487-2650 UK office: Wenlock Building, 50-52 Wharf Rd, London N1 7EU. +44-203-1760143

What to use, when?Given the wide range of NOSQL (or NoSQL) databases available, how should one choose? Some of the technologies lend themselves to specific specialised use cases. If you are looking to store a social graph, for example, it is worth looking at a graph database.

But for the use cases described in this paper, several of the mainstream NOSQL databases are a good fit. Acunu Reflex incorporates Apache Cassandra: It has a particularly strong set of features combined with a significant user base. Cassandra works well for time series data, for applications that have write-heavy workloads, supports deployments ranging in size from a handful to hundreds of nodes, either on hardware or in the cloud, and provides great features for high availability: it allows clusters which span multiple geographically distributed data centers or cloud availability zones while having no single point of failure.

While Acunu Reflex is able to scale to large deployments, it works well in smaller ones too. All the nodes in an Acunu Reflex cluster are the same; theres no master and slave, no central controlling node and no read-only replicas. This simplicity makes starting small an easy task and helps in sizing for growth as well. Because Acunu incorporates Cassandra it supports CQL, a SQL-like query language that provide familiarity to a generation of developers brought up on SQL.

Hadoop, another Apache project, has become almost synonymous with Big Data in the eyes of many. Unlike Cassandra, it is not really a database. Instead, it is a distributed processing framework based around technology originally developed by Google. Where there is a requirement to store and process huge volumes of unstructured data, Hadoop can be a great fit. But Hadoops big strength is batch analytics. Both Cassandra and Hadoop can deliver huge cost and ROI benefits over legacy technologies but support different use cases and workloads. Big Data commonly doesnt lend itself to a single all-purpose solution.

Whats different? How to get startedIn some cases there are existing applications where current database technology is struggling to keep up with the workload. More commonly there is an existing or new dataset that matches one or more of the use cases described here and which lends itself to a Big Data solution. In both cases, delivering business benefit without big fixed costs for high-end hardware or recurrent costs for enterprise software licensing is also a significant driver.

Choosing the technology to use requires some research and is naturally influenced by the skills and size of the team, along with the appetite for risk. None of the new technologies are as mature as the relational database, so the surrounding ecosystem of tools is sometimes limited and skills are in great demand and often hard to find.

Acunu can help with technology selection and understanding the use cases that can best benefit from the new generation of Big Data tools. Acunu also provides tailored training basedaround your needs to help get your team up to speed. To get the best out of these new technologies means adopting design practices which are strikingly different from those that have been widely used in the past. Picking one example, normalization has been a guideline to the design of relational databases in order to eliminate redundant data, but when the same principle is used with some of the new data stores, much of the performance benefit can be lost. Understanding the trade-offs is critical to project success.

Acunu also offers a product that incorporates a fully tested and supported version of Apache Cassandra, web-based cluster management and an optimised operating system image in a single software package that can streamline deployment and reduce support effort. For organisations that need the benefits that cutting edge technology can bring to their Big Data opportunities but do not want to become experts in the numerous open source projects out there, Acunu Reflex provides an easily accessible route to Big Data results.


big data everywhere making big data results easy, economic and fast

Documents