big data analytics || nosql data management for big data

Download Big Data Analytics || NoSQL Data Management for Big Data

Post on 09-Apr-2017

216 views

Category:

Documents

3 download

Embed Size (px)

TRANSCRIPT

  • CHAPTER 99NoSQL Data Management for Big Data

    9.1 WHAT IS NOSQL?

    In Chapter 6, we discussed different kinds of high-performance appli-ances from both the architectural perspective and from a data organiza-tion perspective. To continue the discussion of application development,it is valuable to continue to review the different means for managing andorganizing massive data volumes to meet business needs, and in thischapter, we will look at some alternative methods for data management.

    In any environment that already depends on a relational databasemodel and/or a data warehousing approach to data management, itwould be unwise to ignore support for these traditional data organiza-tions for data that is to be implemented in a big data environment.Fortunately, most hardware and software appliances support standardapproaches to standard, SQL-based relational database managementsystems (RDBMSs). Software appliances often bundle their executionengines with the RDBMS and utilities for creating the database struc-tures and for bulk data loading.

    However, the availability of a high-performance, elastic distributeddata environment enables creative algorithms to exploit variant modesof data management in different ways. In fact, some algorithms willnot be able to consume data in traditional RDBMS systems and willbe acutely dependent on alternative means for data management.Many of these alternate data management frameworks are bundledunder the term NoSQL databases. The term NoSQL may conveytwo different connotationsone implying that the data managementsystem is not an SQL-compliant one, while the more accepted implica-tion is that the term means Not only SQL, suggesting environmentsthat combine traditional SQL (or SQL-like query languages) withalternative means of querying and access.

    Big Data Analytics. DOI: http://dx.doi.org/10.1016/B978-0-12-417319-4.00009-0 2013 Elsevier Inc.All rights reserved.

    http://dx.doi.org/10.1016/B978-0-12-417319-4.00009-0

  • 9.2 SCHEMA-LESS MODELS: INCREASING FLEXIBILITY FORDATA MANIPULATION

    NoSQL data systems hold out the promise of greater flexibility in data-base management while reducing the dependence on more formal data-base administration. NoSQL databases have more relaxed modelingconstraints, which may benefit both the application developer and theend-user analysts when their interactive analyses are not throttled by theneed to cast each query in terms of a relational table-based environment.

    Different NoSQL frameworks are optimized for different types ofanalyses. For example, some are implemented as keyvalue stores,which nicely align to certain big data programming models, whileanother emerging model is a graph database, in which a graph abstrac-tion is implemented to embed both semantics and connectivity withinits structure. In fact, the general concepts for NoSQL include schema-less modeling in which the semantics of the data are embedded withina flexible connectivity and storage model; this provides for automaticdistribution of data and elasticity with respect to the use of computing,storage, and network bandwidth in ways that dont force specific bind-ing of data to be persistently stored in particular physical locations.NoSQL databases also provide for integrated data caching that helpsreduce data access latency and speed performance.

    The loosening of the relational structure is intended to allow differ-ent models to be adapted to specific types of analyses. The technolo-gies are evolving and maturing. And because of the relaxedapproach to modeling and management that does not enforce shoe-horning data into strictly defined relational structures, the modelsthemselves do not necessarily impose any validity rules; this potentiallyintroduces risks associated with ungoverned data management activi-ties such as inadvertent inconsistent data replication, reinterpretationof semantics, and currency and timeliness issues.

    9.3 KEYVALUE STORESA relatively simple type of NoSQL data store is a keyvalue store, aschema-less model in which values (or sets of values, or even morecomplex entity objects) are associated with distinct character stringscalled keys. Programmers may see similarity with the data structureknown as a hash table. Other alternative NoSQL data stores are

    84 Big Data Analytics

  • variations on the keyvalue theme, which lends a degree of credibilityto the model.

    As an example, consider the data subset represented in Table 9.1.The key is the name of the automobile make, while the value is a listof names of models associated with that automobile make.

    As can be inferred from the example, the keyvalue store does notimpose any constraints about data typing or data structurethe valueassociated with the key is the value, and it is up to the consuming busi-ness applications to assert expectations about the data values and theirsemantics and interpretation. This demonstrates the schema-less prop-erty of the model.

    The core operations performed on a keyvalue store include: Get(key), which returns the value associated with the provided key. Put(key, value), which associates the value with the key. Multi-get(key1, key2,.., keyN), which returns the list of values asso-

    ciated with the list of keys. Delete(key), which removes the entry for the key from the data store.

    One critical characteristic of a keyvalue store is uniqueness of thekeyto find the values you are looking for, you must use the exact key. Inthis data management approach, if you want to associate multiple valueswith a single key, you need to consider the representations of the objectsand how they are associated with the key. For example, you may want toassociate a list of attributes with a single key, which may suggest that thevalue stored with the key is yet another keyvalue store object itself.

    Keyvalue stores are essentially very long, and presumably thintables (in that there are not many columns associated with each row).The tables rows can be sorted by the key value to simplify finding the

    Table 9.1 Example Data Represented in a KeyValue StoreKey Value

    . . .

    BMW {1-Series, 3-Series, 5-Series, 5-Series GT, 7-Series, X3, X5, X6, Z4}

    Buick {Enclave, LaCrosse, Lucerne, Regal}

    Cadillac {CTS, DTS, Escalade, Escalade ESV, Escalade EXT, SRX, STS}

    . . .

    85NoSQL Data Management for Big Data

  • key during a query. Alternatively, the keys can be hashed using a hashfunction that maps the key to a particular location (sometimes called abucket) in the table. Additional supporting data structures and algo-rithms (such as bit vectors and bloom filters) can be used to even deter-mine whether the key exists in the data set at all. The representationcan grow indefinitely, which makes it good for storing large amountsof data that can be accessed relatively quickly, as well as environmentsrequiring incremental appends of data. Examples include capturingsystem transaction logs, managing profile data about individuals, ormaintaining access counts for millions of unique web page URLs.

    The simplicity of the representation allows massive amounts ofindexed data values to be appended to the same keyvalue table,which can then be sharded, or distributed across the storage nodes.Under the right conditions, the table is distributed in a way that isaligned with the way the keys are organized, so that the hashing func-tion that is used to determine where any specific key exists in thetable can also be used to determine which node holds that keys bucket(i.e., the portion of the table holding that key).

    While keyvalue pairs are very useful for both storing the results ofanalytical algorithms (such as phrase counts among massive numbersof documents) and for producing those results for reports, the modeldoes pose some potential drawbacks. One is that the model will notinherently provide any kind of traditional database capabilities (suchas atomicity of transactions, or consistency when multiple transactionsare executed simultaneously)those capabilities must be provided bythe application itself. Another is that as the model grows, maintainingunique values as keys may become more difficult, requiring the intro-duction of some complexity in generating character strings that willremain unique among a myriad of keys.

    9.4 DOCUMENT STORES

    A document store is similar to a keyvalue store in that stored objectsare associated (and therefore accessed via) character string keys. The dif-ference is that the values being stored, which are referred to as docu-ments, provide some structure and encoding of the managed data. Thereare different common encodings, including XML (Extensible MarkupLanguage), JSON (Java Script Object Notation), BSON (which is a

    86 Big Data Analytics

  • binary encoding of JSON objects), or other means of serializing data(i.e., packaging up the potentially linearizing data values associated witha data record or object).

    As an example, in Figure 9.1 we have some examples of documentsstored in association with the names of specific retail locations. Note thatwhile the three examples all represent locations, yet the representativemodels differ. The document representation embeds the model so that themeanings of the document values can be inferred by the application.

    One of the differences between a keyvalue store and a documentstore is that while the former requires the use of a key to retrieve data,the latter often provides a means (either through a programming API orusing a query language) for querying the data based on the contents.Because the approaches used for encoding the documents embed theobject metadata, one can use methods for querying by example. Forinstance, using the example in Figure 9.1, one could execute a FIND(MallLocation: Westfield Wheaton) that would pull out all documentsassociated with the Reta