Big Data Analytics || NoSQL Data Management for Big Data

Download Big Data Analytics || NoSQL Data Management for Big Data

Post on 09-Apr-2017

216 views

Category:

Documents

3 download

Embed Size (px)

TRANSCRIPT

  • CHAPTER 99NoSQL Data Management for Big Data

    9.1 WHAT IS NOSQL?

    In Chapter 6, we discussed different kinds of high-performance appli-ances from both the architectural perspective and from a data organiza-tion perspective. To continue the discussion of application development,it is valuable to continue to review the different means for managing andorganizing massive data volumes to meet business needs, and in thischapter, we will look at some alternative methods for data management.

    In any environment that already depends on a relational databasemodel and/or a data warehousing approach to data management, itwould be unwise to ignore support for these traditional data organiza-tions for data that is to be implemented in a big data environment.Fortunately, most hardware and software appliances support standardapproaches to standard, SQL-based relational database managementsystems (RDBMSs). Software appliances often bundle their executionengines with the RDBMS and utilities for creating the database struc-tures and for bulk data loading.

    However, the availability of a high-performance, elastic distributeddata environment enables creative algorithms to exploit variant modesof data management in different ways. In fact, some algorithms willnot be able to consume data in traditional RDBMS systems and willbe acutely dependent on alternative means for data management.Many of these alternate data management frameworks are bundledunder the term NoSQL databases. The term NoSQL may conveytwo different connotationsone implying that the data managementsystem is not an SQL-compliant one, while the more accepted implica-tion is that the term means Not only SQL, suggesting environmentsthat combine traditional SQL (or SQL-like query languages) withalternative means of querying and access.

    Big Data Analytics. DOI: http://dx.doi.org/10.1016/B978-0-12-417319-4.00009-0 2013 Elsevier Inc.All rights reserved.

    http://dx.doi.org/10.1016/B978-0-12-417319-4.00009-0

  • 9.2 SCHEMA-LESS MODELS: INCREASING FLEXIBILITY FORDATA MANIPULATION

    NoSQL data systems hold out the promise of greater flexibility in data-base management while reducing the dependence on more formal data-base administration. NoSQL databases have more relaxed modelingconstraints, which may benefit both the application developer and theend-user analysts when their interactive analyses are not throttled by theneed to cast each query in terms of a relational table-based environment.

    Different NoSQL frameworks are optimized for different types ofanalyses. For example, some are implemented as keyvalue stores,which nicely align to certain big data programming models, whileanother emerging model is a graph database, in which a graph abstrac-tion is implemented to embed both semantics and connectivity withinits structure. In fact, the general concepts for NoSQL include schema-less modeling in which the semantics of the data are embedded withina flexible connectivity and storage model; this provides for automaticdistribution of data and elasticity with respect to the use of computing,storage, and network bandwidth in ways that dont force specific bind-ing of data to be persistently stored in particular physical locations.NoSQL databases also provide for integrated data caching that helpsreduce data access latency and speed performance.

    The loosening of the relational structure is intended to allow differ-ent models to be adapted to specific types of analyses. The technolo-gies are evolving and maturing. And because of the relaxedapproach to modeling and management that does not enforce shoe-horning data into strictly defined relational structures, the modelsthemselves do not necessarily impose any validity rules; this potentiallyintroduces risks associated with ungoverned data management activi-ties such as inadvertent inconsistent data replication, reinterpretationof semantics, and currency and timeliness issues.

    9.3 KEYVALUE STORESA relatively simple type of NoSQL data store is a keyvalue store, aschema-less model in which values (or sets of values, or even morecomplex entity objects) are associated with distinct character stringscalled keys. Programmers may see similarity with the data structureknown as a hash table. Other alternative NoSQL data stores are

    84 Big Data Analytics

  • variations on the keyvalue theme, which lends a degree of credibilityto the model.

    As an example, consider the data subset represented in Table 9.1.The key is the name of the automobile make, while the value is a listof names of models associated with that automobile make.

    As can be inferred from the example, the keyvalue store does notimpose any constraints about data typing or data structurethe valueassociated with the key is the value, and it is up to the consuming busi-ness applications to assert expectations about the data values and theirsemantics and interpretation. This demonstrates the schema-less prop-erty of the model.

    The core operations performed on a keyvalue store include: Get(key), which returns the value associated with the provided key. Put(key, value), which associates the value with the key. Multi-get(key1, key2,.., keyN), which returns the list of values asso-

    ciated with the list of keys. Delete(key), which removes the entry for the key from the data store.

    One critical characteristic of a keyvalue store is uniqueness of thekeyto find the values you are looking for, you must use the exact key. Inthis data management approach, if you want to associate multiple valueswith a single key, you need to consider the representations of the objectsand how they are associated with the key. For example, you may want toassociate a list of attributes with a single key, which may suggest that thevalue stored with the key is yet another keyvalue store object itself.

    Keyvalue stores are essentially very long, and presumably thintables (in that there are not many columns associated with each row).The tables rows can be sorted by the key value to simplify finding the

    Table 9.1 Example Data Represented in a KeyValue StoreKey Value

    . . .

    BMW {1-Series, 3-Series, 5-Series, 5-Series GT, 7-Series, X3, X5, X6, Z4}

    Buick {Enclave, LaCrosse, Lucerne, Regal}

    Cadillac {CTS, DTS, Escalade, Escalade ESV, Escalade EXT, SRX, STS}

    . . .

    85NoSQL Data Management for Big Data

  • key during a query. Alternatively, the keys can be hashed using a hashfunction that maps the key to a particular location (sometimes called abucket) in the table. Additional supporting data structures and algo-rithms (such as bit vectors and bloom filters) can be used to even deter-mine whether the key exists in the data set at all. The representationcan grow indefinitely, which makes it good for storing large amountsof data that can be accessed relatively quickly, as well as environmentsrequiring incremental appends of data. Examples include capturingsystem transaction logs, managing profile data about individuals, ormaintaining access counts for millions of unique web page URLs.

    The simplicity of the representation allows massive amounts ofindexed data values to be appended to the same keyvalue table,which can then be sharded, or distributed across the storage nodes.Under the right conditions, the table is distributed in a way that isaligned with the way the keys are organized, so that the hashing func-tion that is used to determine where any specific key exists in thetable can also be used to determine which node holds that keys bucket(i.e., the portion of the table holding that key).

    While keyvalue pairs are very useful for both storing the results ofanalytical algorithms (such as phrase counts among massive numbersof documents) and for producing those results for reports, the modeldoes pose some potential drawbacks. One is that the model will notinherently provide any kind of traditional database capabilities (suchas atomicity of transactions, or consistency when multiple transactionsare executed simultaneously)those capabilities must be provided bythe application itself. Another is that as the model grows, maintainingunique values as keys may become more difficult, requiring the intro-duction of some complexity in generating character strings that willremain unique among a myriad of keys.

    9.4 DOCUMENT STORES

    A document store is similar to a keyvalue store in that stored objectsare associated (and therefore accessed via) character string keys. The dif-ference is that the values being stored, which are referred to as docu-ments, provide some structure and encoding of the managed data. Thereare different common encodings, including XML (Extensible MarkupLanguage), JSON (Java Script Object Notation), BSON (which is a

    86 Big Data Analytics

  • binary encoding of JSON objects), or other means of serializing data(i.e., packaging up the potentially linearizing data values associated witha data record or object).

    As an example, in Figure 9.1 we have some examples of documentsstored in association with the names of specific retail locations. Note thatwhile the three examples all represent locations, yet the representativemodels differ. The document representation embeds the model so that themeanings of the document values can be inferred by the application.

    One of the differences between a keyvalue store and a documentstore is that while the former requires the use of a key to retrieve data,the latter often provides a means (either through a programming API orusing a query language) for querying the data based on the contents.Because the approaches used for encoding the documents embed theobject metadata, one can use methods for querying by example. Forinstance, using the example in Figure 9.1, one could execute a FIND(MallLocation: Westfield Wheaton) that would pull out all documentsassociated with the Retail Stores in that particular shopping mall.

    9.5 TABULAR STORES

    Tabular, or table-based stores are largely descended from Googlesoriginal Bigtable design1 to manage structured data. The HBase modeldescribed in Chapter 7 is an example of a Hadoop-related NoSQLdata management system that evolved from bigtable.

    The bigtable NoSQL model allows sparse data to be stored in athree-dimensional table that is indexed by a row key (that is used in afashion that is similar to the keyvalue and document stores), a col-umn key that indicates the specific attribute for which a data value isstored, and a timestamp that may refer to the time at which the rowscolumn value was stored.

    As an example, various attributes of a web page can be associatedwith the web pages URL: the HTML content of the page, URLs ofother web pages that link to this web page, and the author of the con-tent. Columns in a Bigtable model are grouped together as families,

    1Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, et al. Bigtable: adistributed storage system for structured data. Accessed via ,http://research.google.com/archive/bigtable.html. (Last accessed 08-08-13).

    87NoSQL Data Management for Big Data

    http://research.google.com/archive/bigtable.htmlhttp://research.google.com/archive/bigtable.html

  • and the timestamps enable management of multiple versions of anobject. The timestamp can be used to maintain historyeach time thecontent changes, new column affiliations can be created with the time-stamp of when the content was downloaded.

    9.6 OBJECT DATA STORES

    In some ways, object data stores and object databases seem to bridgethe worlds of schema-less data management and the traditional rela-tional models. On the one hand, approaches to object databases can besimilar to document stores except that the document stores explicitlyserializes the object so the data values are stored as strings, whileobject databases maintain the object structures as they are bound toobject-oriented programming languages such as C11, Objective-C,Java, and Smalltalk. On the other hand, object database managementsystems are more likely to provide traditional ACID (atomicity, consis-tency, isolation, and durability) compliancecharacteristics that arebound to database reliability. Object databases are not relational data-bases and are not queried using SQL.

    9.7 GRAPH DATABASES

    Graph databases provide a model of representing individual entitiesand numerous kinds of relationships that connect those entities. Moreprecisely, it employs the graph abstraction for representing connectiv-ity, consisting of a collection of vertices (which are also referred to asnodes or points) that represent the modeled entities, connected by edges(which are also referred to as links, connections, or relationships) that

    {StoreName:Retail Store #34,

    {Street:1203 O ST, City:Lincoln, State:NE, ZIP:68508}

    }

    {StoreName:Retail Store #65,

    {MallLocation:Westfield Wheaton, City:Wheaton, State:IL}

    }

    {StoreName:Retail Store $102,

    {Latitude: 40.748328, Longitude: -73.985560}

    }

    Figure 9.1 Example of document store.

    88 Big Data Analytics

  • capture the way that two entities are related. Graph analytics per-formed on graph data stores are somewhat different than more fre-quently used querying and reporting, and we cover graph databases inmuch greater detail in Chapter 10.

    9.8 CONSIDERATIONS

    The decision to use a NoSQL data store instead of a relational modelmust be aligned with the data consumers expectations for compliancewith their expectations of relational models. As should be apparent,many NoSQL data management environments are engineered for twokey criteria:

    1. fast accessibility, whether that means inserting data into the modelor pulling it out via some query or access method;

    2. scalability for volume, so as to support the accumulation and man-agement of massive amounts of data.

    The different approaches are amenable to extensibility, scalability,and distribution, and these characteristics blend nicely with programmingmodels (like MapReduce) with straightforward creation and execution ofmany parallel processing threads. Distributing a tabular data store or akeyvalue store allows many queries/accesses to be performed simulta-neously, especially when the hashing of the keys maps to different datastorage nodes. Employing different data allocation strategies will allowthe tables to grow indefinitely without requiring significant rebalancing.

    In other words, these data organizations are designed for high-performance computing for reporting and analysis. However, mostNoSQL environments are not generally designed for transaction pro-cessing, and it would require some whittling down of the list of ven-dors to find those that support ACID transactions.

    9.9 THOUGHT EXERCISES

    When considering NoSQL data stores, here are some questions andexercises to ponder:

    At what point do you decide that a traditional RDBMS implemen-ted on a standard server environment is insufficient to satisfy thebusiness needs of an analytical application?

    89NoSQL Data Management for Big Data

  • For the most valuable big data opportunity in your organization,describe how the data would be mapped to a keyvalue storemodel, a document store model, and a tabular model.

    Would you consider using...