Architecting for the cloud storage misc topics

Download Architecting for the cloud storage misc topics

Post on 27-Aug-2014




2 download

Embed Size (px)


this is day 4 of the Architecting for the Cloud course. It covers storage solutions and a collection of miscellaneous topics


<ul><li> Matthew Bass 2013 Architecting for the Cloud Len and Matt Bass Storage in the Cloud </li> <li> Matthew Bass 2013 Outline This section will focus on storage in the cloud We will first look at relational databases What solutions emerged for the cloud Storage options for NoSQL databases Architecture of typical NoSQL databases </li> <li> Matthew Bass 2013 Outline This section will focus on storage in the cloud We will first look at relational databases What solutions emerged for the cloud Storage options for NoSQL databases Architecture of typical NoSQL databases </li> <li> Matthew Bass 2013 History The relational data model was created in the late 1960s In the 1980s relational databases became commercially successful Replacing Hierarchical and Network data bases Relational databases continue to be the dominate db model today </li> <li> Matthew Bass 2013 Relational Databases The relational model is a mathematical model for describing the structure of data We will not go into this model Lets take a quick review of the 1st and 2nd normal form, however </li> <li> Matthew Bass 2013 Example Imagine you sell car parts You have warehouses You have part inventories You have orders Whats the problem? Warehouse Warehouse Address Part </li> <li> Matthew Bass 2013 What Happens Here? Warehouse 1 123 Main Street Transmission, Steering wheel, Brake pads, What about here? Warehouse 1 123 Main Street Transmission Warehouse 1 123 Main Street Steering wheel Warehouse 1 123 Main Street Brake Pads </li> <li> Matthew Bass 2013 The Solution Warehouse ID Warehouse Address Warehouse Table: Parts Table: Relations Table: Part ID Part Description Warehouse ID Part ID </li> <li> Matthew Bass 2013 This Works We have a standard language for querying the data (SQL) We can now extract data in a very flexible way We can read, write, update, and delete data pretty efficiently Joins add some overhead </li> <li> Matthew Bass 2013 Moreover We Have RDBMS We have robust software systems that manage the data These systems provide many advanced features including: Behavior Concurrency control Transactions Referential integrity Optimization </li> <li> Matthew Bass 2013 Behavior DBMSs provide mechanisms for building in behavior These are mechanisms like Stored procedures PLSQL This allows you to simplify the application logic </li> <li> Matthew Bass 2013 Concurrency Control DBMSs will support multiple user access They will lock tables during updates to ensure that writes are complete prior to reads They will manage multiple updates to ensure integrity and consistency of data </li> <li> Matthew Bass 2013 Transactions Transactions are supported This ensures that updates either happen completely or not at all Often an atomic update is a set up updates to individual records across multiple tables If only some of these updates happen the integrity of the overall database is compromised </li> <li> Matthew Bass 2013 Referential Integrity Ensures that references from one table refer to a valid entry in another table </li> <li> Matthew Bass 2013 Optimization Database systems will perform a variety of actions to optimize based on usage patterns They will Create indexes Create virtual tables Cache values </li> <li> Matthew Bass 2013 Impedance Mismatch There is however, a mismatch We need to translate between the relational structure and the organizational needs Think about the reports needed for the warehouse Purchase orders History of orders for customer Parts inventory per warehouse This means we will need lots of Joins This isnt too much of an issue until we scale </li> <li> Matthew Bass 2013 Speaking of Scaling Do relational databases scale? </li> <li> Matthew Bass 2013 Internet Scale Is Difficult We can shard the data Split the data across the machines This is very difficult to do efficiently This makes joins more costly Remember joins are common This also has a practical limit At some point you will need to replicate the data The database becomes slow </li> <li> Matthew Bass 2013 Change is Needed For this reason internet scale applications moved to distributed file systems Google was the first Many others followed This allowed the data to be partitioned across nodes more efficiently Well talk about this in a minute </li> <li> Matthew Bass 2013 Outline This section will focus on storage in the cloud We will first look at relational databases What solutions emerged for the cloud Storage options for NoSQL databases Architecture of typical NoSQL databases </li> <li> Matthew Bass 2013 Needs Lets explore the needs in a bit more detail The file system needed to: Be fault-tolerant Handle large files Accommodate extremely large data sets Accommodate many concurrent clients Be flexible enough to handle multiple kinds of applications </li> <li> Matthew Bass 2013 Fault-Tolerance Due to the scale of the systems they were deployed on hundreds or thousands of servers This meant that at any given time some of these nodes would not be operational Problems from application bugs, operating system bugs, human error, hardware failures, and networks are common </li> <li> Matthew Bass 2013 Large Files/Large Data Sets Its common for files in these systems to be multiple GBs Each file could have millions of objects E.g. many individual web pages The data sets grow quickly The data sets can be multiple terabytes or petabytes </li> <li> Matthew Bass 2013 Many Concurrent Clients The system needs to efficiently handle multiple clients These clients could be reading or writing </li> <li> Matthew Bass 2013 Multiple Applications Additionally the system needs to be flexible enough to handle multiple applications Applications have a variety of needs Long streaming reads Throughput oriented operations Low latency reads </li> <li> Matthew Bass 2013 Addressing Needs There were a number of things that were done to address the needs One primary decision was the de-normalization of the data Well talk about this more in the next slides Other decisions include (well talk about these in a bit) Block size Replication strategy Data consistency checks API and capability of the system </li> <li> Matthew Bass 2013 De-Normalizing Data Remember what was difficult with relational models? Joins across nodes are expensive As is synchronization for replicated data If the data is de-normalized it can be localized Data that will likely be accessed together can be collocated In other words store it as you will use it </li> <li> Matthew Bass 2013 Example Imagine a Purchase Order Typically this would contain Customer information Product information Pricing </li> <li> Matthew Bass 2013 Relational Purchase Order The data could would be split across multiple tables such as Customer Product Catalog Inventory If the data set is large enough the data would be distributed </li> <li> Matthew Bass 2013 De-Normalized Purchase Order In a file system without a relational model the data doesnt need to be split up The purchase order data would be co-located If the data set was very large purchase orders would still be co-located Different purchased orders could be distributed A single purchase order, however would not be </li> <li> Matthew Bass 2013 Relational vs NoSQL Relational Model NoSQL Customers Product Catalog Inventory Orders 1 - 100 Orders 101 - 200 Orders 201 - 300 </li> <li> Matthew Bass 2013 What Does This Mean? Data has no explicit structure (not entirely true but well talk about this) Data is largely treated as a blob This has several implications You can change the nature of the data as needed You can collocate the data as desired The application now has increased burden </li> <li> Matthew Bass 2013 Back to Purchase Order PO Number PO 1 Contents of PO1 2 Contents of PO2 3 Contents of PO3 4 Contents of PO4 Key Value </li> <li> Matthew Bass 2013 Retrieving Data To retrieve the purchase order data you provide the reference key The file system routes you to the appropriate node (more later) The single node returns the entire purchase order This can happen quickly regardless of how many purchase orders you have Do you see any potential issues? </li> <li> Matthew Bass 2013 Data Locality First, being able to retrieve the data quickly depends on the location of the data If the data is distributed its difficult to retrieve quickly Imagine you want to get the number of times a customer ordered product X More on this later While there is not an explicit structure there is an implicit structure Design of this structure is important </li> <li> Matthew Bass 2013 Data Processing As the file system treats the data as unstructured its not able to preprocess the data Getting an ordered list, for example, has to be done in the application The validity of the data needs to be checked by the application </li> <li> Matthew Bass 2013 Updating Data What happens if you want to change the data? Imagine trying to update the customers address Updates tend to be difficult In this environment you tend to not update data Instead you will append the new data You can establish rules for the lifetime of the data </li> <li> Matthew Bass 2013 Other Issues Things like data integrity are not managed by the file system You dont (typically) have full support for transactions There is no notion of referential integrity There is support for some concurrent access, but with built in assumptions Consistency is not typically guaranteed (more later) </li> <li> Matthew Bass 2013 A New Tool in Your Toolbox Youve been given a new kind of hammer Remember that everything is not a nail In other words these kinds of data stores are good for some things and not others Today there are many different flavors of these data stores Both in terms of structures and features </li> <li> Matthew Bass 2013 Multiple Data Structures Today many options exist Key value stores Document centric data stores Column data bases Weve also started to see old models reemerge e.g. Hierarchical data stores </li> <li> Matthew Bass 2013 Key Value Databases Basically you have a key that maps to some value This value is just a blob The database doesnt care about the content or structure of this value The operations are quite simple e.g. Read (get the value given a key) Insert (inserts a key/value pair) Remove (removes the value associated with a given key) </li> <li> Matthew Bass 2013 Key Value Databases II There is no real schema Basically you query a key and get the value This can be useful when accessing things like user sessions, shopping carts, Concurrency Concurrency only makes sense at the level of a single key Can have either optimistic write or eventual consistency well talk about this more later Replication Can be handled by the client or the data store more about this later </li> <li> Matthew Bass 2013 Uses Very fast reads Scales well Good for quick access of data without complex querying needs The classic example is for session management Not good for Situations where data integrity is critical Data with complex querying needs </li> <li> Matthew Bass 2013 Document Centric Databases Stores a document ID : 123 Customer : 8790 Line Items : [{product id: 2, quantity: 2} {product id: 34, quantity 1}] </li> <li> Matthew Bass 2013 Document Centric No schema You can query the data store Can return all or part of the document Typically query the store by using the id (or key) As with key value, discussing concurrency only makes sense at the level of a single document </li> <li> Matthew Bass 2013 Advantages A document centric data store is similar in many ways to a key/value data store It does, however, allow for more complex queries For example you can query using a non-primary key </li> <li> Matthew Bass 2013 Column Databases Row key maps to column families 1234 Name Matt Billing Address 123 Main st Phone 412 770-4145 Profile Order Data Order Data Order Data Orders </li> <li> Matthew Bass 2013 Column Databases - Rows Rows are grouped together to form units of load balancing Row keys are ordered and grouped together by locality In this example consecutive rows would be from the same domain (CNN) Concurrency makes sense at the level of a row Key Contents com.cnn.www Html page </li> <li> Matthew Bass 2013 Column Databases Columns Columns are grouped into column families...</li></ul>