[mas 500] data basics

19
MAS.500 - Software Module - Rahul Bhargava Data Management 2014.11.21

Upload: rahulbot

Post on 13-May-2015

250 views

Category:

Education


4 download

TRANSCRIPT

Page 1: [Mas 500] Data Basics

MAS.500 - Software Module - Rahul Bhargava

Data Management

2014.11.21

Page 2: [Mas 500] Data Basics

Topics

❖ Regular Expressions (online quickstart)

❖ Databases❖ History❖ Relational modeling❖ Sql (mysql quickstart)❖ Keys/Indexes❖ No-sql (couchdb quickstart)

❖ Behind the Scenes with Ed Platt

❖ Homework

Page 3: [Mas 500] Data Basics

Regular Expressions

Page 4: [Mas 500] Data Basics

Regular Expressions (RegEx/grep)

❖ Match a string of text by defining a pattern

❖ Useful for cleaning up or identifying data

❖ “Find” Demo on http://regexpal.com❖ “Find/Replace” Demo with

http://www.sugarscript.com/findandreplace/index.php

❖ Interested? Interactive tutorial on http://regexone.com

Page 5: [Mas 500] Data Basics

Databases

Page 6: [Mas 500] Data Basics

Database History

❖ List-based❖ Follow link from one record to another

(linked-list)

❖ File-system data stores❖ Based on filenaming convention, limited by

file i/o speeds

❖ Generic data storage and management❖ Relational modeling or entities and

relationships (ER)

Page 7: [Mas 500] Data Basics

Relational Modeling: In English

❖ A Group has many People❖ A Person belongs to one Group

❖ A Group has many Projects❖ A Project belongs to one Group

❖ A Person has many Projects❖ A Project has many People

Page 8: [Mas 500] Data Basics

Relational Modeling: Diagram

GroupPerson

Project

many 1

1

many

many

many

Page 9: [Mas 500] Data Basics

Relational Modeling: Tables

Group:id

nameurl

Person:id

namepasswordgroup_id

Project:id

nameurl

many 1

1

many

many

manyMembership:person_idproject_id

Page 10: [Mas 500] Data Basics

Relational Modeling: Keys

Group:id

nameurl

Person:id

namepasswordgroup_id

Project:id

nameurl

many 1

1

many

many

manyMembership:

person_idproject_id

key

Foreign keys

key

key

Page 11: [Mas 500] Data Basics

Structured Query Language (SQL)

❖ Works in lots of database servers❖ SQLite, MySQL, PostgreSQL, MS SQL Server

❖ Standard way to:❖ Find subsets of data based on criteria❖ Merge data in separate tables❖ Compute aggregate info

❖ Assumptions❖ Don’t duplicate data (“data normalization”)❖ Various parts of your data relate to each other❖ Your metadata/schema (tables/columns) doesn’t change often

❖ Many frameworks will generate SQL for you❖ Ask about Database Abstraction Layers

Page 12: [Mas 500] Data Basics

NoSQL

❖ Sometimes your data isn’t relational and the metadata changes often

❖ Queuing, document storage, logging, real-time, low-latency, concurrency

❖ Read this write up for more:❖ http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

Page 13: [Mas 500] Data Basics

Tangent: JavaScript Object Notation(JSON)

❖ A human-readable data exchange format❖ CSV, XML, YAML are some others

❖ Example:

❖ http://media.mongodb.org/zips.json

❖ http://mongohub.todayclose.com (for Mac)

Page 14: [Mas 500] Data Basics

❖ sudo mkdir -p /data/db

Page 15: [Mas 500] Data Basics

MongoDB: Intro

❖ Demo:

❖ Command Line

❖ MongoHub

Page 16: [Mas 500] Data Basics

Indexes

❖ An index tracks keys❖ Convention: have an “id” column with an index on

it❖ Why all these indexes?

❖ Multiple ways to get at rows quickly❖ Creating indexes is tricky

❖ Many frameworks include query logging to help you find slow queries that might need optimizing

❖ Query optimization is a bit of an art❖ Use the “Explain” command

Page 17: [Mas 500] Data Basics

Map-Reduce Instead of SQL

❖ Used to query large datasets

❖ Example: Count words in a document

❖ Map: select the data you need to operate on❖ “emit” one records for each word in a

document, keyed by the word

❖ Reduce: combine the mapped data❖ Sum up the uses of each word, “emitting”

one record for each total

Page 18: [Mas 500] Data Basics

Picking Data Storage Strategies

❖ If you just need to dump data and pull it out by some id, use a no-sql solution (MongoDB is simple)

❖ flexible, easy to start with

❖ If you are modeling an app, a relational database is usually the right answer (MySQL/PostgreSQL are standard)❖ Database modeling is REALLY important to get

right at the start of your project, because it is a pain to change later

❖ Names matter – choose your table names carefully

❖ PS: we can try stuff out on Amazon’s cloud services for free

Page 19: [Mas 500] Data Basics

Homework

❖ see course outline