[mas 500] data basics
TRANSCRIPT
MAS.500 - Software Module - Rahul Bhargava
Data Management
2014.11.21
Topics
❖ Regular Expressions (online quickstart)
❖ Databases❖ History❖ Relational modeling❖ Sql (mysql quickstart)❖ Keys/Indexes❖ No-sql (couchdb quickstart)
❖ Behind the Scenes with Ed Platt
❖ Homework
Regular Expressions
Regular Expressions (RegEx/grep)
❖ Match a string of text by defining a pattern
❖ Useful for cleaning up or identifying data
❖ “Find” Demo on http://regexpal.com❖ “Find/Replace” Demo with
http://www.sugarscript.com/findandreplace/index.php
❖ Interested? Interactive tutorial on http://regexone.com
Databases
Database History
❖ List-based❖ Follow link from one record to another
(linked-list)
❖ File-system data stores❖ Based on filenaming convention, limited by
file i/o speeds
❖ Generic data storage and management❖ Relational modeling or entities and
relationships (ER)
Relational Modeling: In English
❖ A Group has many People❖ A Person belongs to one Group
❖ A Group has many Projects❖ A Project belongs to one Group
❖ A Person has many Projects❖ A Project has many People
Relational Modeling: Diagram
GroupPerson
Project
many 1
1
many
many
many
Relational Modeling: Tables
Group:id
nameurl
Person:id
namepasswordgroup_id
Project:id
nameurl
many 1
1
many
many
manyMembership:person_idproject_id
Relational Modeling: Keys
Group:id
nameurl
Person:id
namepasswordgroup_id
Project:id
nameurl
many 1
1
many
many
manyMembership:
person_idproject_id
key
Foreign keys
key
key
Structured Query Language (SQL)
❖ Works in lots of database servers❖ SQLite, MySQL, PostgreSQL, MS SQL Server
❖ Standard way to:❖ Find subsets of data based on criteria❖ Merge data in separate tables❖ Compute aggregate info
❖ Assumptions❖ Don’t duplicate data (“data normalization”)❖ Various parts of your data relate to each other❖ Your metadata/schema (tables/columns) doesn’t change often
❖ Many frameworks will generate SQL for you❖ Ask about Database Abstraction Layers
NoSQL
❖ Sometimes your data isn’t relational and the metadata changes often
❖ Queuing, document storage, logging, real-time, low-latency, concurrency
❖ Read this write up for more:❖ http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Tangent: JavaScript Object Notation(JSON)
❖ A human-readable data exchange format❖ CSV, XML, YAML are some others
❖ Example:
❖ http://media.mongodb.org/zips.json
❖ http://mongohub.todayclose.com (for Mac)
❖ sudo mkdir -p /data/db
MongoDB: Intro
❖ Demo:
❖ Command Line
❖ MongoHub
Indexes
❖ An index tracks keys❖ Convention: have an “id” column with an index on
it❖ Why all these indexes?
❖ Multiple ways to get at rows quickly❖ Creating indexes is tricky
❖ Many frameworks include query logging to help you find slow queries that might need optimizing
❖ Query optimization is a bit of an art❖ Use the “Explain” command
Map-Reduce Instead of SQL
❖ Used to query large datasets
❖ Example: Count words in a document
❖ Map: select the data you need to operate on❖ “emit” one records for each word in a
document, keyed by the word
❖ Reduce: combine the mapped data❖ Sum up the uses of each word, “emitting”
one record for each total
Picking Data Storage Strategies
❖ If you just need to dump data and pull it out by some id, use a no-sql solution (MongoDB is simple)
❖ flexible, easy to start with
❖ If you are modeling an app, a relational database is usually the right answer (MySQL/PostgreSQL are standard)❖ Database modeling is REALLY important to get
right at the start of your project, because it is a pain to change later
❖ Names matter – choose your table names carefully
❖ PS: we can try stuff out on Amazon’s cloud services for free