introduction to data science – info 480 – drexel university’s ischool

36
Introduction to Data Science – INFO 480 – Drexel University’s iSchool Sean P. Goggins, PhD April 2, 2013 Week Three

Upload: len

Post on 24-Feb-2016

28 views

Category:

Documents


1 download

DESCRIPTION

Introduction to Data Science – INFO 480 – Drexel University’s iSchool. Sean P. Goggins, PhD April 2, 2013 Week Three. What is Data Science?. Storytelling Database Theory – How you organize your data has a big influence on what you can do with it. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Introduction to Data Science – INFO 480 – Drexel University’s

iSchoolSean P. Goggins, PhD

April 2, 2013Week Three

Page 2: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

What is Data Science?

Storytelling Database Theory – How you organize your

data has a big influence on what you can do with it.

Agile Manifesto – Key thing is iterative development; it’s a technology value system.

Spiral Dynamics – What we view as fact and what we desire emerges from the data presented to us.

Credit: http://www.datascientists.net/what-is-data-science

Page 3: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Database Theory Relational Algebra & Set Theory Thinking in relations helps you to connect

disparate data; What is the connecting field? What is the cardinality?

Set Theory Helps you think about summarizing data What time period? Weeks? Months? By person? By Group? By Geography?

Page 4: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Agile Manifesto Individuals and interactions over

processes and tools Working software over comprehensive

documentation Customer collaboration over contract

negotiation Responding to change over following a

plan

http://www.agilemanifesto.org

Page 5: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Spiral DynamicsNew research unveiled at this year’s AERA conference documents a disturbing trend among the nation’s secondary schools: Between 2001 and 2012, high school graduation rates regularly spiked in late May and early June, ballooning from near zero to a staggering average of 78 percent.

Page 6: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

What you’ll need for this course

Interest in learning data analysis tools R Python

Curiosity A laptop to bring to class (see me if this is a problem) Persistence A Github Account Willingness to do weekly homeworks and participate in

online iteration of data products you and your course mates develop

A dropbox account will be helpful

Page 7: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Discuss Homework Analysis Questions. Write up a short essay with

tables or graphs if needed to describe how you would: Build a network using the scripts from week1

against the mention connections? Reply-To connections? In this sample data. What transformations are required? How would you filter the data? Use the actual data to ground your thinking. Feel free to actually write or modify the R code samples from the first two weeks to experiment. Some of you will be more comfortable doing this; some will be more comfortable addressing the question conceptually. This is OK.

Page 8: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Using GitHub for Software Sharing

Creating a GitHub Account Creating a GitHub Project Using the GitHub Desktop client Committing & Syncing The Pull Request

Page 9: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Anarchist in the Library

Chapter Four – The Music Industry Take five minutes to prepare a one minute

summary of how you use file sharing and distribution to use music and videos

I will mute the class capture when you share your stories

Page 10: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

A personal story of Music

Page 11: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Who is Zooey Deschanel?

Page 12: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

To Spotif

y!

Page 13: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Multiple Variable Distributions

Page 14: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Examples in Python

Page 15: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Two Variables Where are the data points

located, and how far do they spread? What are typical, as

well as minimal and maximal, values?

• How are the points distributed? Are they spread out evenly or do they cluster in certain areas?

• How many points are there? Is this a large data set or a relatively small one?

Health Expenditures

Page 16: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Plotting 2 variables Health Expenditures vs life expectancy Health Expenditures vs Doctor Visits

Page 17: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Interpolation

Page 18: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Spline

Page 19: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Polynomial Intertpolation

Page 20: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Draft Lottery LOESS Curve

Page 21: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

LOESS in R

Page 22: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

LOESS & Well Water Testing

Credit to “Davids Statistics”

False Positives & False Negatives

• This is a modern statistical method that is useful when the relationship between x & y are unknown and complicated

• “Locally weighted polynomial regression• Basic regression• Localized regression

• Localized subsets of data• Q – smoothing parameter• Bigger q = more smoothing

• Assumption: Any local model can be well approximated in a small neighborhood

• Models

Page 23: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

The Data

Page 24: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Simple LOESS

Page 25: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Detects & Non-Detects

Page 26: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

First Curve

Page 27: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Second Curve

Page 28: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Smoothing Curve

ggplot2 library

Page 29: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Detects & ND’s With Sep Curves & CI’s

Page 30: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Activity Download the GitHub Project Get the Python code to run inside of Canopy

(Week 4 Folder) Draft Data

R-Code Twitter Data – @j_tsar

Looking at various types of interpolation? How might interpolated data help tell a story?

Now, what other data sets do you want to get?

Page 31: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Underpants Gnomes

With much discourtesy from the US TV Program “South Park”

Motivation

Page 32: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Underpants GnomesMotivation

Page 33: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Addressing The Underpants Gnome

Postulate

Page 34: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

34

Discussion Post•Read•Response

Classification•Open Coding•Axial Coding

Identification of Coordination Events•Time proximity•Topical proximity

Aggregation of Posts by

Topic

Weighted Network

Analysis of Interactions

Methodological Approach

Weight Connections Based on Time Distance, GroupedBy Topic and informed by analysis of time distance between posts.

Identify Key InformationBrokers

Group Informatics Described

Page 35: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Week Five – Assignment Two

Software Sharing #1 (Share scripts produced in week 3 using an open source software configuration management tool). Students will refine and then share their scripts

with other students Included in the assignment is a 500 word

explanation of how their script could be improved, optimized and adapted to other data of a similar type.

The “read me” file distributed with the script will explain to another user how to apply the script to the data distributed in assignment one. This will include specific, technical specifications.

Page 36: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

upcoming Readings Week 5: Data Presentation Tools (We’re a

week behind on assignments) Software Sharing #1 (Share scripts

produced in week 3 using an open source software configuration management tool).

Readings and Assignments Due: Data Visualization Example Presentation Part Three of “Data Analysis with Open Source

Tools”.