introduction to data science – info 480 – drexel university’s ischool

65
Introduction to Data Science – INFO 480 – Drexel University’s iSchool Sean P. Goggins, PhD April 2, 2013 Week One

Upload: katina

Post on 24-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Data Science – INFO 480 – Drexel University’s iSchool. Sean P. Goggins, PhD April 2, 2013 Week One. What is “Data Science”?. Time, Number of participants Number of new participants, Number of sustained participants. Time Number of Participants. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Introduction to Data Science – INFO 480 – Drexel University’s

iSchoolSean P. Goggins, PhD

April 2, 2013Week One

Page 2: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

What is “Data Science”? Data scientists are story tellers.

Variables?

TimeNumber of Participants

Time, Number of participantsNumber of new participants, Number of sustained participants

Page 3: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Story Telling

Page 4: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Story Telling

Page 5: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Story Telling

Page 6: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

What is Data Science?

Storytelling Database Theory – How you organize your

data has a big influence on what you can do with it.

Agile Manifesto – Key thing is iterative development; it’s a technology value system.

Spiral Dynamics – What we view as fact and what we desire emerges from the data presented to us.

Credit: http://www.datascientists.net/what-is-data-science

Page 7: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Database Theory Relational Algebra & Set Theory Thinking in relations helps you to connect

disparate data; What is the connecting field? What is the cardinality?

Set Theory Helps you think about summarizing data What time period? Weeks? Months? By person? By Group? By Geography?

Page 8: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Agile Manifesto Individuals and interactions over

processes and tools Working software over comprehensive

documentation Customer collaboration over contract

negotiation Responding to change over following a

plan

http://www.agilemanifesto.org

Page 9: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Spiral DynamicsNew research unveiled at this year’s AERA conference documents a disturbing trend among the nation’s secondary schools: Between 2001 and 2012, high school graduation rates regularly spiked in late May and early June, ballooning from near zero to a staggering average of 78 percent.

Page 10: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

What you’ll need for this course

Interest in learning data analysis tools R Python

Curiosity A laptop to bring to class (see me if this is a problem) Persistence A Github Account Willingness to do weekly homeworks and participate in

online iteration of data products you and your course mates develop

A dropbox account will be helpful

Page 11: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

An Outline of The Class

Two Books Janert, PK. (2011) Data Analysis with Open

Source Tools. New York: O’Reilly. Vaidhyanathan, S. (2005) The Anarchist in

the Library: How the Clash Between Freedom and Control is Hacking the Real World and Crashing the System. New York: Basic Books.

Ten Weeks

Page 12: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Schedule

Page 13: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Activity One Academic Integrity Form at the Back of The

Syllabus 15 Minutes to Work on Tool Configuration

R Download Python Download & Config http://seangoggins.net/info480

Github Account and Client http://mac.github.com http://windows.github.com

Page 14: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Basic R Instructions

Page 15: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Installation Download and install R Run the “R-Libraries.R” script from the root

of the github project directory

Page 16: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Basics

Page 17: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Google Trends – 2005 – Present

Produce a Trend graph for Google search phrases. Identify four search phrases. Describe what makes

these search phrases a coherent comparison in two sentences. What do they have in common? How do they provide a useful contrast?

BEFORE you run the search, write down what you expect the trends to look like. Spikey, trending upward, trending downward?

Examine the resulting changes What did you find? What are some theories that

might explain the similarities or differences you observed

Page 18: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Network Analysis

Page 19: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Telling Stories: A Visualization of Purely

Qualitative DataYou CAN do that without Quantitative Data

You can do it with Qualitative DataAnd A LOT of Quantitative Data REQUIRES qualitative Analysis

Page 20: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Idealized Distributed Team

Page 21: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Actual Models Found

Page 22: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

The Different ICT Roles

Page 23: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Organizational Evolution

Page 24: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Network Analysis – Github Activities

Page 25: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Underpants Gnomes

With much discourtesy from the US TV Program “South Park”

Motivation

Page 26: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Underpants GnomesMotivation

Page 27: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Addressing The Underpants Gnome

Postulate

Page 28: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

28

Discussion Post•Read•Response

Classification•Open Coding•Axial Coding

Identification of Coordination Events•Time proximity•Topical proximity

Aggregation of Posts by

Topic

Weighted Network

Analysis of Interactions

Methodological Approach

Weight Connections Based on Time Distance, GroupedBy Topic and informed by analysis of time distance between posts.

Identify Key InformationBrokers

Group Informatics Described

Page 29: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Data: Github

Page 30: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Github Network Activity One

Page 31: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool
Page 32: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool
Page 33: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Actual R – Code Work through setup Scripts are ready to run Talk Through Them and Walk around to

help

Page 34: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Further Analysis Tools

Eight Mylyn Releases (Temporal Analysis) R Packages Used

TNET iGraph Statnet

Page 35: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Weighted Network: TNET

Page 36: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

The Dense Graph (Work)

Developers create a dense graph. Not a complete graph, but dense.

Work

Page 37: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

A Sparser Graph (Talk) Commenter's create a sparse graph

Talk

Page 38: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Release One (2.0) Analysis

Code DiscussionWork Talk

iGraph

Release 1

Page 39: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

STATNET for Discussion StatNet

Red = Bug CommenterBlue = Bug Opener

StatNET

Talk

Release 1

Page 40: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Release OneWork & Talk

Page 41: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Release 1 (2.0) iGraph & Statnet

Talk

Clusters

In Degree & Out Degree

Red = Bug CommenterBlue = Bug Opener

iGraph

StatNETRelease 1

Page 42: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Google Summer Coder

Release One (2.0): Filtered

Code Discussion

304, 373, 399 & 143 formThe Strongest ConnectionsIn both networks

Red = Bug CommenterBlue = Bug Opener

Talk

Work Release 1

Page 43: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Release One (2.0): Filtered

Code Discussion

304, 373, 399 & 143 formThe Strongest ConnectionsIn both networks

Red = Bug CommenterBlue = Bug Opener

Google Summer Coder

TalkWork

457, 391 & 159 – Comment & Open

Release 1

Page 44: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Compare Over TimeFirst & Last Release

Page 45: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Release 1 (2.0) Compared to Release 8 (3.3)

Talk

304, 399, 143, 159, 173, 373399, 118, 304, 159, 391, 416

StatNET & ordinary plotting

Release 1

Release 8

Page 46: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Release 1 (2.0) Compared to Release 8 (3.3)

Work

Two disconnectedGraphs in release 8

304, 373, 399 & 143

Release 1

Release 8

143 & 304 disengagedOr missing entirely

iGraph

Page 47: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Release EightWork & Talk

Page 48: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Release 8 (3.3): Filtered

Code

Discussion

Red = Bug CommenterBlue = Bug Opener

Talk

Work

Nobody is“Just Blue”

Release 8

Page 49: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Release 8 (3.3): Filtered

Code

Discussion

Red = Bug CommenterBlue = Bug Opener

Talk

WorkNotice 416 in Talk & Second Coder Graph

Release 8

Page 50: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Talk

Clusters

In Degree & Out Degree

Red = Bug CommenterBlue = Bug Opener

iGraph

StatNET

Release 8 (3.3) iGraph & Statnet

399, 118 & 159 are significant, But play with different clusters of Other people.

BlueCluste

r

Release 8

Page 51: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Releases One EightHigh Level Views Over Time

Page 52: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Discussion, Releases 1 – 8

Where there is no color,There are multiple, incompleteGraphs.

Page 53: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Code, Releases 1 – 8 One Possible explanation: A few centralPeople who slowly butObservably begin to engageOther contributors inAn open source softwareDevelopment project.

Structure evolvesKey Groups Evolve

iGraph

Page 54: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Github: Longitudinal

Page 55: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Github Statnet Summaries

Page 56: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Built with This R Code

Page 57: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Longitudinal Network Libraries

Page 58: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

GGPLOT2 Code

pc <- ggplot(gMerged, aes(timeperiodNum, inDegree-outDegree, group=person, size=betweenness))

pc <- pc + geom_line(aes(colour = person)) + guides(colour=guide_legend(nrow=7)) + opts(legend.text = theme_text(colour="blue", size=20), legend.title=theme_text(size=24),legend.position="top",axis.text.x=theme_text(size=20), axis.text.y=theme_text(size=20), axis.title.x = theme_text(size=20), axis.title.y = theme_text(size=20, angle=90), panel.background = theme_rect(fill = "white", colour = NA)) +ylab("In Degree MINUS Out Degree")+xlab("Time Period Number")

Page 59: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Github Longitudinal

Page 60: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

GGPLOT2 Code

pc <- ggplot(gMerged, aes(timeperiodNum, inDegree-outDegree, group=person, size=betweenness))

pc <- pc + geom_line(aes(colour = person)) + guides(colour=guide_legend(nrow=7)) + opts(legend.text = theme_text(colour="blue", size=20), legend.title=theme_text(size=24),legend.position="top",axis.text.x=theme_text(size=20), axis.text.y=theme_text(size=20), axis.title.x = theme_text(size=20), axis.title.y = theme_text(size=20, angle=90), panel.background = theme_rect(fill = "white", colour = NA)) +ylab("In Degree MINUS Out Degree")+xlab("Time Period Number")

Page 61: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

GITHUB: Types of Work

Page 62: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Code & Data for Types of Work

gg <- ggplot(pullerListTyped, aes(timeperiodNum, fill=person))+geom_bar()+facet_grid(facets=codeCode~.)ggsave("fix-feature-pullers.png", dpi=600, plot=gg, width=12, height=12)

Page 63: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Practice Activities Our First Python Program Network Analysis in R https://github.com/The-Art-of-Big-Social-Data/info480

Week 1 folder contains a python script that can be opened in iPython, following the instructions at http://www.seangoggins.net/info480

Github Networks Folder contains two activities

Page 64: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

upcoming No Lecture next week Three Readings for next week

I’ll setup an online discussion forum using GitHub to focus a discussion. Participation is required. I don’t count posts, I don’t count words; I evaluate thoughtfulness

Selecting a DataSet – By This Friday I will post data sets with analysis questions. Your task will be to think through how the data is organized, and describe *in words* how that data needs to be reshaped to answer the analysis questions.

Page 65: Introduction to Data Science – INFO 480 – Drexel University’s  iSchool

Week Three – Assignment One

Prepared Data #1 (Transform an instructor supplied data set into an analyzable form using open source tools) Instructor supplies a public data set for analysis,

along with a set of questions or insights that are obtainable from the data set

Included in the assignment is a 500 word explanation of how the preparation of the data alters the original data and a description of how the questions can be answered using the data. (reshaping)

Transform the data using whatever tools you are familiar with to be in the “shape” required for analysis (Assignment 2 you will have to share a script used for the transformation; so if you use a tool instead of a language the first time, you will need to write a script to do the transformation for assignment 2)