introduction to data science – info 480 – drexel university’s ischool
DESCRIPTION
Introduction to Data Science – INFO 480 – Drexel University’s iSchool. Sean P. Goggins, PhD April 2, 2013 Week One. What is “Data Science”?. Time, Number of participants Number of new participants, Number of sustained participants. Time Number of Participants. - PowerPoint PPT PresentationTRANSCRIPT
Introduction to Data Science – INFO 480 – Drexel University’s
iSchoolSean P. Goggins, PhD
April 2, 2013Week One
What is “Data Science”? Data scientists are story tellers.
Variables?
TimeNumber of Participants
Time, Number of participantsNumber of new participants, Number of sustained participants
Story Telling
Story Telling
Story Telling
What is Data Science?
Storytelling Database Theory – How you organize your
data has a big influence on what you can do with it.
Agile Manifesto – Key thing is iterative development; it’s a technology value system.
Spiral Dynamics – What we view as fact and what we desire emerges from the data presented to us.
Credit: http://www.datascientists.net/what-is-data-science
Database Theory Relational Algebra & Set Theory Thinking in relations helps you to connect
disparate data; What is the connecting field? What is the cardinality?
Set Theory Helps you think about summarizing data What time period? Weeks? Months? By person? By Group? By Geography?
Agile Manifesto Individuals and interactions over
processes and tools Working software over comprehensive
documentation Customer collaboration over contract
negotiation Responding to change over following a
plan
http://www.agilemanifesto.org
Spiral DynamicsNew research unveiled at this year’s AERA conference documents a disturbing trend among the nation’s secondary schools: Between 2001 and 2012, high school graduation rates regularly spiked in late May and early June, ballooning from near zero to a staggering average of 78 percent.
What you’ll need for this course
Interest in learning data analysis tools R Python
Curiosity A laptop to bring to class (see me if this is a problem) Persistence A Github Account Willingness to do weekly homeworks and participate in
online iteration of data products you and your course mates develop
A dropbox account will be helpful
An Outline of The Class
Two Books Janert, PK. (2011) Data Analysis with Open
Source Tools. New York: O’Reilly. Vaidhyanathan, S. (2005) The Anarchist in
the Library: How the Clash Between Freedom and Control is Hacking the Real World and Crashing the System. New York: Basic Books.
Ten Weeks
Schedule
Activity One Academic Integrity Form at the Back of The
Syllabus 15 Minutes to Work on Tool Configuration
R Download Python Download & Config http://seangoggins.net/info480
Github Account and Client http://mac.github.com http://windows.github.com
Basic R Instructions
Installation Download and install R Run the “R-Libraries.R” script from the root
of the github project directory
Basics
Google Trends – 2005 – Present
Produce a Trend graph for Google search phrases. Identify four search phrases. Describe what makes
these search phrases a coherent comparison in two sentences. What do they have in common? How do they provide a useful contrast?
BEFORE you run the search, write down what you expect the trends to look like. Spikey, trending upward, trending downward?
Examine the resulting changes What did you find? What are some theories that
might explain the similarities or differences you observed
Network Analysis
Telling Stories: A Visualization of Purely
Qualitative DataYou CAN do that without Quantitative Data
You can do it with Qualitative DataAnd A LOT of Quantitative Data REQUIRES qualitative Analysis
Idealized Distributed Team
Actual Models Found
The Different ICT Roles
Organizational Evolution
Network Analysis – Github Activities
Underpants Gnomes
With much discourtesy from the US TV Program “South Park”
Motivation
Underpants GnomesMotivation
Addressing The Underpants Gnome
Postulate
28
Discussion Post•Read•Response
Classification•Open Coding•Axial Coding
Identification of Coordination Events•Time proximity•Topical proximity
Aggregation of Posts by
Topic
Weighted Network
Analysis of Interactions
Methodological Approach
Weight Connections Based on Time Distance, GroupedBy Topic and informed by analysis of time distance between posts.
Identify Key InformationBrokers
Group Informatics Described
Data: Github
Github Network Activity One
Actual R – Code Work through setup Scripts are ready to run Talk Through Them and Walk around to
help
Further Analysis Tools
Eight Mylyn Releases (Temporal Analysis) R Packages Used
TNET iGraph Statnet
Weighted Network: TNET
The Dense Graph (Work)
Developers create a dense graph. Not a complete graph, but dense.
Work
A Sparser Graph (Talk) Commenter's create a sparse graph
Talk
Release One (2.0) Analysis
Code DiscussionWork Talk
iGraph
Release 1
STATNET for Discussion StatNet
Red = Bug CommenterBlue = Bug Opener
StatNET
Talk
Release 1
Release OneWork & Talk
Release 1 (2.0) iGraph & Statnet
Talk
Clusters
In Degree & Out Degree
Red = Bug CommenterBlue = Bug Opener
iGraph
StatNETRelease 1
Google Summer Coder
Release One (2.0): Filtered
Code Discussion
304, 373, 399 & 143 formThe Strongest ConnectionsIn both networks
Red = Bug CommenterBlue = Bug Opener
Talk
Work Release 1
Release One (2.0): Filtered
Code Discussion
304, 373, 399 & 143 formThe Strongest ConnectionsIn both networks
Red = Bug CommenterBlue = Bug Opener
Google Summer Coder
TalkWork
457, 391 & 159 – Comment & Open
Release 1
Compare Over TimeFirst & Last Release
Release 1 (2.0) Compared to Release 8 (3.3)
Talk
304, 399, 143, 159, 173, 373399, 118, 304, 159, 391, 416
StatNET & ordinary plotting
Release 1
Release 8
Release 1 (2.0) Compared to Release 8 (3.3)
Work
Two disconnectedGraphs in release 8
304, 373, 399 & 143
Release 1
Release 8
143 & 304 disengagedOr missing entirely
iGraph
Release EightWork & Talk
Release 8 (3.3): Filtered
Code
Discussion
Red = Bug CommenterBlue = Bug Opener
Talk
Work
Nobody is“Just Blue”
Release 8
Release 8 (3.3): Filtered
Code
Discussion
Red = Bug CommenterBlue = Bug Opener
Talk
WorkNotice 416 in Talk & Second Coder Graph
Release 8
Talk
Clusters
In Degree & Out Degree
Red = Bug CommenterBlue = Bug Opener
iGraph
StatNET
Release 8 (3.3) iGraph & Statnet
399, 118 & 159 are significant, But play with different clusters of Other people.
BlueCluste
r
Release 8
Releases One EightHigh Level Views Over Time
Discussion, Releases 1 – 8
Where there is no color,There are multiple, incompleteGraphs.
Code, Releases 1 – 8 One Possible explanation: A few centralPeople who slowly butObservably begin to engageOther contributors inAn open source softwareDevelopment project.
Structure evolvesKey Groups Evolve
iGraph
Github: Longitudinal
Github Statnet Summaries
Built with This R Code
Longitudinal Network Libraries
GGPLOT2 Code
pc <- ggplot(gMerged, aes(timeperiodNum, inDegree-outDegree, group=person, size=betweenness))
pc <- pc + geom_line(aes(colour = person)) + guides(colour=guide_legend(nrow=7)) + opts(legend.text = theme_text(colour="blue", size=20), legend.title=theme_text(size=24),legend.position="top",axis.text.x=theme_text(size=20), axis.text.y=theme_text(size=20), axis.title.x = theme_text(size=20), axis.title.y = theme_text(size=20, angle=90), panel.background = theme_rect(fill = "white", colour = NA)) +ylab("In Degree MINUS Out Degree")+xlab("Time Period Number")
Github Longitudinal
GGPLOT2 Code
pc <- ggplot(gMerged, aes(timeperiodNum, inDegree-outDegree, group=person, size=betweenness))
pc <- pc + geom_line(aes(colour = person)) + guides(colour=guide_legend(nrow=7)) + opts(legend.text = theme_text(colour="blue", size=20), legend.title=theme_text(size=24),legend.position="top",axis.text.x=theme_text(size=20), axis.text.y=theme_text(size=20), axis.title.x = theme_text(size=20), axis.title.y = theme_text(size=20, angle=90), panel.background = theme_rect(fill = "white", colour = NA)) +ylab("In Degree MINUS Out Degree")+xlab("Time Period Number")
GITHUB: Types of Work
Code & Data for Types of Work
gg <- ggplot(pullerListTyped, aes(timeperiodNum, fill=person))+geom_bar()+facet_grid(facets=codeCode~.)ggsave("fix-feature-pullers.png", dpi=600, plot=gg, width=12, height=12)
Practice Activities Our First Python Program Network Analysis in R https://github.com/The-Art-of-Big-Social-Data/info480
Week 1 folder contains a python script that can be opened in iPython, following the instructions at http://www.seangoggins.net/info480
Github Networks Folder contains two activities
upcoming No Lecture next week Three Readings for next week
I’ll setup an online discussion forum using GitHub to focus a discussion. Participation is required. I don’t count posts, I don’t count words; I evaluate thoughtfulness
Selecting a DataSet – By This Friday I will post data sets with analysis questions. Your task will be to think through how the data is organized, and describe *in words* how that data needs to be reshaped to answer the analysis questions.
Week Three – Assignment One
Prepared Data #1 (Transform an instructor supplied data set into an analyzable form using open source tools) Instructor supplies a public data set for analysis,
along with a set of questions or insights that are obtainable from the data set
Included in the assignment is a 500 word explanation of how the preparation of the data alters the original data and a description of how the questions can be answered using the data. (reshaping)
Transform the data using whatever tools you are familiar with to be in the “shape” required for analysis (Assignment 2 you will have to share a script used for the transformation; so if you use a tool instead of a language the first time, you will need to write a script to do the transformation for assignment 2)