getting started with r when analysing github commits
TRANSCRIPT
Getting started with R when analysing GitHub
eventsBarbara Fusinska
barbarafusinska.com
About me
ProgrammerMath enthusiast
Sweet tooth@BasiaFusinska
https://github.com/BasiaFusinska/RTalk
Agenda• R ecosystem • R basics
• Analysing GitHub events• Data sources• Code… a lot of code
Why R?• Ross Ihaka & Robert Gentleman• Name:• First letter of names• Play on the name of S• S-PLUS – commercial alternative
• Open source• Nr 1 for statistical computing
R Environment• R project• console environment• http://www.r-project.org/
• IDE• Any editor• RStudiohttp://www.rstudio.com/products/rstudio/download/
RStudio
Editor
Console
Environment variables
PlotsFilesHelp
Packages
R Basics
Basics - Types> myChar <- "a"> myChar[1] "a"> typeof(myChar)[1] "character"
> myNum <- 10> myNum[1] 10> typeof(myNum)[1] "double"
> # Dynamic> myNum <- "some text"> typeof(myNum)[1] "character"
Vectors> myVector <- c("a", "b", "c")> myVector[1] "a" "b" "c"> typeof(myVector)[1] "character"
myVector <- 1:10myVector <- double(0)
myVector <- c(2, 5:10, 20)myVector <- letters[1:5]
myVector[5]
Lists> myList <- list("a", "b", "c")> myList[[1]][1] "a"
[[2]][1] "b"
[[3]][1] "c"
> typeof(myList)[1] "list"
Named elements> myVector <- c(a="a", b="b", c="c")> myVector a b c "a" "b" "c"
> myList <- list(a="a", b="b", c="c")> myList$a[1] "a"
$b[1] "b"
$c[1] "c"
Accessing element> myVector[1] a "a" > myVector[[1]][1] "a"> myVector['a'] a "a" > myVector[['a']][1] "a"
> myList[1]$a[1] "a"> myList[[1]][1] "a"> myList['a']$a[1] "a"> myList[['a']][1] "a"> myList$a[1] "a"
Data frames> dataFrame <- data.frame(col1=c(1,2,3), col2=c(4,5,6))> dataFrame col1 col21 1 42 2 53 3 6> typeof(dataFrame)[1] "list"
Summary> summary(dataFrame) col1 col2 Min. :1.0 Min. :4.0 1st Qu.:1.5 1st Qu.:4.5 Median :2.0 Median :5.0 Mean :2.0 Mean :5.0 3rd Qu.:2.5 3rd Qu.:5.5 Max. :3.0 Max. :6.0
Summary statisticsmean(dataFrame$col1)max(dataFrame$col1)min(dataFrame$col1)sum(dataFrame$col1)median(dataFrame$col1)quantile(dataFrame$col1)
Filtering vectors and lists> a <- 1:10> a[a > 4][1] 5 6 7 8 9 10
> select <- function(x) { x > 4}> a[select(a)][1] 5 6 7 8 9 10
> Filter(select, a)[1] 5 6 7 8 9 10
Filtering data framesdataFrame <- data.frame( age=c(20, 15, 31, 45, 17), gender=c('F', 'F', 'M', 'M', 'F'), smoker=c(TRUE, TRUE, FALSE, TRUE, FALSE))
> dataFrame age gender smoker1 20 F TRUE2 15 F TRUE3 31 M FALSE4 45 M TRUE5 17 F FALSE
Filtering by rows> dataFrame$age[ dataFrame$gender == 'F'][1] 20 15 17
> dataFrame[2:4, ] age gender smoker2 15 F TRUE3 31 M FALSE4 45 M TRUE
> dataFrame[ dataFrame$age < 30, ] age gender smoker1 20 F TRUE2 15 F TRUE5 17 F FALSE
> dataFrame[ dataFrame$gender == 'M', ] age gender smoker3 31 M FALSE4 45 M TRUE
Filtering by columns> dataFrame[, 3][1] TRUE TRUE FALSE TRUE FALSE
> dataFrame[, c(1,3)] age smoker1 20 TRUE2 15 TRUE3 31 FALSE4 45 TRUE5 17 FALSE
> dataFrame[, c(3,2)] smoker gender1 TRUE F2 TRUE F3 FALSE M4 TRUE M5 FALSE F
> dataFrame[, c('age', 'smoker')] age smoker1 20 TRUE2 15 TRUE3 31 FALSE6 45 TRUE7 17 FALSE
Goal: Language distribution
https://www.githubarchive.org/
Google BigQuery
Language information• Only Pull Requests event types
have language information
• Data source – 1h events from 01.01.2015 3 PM• ~11k events• ~500 pull requests
Gender bias?• 4,037,953 GitHub user
profiles• 1,426,121 identified
(35.3%)
http://arstechnica.com/information-technology/2016/02/data-analysis-of-github-contributions-reveals-unexpected-gender-bias/
Open ClosedWomen 8,216 111,011
Men 150,248 2,181,517
Reading data from files - csv> sizes <- read.csv(sizesFile)> sizes category length width1 B 20.0 3.02 A 23.0 3.63 B 75.0 18.04 B 44.0 10.05 C 2.5 6.06 B 7.2 27.07 A 45.8 34.08 C 12.0 2.09 A 5.0 13.010 A 68.0 14.5
Reading data from files - lines> lines <- readLines(sizesFile)> lines [1] "category,length,width" "B,20,3" [3] "A,23,3.6" "B,75,18" [5] "B,44,10" "C,2.5,6" [7] "B,7.2,27" "A,45.8,34" [9] "C,12,2" "A,5,13" [11] "A,68,14.5"
Writing data to csv filewrite.csv(sizes, file=outputFile)write.csv(sizes, file=outputFile, row.names = FALSE)
Applying operation across elements> myVector <- c(1, 4, 9, 16, 25)
> sapply(myVector, sqrt)[1] 1 2 3 4 5
> lapply(myVector, sqrt)[[1]][1] 1
[[2]][1] 2
[[3]][1] 3
[[4]][1] 4
[[5]][1] 5
Read GitHub Archive eventslibrary("rjson")
readEvents <- function(file, eventNames) { lines <- readLines(file) jsonEvents <- lapply(lines, fromJSON) specificEvents <- Filter( function(e) { e$type %in% eventNames }, jsonEvents)
return(specificEvents)}
Missing data# Missing values> a <- c(1,2,NA,3,4,5)> a[1] 1 2 NA 3 4 5
# Checking if missing data> is.na(a)[1] FALSE FALSE TRUE FALSE FALSE FALSE> anyNA(a)[1] TRUE
# Setting missing values> is.na(a) <- c(2,4)> a[1] 1 NA NA NA 4 5
# Setting null values> a <- NULL> is.null(a)[1] TRUE
Read pull requestspullRequestEvents <- readEvents(fileName,"PullRequestEvent")
select <- function(x) { id <- x$payload$pull_request$base$repo$id language <- x$payload$pull_request$base$repo$language
if (!is.null(language)) { c(ID=id, Language=language) } else { c(ID=id, Language="") }}
pullRequests <- sapply(pullRequestEvents, select)
Some solutionsfor(x in pullRequests) { # version 1 rbind(dataFrame, x)
#version 2 idColumn <- c(idColumn, x[“ID”,]) languageColumn <- c(languageColumn, x[“Language”,])}
# version 2dataFrame <- data.frame(
id=idColumn, language=languageColumn)
Prepare datareposLanguages <- data.frame(
id=pullRequests["ID",],language=pullRequests["Language",])
head(reposLanguages)summary(reposLanguages)
Little look on the data> head(reposLanguages) id language1 3542607 C++2 10391073 Python3 28668460 Python4 28608107 Ruby5 5452699 JavaScript6 19777872 C#
> summary(reposLanguages) id language28648149: 12 Ruby : 66 28688863: 8 PHP : 55 20413356: 5 Python : 53 28668553: 5 : 51 10160141: 4 JavaScript: 47 206084 : 4 C++ : 30 (Other) :436 (Other) :172
Duplicated data> myData <- c(1,2,3,4,3,2,5,6)> duplicated(myData)[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE> anyDuplicated(myData)[1] 5
> unique(myData)[1] 1 2 3 4 5 6
Unique repositories data> reposLanguages <- unique(reposLanguages)> summary(reposLanguages) id language 25994257: 2 Python : 36 28528325: 2 JavaScript: 35 10126031: 1 Ruby : 35 10160141: 1 PHP : 34 10344201: 1 : 27 10391073: 1 Java : 22 (Other) :297 (Other) :116
Distribution tables> collection <- c('A','C','B','C','B','C')
> oneWayTable <- table(collection)
> oneWayTablecollectionA B C 1 2 3
> attributes(oneWayTable)$dim[1] 3
$dimnames$dimnames$collection[1] "A" "B" "C"
Language distribution> languages <- table(reposLanguages$language)> head(languages) ActionScript Bluespec C 27 1 1 9 C# C++ 11 20
> languages <- sort(languages, decreasing=TRUE)> head(languages) Python JavaScript Ruby PHP 36 35 35 34 27 Java 22
Recognised languagesreposLanguages <- reposLanguages[reposLanguages$language != "",]
languages <- table(reposLanguages$language)languages <- sort(languages, decreasing=TRUE)
Language names> languagesNames <- names(languages)> languagesNames [1] "Python" "JavaScript" "Ruby" [4] "PHP" "Java" "C++" [7] "CSS" "C#" "C" [10] "Go" "Shell" "CoffeeScript”[13] "Objective-C" "Puppet" "Scala" [16] "Lua" "Rust" "Clojure" [19] "Emacs Lisp" "Haskell" "Julia" [22] "Makefile" "Perl" "VimL" [25] "ActionScript" "Bluespec" "DM" [28] "Elixir" "F#" "Haxe" [31] "Matlab" "Swift" "TeX" [34] ""
Plotting languages2Display <- languages[languages > 5]barplot(languages2Display)
Summary• GitHub Archive• Introduction to R• Data types• Filtering• I/O• Applying operations• Missing values & duplicates• Binding data• Distribution tables• Plotting (barplot)
Thank you
[email protected]@BasiaFusinskabarbarafusinska.com
https://github.com/BasiaFusinska/RTalk
Questions?