exploring language communities on github
TRANSCRIPT
Exploring Language Communities on GitHub
Antigoni M. Founta
IntroductionThis study focuses on the exploration of underlying patterns and the detection of
communities on programming languages used by GitHub users, via network analysis.
There are two graphs derived from the whole dataset and two location-specific graphs,
in order to study both the general audience of GitHub as well as the trends regarding
some sample locations.
Goal: Understand how languages are practically grouped in terms of the way
developers use them, as well as discover trends either worldwide or on specific
locations.
Nodes → Languages
Edges → Language co-occurrence in User Profiles (based on the user repositories)
GitHub● GitHub is a web-based Git repository hosting service
● It offers distributed revision control and source code
management (SCM)
● It is the largest host of source code in the world! [1]
Why Github?
“The introduction of social features in a code hosting site has drawn
particular attention from researchers while the integrated social
features, and the availability of metadata through an accessible api
have made GitHub very attractive for software engineering
researchers” [3]
Top Image Source: https://goo.gl/CWBMqbBottom Image Source: https://github.com/logos
● Programming Language
categorization ambiguity
● GitHub bias on Web Development
● Locations and users have
power-law distribution: there are
numerous developers from few
locations (such as California,
London etc) and there is a
significant amount of locations
with few users
Pros Challenges● Developers will get a hint of
which languages are used jointly,
and thus perhaps serve the same
purpose.
● Language creators will get a hint
of what their audience prefer and
trust.
● Language communities might
actually be another way to
explore developer communities.
FundamentalsDataset Features
➔ ID, Username, Location, Followers, Public Repos, Languages & Bytes of code
Network Structure
➔ Nodes: Languages
◆ Attribute: Total Bytes of Code
➔ Edges: Pairs of Languages that co-occurred in at least one user profile
◆ Weight: Amount of users that use both languages
Challenges upon Data
➔ Only public repositories accessible (users mainly work on private!)
➔ Languages are added by the user (empty, not real, not written in the same way)
PyGithub[2]
Final Datasets
❏ 4000 users since GitHub foundation + 150.000 from 2012
❏ Filter: Get only users with locations!
❏ Final: 2300 users since GitHub foundation + 37.000 from 2012
Descriptive Statistics
Data Distribution
MethodologyCreate graph (as described):
● Filters: Degree Range
● Layout: Force Atlas 2
● Node size: “Bytes of Code” Range
● Label size: Degree Range
Compute Modularity & get communities:
● Sometimes using edge weights, sometimes not
Visualize pairs of languages and amount of developers that use both
Results: All Data - All Languages
User-based Language Graph
Language Co-occurrences on User Profiles &
Top Languages based on Bytes of Code written
Results: All Data - Top Languages
User-based Language Graph
Language Co-occurrences
on User Profiles
#Top languages had minor differences, and thus are not reported
Results: California - Top 3 Languages
User-based Language Graph
Language Co-occurrences on User Profiles &
Top Languages based on Bytes of Code written
Results: Greece - Top 3 Languages
User-based Language Graph
Language Co-occurrences on User Profiles &
Top Languages based on Bytes of Code written
Repo-based Language Graph
Communities(modularity: 0.23)
Blue: Web-oriented
Pink: Desktop-oriented
Yellow: Other
ConclusionsLanguage-Oriented
➔ “Web-oriented” is the most robust category of languages used in Github
➔ “JavaScript - CSS” is the leading pair of languages, always outnumbering all other pairs
➔ Even though JavaScript is almost always dominating Pairs of Languages, C is always the
most used one in matters of Bytes of Code [perhaps C users are not language-extroverts…]
Scheme-Oriented
➔ With a user-based scheme we can understand the general preferences of developers and the
patterns between languages. [difficult when dataset is big!]
➔ With a repo-based scheme we can understand hidden (or at least not widely known)
patterns of languages that are used for same purposes.
➔ General purpose: repo-based scheme
Location purpose: user-based scheme
Future Work● More Data !
● More Locations and Comparisons
● Language Graphs based on Top/Most influential Users [using followers or stars]
● Association Rules on Languages for community detection
● User Graph to detect user communities per Location (e.g. web developers, game
developers) and compare with Language Graph of Location
References1. Github on Wikipedia: https://en.wikipedia.org/wiki/GitHub
2. PyGithub Library: https://github.com/PyGithub/PyGithub
3. Kalliamvakou, Eirini, et al. "The promises and perils of mining GitHub." Proceedings of the
11th working conference on mining software repositories. ACM, 2014.
4. Thung, Ferdian, et al. "Network structure of social coding in github." Software maintenance
and reengineering (csmr), 2013 17th european conference on. IEEE, 2013.
5. Takhteyev, Yuri, and Andrew Hilts. "Investigating the geography of open source software
through GitHub." (2010).
6. Figueira Filho, Fernando, et al. "A study on the geographical distribution of Brazil’s
prestigious software developers." Journal of Internet Services and Applications 6.1 (2015): 1.
Image Source: http://wifflegif.com/tags/58347-octocat-gifs
Thank you for your attention! Any questions?Image Source: https://octodex.github.com/images/heisencat.png