mapping the landscape of deep learning models use in the wild · mapping the landscape of deep...

79
Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course COMP4560 Individual Computing Project Supervised by: Dr. Ben Swift The Australian National University October 2019

Upload: others

Post on 23-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Mapping the landscape of deeplearning models use in the wild

Xing Yu (u6034476)

A report submitted for the course COMP4560Individual Computing ProjectSupervised by Dr Ben Swift

The Australian National University

October 2019

copy Xing Yu (u6034476) 2019

Except where otherwise indicated this report is my own original work

Xing Yu (u6034476)23 October 2019

This report is dedicated to my supervisor my parentsand the people who support me

for your kindness and devotion and for your endless help and careyour selflessness will always be remembered

Acknowledgments

bull Foremost I would like to show my greatest gratitude to my supervisor Dr BenSwift Over the years you have witnessed almost every step of my growthfrom ignorant second year to final year from Beijing to Canberra without yoursupport forgiveness and encouragement I could not make progress and grow

You are just the way I remember in my first semester of second year I canremember when you sang Taylor Swifts song And it will still vivid in mymemory

bull I would like to extend my thanks to Prof Weifa Liang who offered me per-mission code to enrol in this course and explain the study contract in detail tome

bull I shall extend my thanks to ANU CECS thank you for believes in me andcarry me on the waves to the land I have never seen Without you I cannotsee snow in Montreal blossoms in Singapore forest in Beijing and last but notleast spectacular view in Canberra - everyplace leave so many extraordinarymemories and new dreams for me

bull I also give thanks for every person I meet during my 4-year undergraduate lifeand those sleepless hard time they shape me into the person I am today Everymemory in ANU is the most precious commondity in my life

vii

Abstract

Deep learning as a subfield of machine learning has rapidly became a popular re-search area nowadays However there is little empirical work previously done toanalyze deep learning model usage in public GitHub is one of the largest sourcecode web-based hosting community and it could be the best place to measure thepopularity of deep learning models directly

In this project a tool called STAMPER is proposed and developed to aid researchersin deep learning field to study the past trend in GitHub All of the visualizationsdisplay the repositories information in GitHub at a high level Our tool shows theevolution of deep learning models over time in particular we study the impactof some external features on deep learning modelsrsquo popularity We end up with asummary of the current state-of-the-art in-depth learning model repository analysisand a crucial discussion of challenges and directions for future research

Key words Software Engineering Deep-learning Popularity Data Visualization

ix

x

List of Abbreviations

bull ML Machine Learning

bull DL Deep Learning

bull CNN Convolutional Neural Network

bull LSTM Long short-term memory

bull NLP Natural Language Processing

bull Bert Bidirectional Encoder Representations from Transformers

bull NCF Neural Collaborative Filtering

bull ResNet Residual Network

bull Wide amp Deep Wide and Deep Learning

xi

xii

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 2: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

copy Xing Yu (u6034476) 2019

Except where otherwise indicated this report is my own original work

Xing Yu (u6034476)23 October 2019

This report is dedicated to my supervisor my parentsand the people who support me

for your kindness and devotion and for your endless help and careyour selflessness will always be remembered

Acknowledgments

bull Foremost I would like to show my greatest gratitude to my supervisor Dr BenSwift Over the years you have witnessed almost every step of my growthfrom ignorant second year to final year from Beijing to Canberra without yoursupport forgiveness and encouragement I could not make progress and grow

You are just the way I remember in my first semester of second year I canremember when you sang Taylor Swifts song And it will still vivid in mymemory

bull I would like to extend my thanks to Prof Weifa Liang who offered me per-mission code to enrol in this course and explain the study contract in detail tome

bull I shall extend my thanks to ANU CECS thank you for believes in me andcarry me on the waves to the land I have never seen Without you I cannotsee snow in Montreal blossoms in Singapore forest in Beijing and last but notleast spectacular view in Canberra - everyplace leave so many extraordinarymemories and new dreams for me

bull I also give thanks for every person I meet during my 4-year undergraduate lifeand those sleepless hard time they shape me into the person I am today Everymemory in ANU is the most precious commondity in my life

vii

Abstract

Deep learning as a subfield of machine learning has rapidly became a popular re-search area nowadays However there is little empirical work previously done toanalyze deep learning model usage in public GitHub is one of the largest sourcecode web-based hosting community and it could be the best place to measure thepopularity of deep learning models directly

In this project a tool called STAMPER is proposed and developed to aid researchersin deep learning field to study the past trend in GitHub All of the visualizationsdisplay the repositories information in GitHub at a high level Our tool shows theevolution of deep learning models over time in particular we study the impactof some external features on deep learning modelsrsquo popularity We end up with asummary of the current state-of-the-art in-depth learning model repository analysisand a crucial discussion of challenges and directions for future research

Key words Software Engineering Deep-learning Popularity Data Visualization

ix

x

List of Abbreviations

bull ML Machine Learning

bull DL Deep Learning

bull CNN Convolutional Neural Network

bull LSTM Long short-term memory

bull NLP Natural Language Processing

bull Bert Bidirectional Encoder Representations from Transformers

bull NCF Neural Collaborative Filtering

bull ResNet Residual Network

bull Wide amp Deep Wide and Deep Learning

xi

xii

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 3: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Except where otherwise indicated this report is my own original work

Xing Yu (u6034476)23 October 2019

This report is dedicated to my supervisor my parentsand the people who support me

for your kindness and devotion and for your endless help and careyour selflessness will always be remembered

Acknowledgments

bull Foremost I would like to show my greatest gratitude to my supervisor Dr BenSwift Over the years you have witnessed almost every step of my growthfrom ignorant second year to final year from Beijing to Canberra without yoursupport forgiveness and encouragement I could not make progress and grow

You are just the way I remember in my first semester of second year I canremember when you sang Taylor Swifts song And it will still vivid in mymemory

bull I would like to extend my thanks to Prof Weifa Liang who offered me per-mission code to enrol in this course and explain the study contract in detail tome

bull I shall extend my thanks to ANU CECS thank you for believes in me andcarry me on the waves to the land I have never seen Without you I cannotsee snow in Montreal blossoms in Singapore forest in Beijing and last but notleast spectacular view in Canberra - everyplace leave so many extraordinarymemories and new dreams for me

bull I also give thanks for every person I meet during my 4-year undergraduate lifeand those sleepless hard time they shape me into the person I am today Everymemory in ANU is the most precious commondity in my life

vii

Abstract

Deep learning as a subfield of machine learning has rapidly became a popular re-search area nowadays However there is little empirical work previously done toanalyze deep learning model usage in public GitHub is one of the largest sourcecode web-based hosting community and it could be the best place to measure thepopularity of deep learning models directly

In this project a tool called STAMPER is proposed and developed to aid researchersin deep learning field to study the past trend in GitHub All of the visualizationsdisplay the repositories information in GitHub at a high level Our tool shows theevolution of deep learning models over time in particular we study the impactof some external features on deep learning modelsrsquo popularity We end up with asummary of the current state-of-the-art in-depth learning model repository analysisand a crucial discussion of challenges and directions for future research

Key words Software Engineering Deep-learning Popularity Data Visualization

ix

x

List of Abbreviations

bull ML Machine Learning

bull DL Deep Learning

bull CNN Convolutional Neural Network

bull LSTM Long short-term memory

bull NLP Natural Language Processing

bull Bert Bidirectional Encoder Representations from Transformers

bull NCF Neural Collaborative Filtering

bull ResNet Residual Network

bull Wide amp Deep Wide and Deep Learning

xi

xii

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 4: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

This report is dedicated to my supervisor my parentsand the people who support me

for your kindness and devotion and for your endless help and careyour selflessness will always be remembered

Acknowledgments

bull Foremost I would like to show my greatest gratitude to my supervisor Dr BenSwift Over the years you have witnessed almost every step of my growthfrom ignorant second year to final year from Beijing to Canberra without yoursupport forgiveness and encouragement I could not make progress and grow

You are just the way I remember in my first semester of second year I canremember when you sang Taylor Swifts song And it will still vivid in mymemory

bull I would like to extend my thanks to Prof Weifa Liang who offered me per-mission code to enrol in this course and explain the study contract in detail tome

bull I shall extend my thanks to ANU CECS thank you for believes in me andcarry me on the waves to the land I have never seen Without you I cannotsee snow in Montreal blossoms in Singapore forest in Beijing and last but notleast spectacular view in Canberra - everyplace leave so many extraordinarymemories and new dreams for me

bull I also give thanks for every person I meet during my 4-year undergraduate lifeand those sleepless hard time they shape me into the person I am today Everymemory in ANU is the most precious commondity in my life

vii

Abstract

Deep learning as a subfield of machine learning has rapidly became a popular re-search area nowadays However there is little empirical work previously done toanalyze deep learning model usage in public GitHub is one of the largest sourcecode web-based hosting community and it could be the best place to measure thepopularity of deep learning models directly

In this project a tool called STAMPER is proposed and developed to aid researchersin deep learning field to study the past trend in GitHub All of the visualizationsdisplay the repositories information in GitHub at a high level Our tool shows theevolution of deep learning models over time in particular we study the impactof some external features on deep learning modelsrsquo popularity We end up with asummary of the current state-of-the-art in-depth learning model repository analysisand a crucial discussion of challenges and directions for future research

Key words Software Engineering Deep-learning Popularity Data Visualization

ix

x

List of Abbreviations

bull ML Machine Learning

bull DL Deep Learning

bull CNN Convolutional Neural Network

bull LSTM Long short-term memory

bull NLP Natural Language Processing

bull Bert Bidirectional Encoder Representations from Transformers

bull NCF Neural Collaborative Filtering

bull ResNet Residual Network

bull Wide amp Deep Wide and Deep Learning

xi

xii

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 5: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Acknowledgments

bull Foremost I would like to show my greatest gratitude to my supervisor Dr BenSwift Over the years you have witnessed almost every step of my growthfrom ignorant second year to final year from Beijing to Canberra without yoursupport forgiveness and encouragement I could not make progress and grow

You are just the way I remember in my first semester of second year I canremember when you sang Taylor Swifts song And it will still vivid in mymemory

bull I would like to extend my thanks to Prof Weifa Liang who offered me per-mission code to enrol in this course and explain the study contract in detail tome

bull I shall extend my thanks to ANU CECS thank you for believes in me andcarry me on the waves to the land I have never seen Without you I cannotsee snow in Montreal blossoms in Singapore forest in Beijing and last but notleast spectacular view in Canberra - everyplace leave so many extraordinarymemories and new dreams for me

bull I also give thanks for every person I meet during my 4-year undergraduate lifeand those sleepless hard time they shape me into the person I am today Everymemory in ANU is the most precious commondity in my life

vii

Abstract

Deep learning as a subfield of machine learning has rapidly became a popular re-search area nowadays However there is little empirical work previously done toanalyze deep learning model usage in public GitHub is one of the largest sourcecode web-based hosting community and it could be the best place to measure thepopularity of deep learning models directly

In this project a tool called STAMPER is proposed and developed to aid researchersin deep learning field to study the past trend in GitHub All of the visualizationsdisplay the repositories information in GitHub at a high level Our tool shows theevolution of deep learning models over time in particular we study the impactof some external features on deep learning modelsrsquo popularity We end up with asummary of the current state-of-the-art in-depth learning model repository analysisand a crucial discussion of challenges and directions for future research

Key words Software Engineering Deep-learning Popularity Data Visualization

ix

x

List of Abbreviations

bull ML Machine Learning

bull DL Deep Learning

bull CNN Convolutional Neural Network

bull LSTM Long short-term memory

bull NLP Natural Language Processing

bull Bert Bidirectional Encoder Representations from Transformers

bull NCF Neural Collaborative Filtering

bull ResNet Residual Network

bull Wide amp Deep Wide and Deep Learning

xi

xii

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 6: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Abstract

Deep learning as a subfield of machine learning has rapidly became a popular re-search area nowadays However there is little empirical work previously done toanalyze deep learning model usage in public GitHub is one of the largest sourcecode web-based hosting community and it could be the best place to measure thepopularity of deep learning models directly

In this project a tool called STAMPER is proposed and developed to aid researchersin deep learning field to study the past trend in GitHub All of the visualizationsdisplay the repositories information in GitHub at a high level Our tool shows theevolution of deep learning models over time in particular we study the impactof some external features on deep learning modelsrsquo popularity We end up with asummary of the current state-of-the-art in-depth learning model repository analysisand a crucial discussion of challenges and directions for future research

Key words Software Engineering Deep-learning Popularity Data Visualization

ix

x

List of Abbreviations

bull ML Machine Learning

bull DL Deep Learning

bull CNN Convolutional Neural Network

bull LSTM Long short-term memory

bull NLP Natural Language Processing

bull Bert Bidirectional Encoder Representations from Transformers

bull NCF Neural Collaborative Filtering

bull ResNet Residual Network

bull Wide amp Deep Wide and Deep Learning

xi

xii

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 7: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

x

List of Abbreviations

bull ML Machine Learning

bull DL Deep Learning

bull CNN Convolutional Neural Network

bull LSTM Long short-term memory

bull NLP Natural Language Processing

bull Bert Bidirectional Encoder Representations from Transformers

bull NCF Neural Collaborative Filtering

bull ResNet Residual Network

bull Wide amp Deep Wide and Deep Learning

xi

xii

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 8: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

List of Abbreviations

bull ML Machine Learning

bull DL Deep Learning

bull CNN Convolutional Neural Network

bull LSTM Long short-term memory

bull NLP Natural Language Processing

bull Bert Bidirectional Encoder Representations from Transformers

bull NCF Neural Collaborative Filtering

bull ResNet Residual Network

bull Wide amp Deep Wide and Deep Learning

xi

xii

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 9: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

xii

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 10: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Contents

Acknowledgments vii

Abstract ix

List of Abbreviations xi

1 Introduction 111 Trace Deep Learning use through GitHub 112 Contribution 213 Report Outline 2

2 Background and Related Work 321 Background 3

211 Deep learning 32111 TensorFlow 42112 PyTorch 4

212 Deep learning models 5213 Summarized Timeline 7

22 Public Code Repositories 8221 Web-based hosting service 8222 Measuring Popularity From GitHub 8223 Extracting Messy Data in the Wild 9224 Visualizing data in Repositories 9

23 Summary 10

3 STAMPER Design and Implementation 1131 Overview 1132 Data Collection 1233 Repository Search 1334 Data Selection 14

Example 1535 Construct the Visualizations 1636 Summary 18

4 STAMPER in Action 1941 Popularity of Deep Learning Models in GitHub 19

411 Popularity Feature Selection 19412 Past and Current Status A Full Integration 23

xiii

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 11: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

xiv Contents

413 RQ1 How has the popularity of model changed over time Acloser look at the deep learning models 26

414 RQ2 How popularity varies per model 29415 RQ3 Does the popularity of models relate to other features 30

42 Contribution of Deep Learning Models in GitHub 34421 Collaborative Contribution 34422 RQ1 After forking do developers change the codebase 36

43 Maintenance of Deep Learning Models in GitHub 39431 RQ1 How long has it been in existence 39432 RQ2 Do old models have more issues compared to new mod-

els 41433 RQ3 Are they well maintained 42

44 Summary 42

5 Discussion And Future Work 4551 Discussion 45

511 Data in the wild Limitation and Improvement 45512 Extensibility and Open-Source Software 45

52 Future Work 46521 Social Network Analysis in GitHub 46522 Trend Detection using Commitments Timestamp 46

6 Conclusion 47

7 Appendix 4971 Appendix 1 Project Description 49

711 Project Title 49712 Supervisors 49713 Project Description 49714 Learning Objectives 49

72 Appendix 2 Study Contract 4973 Appendix 3 Artefact Description 52

731 Code Files Submitted 52732 Program Testing 52733 Experiment 52

Hardware 52Softwares 52Other 53Datasets 53

74 Appendix 4 README 54

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 12: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

List of Figures

21 git2net [Gote et al 2019] 10

31 Overview of STAMPER 1132 Data Selection 1433 Store in Local Disk 1434 Overall Construct the Visualizations 1635 Examine Uniqueness after Forking 18

42 Star Sort Menu [Git a] 2041 Repository Watching [Git b] 2043 Popularity Metric 2144 Repositories with Forks 2445 Repositories without Forks 2446 Repository Trend in GitHub For Each Model 2547 Creation Time vs Stars 2648 Number of Forks Related to Repositories in Deep learning Model De-

velopment 2849 Star vs Contributors 30410 Star vs Development Time 31411 Star vs Open Issues 31412 Star vs Entropy Value 32413 Collaboration Entropy 35414 Percentage of Forked Repositories Unique From Origin (Boxplot) 36415 Repository Uniqueness Distribution () 37416 Repository Change Statistic 38417 Development Time Boxplot 40418 Development Time vs Number of Open Issues 41419 Open Issues vs Number of Repository 43

xv

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 13: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

xvi LIST OF FIGURES

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 14: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

List of Tables

21 Deep Learning History 422 Timeline 7

31 Repositories Related to Tensorflow 17

41 Popularity metric for repositories 2142 Stars Comparison 2943 Forks Comparison 2944 Percentage of one contributor development for DL related repositories 3245 Sample Contributions to One Repository 3446 Repository Development Time Stat 4047 Repository Open Issue Statistics 4148 Descriptive statistics on percentage of Wiki Existence 42

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 15: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

xviii LIST OF TABLES

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 16: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Chapter 1

Introduction

11 Trace Deep Learning use through GitHub

GitHub is one of the largest web-based hosting communities in the world whichcontains a rich source of data facilitating different software engineering projects Thismassive database allows developers and researchers to publish their work

GitHub could also be used in popularity measurement Developers could con-struct their social networks in GitHub in multiple ways such as joining an organi-sation staring a repository heshe is interested Those features make repositories inGitHub easily accessible and the best place to conduct empirical studies

The deep learning research community is continually evolving not limited toacademic research deep learning is allowing the business to use data to teach com-puters how to learn Therefore a number of machine learning and deep learningstartups come into existence worldwide

The development of deep learning models sometimes has software engineeringproblems The quality of deep learning models related projects are sparse few re-searchers focus on the usage outside academic With the expansion of the usablerange of deep learning and deepening of using degree we would like to testifywhether developers catch up with latest deep learning upsurge in academic

A tool to extract the metadata of historical repository with the integration ofvisualisation based on the vast corpus of GitHub repositories is not currently avail-able To fill in this gap we present our tool STAMPER It is implemented in Pythonand supports GitHub repositories metadata extraction based on repository keywordsearch and code segment search using GitHub API It could also analyse the modifi-cation of forked repositories from the original repository and capture the repositorydifference

We study the historical trend of deep learning models and frameworks fromrepositories in GitHub We further demonstrate how repository metadata could beused to see the deep learning trend Our project provides a novel method to studydeep learning frameworks and models from historical aspect and in the meanwhileour work creates a new aspect of the empirical study in deep learning

1

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 17: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

2 Introduction

12 Contribution

bull We introduce STAMPER a python tool that could be used to extract the meta-data that characterize the historical open source projects from GitHub based onresearchersrsquo interest

bull Utilizing STAMPER in a case study on analysing the usage of deep learningmodels in TensorFlow We further demonstrate how we extract from a rich setof features and establish the connections between those features and popularity

bull Create and mine a new dataset for further research use

13 Report Outline

The rest of this report is structured as follows Chapter 2 provides the timeline fordeep learning frameworks and deep learning models In that chapter some back-ground knowledge is presented and previous works related to software miningtools related to GitHub visualizations are recorded in this chapter as well

Chapter 3 provides an overview of our work and introduces our proposed method-ology to extract the metadata from the GitHub repository

Chapter 4 presents a case study in which we use our tool to extract deep learningrelated repositories from GitHub and use this tool to trace the landscape of thosehot deep learning models The visualizations generated using our tool could aidresearchers in understanding the past and current trends easily

Refer back to the critical questions posed we conclude our project and highlightthe next step in Chapter 5

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 18: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Chapter 2

Background and Related Work

Software repository mining can make researchers study the historical trend of soft-ware engineering practice effectively The use of repository mining is based on theuse of web hosting service There exists multiple approaches to conduct these andin the first section we will introduce some background knowledge in Web-basedhosting service Then we will introduce some popular deep learning frameworksin Section 211 Finally we will detail some previous works in Section 22 whichconduct the software repository mining and identify the main pattern in exploringrepository popularity

21 Background

211 Deep learning

Different from traditional machine learning deep learning allows the computationalmodels that are composed of multiple processing layers to learn representations ofdata with multiple levels of abstraction [LeCun et al 2015] Methods in deep learn-ing could dramatically improve humanprimes life in multiple aspects from image classi-fication speech recognition to machine translation even to autonomous cars

The usage of deep learning has grown tremendously in the past few years withthe introduction and improvement of hardware availability (GPU) big data andcloud providers such as Amazon Web Services (AWS) Not limited to this largecompanies create their research team to develop their deep learning algorithms andintegrate them into frameworks such as Pytorch TensorFlow In the meanwhile theyshare their dataset and model training tutorials in helping startup companies to buildtheir state-of-the-art products with minimal effort and time

Though deep learning becomes a hot topic in recent years the history behinddeep learning has been evolving since the 1950s and summarised in Table 21 [Sub-ramanian 2018]

Back to previous developers researchers require expertise in C++ or CUDA toimplement deep learning models and algorithms However thanks to large tech-nology companies and organisations people with knowledge of scripting language(Python) could also build their deep learning algorithms

3

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 19: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

4 Background and Related Work

Technique YearNeural network 1943

Backpropogation 1960sConvolution Neural Network 1979

Recurrent neural network 1980Long Short-Term Memory 1997

Table 21 Deep Learning History

In Section 2111 and Section 2112 we will talk about some state-of-art deeplearning frameworks used today both in academia and industry

2111 TensorFlow

TensorFlow is a machine learning system that operates at large scale [Abadi et al2016] Starting from the initial release from Google Brain team back in November2015 it is developed under the name DistBelief Then TensorFlow released itsofficial 100 version on 11th of Feb 2017 with the introduction of the ability to runon multiple CPU and GPUs

Currently google use this framework in numerous ways to improve the searchengine translation process recommendation system and image recognition and cap-tion Tensorflow library contains different API to build the deep learning modelsat scales such as ResNet CNN or LSTM The architecture of this framework givesthe flexibility to the developers to do the experiment with new optimisations andgoogleprimes training algorithms It currently supports numerous applications from frontend to mobile the flexibility and scalability make it one of the most popular deeplearning frameworks in the world

2112 PyTorch

Based on the Torch library PyTorch can be seen as a Python front end of the torchengine which provides the capability to define the mathematical functions and com-pute gradients Similar to TensorFlow PyTorch has a good GPU and CPU supportand primarily developed by Facebook primes artificial intelligence research group Theofficial initial release of PyTorch is in October 2016 the dynamic computation powerallows the flexibility in building the complex architectures which make it has thegreater flexibility to build the complex architectures

However not like TensorFlow which could seamlessly integrate into real indus-trial application PyTorch was primarily developed by researchers and scientists andnot easily and recommended for production usage in specific scenarios

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 20: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect21 Background 5

212 Deep learning models

Selection of Framework and Models

Companies would like to take every advantage of deep learning models and pickup the speed of project development to survive in this keen competition Winingtrust from the public with service of high quality thus is required

Initially we would like to conduct our research in the latest model store suchas AWS SageMaker Azure machine learning service wolfram neural net repositoryand ONNX However the algorithms and datasets used are not transparent and thissense of opacity and impenetrability gives these businesses the upper hand in thepublic To avoid those problems and gain a more in-depth insight into usage in so-ciety we choose the framework with models of greater transparency and substantialusage - TensorFlow

API Referenced

Since this project involves a range of models in deep learning We begin ourselvesby developing an understanding of how to build and train neural networks Theconstruction of networks may rely on layers API provided by Keras and TensorFlowIn the meanwhile TensorFlow recently introduces estimator APIs to simplify theprocedure of training evaluation prediction and export

Convolutional Neural Network (CNN)Convolutional Neural Network is one of the most established algorithms among

all the deep learning models and one of the most dominant algorithms in computervision It can be thought of as a special kind of neural network with numerous sameneurons in a grid pattern Typically CNN consists of three types of layers fully con-nected layers convolution layer and pooling layer Convolution and pooling layersconduct the feature extraction and the fully connected layer maps the target featureinto the final output Layers are interconnected and thus the extracted features aretransferred layer by layer

Long short-term memory (LSTM)Different from the traditional neural networks which could not memorise the pre-

vious data a special kind of recurrent neural network Short-Term Memory (LSTM)provides researchers with an effective way to perform the persistent data tasks andcapable of learning the long-term dependencies [Hochreiter and Schmidhuber 1997]Specifically LSTM combines the forget and input gates into an update gate It is ca-pable of learning all the dependency from the historical data and making the predic-tion from the information remembered previously Inside LSTM instead of using the

TensorFlow official models are chosen in our project The TensorFlow official mod-els contians a collection of deep learning models which contains their high-level APIs(httpsgithubcomtensorflowmodelstreemasterofficial)

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 21: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

6 Background and Related Work

linear layer there is a small network inside the LSTM which perform the functionindependently

LSTM is one of the most common uses of recurrent neural networks This modelis generally used with sequential related data and can solve the language modellingproblem such as NLP concepts (word embedding encoder)

Residual Network (ResNet)One of the problems that deep learning models are facing is that as the numberof layers increases to a certain point accuracy will stop improves any more Thisproblem introduces the modern network architectures such as a residual network(ResNet) and inception to solve this certain problem by residual connections

ResNet normally solves the problem addressing above by fit a residual mappingby adding a shortcut connection Each of the ResNet blocks contains a series of layersand a connection component

Bidirectional Encoder Representations from Transformers (Bert)Bert is a language representation model designed to pretrain the deep bidirec-

tional representations from the unlabelled text by jointly conditioning on both leftand right context in all layers [Devlin et al 2018]

It is first released in google-researchbert in GitHub on the 1st of Nov 2018with the ability for a wide range of tasks in natural language processing tasks suchas question answering language inference without the substantial task-specific archi-tecture modifications which aims to predict the relationship between the sentencesby analysing the whole sentence holistically [Devlin et al 2018]

Attention is all you need (Transformer)Most of the problems in deep learning can be seen as a form of sequence-to-

sequence mapping and could be solved using a common type of the architectureEncoder-Decoder architecture

Encoder and decoder both consist of identical layers and each of layers consistsof 2 sub-layers multi-head self-attention mechanism and position-wide fully con-nected FFN [Vaswani et al 2017]

Neural Collaborative Filtering (NCF)Neural Collaborative Filtering (NCF) is a new neural network architecture which

utilises the non-linearity of the neural network to build the recommendation system[He et al 2017] It demonstrates that the matrix factorisation could be used as acase of neural collaborative filtering To add additional non-linearity this model in-troduces a multiple-layer perception module to the generalised matrix factorisationlayer (GMP)

Wide and Deep LearningSince the linear models are not great at generalising across unique features deepmodels are introduced to solve this problem Deep models could use the embed-ding vectors for every query and this technique could then be able to generalise by

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 22: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect21 Background 7

coupling the items and queriesTo overcome the over-generalise Google research team introduce the Wide amp

Deep Learning models jointly trained comprehensive linear models and deep neu-ral networks to combine the benefits of memorisation and generalisation for recom-mender systems [Cheng et al 2016]

213 Summarized Timeline

Model Name Definition Raised TimeCNN 1980sLSTM 1997

ResNet 2015Wide amp Deep 2016

NCF 2017Transformer 2017

Bert 2018

Table 22 Timeline

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 23: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

8 Background and Related Work

22 Public Code Repositories

221 Web-based hosting service

With almost 9 million users and 17 million repositories Github is one of the largestopen-source distributed version control systems (DVCS) [Gousios et al 2014] Thisdistributed version control system enables contributors to submit a set of changesand do the integration in the main development branch The use of git is based onpragmatic needs the advantages of which are combined with version control andcollaborative development

GitHub could lead to insights into the social aspects of software developmentUsers of GitHub can star a repository and express their interest to some projectsThus the number of stars could reveal the popularity in software development re-searching perspective it also may give developers researchers valuable insights

222 Measuring Popularity From GitHub

Understanding the Factors that Impact the Popularity of GitHub Repositories

Borges et al [2016b] investigate the popularity of GitHub repositories and identifyfour main patterns of popularity growth using the time series metadata derived from2279 accessible GitHub repositories In the meanwhile they found slow growth ismore common in case of overpopulated application domains and for old repositoriesMoreover they conclude that the three most common domains on GitHub are weblibraries and frameworks non-web libraries and frameworks

Build on his work we would like to examine whether deep learning frameworksrepositories follow a similar popularity trend pattern or not In the same time wealso would like to study whether exists a relationship between three factors forksstars and watchers

Predicting the Popularity of GitHub RepositoriesIn the same year Borges et al [2016a] publish another paper about predicting thepopularity of GitHub repositories using machine learning techniques They use mul-tiple linear regressions to predict the number of stars of GitHub repositories and asa result the project owners can see how their projects are performing in the open-source community

Their studies examine the correlation between the popularity of the repositorywith the number of stars owned itself In the meanwhile this study reports that theirprediction has a very strong correlation between predicted and real rankings

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 24: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect22 Public Code Repositories 9

223 Extracting Messy Data in the Wild

GHTorrentGHTorrent [Gousios and Spinellis 2012] is designed to conduct the independentrepository mining through peer-to-peer BitTorrent protocol They use REST API tonavigate all the repositories and the resulting dataset could be used towards answer-ing the multiple empirical research question in a salable manner

Our project extends their idea in retrieving the raw data based on the result re-turned from REST API However their tool does not have the ability to visualise themetadata and offer trend analysis at a high level

MetricMinerA similar tool is called MetricMiner [Sokol et al 2013] It is a web application thatsupports researchers in mining the software repositories doing data extraction andstatistical inference from data collected This tool automatically clone the reposi-tory process the metadata and stores data into the cloud which make it has goodscalability and fast computational speed in query answering without installing anysoftwares in their localhost

GitcProcGitcProc [Casalnuovo et al 2017] is a tool that use regular expressions to extract thechanged lines within the repository to facilitate answering the project evolution ques-tions GitcProc can retrieve and summarise the global statistic of the project metricsincluding of commits commit dates and contributors This tool can measure howmany changes have taken place in Java and is also able to locate the changing files

RepoVisRepoVis [Feiner and Andrews 2018] is a new tool which provides visual overviewsfor software maintained in Git repositories RepoVis is a client-server web applica-tion provides a full-text search for terms of interest within a software project

Inspired by RepoVis our project adopts its searchable functionality to Github andassociated with a code-based search The display of all the visualizations are writterninto SVG format

224 Visualizing data in Repositories

ChronosCHRONOS[Servant and Jones 2013] is a software tool that enables the visualisation ofthe historical change inside the software source code This tool implements a zoom-able user interface as a visualisation of the actual code which supports developersfrom high-level view (pattern recognition) to low-level ( reverse engineering) viewusing History Slicing approach

Specifically for selected lines of code CHRONOS could visualise the complete his-tory of change including the revisions that modified them Inspired by this tool ourproject using the visualisation method to track the historical change of the popular

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 25: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

10 Background and Related Work

Figure 21 git2net [Gote et al 2019]

trend related to the keyword specified by users in GitHub

GEVOLCollberg et al [2003] implemented a tool called GEVOL that could visualise the evolu-tion of software using the novel graph drawing technique to deduce a better under-standing of the program from its development history and display all the visualisa-tions using a temporal graph visualizer

This system aids in the discovery of the structure of the system and provides theuser with a new way to discover the evolution of the program by visualising thechange of the system It extracts the information about the programs in Java thatstored within a CVS version control system and extracted the metadata into threetypes of graphs inheritance control-flow and call-graphs

git2netgit2net[Gote et al 2019] is a software that facilitates the extraction of the co-editingnetwork in git repositories Similar to GEVOL it uses the text mining techniques toanalyse the history of the modification within files However if the address the im-portance to study social network in GitHub and give the reader a broader view onthe application of graph-based data analysis and modelling Their tool shows advan-tages in constructing a directed weighted network linking co-editing developers asdepicted in Figure 21

23 Summary

In this chapter we detail the web-based hosting service we selected (GitHub) to studyand present the concept of deep learning with two popular frameworks related to-gether with state-of-art neural network models In the next section we will elaboratethe how we design and implement STAMPER

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 26: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Chapter 3

STAMPER Design andImplementation

In this chapter we will outline our design and implementation for data extractionand then we will detail the metric we use to estimate the trend of deep learningframework and models usage information in Github

31 Overview

Our project was designed to evolve through the following steps as shown in Figure31

2 Repository Search

1 Data Collection[Optional] 3 Data Selection

Keyword 1

Model Name

Keyword 2

Keyword 3

Keyword

Keyword n

Git Project Search

APIData

SelectionGit

Code Search API

Local

Data Visualisation

Figure 31 Overview of STAMPER

11

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 27: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

12 STAMPER Design and Implementation

Data CollectionWe first collected all the repository metric through Github API this step allows usto extract the history of all the repositories related to the keyword and record themeta-data for each repository(Repository-based search)

Repository SearchAs each repository contains multiple contributors and forks subsequently we extractand analyze the whether changes are made or not based on the size informationand calculate the collaborative factor (entropy) for those repositories This processrequires additional crawling and processing the forked information to create visualrepresentations

Data SelectionWe have implemented a selector allowing to exclude some specific repositories notrelated to the desired repositories The selector will summarize the frequency countsfor keywords user entered and write the corresponding frequency into the local diskThe resulting file could be utilized to examined API usage statistics

Data AnalysisSince each forked repositories may be related to the re-developing and modificationModification of forked repositories is added in our project For all changed reposi-tories the size difference is analyzed examine whether lines are added or removedcompared to the original repository

32 Data Collection

Our tool automatically collect historical repository metadata and store them into thelocalhost allow users to deep analysis manipulate data and even run statistical teston the data set To better understand those metric we divided them into multiplecategories For the attributes not primary data from Github API we will explain itin the data expansion part and labelled them as [Data Expansion] in Table 32

To maximize the GitHub API requesting rate the user requires to do the au-thentication by entering their OAuth2 token at the start of the program After theauthentication the user could make up to 5000 requests per hour Otherwise therate limits will only allow for up to 60 requests per hour [Git d]

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 28: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect33 Repository Search 13

Type Meta-data

Contributorcontribution int [Data Expansion]

login (user name) Stringtype (user organization) String

contributors_url

Repositorycreated_atdescriptionfull_namelanguage

size

Popularityfork Boolean

forks intforks_url

stargazers_countwatchers_count

unique_repos [Data Expansion]

Ownerid

login (username)type

Maintainesehas_issues Booleanhas_wiki Booleanopen_issues int

pushed_atupdated_at

score

33 Repository Search

We expanded the collection operation to use event data as the starting point for deepcrawling operations to collect raw data

bull ContributionOne repository generally consists of multiple developers conduct software de-velopment The project owner not necessary the person who contributes themost amount of code Different amount of contributions made by the develop-ers are potentially not the same As a result we further track that informationby utilizing Github API and record the number of contribution each developermade for each repository

bull Unique_reposPopular repositories may have numerous relative forked repositories owned byother users The reason behind the forking behaviour may include variousOur research would like to explore whether they conduct the subsequent de-velopment based on the original codebase By comparing the size of the forked

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 29: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

14 STAMPER Design and Implementation

repository(Fi) and original repository (O) we obtain all the forked repositorieswith the change of size (c)

Fi + c = O (31)

34 Data Selection

Entity (Model) API Keywords

Searching inRepository

Statistics

Figure 32 Data Selection

ForkedRepositoryTimestamp

UnfilteredData

FilteredData

Model Related Keywords

Bert ResNet CNN

Group(Modelpy)

Figure 33 Store in Local Disk

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 30: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect34 Data Selection 15

Figure 32 represents our method to search API usage in DL model-related reposito-ries GitHub provides the REST API to examine whether repository has targeted APIor not

Based on the previous repositories got from the data collection phase we conductthe secondary data selection Moreover the appearance frequency of user-specifiedAPI is also embedded directly within the returned result and can be identified in ourprogram corresponding to each repositoriesrsquo full name The overall result will finallywrite into the local disk in JSON format

For each extracted repository in the first stage we record the number of APIappearance inside the repository This measure proxies the development effort asso-ciated with the API This approach also allows users to construct the knowledge theAPI usage information in GitHub-related repositories from a high perspective

In the meanwhile we also provided multiple options in selecting API whichstored in model_keywordpy

Table 31 show a descriptive statistics on the number of models related to Tensor-Flow and number of repositories we retrieved from GitHub

Example Keras application library provides user the ability to load deep learningmodels and instantiate a model with default weight Use ResNet as an example

bull With pre-defined modelsUser may add from kerasapplicationsresnet50 import ResNet50 in key-words defined in model_keywordpy this denotes that repository owner mayuse pre-trained model from the Keras library and and this model could befine-tuned in the current class

Similarly Keras has many instances proving its feasibility in creating deeplearning models and they all could be used as great sample keywords inSTAMPER such as

1 kerasapplicationsresnetResNet502 kerasapplicationsresnetResNet1013 kerasapplicationsresnetResNet1524 kerasapplicationsresnet_v2ResNet50V25 kerasapplicationsresnet_v2ResNet101V26 kerasapplicationsresnet_v2ResNet152V2

bull With self-defined modelsTensorflow has a sample ResNet50 model in its official repository In the sampleclass given Resnet consists of two main blocks If users would like to explorethe components inside the ResNet class further they may first examine whetherdefined and trace the number of the ResNet self-defined class Due to theflexibility model construction data selection heuristic varies Deep learning

httpsgithubcomkeras-teamkeras-applicationshttpsgithubcomkeras-teamkeras-applicationsblobmasterkeras_applicationsresnet50py

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 31: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

16 STAMPER Design and Implementation

users and experts could define their searching according to their interests andpreferences

35 Construct the Visualizations

Given the database of metadata collected from GitHub and generated by the ap-proach described above STAMPER provides procedures to generate three differenttypes of analysis (i) Popularity analysis (ii) Contribution analysis (iii) Maintenanceanalysis

The process of generating the visualizations in three perspective is illustrated inFigure 34 In the meanwhile Chapter 5 give an example of our collected repositoriesmetadata in deep learning models

Entity1 Entity2 Entity3Entity4 Entity Entity n

Functional Mapping

ContributionRelated

Visualisations

Popularity RelatedVisualisations

MaintenanceRelated

Visualisations

Functional MappingFunctional Mapping

Figure 34 Overall Construct the Visualizations

Popularity

bull Total number of repositories with fork (line)

bull Total number of repositories without fork (line)

bull Number of creations with change of time grouped in weeks (with fork)

bull Repository Creation Time vs Stars

Contribution

To additionally facilitate the forking information STAMPER finally supports thecomparison between the original repository and forked repository The work couldbe further extended by visiting the forked repository url and tracing the commitment

As shown in Figure 35 the entity (E) we searched in GitHub may exists multiplerelated repositories (Ri) and their corresponding forked repositories (Fi) In sideforked repositories we name changed forked repository Ci

To examine whether there exists change in forked repositories and the differencebetween multiple entities we calculate the difference by using the equation down

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 32: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect35 Construct the Visualizations 17

Key

wor

dTo

tal

ofR

epos

itor

y(i

nclu

ding

Fork

s)C

olle

cted

Tota

lof

Ori

gina

lRep

osit

ory

Col

lect

edR

esN

ette

nsor

flow

612

933

9Be

rtte

nsor

flow

137

3410

6C

NN

tens

orflo

w39

765

100

0LS

TMte

nsor

flow

195

721

000

Tran

sfor

mer

tens

orflo

w7

188

145

Wid

ean

dde

epte

nsor

flow

324

39

Tabl

e31

Rep

osit

orie

sR

elat

edto

Tens

orflo

w

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 33: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

18 STAMPER Design and Implementation

below Uniqueness percentages distributions are composed of all the percentage picorresponding to its original repository Ri

pi =sum Ci

sum Fi(32)

Entity (E)

Repository 1

Repository 2

Repository 3

Repository 4

Forked Repository 1Forked Repository 2Forked Repository 3Forked Repository n

Changed YNYY

Figure 35 Examine Uniqueness after Forking

bull Percentage of Forked Repositories Unique from Origin (Boxplots)

bull Uniqueness percentage distribution for Each Entity (Histograms)

bull Entropy Distribution histograms for Each Entity (Histograms)

Maintenance

bull Development Time Boxplot For Each Entity

bull Open Issues Distribution For Each Entity

36 Summary

In this chapter we detail the design of our tool in how it conduct repository min-ing and analyzing We present such a tool that facilitates the scalable extraction ofthe original repositories with their forked repositories related to the deep learningmodelsIt could be used as a stand-alone tool for analysis software trend in GitHubcommunity In the meanwhile we introduce and analysis two novel features relatedto GitHub repository

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 34: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Chapter 4

STAMPER in Action

Real-world software trend changes over time especially in the deep learning fieldDeep learning models are continually evolving and build train and deployed byresearchers Our tool is available for analyzing such changes We collected the his-torical information stored in GitHub and extract the metadata in the repository byusing STAMPER

41 Popularity of Deep Learning Models in GitHub

In 2019 with the rapid improvement of computation speed and power model devel-opment today is a battlefield without smoke of gunpower Researchers companiesand developers are trying to dominate the right to speak in this deep learning Thereexists a variety of models to think in but there is no common bridge to connect thoseideas together in the same time Historical data in GitHub are mysterious and hardto find And as a result many developers especially for those experienced develop-ers remain in their original zone With our study people who is hoping to shed somelight on deep learning and highlight a few suggestions for all the public

This section aims to answer some questions related to both modelprimes usage inGitHub as well as popularity of DL model development

411 Popularity Feature Selection

Borges et al [2016b] collected 2500 popular repositories based on the number ofstars However due to the few studies about the popularity of the GitHub systemthere is no standardized feature to measure popularity We analyze some potentialfeatures of each repository and make the hypothesis that the popularity is stronglyrelated to the stars each repository owned

This decision will be justified in the following section with more background inGithub

bull WatchersSimilar to following the RSS feed Watchers are Github users who would liketo ask to be notified of changing activities of the repository they are watchingHowever watch does not mean collaborators [Git b] The watcher could watch

19

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 35: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

20 STAMPER in Action

Figure 42 Star Sort Menu [Git a]

a repository to receive the notifications for the new pull requests or issuesthat are created Watchers could indicate how much interest does the GitHubcommunity give to the repository

Figure 41 Repository Watching [Git b]

bull StarsStarring a repository makes it easy to keep track of the repository user in-terested again The starred repository will appear on userrsquos own host domain(https[hostnamestars]) Star is also another metric for measure the pop-ularity within the GitHub community and thus Github has a ranking systembased on the number of stars a repository has [Git c]

bull ForksForks are created when the user would like to make an original copy of arepository The user could fork a repository and suggest changes or as a basisfor a new project

Based on the data we collected in the data collection stage we extract the watchers_countstargazer_count and forks attributes and test whether those attributes got corre-lation In Table 41 we summarize 86712 repositories (including their forked reposi-tories) with their related popularity metric and draw a scatter plot as shown in Figure43

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 36: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect41 Popularity of Deep Learning Models in GitHub 21

Figure 43 Popularity Metric

star forks_count watchers_count model name17940 4661 17940 Bert12405 3637 12405 Bert5263 1056 5263 Bert

Table 41 Popularity metric for repositories

The scatter plot suggests a definite positive correlation between the number ofstars number of forks and number of watchers However there exists evidence ofnon-linearity for forks_count values close to zero and thus we may consider othernormality

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 37: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

22 STAMPER in Action

Spearman Correlation CoefficientDefinitionThe Spearman rank-order correlation is a statistical procedure that is designed to

measure the relationship between two variables on an ordinal scale of measurement[Corder and Foreman 2011] This statistical method allows us to test for a rank orderrelationship between two numerical ordinal variables associated with a monotonicfunction (increasing decreasing relationship)

Hypothesis Testing

bull H0 The variables(star fork and watcher) do not have a relationship with eachother

bull H1 There is a relationship between those three variables

Result

bull Star vs Fork

1 from scipystats import spearmanr2 coef1 p1 = spearmanr(star forks)3 print(coef1 p1)4 gtgt 08752903811064278 00

bull Star vs Watcher

1 coef2 p2 = spearmanr(star watchers)2 print(coef2 p2)3 gtgt 10 00

bull Fork vs Watcher

1 coef3 p3 = spearmanr(forks watchers)2 print(coef3 p3)3 gtgt 08752903811064278 00

Running the code above calculates the Spearmanrsquos correlation coefficient betweenthree variables in the testing dataset

Set α = 005 since ρ1 ρ2 ρ3 both less than α in the meanwhile from the calcu-lation above we could also find that there exists a strong positive correlation withvalues of coe f 1 = 0875 coe f 2 = 10 coe f 3 = 0875 respectively

This means that the likelihood of testing data are uncorrelated is very unlikely(95 confidence) and thus we can reject the hypothesis that those variables areuncorrelated

And in the rest of report we consider that number of star is the proxy for projectrsquospopularity

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 38: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect41 Popularity of Deep Learning Models in GitHub 23

412 Past and Current Status A Full Integration

According to our survey convolution neural network (CNN) and long short-termmemory network (LSTM) in Figure below looks like two of the most trending mod-els these days Rising from 2017 CNN and LSTM existing the greatest number ofrepositories in both creation and forks Aside from models exists longer historyBERT and ResNet are two of the rising stars in the model competition They arrivednow with significant improvements in architecture design and performance as wealready described in background section

The model development community recently saw the declaration of multiplepowerful frameworks and treat them as a baseline for building models Howeverfor many new models the usage not grow in abundance like Wide and Deep modeland NCF model

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 39: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

24 STAMPER in Action

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Number of Repositories Created With Forks (Accumulated)

2015 2016 2017 2018 2019time

0

5000

10000

15000

20000

25000

30000

35000

40000

coun

ts

Figure 44 Repositories with Forks

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Numebr of Repositories Created (Accumulated)

2015 2016 2017 2018 2019created_at

0

500

1000

1500

2000

2500

3000

coun

ts

Figure 45 Repositories without Forks

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 40: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect41 Popularity of Deep Learning Models in GitHub 25

name

bert

tens

orflo

w

0

200

400

600

800

counts

cnn

tens

orflo

w

0

200

400

600

800

counts

lstm

tens

orflo

w

0

200

400

600

800

counts

ncf t

enso

rflow

0

200

400

600

800

counts

resn

et te

nsor

flow

0

200

400

600

800

counts

trans

form

er te

nsor

flow

0

200

400

600

800

counts

wide

dee

p te

nsor

flow

0

200

400

600

800

counts

October2015 April JulyOctober2016 April JulyOctober2017 April JulyOctober2018 April JulyOctober2019 April JulyOctobertime

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 46 Repository Trend in GitHub For Each Model

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 41: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

26 STAMPER in Action

5000

10000

15000

Number of Stars

Created time vs Stars

October2015April JulyOctober2016 April JulyOctober2017April JulyOctober2018April JulyOctober2019April Julycreated_at

bert tensorflow

cnn tensorflow

lstm tensorflow

ncf tensorflow

resnet tensorflow

transformer tensorflow

wide deep tensorflow

nam

e

Figure 47 Creation Time vs Stars

Fork is another copy of a repository The forked repository could either contributeback to the original repository or use the original code in a derivative way FromFigure 44 and Figure 45 surprisingly we can see there is a considerable differencebetween the total number of created repositories with forks and the total numberof repositories with forks We can find that most of the repositories related to deeplearning models are not original This indicates a considerable amount of developersremain in studying stage

In the same time we use this dataset to answer some research questions

413 RQ1 How has the popularity of model changed over time A closerlook at the deep learning models

The goal is to provide an initial view about the popularity of deep learning modelsby comparing repositories currently exists in GitHub community as shown in Figure47 and Figure 46

In this section different from the previous summarizing method we study thepopularity of deep learning models with change of time

CNN and LSTMs dominant role

As per the above comparison it is clear that CNN and LSTM are the winnersin the GitHub community as they have the highest average number of stars and thenumber of repositories created Let us examine our thought by using the data In

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 42: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect41 Popularity of Deep Learning Models in GitHub 27

2017 the number of created repositories gradually increased for CNN and in 2018-2019 the creation trend continued increasing to an upper level till now

What accounts for this tremendous usage difference Currently CNN and LSTMhave one of the most important and most significant communities in the deep learn-ing field This network both in computer vision and NLP is essential and it is anoverwhelming majority

Models in fast and steady state ResNet and TransformerAs the title indicates ResNet and Transformers usage significantly improve in recenttwo years Differ from the previous structure like CNN and ResNet both of themconduct modification from the original structure and significantly improve on theresults in computer vision and translation tasks

Rising Star Bert

However there are no models that come with perfection LSTM itself can beextended to many variants and BERT is one of those

The current trends as depicted in the graph are throwing the conclusion andinference that deep learning models are proliferating fast with innovative develop-ments Currently there surely have ample space to grow and improve

Models in frozen zone NCF and Wide and Deep

If someone believes that the popularity of deep learning model is strongly corre-lated to the time model come to exists but our data tells another different story

Paper published in 2016 NCF draws the least attention in the GitHub communityThis data also shows that there is no relationship between popularity (ie stars)and creation time and past trends are no guarantee of future ones and it could bepossible that the momentum toward increasing attention for the specific model indeep learning could flatten out or reverse itself

Similar to Wide and Deep model though Google provides full documentationand tutorial in this model we still take a pessimistic view of this model published in2016 Moreover previous data also confirmed the there is no significant rise in usingthis model

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 43: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

28 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

01020304050607080901001101201301401501601701801902002102202302402502602702802903003103203303403503603703803904004104204304404504604704804905005105205305405505605705805906006106206306406506606706806907007107207307407507607707807908008108208308408508608708808909009109209309409509609709809901000forks_count (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameFork Distribution Histograms

Figure 48 Number of Forks Related to Repositories in Deep learning Model Development

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 44: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect41 Popularity of Deep Learning Models in GitHub 29

414 RQ2 How popularity varies per model

Figure 48 shows the distribution of the number of forks for each model And aspresented in Table 42 and Table 43 we can see that

Model Name Mean STD Min 25 50 75 maxBert 49865 21963 0 1 8 43 17940

CNN 10684 61197 2 3 8 32 13882LSTM 4882 21422 0 1 2 13 2703NCF 77 12991 1 2 3 115 227

ResNet 4688 22143 0 0 1 8 2980Transformer 18679 115587 0 0 4 21 12408

Wide and Deep 1623 3680 0 0 1 8 146

Table 42 Stars Comparison

Model Name Mean STD Min 25 50 75 maxBert 128214953 585926617 00 00 10 165 46610

CNN 40710 252713617 00 10 40 140 62740LSTM 17793 71956709 00 00 10 50 9680NCF 34333333 58603185 00 05 10 515 1020

ResNet 17442478 93754994 00 00 00 30 14420Transformer 53518797 336103826 00 00 10 60 36370WideDeep 7282051 16364192 00 00 00 25 710

Table 43 Forks Comparison

The top-3 model whose repositories has the highest average number of stars areBert (49865) Transformer (18679) and CNN (10684) The top-3 model whose reposi-tories has the highest average number of forks are Bert(12821) Transformer(5352)and NCF (3433)

The 3 models whose repositories has the lowest average number of stars are Wideand Deep(1623) ResNet(4688) and LSTM(4882) The 3 model whose repositorieshas the lowest average number of forks are Wide and Deep(782) LSTM(17793) andResNet(17442)

Kruskal-Wallis Test Kruskal-Wallis test is used in this project for comparingmode than two samples based on ranks The hypothesis is given below and 5level of significance is chosen

bull H0 7 modelsrsquo distributions are the same

bull H1 7 modelsrsquo distributions are different

1 from scipystats import kruskal2 stat p = kruskal( dfBert[star]tolist() dfCnn[star]tolist() dfLstm[star]tolist()

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 45: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

30 STAMPER in Action

3 dfNcf[star]tolist() dfResnet[star]tolist() dfTransformer[star]tolist()4 dfWideDeep[star]tolist())5 print(stat p)6 gtgt 327161878375634 12287000508128928e-67

Since p lt 005 we can reject the null hypothesis and can conclude that there isno significant evidence to say that seven modelsrsquo distribution is the same

Summary

The top-2 models with both the greatest average number of star and greatestaverage number of forks are Bert Transformer

However this finding is different from the previous popularity exploration basedon number of repositories existed in GitHub This unexpected result could be inter-preted in two ways It may be that the developer prefer start with easier model andwork their way up to more complex models Alternatively the difficulty of buildingtheir own Transformer or Bert model require large number of time and effort butthey show their interest in those novel deep learning models

number_of_contributors

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 49 Star vs Contributors

415 RQ3 Does the popularity of models relate to other features

Figure 49 Figure 410 Figure 411 and Figure 412 shows the scatter plots correlatingthe the number of stars with number of contributors development time number ofopen issues respectively

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 46: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect41 Popularity of Deep Learning Models in GitHub 31

develop_duration

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

200

400

600

800

1000

1200

1400

1600

1800

2000

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 410 Star vs Development Time

open_issues

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000011000120001300014000150001600017000180000

100

200

300

400

500

600

700

800

900

1000

1100

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 411 Star vs Open Issues

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 47: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

32 STAMPER in Actionentropies

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000110001200013000140001500016000170001800000

02

04

06

08

10

12

14

16

18

20

22

24

26

28

stargazers_count

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 412 Star vs Entropy Value

Number of ContributorsFrom Figure 49 and Spearman correlation test the number of stars is weakly corre-lated to the number of contributors (ρ = 03303 with pminus value le 001) from all therepositories the top-3 repositories with more stars per contributors are from modelCNN (16875 starscontributor) Transformer (1551 starscontributor) and Bert (1550starscontributor)

Model Percentage of One Contributor Development ()Bert 7453

CNN 833LSTM 859NCF 100

ResNet 9026Transformer 8120

Wide and Deep 8974

Table 44 Percentage of one contributor development for DL related repositories

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 48: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect41 Popularity of Deep Learning Models in GitHub 33

Development TimeFrom Figure 410 and spearman correlation test the number of stars is strongly cor-related to development time (ρ = 07341 with p minus value = 0) This suggests thatthe longer a model develops the more stars it will have (ie this model becomemore popular) The top-2 repository with more development duration are belongsto model LSTM CNN

Open IssuesFrom Figure 411 and spearman correlation test the number of stars is strongly cor-related to open issues (ρ = 06186 with p minus value le 001) As visually suggestedby the figure there is a strong positive relationship between stars and open issuesInterestingly we can make the assumption that the more popular become the moreissues it will have And we will further investigate this correlation in the followingsection

EntropyFrom Figure 412 and spearman correlation test the number of stars is weakly corre-lated to entropy (ρ = 03206 with pminus value le 001) In this project we also investigatethe impact of new features (entropy value) on the popularity of GitHub repositories

First we collected each contributorrsquos contribution distribution in a repositoryduring the data collection stage As mentioned the goal is to check whether thework is divided evenly in developing a deep learning model After collecting all thecontribution we calculate the entropy value per repository and Figure 413 showsthe distribution of the entropy value The majority of repositories has the entropyvalue between 0 and 1 which means the development is not distributed evenly

We can examine this using Table 44 most of the deep learning related reposito-ries (great than 70) are developed by one contributor not a team

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 49: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

34 STAMPER in Action

42 Contribution of Deep Learning Models in GitHub

421 Collaborative Contribution

Considering the software development may involve multiple developers and eachdeveloperrsquos contribution may not be the same Here we introduce an information-theoretic approach to test whether their contributions are even or notEntropyIn particular we compute the entropy H of each repository defined as

pi =ci

sumi ci(41)

H = minussumi

pilog2(pi) (42)

i represents iprimeth contributor ci means iprimeth contributorrsquos contribution and sumi cirepresents total contribution for one repository

Taking repository dragen1860TensorFlow-2x-Tutorials as an example

The contribution table could be summarized in Table 45 and its correspondingentropy could then be calculated

Total = 174 + 36 + 4 = 214 (43)

p1 =174214

p2 =36214

p3 =4

214(44)

H(repository) = minus(174214

log2(174214

) +36

214log2(

36214

) +4

214log2(

4214

)) asymp 080133 (45)

name contributiondragen1860 174

ash3n 36kelvinkoh0308 4

Table 45 Sample Contributions to One Repository

The resulting distribution of entropy for all the repositories can be used to de-termining whether the repository is developed unevenly The lower the entropy thehigher phase separation which results in more unevenly work distributed

Figure 413 shows the distribution of the entropy value related to all modelsFrom those figures we can see that most of the repositories have entropy value ofaround zero which means that Deep learning related repositories are either devel-oped majority by one developer or a team of of uneven allocation of work

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 50: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect42 Contribution of Deep Learning Models in GitHub 35

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

000 020 040 060 080 100 120 140 160 180 200 220 240 260 280 300entropy (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameEntropy Distribution

Figure 413 Collaboration Entropy

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 51: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

36 STAMPER in Action

422 RQ1 After forking do developers change the codebase

Given a repository STAMPER could retrieve the full history of the user who forked theoriginal repository along with their repository metadata information

Figure 414 and Figure 415 highlight the percentage of unique forked repositoriescompared to the original one We observe that Bert has a high proportion of uniqueforked repositories out of 6 models

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

10

20

30

40

50

60

70

80

90

100

unique_percent

Figure 414 Percentage of Forked Repositories Unique From Origin (Boxplot)

Figure 416 shows the distribution of the number of lines changed compared tothe original repository Our objective was to have a summarized view allows peopleto have a view of what percentage of developers interested in deep learning aredeveloping a new project

An example of this view is depicted in Figure 415Most of the model repositoriesare not changed after forking To provide a more detailed analysis we can see ata glance not only the changes are rarely made after forking but also most changed

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 52: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect42 Contribution of Deep Learning Models in GitHub 37

nam

e

bert

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

100

200

300

400

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

100

200

300

400

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

100

200

300

400

Cou

nt o

f Rec

ords

000005 010 015 020 025 030 035 040 045 050 055 060 065 070 075 080 085 090 095100percentage (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameUniqueness ()

Figure 415 Repository Uniqueness Distribution ()

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 53: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

38 STAMPER in Actionna

me

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

-2500-2400-2300-2200-2100-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-10001002003004005006007008009001000110012001300140015001600170018001900200021002200230024002500means (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameRepository Changed Histograms

Figure 416 Repository Change Statistic

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 54: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect43 Maintenance of Deep Learning Models in GitHub 39

repository size difference between the original repository in 0 to 100 bytes as depictedin Figure 416

As mentioned we found a high percentage of deep learning repositories are notchanged after forking especially the one implemented in new models We hypothe-size two main reasons to explain this result First new models are not released fora long time lack of tutorial and attention make it less concerned by people Secondthe model itself may only valid for specific type of data and make it less robust andgeneralized and suits developersrsquo need

We conclude that the forked repositories development size is quite imbalancedand a large number of forked proejcts with no change from the original repository

43 Maintenance of Deep Learning Models in GitHub

In this section the problems of software maintenance in these deep learning relatedrepositories were surveyed The overall purpose of this section was to explore threesets of factors related to the maintenance development time number of open issuesand wiki page for each repository In this project we also explore whether the age ofthe project affects software maintenance

431 RQ1 How long has it been in existence

Software maintenance requires a large number of time and work Older systemtended to have more problems in software maintenance In this report we calculatethe age of each repository from repository creation time as depicted in the equationbelow

age = T(updated at)minus T(created at) (46)

Figure 417 and Table 46 shows the how development varies depending on theeach model The median number of development time varies as follow Bert(110days) Transformer (142 days) Wide deep (117 days) ResNet (120 days) NCF(216days) LSTM (3155 days) and CNN (483 days) After applying the Kruskal-Wallistest the distribution of development days by the model is different (pminus value le 005)Therefore we hypothesize that many of these models from earlier immediately startusing the open-source web community after the first release

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 55: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

40 STAMPER in Action

Model Max of days Q3 of days Median of days Q1 of days Min of daysBert 779 229 110 32 0

Transformer 1254 321 142 11 0Wide deep 1107 575 117 05 0

ResNet 1360 4565 120 15 0NCF 1120 476 216 8 0

LSTM 1812 62125 3155 4725 0CNN 1385 69925 483 27025 0

Table 46 Repository Development Time Stat

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

model

bert

tens

orflo

w

cnn

tens

orflo

w

lstm

tens

orflo

w

ncf t

enso

rflow

resn

et te

nsor

flow

trans

form

er te

nsor

flow

wid

e de

ep te

nsor

flow

model

0

200

400

600

800

1000

1200

1400

1600

1800

2000

days

Figure 417 Development Time Boxplot

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 56: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect43 Maintenance of Deep Learning Models in GitHub 41

open_issues

0 200 400 600 800 1000 1200 1400 1600 1800 20000

100

200

300

400

500

600

700

800

900

1000

1100

develop_duration

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

name

Figure 418 Development Time vs Number of Open Issues

432 RQ2 Do old models have more issues compared to new models

Figure 418 shows scatter plot correlating the development time with the number ofissues As visually suggested by the Figure and Spearman correlation test there is aweak correlation between those two variables (coe f = 04608 p le 001 ) Essentiallythis result shows that the older model due to the high cost in maintaining may havemore users and more issues related to them

Specifically as depicted in Table 47 the top-3 models has the highest number ofopen issues are Bert (8299) CNN (3414) and Transformer (1857) Moreover fromFigure 419 we can also see that most of the repositories have less than 200 openissues

Model mean Std 25 50 75 min maxBert 8299 5055 0 0 1 0 504

CNN 3414 35456 0 0 1 0 1077LSTM 1292 4915 0 0 1 0 69

ResNet 1791 11164 0 0 0 0 186Transformer 1857 8608 0 0 1 0 95Wide Deep 0231 0742 0 0 0 0 4

Table 47 Repository Open Issue Statistics

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 57: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

42 STAMPER in Action

Model-Related Repository Percentage of repository having Wiki()Bert 9717

CNN 98498LSTM 98799NCF 98864

Resnet 98817Transformer 9697Wide deep 100

Table 48 Descriptive statistics on percentage of Wiki Existence

433 RQ3 Are they well maintained

Table 48 shows descriptive statistics on the percentage of repositories have docu-mentation (wiki) The percentage ranges from 9697 to 100 Therefore from ourdata collected we can see deep learning related repositories are well documented(ie mostly have wiki page) Figure 419 shows the histogram of open issues Allthose samples best are modelled by long-tail distributions macr most of the repositorieswith a much lower number of issues with large sample variance

44 Summary

In this section we first used STAMPER to get the metadata of repositories related toDL models from GitHub Then we investigated three common aspects (popularitycontribution and maintenance) of software engineering in deep learning repositoriesusing collected data from STAMPER

We identified some patterns of popularity in deep learning repositories whichderived from the Spearman correlation test We also reported the maintenance met-ric of deep learning repositories and conducted an in-depth investigation to testifywhether developers change the original code base or not after forking

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 58: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect44 Summary 43

nam

e

bert

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

cnn

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

lstm

tens

orflo

w

0

200

400

600

800

Cou

nt o

f Rec

ords

ncf t

enso

rflow

0

200

400

600

800

Cou

nt o

f Rec

ords

resn

et te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

trans

form

er te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

wid

e de

ep te

nsor

flow

0

200

400

600

800

Cou

nt o

f Rec

ords

0 2 4 6 8 101214161820222426283032343638404244464850525456586062646668707274767880828486889092949698100open_issues (binned)

bert tensorflowcnn tensorflowlstm tensorflowncf tensorflowresnet tensorflowtransformer tensorflowwide deep tensorflow

nameDistribution of Issues

Figure 419 Open Issues vs Number of Repository

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 59: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

44 STAMPER in Action

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 60: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Chapter 5

Discussion And Future Work

51 Discussion

511 Data in the wild Limitation and Improvement

Pinpointing the landmark of deep learning models is tricky We had plenty of ideason the identification of models using different strategies We have developed heuris-tics to in-depth analysis of the construction of models and high-level APIs such asestimators layers However our heuristic may not be perfect considering numerousways in constructing models in the real-world

There exists sampling problem at the same time The models we choose could notrepresent all the new models in the wild This is an open research question whichneeds further investigate in the future for example users may use prototxt formatto publish their models In our project we only focused on deep learning modelsconstructed using Python Also findings may reflect sampling problems That is thepresent experiment uses a limited number of repositories on GitHub which cannotexceed 1000 original created repositories boundary We tried to overcome this issueusing different sorting strategies provided by GitHub However this strategy stillcannot capture all the repositories in GitHub It may be that other more stratifiedsamples would have a more precise outcome

However our research project is essential and necessary in that it provides anintuitive way for researchers and developers to know how deep learning involves inour real life an idea that was novel compared with previous works

512 Extensibility and Open-Source Software

The field of deep learning is changing rapidly different models release in a timescaleof months In the future it would be advantages to incorporate other contributorsto this project and explore other models in the wild this program could provide abroader picture of deep learning model usage in the world Our program also allowsdeveloper or user to develop their heuristic in data selection experts could easilychange their API searching in our program to have an in-depth understanding ofwhat has been done in the past We expect our tool and dataset move beyond bymigrating to open-source software or as a plugin for GitHub allowing researchers

45

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 61: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

46 Discussion And Future Work

and developers easily accessing the trend in the past

52 Future Work

521 Social Network Analysis in GitHub

Social media like Twitter and Youtube have been well-studied in recent years how-ever less attention have been paid to GitHub The future work could be payingattention to classification or regression of GitHub repositories using machine learn-ing deep learning techniques The ultimate goal is to predict the future trend inGitHub or even give recommendation to developers in the future

522 Trend Detection using Commitments Timestamp

In this project we investigate and examine the popularity of deep learning modelrelated to the number of repositories exists in GitHub It is very likely that thecommitment metadata reflect the popularity in the same time In the future we couldmove beyond and develop techniques that incorporate machine learning clusteringalgorithms (eg KMeans) in high resolution time series data from commitments

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 62: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Chapter 6

Conclusion

This research project identifies the need for developing a tool to conduct the trendanalysis in GitHub Our new approach used current GitHub API to extract repos-itoriesrsquo metadata Used this tool we studied the popularity maintenance and con-tribution of GitHub deep learning related repositories and identified factors affectedprevious domains The key advantage of STAMPER is that it provides a simple wayto extract the historical information in GitHubOur tool utilises a large number ofrepositories related to deep learning models to report the large-scale emergence ofdeep learning models

This demonstrates the ability of this tool to make user gain a deeper insight intocurrent deep learning trend and generate a corpus for further research use Ourstudy could be used for example to discover other trend across the GitHub Oneavenue for further study would be social network analysis in GitHub

Additionally we expect our tool and resulting corpus will be of considerableinterest for researchers in different fields and serve the need for people working atthe intersection of social media analysis data visualization and data science

47

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 63: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

48 Conclusion

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 64: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Chapter 7

Appendix

71 Appendix 1 Project Description

711 Project Title

Mapping the landscape of Machine learning models and datasets in the wild

712 Supervisors

Dr Ben Swift

713 Project Description

The ML software (both models amp datasets) is evolving rapidly changing and manyML applications are widely used in people lives The introduction of different modelstore offers developers to learn train and develop their projects However the down-side of this rsquowild westrsquo approach is that it is hard to know what is being done whatrsquosbeen tried in the past based on the previous research prototyping Have the modelstores maintained their model up to date What is the difference between thosemodel stores

By scraping analyzing visualizing the use of various models amp datasets fromreal-world software repositories (Github AWS etc) this project will produce vi-sualizations of the ML landscape to aid researchers in understanding past currenttrends

714 Learning Objectives

bull Identify data sources for current trends in model amp dataset use

bull Develop visualization analysis techniques for representing trends in their use

Keywords machine learning TensorFlow

72 Appendix 2 Study Contract

49

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 65: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

52 Appendix

73 Appendix 3 Artefact Description

Except where otherwise indicated this program is my own original work

731 Code Files Submitted

0 Configuration and Setup__init__py setuppy Modelpymodel_keywordpy testpyJSONFormattersh change_to_pdfsh

1 Data Collectionmodel_searcherpy item_filterpy

2 Repository Searchforks_time_stamp_getterpy

3 (Optional) Data Selectionrepository_filterpy filtered_repopy

4 Data Visualization Use Altaircontribution_statpyentropy_calculationpyAnalysiscontribution_relatedpyAnalysismeta_datapy

732 Program Testing

testpy

733 Experiment

Hardware

MacBook Pro (Retina 15-inch Mid 2015)Processor 22 GHz Intel Core i7Memory 16 GB 1600 MHz DDR3Graphics Intel Iris Pro 1536 MB

Softwares

bull PyCharm

PyCharm 201913 (Professional Edition)

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 66: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

sect73 Appendix 3 Artefact Description 53

Build PY-191747930 built on May 30 2019Licensed to ANU Xing YuJRE 1102+9-b15960 x86_64JVM OpenJDK 64-Bit Server VM by JetBrains sromacOS 10146

bull Anaconda

ndash jupyter-notebook 600

Other

- Python 374-- pandas== 0220 -- numpy== 1140-- statistics==1035 -- ratelimit==221-- requests -- altair -- matplotlib==222-- selenium- Git

Datasets

asc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_generalbertjson lstmjson resnetjson wide deepjsoncnnjson ncfjson transformerjson

desc_by_starbert tensorflowjson lstm tensorflowjson wide deep tensorflowjsonresnet tensorflowjson transformer tensorflowjsoncnn tensorflowjson ncf tensorflowjson

asc_by_starcnn tensorflowjson lstm tensorflowjson

pytorch_modelsAlexNetjson HarDNetjson ResNet101jsonShuffleNet v2json U-NetjsonDCGANjson Inception_v3jsonResNext WSLjson SqueezeNetjson WaveGlowjsonDensenetjson MobileNet v2json ResNextjson

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 67: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

54 Appendix

Wide ResNetjson Tacotron 2jsonFCN-ResNet101json PGANjson RoBERTajsonTransformerjson fairseqjsonGoogleNetjson ResNetjsonvgg_netsjson SSDjson U-Net pytorchjson

by_update_timebert tensorflowjson lstm tensorflowjsonresnet tensorflowjson wide deep tensorflowjsoncnn tensorflowjson ncf tensorflowjsontransformer tensorflowjson

filtered_repotensorflow_model_filteringbertjson lstmjsonncfjson resnetjsontransformerjson wide deepjson

filtered_repopytorch_model_filteringDensenetjson GoogleNetjsonResNet101json ShuffleNet v2jsonTacotron 2json vgg_netsjsonFCN-ResNet101json MobileNet v2jsonResNextjson SqueezeNetjson Wide ResNetjson

forked_timestampbert tensorflowcsv lstm tensorflowcsvresnet tensorflowcsv wide deep tensorflowcsvcnn tensorflowcsv ncf tensorflowcsvtransformer tensorflowcsv

74 Appendix 4 README

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 68: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

STAM

PER Mapping the landscape of deep

learning models use in the w

ild

cccccc

STAMPER

STAMPER

Table of Contents

Before You BeginPrerequisitesInstallRunningTestH

igh Level Description of all M

odules amp D

atasetsAuthorsLicense

STAM

PER

STAMPER

is python software tool for exam

ining real-world softw

are repositories related to deep learning models to

aid researchers in understanding past and current trends

Before You Begin

Before you begin we strongly recom

mend you dow

nload PyCharm and Anaconda before view

editexecute ourcodePyCharm

Anaconda

Amphetam

ine on the Mac App Store Keep M

ac awake w

ith this useful App (Otherw

ise it will disconnect

internet)

Prerequisites

Please make sure you have installed all of the follow

ing prerequisites on your development m

achine

Git - httpsgit-scm

comdow

nloadsG

it Authentication tokenPython 37 w

ith pip

Jupyter Notebook 600

All external libraries used listed in requirementstxt

Install

Install some dependencies M

ake sure you have the latest pip

Install a list of requirements specified in r

equirementstxt

Running

All the code scripts run from the root

1 Data Collection

Clone our project

Run python3 model_searcherpy

to get keyword related repositories m

etadata in GitH

ub in output

folderAn authentication key is required to get a higher request rateN

ote Request rate could be changed in function call_api

and call_api_dir

Then you need to run sh JSONFormattersh

in your terminal to w

ell-format your output data

Sample Case

In main()

change keywords

in terms of interest Then the resulting JSO

N file w

ill beoutputbertJSON

Customized sorting m

ethod in function get_total_pages

and request_ith_page

Could besort

updated

or stars

order

asc

or desc

2 Repository Search

Run python3 forks_time_stamp_getterpy

to get all your the forks timestam

p in forked_timestamp

3 Data Selection (O

ptional)

pip3 install --upgrade pip

1

pip3 install -r requirementstxt

1

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 69: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Run python3 repository_filterpy

to get your code-related repositories with statistics in

filtered_repo

folder

Run python3 filtered_repopy

to filter your data

NoteYour keyw

ords could be customized in m

odel_keywordpy

We store all the previous experim

ent data in tensorflow_model_filtering

andpytorch_model_filtering

4 Data Visualization

Popularity

Run python3 visualizationspopularitypy

and get your graphs invisualizationsgraphspopularity

Maintenance

Run python3 visualizationsmaintenancepy

and get your graphs invisualizationsgraphsmaintenance

Contribution

Run python3 visualizationscontributionpy

and get your graphs invisualizationsgraphscontribution

Multi Correlations

Run python3 visualizationsmulti_variablepy

and get your graphs invisualizationsgraphsmulti_variable

Test

Some G

itHub repositories does not m

aintained well and their links som

etimes are broken and unreachable To

guarantee your best experience in using our tool we provide testing unit for G

itHub links in t

estpy

This module

will record all the unreachable links and w

rite them into file

unreachable_urlstxt

UsageChange elem

ents in keywords

run python3 testpy

All the unreachable links will w

rite tounreachable_urlstxt

Customizing Your O

wn Search

In module M

odelpy

define your own entity lists (eg t

ensorflow_models

)

In Constructor Model

we store all unfiltered_data filtered_data and forked_tim

e_location in three folders

Instantiation

Since you already got data from the previous steps (1-2) Then you can construct a m

odel by calling aconstructor M

odel

eg bert = Model(bert tensorflow desc_by_star)

parameter M

odel_name and Respository m

etadata subfolder

Then you can call this object with its relative data easily (

from Model import bert

and use bert

as you goalong)

Customize Keyw

ords

In module m

odel_keywordpy

import your instantiation (

lstm

) and call add_keywords

eg

High Level D

escription of all Modules amp

Datasets

1 Data Collection

2 Repository Search

3 (Optional) D

ata Selection

4 Data Visualization

Altair is used to draw elegant graphs

Experiment D

atasets Collected

lstm_keywords = [tfkeraslayersLSTMCell tfnnrnn_cellLSTMCell]

lstmadd_keywords(lstm_keywords)

12

model_searcherpy

item_filterpy

12

model_searcherpy

forks_time_stamp_getterpy

12

repository_filterpy

filtered_repopy

12

contribution_statpy

entropy_calculationpy

Analysiscontribution_relatedpy

Analysismeta_datapy

1234

1 After Data Collection

output

asc_by_star

cnn tensorflowjson

$

lstm tensorflowjson

asc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

by_update_time

123456789

10

11

12

13

14

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 70: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_by_star

bert tensorflowjson

cnn tensorflowjson

lstm tensorflowjson

ncf tensorflowjson

resnet tensorflowjson

transformer tensorflowjson

$

wide deep tensorflowjson

desc_general

bertjson

cnnjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

$

pytorch_models

AlexNetjson

DCGANjson

Densenetjson

FCN-ResNet101json

GoogleNetjson

HarDNetjson

Inception_v3json

MobileNet v2json

PGANjson

ResNetjson

ResNet101json

ResNext WSLjson

ResNextjson

RoBERTajson

SSDjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Transformerjson

U-Net pytorchjson

U-Netjson

WaveGlowjson

Wide ResNetjson

fairseqjson

$

vgg_netsjson

2 After Repository Search

forked_timestamp

bert tensorflowcsv

cnn tensorflowcsv

lstm tensorflowcsv

ncf tensorflowcsv

resnet tensorflowcsv

transformer tensorflowcsv

$

wide deep tensorflowcsv

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

Generated G

raphs

3 After Data Selection (Optional)

filtered_repo

bertjson

pytorch_model_filtering

Densenetjson

FCN-ResNet101json

GoogleNetjson

MobileNet v2json

ResNet101json

ResNextjson

ShuffleNet v2json

SqueezeNetjson

Tacotron 2json

Wide ResNetjson

$

vgg_netsjson

$

tensorflow_model_filtering

bertjson

lstmjson

ncfjson

resnetjson

transformerjson

$

wide deepjson

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

graphs

contribution

change_to_pdfbash

entropy_distributionsvg

entropy_dotssvg

lines_changed_boxssvg

lines_changed_histssvg

unique_percentage_distributionsvg

uniqueness_chartsvg

maintenance

devTime_boxplotsvg

issues_distributionsvg

wiki_ynsvg

multi_variable

dev_t_to_open_issuessvg

multi_correlationsvg

star_to_contributorssvg

star_to_dev_tsvg

star_to_entropysvg

$

star_to_open_issuessvg

$

popularity

accumulated_popularitysvg

creation_repository_trend_totalsvg

creation_with_fork_timelinesvg

fork_distributionsvg

popularity_dotsvg

$

popularity_measurement_correlationsvg

123456789

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 71: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Authors

Xing(Nicole) Yu w

ith

Under the Supervison of D

r Ben Swift

License and References

MIT copy

Xing Yu

Stamper Im

age from httpbestpriceforrubberstam

pscom (License Free for personal use only)

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 72: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

Bibliography

a Github description httpshelpgithubcomenenterprise216userarticlessaving-repositories-with-stars Accessed 2019-09-22 (cited on pages xv and 20)

b Github description httpswwwmetrics-toolkitorggithub-forks-collaborators-watchers Accessed 2019-09-22 (cited on pagesxv 19 and 20)

c Github description httpshelpgithubcomenarticleswatching-and-unwatching-repositories Accessed 2019-09-22 (cited on page20)

d Github Search API description httpsdevelopergithubcomv3rate-limitingAccessed 2019-09-22 (cited on page 12)

Abadi M Barham P Chen J Chen Z Davis A Dean J Devin MGhemawat S Irving G Isard M Kudlur M Levenberg J Monga RMoore S Murray D G Steiner B Tucker P Vasudevan V Warden PWicke M Yu Y and Zheng X 2016 Tensorflow A system for large-scalemachine learning In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16) 265ndash283 USENIX Association Savannah GA httpswwwusenixorgconferenceosdi16technical-sessionspresentationabadi (citedon page 4)

Borges H Hora A and Valente M T 2016a Predicting the popularity ofgithub repositories In Proceedings of the The 12th International Conference on Pre-dictive Models and Data Analytics in Software Engineering 9 ACM (cited on page8)

Borges H Hora A and Valente M T 2016b Understanding the factors thatimpact the popularity of github repositories In 2016 IEEE International Conferenceon Software Maintenance and Evolution (ICSME) 334ndash344 IEEE (cited on pages 8and 19)

Casalnuovo C Suchak Y Ray B and Rubio-Gonzaacutelez C 2017 Gitcproc Atool for processing and classifying github commits In Proceedings of the 26th ACMSIGSOFT International Symposium on Software Testing and Analysis 396ndash399 ACM(cited on page 9)

Cheng H-T Koc L Harmsen J Shaked T Chandra T Aradhye H Ander-son G Corrado G Chai W Ispir M et al 2016 Wide amp deep learning

59

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 73: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

60 BIBLIOGRAPHY

for recommender systems In Proceedings of the 1st workshop on deep learning forrecommender systems 7ndash10 ACM (cited on page 7)

Collberg C Kobourov S Nagra J Pitts J and Wampler K 2003 A systemfor graph-based visualization of the evolution of software In Proceedings of the 2003ACM symposium on Software visualization 77ndashff ACM (cited on page 10)

Corder G W and Foreman D I 2011 Nonparametric statistics for non-statisticians (cited on page 22)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on page 6)

Feiner J and Andrews K 2018 Repovis Visual overviews and full-text searchin software repositories In 2018 IEEE Working Conference on Software Visualization(VISSOFT) 1ndash11 IEEE (cited on page 9)

Gote C Scholtes I and Schweitzer F 2019 git2net mining time-stamped co-editing networks from large git repositories In Proceedings of the 16th InternationalConference on Mining Software Repositories 433ndash444 IEEE Press (cited on pages xvand 10)

Gousios G Pinzger M and Deursen A v 2014 An exploratory study of thepull-based software development model In Proceedings of the 36th InternationalConference on Software Engineering 345ndash355 ACM (cited on page 8)

Gousios G and Spinellis D 2012 Ghtorrent Githubrsquos data from a firehoseIn 2012 9th IEEE Working Conference on Mining Software Repositories (MSR) 12ndash21IEEE (cited on page 9)

He X Liao L Zhang H Nie L Hu X and Chua T-S 2017 Neural collabo-rative filtering In Proceedings of the 26th international conference on world wide web173ndash182 International World Wide Web Conferences Steering Committee (citedon page 6)

Hochreiter S and Schmidhuber J 1997 Long short-term memory Neural com-putation 9 8 (1997) 1735ndash1780 (cited on page 5)

LeCun Y Bengio Y and Hinton G 2015 Deep learning nature 521 7553 (2015)436 (cited on page 3)

Servant F and Jones J A 2013 Chronos Visualizing slices of source-code historyIn 2013 First IEEE Working Conference on Software Visualization (VISSOFT) 1ndash4 IEEE(cited on page 9)

Sokol F Z Aniche M F and Gerosa M A 2013 Metricminer Supportingresearchers in mining software repositories In 2013 IEEE 13th International WorkingConference on Source Code Analysis and Manipulation (SCAM) 142ndash146 IEEE (citedon page 9)

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README
Page 74: Mapping the landscape of deep learning models use in the wild · Mapping the landscape of deep learning models use in the wild Xing Yu (u6034476) A report submitted for the course

BIBLIOGRAPHY 61

Subramanian V 2018 Deep Learning with PyTorch A practical approach to buildingneural network models using PyTorch Packt Publishing Ltd (cited on page 3)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on page 6)

  • Acknowledgments
  • Abstract
  • List of Abbreviations
  • Contents
  • Introduction
    • Trace Deep Learning use through GitHub
    • Contribution
    • Report Outline
      • Background and Related Work
        • Background
          • Deep learning
            • TensorFlow
            • PyTorch
              • Deep learning models
              • Summarized Timeline
                • Public Code Repositories
                  • Web-based hosting service
                  • Measuring Popularity From GitHub
                  • Extracting Messy Data in the Wild
                  • Visualizing data in Repositories
                    • Summary
                      • STAMPER Design and Implementation
                        • Overview
                        • Data Collection
                        • Repository Search
                        • Data Selection
                          • Example
                            • Construct the Visualizations
                            • Summary
                              • STAMPER in Action
                                • Popularity of Deep Learning Models in GitHub
                                  • Popularity Feature Selection
                                  • Past and Current Status A Full Integration
                                  • RQ1 How has the popularity of model changed over time A closer look at the deep learning models
                                  • RQ2 How popularity varies per model
                                  • RQ3 Does the popularity of models relate to other features
                                    • Contribution of Deep Learning Models in GitHub
                                      • Collaborative Contribution
                                      • RQ1 After forking do developers change the codebase
                                        • Maintenance of Deep Learning Models in GitHub
                                          • RQ1 How long has it been in existence
                                          • RQ2 Do old models have more issues compared to new models
                                          • RQ3 Are they well maintained
                                            • Summary
                                              • Discussion And Future Work
                                                • Discussion
                                                  • Data in the wild Limitation and Improvement
                                                  • Extensibility and Open-Source Software
                                                    • Future Work
                                                      • Social Network Analysis in GitHub
                                                      • Trend Detection using Commitments Timestamp
                                                          • Conclusion
                                                          • Appendix
                                                            • Appendix 1 Project Description
                                                              • Project Title
                                                              • Supervisors
                                                              • Project Description
                                                              • Learning Objectives
                                                                • Appendix 2 Study Contract
                                                                • Appendix 3 Artefact Description
                                                                  • Code Files Submitted
                                                                  • Program Testing
                                                                  • Experiment
                                                                    • Hardware
                                                                      • Softwares
                                                                      • Other
                                                                      • Datasets
                                                                        • Appendix 4 README