comparison of open source license scanning...

45
Bachelor Degree Project Comparison of Open Source License Scanning Tools Author: Hailing Zhang Supervisor: Morgan Ericsson, Lu Wang Semester. VT 2020 Subject: Computer Science

Upload: others

Post on 29-Jan-2021

14 views

Category:

Documents


1 download

TRANSCRIPT

  • Bachelor Degree Project

    Comparison of Open Source License Scanning Tools

    Author: Hailing Zhang Supervisor: Morgan Ericsson, Lu Wang Semester. VT 2020 Subject: Computer Science

  • Abstract We aim to determine the features of four popular FOSS scanning tools, FOSSology, FOSSA, FOSSID(SCAS), and Black Duck, thereby providing references for users to choose a proper tool for performing open-source license compliance in their projects. The sanity tests firstly verify the license detection function by using the above tools to scan the same project. We consider the number of found licenses and scanned sizes as metrics of their accuracy. Then we generate testing samples in different programming languages and sizes for further comparing the scanning efficiency. The experiment data demonstrate that each tool would fit different user requirements. Thus this project could be considered as a definitive user guide. Keywords: Software licenses, FOSS scanning tool, accuracy, efficiency

  • Preface We would like to thank Morgan Ericsson for his guidance and advice during the writing of this thesis. We also want to thank Lu Wang for the research topic and the feedback from Björn Kihlblom, Mats Fröjdh, and Wei Cao. We would not be able to finish this degree project without the resources provided by Ericsson.

  • Contents 1 Introduction 1

    1.1 Related work 1 1.2 Problem formulation 2 1.3 Motivation 2 1.4 Objectives 3 1.5 Scope 3 1.6 Target group 4 1.7 Outline 5

    2 Background 6 2.1 Software licenses 6 2.1.1 Free and Open Source Software 6 2.1.2 Software license compliance 7 2.2 Tools introduction 9 2.2.1 FOSSology 10 2.2.2 FOSSA 11 2.2.3 FOSSID 11 2.2.4 Black Duck 12

    3 Method 13 3.1 Method selection 13 3.2 Reliability and Validity 14

    4 Implementation 15 4.1 Experiment design 15 4.1.1 Sanity test design 15 4.1.2 Advanced test design 16 4.2 Experiment preparation 18 4.3 Experiment execution 18 4.4 Experiment results 20

    5 Results 25 5.1 Sanity test results 25 5.2 Advanced test results 26 5.2.1 Results of advanced test A 26 5.2.2 Results of advanced test B 27

  • 6 Analysis 30 6.1 FOSSology 30 6.2 FOSSA 30 6.3 FOSSID 31 6.4 Black Duck 32

    7 Discussion 34

    8 Conclusion 36 8.1 Future work 36

    References 38

  • 1 Introduction The technical superiority induces companies to use the free, open-source software (FOSS) in almost all products [1]. Due to the FOSS components usually get ample support from the open-source community. The quicker technology iteration with lower cost promotes the spread of emerging technologies and fosters innovation [26].

    On the other hand, the license compatibility problems and copyrighted obligations also arise in legal controversy [8]. Since the reused codes might have contractual license terms and conditions that oblige the licensee to use the source code with preconditions, unintentional ramifications could jeopardize corporate intellectual property and cause subsequent obstructions of development. In such a context, commercial companies such as Black Duck, FOSSID came to market. They assist organizations in identifying licenses and discovery repeated snippets. The availability of scanning tools mitigates the legal risk, especially when developers modifying, redistribution, or create derivative works based on FOSS [20].

    1.1 Related work Researchers made plenty of efforts in the implementation of a new scanning tool and analysis of the legal theories. Still, there are not many published papers discussing the differences in performances among scanning tools. Since it is related to business competition, most of the existing scanning projects released are under copyleft licenses. Under the nondisclosure agreement and copyright protection, analyzing algorithms becomes impossible due to the remote source code. Thus the researches in scanning tools comparison are few and focus on the open-source licensing projects.

    The diploma thesis, "Software Licensing Analysis Tool" by Tomáš Radej [20], inspired the design of our controlled experiments. The author compared license Check and Licorice by performing detection on a random sample of packages taken from the Fedora operating system's repository. Kapitsaki, Tselikas, and Foukarakis contributed to the visualization of the license compatibility and integrated framework to support license conflict detection in their article "An insight into license tools for open source software systems" [14]. It investigated software licensing, giving a critical and comparative overview of existing assistive approaches and tools. Their research demonstrates the role of the different methods in license use decisions. This thesis thus attempts to choose tools with varying principles of working to conduct experiments. The accuracy of license risk given by each tool would be determined based on FOSS license categories. OSI and FSF documents listed compatibility relationships among licenses

  • [10] [19], which lay the theoretical foundation of this project, especially for designing testing samples. The importance of license compliance emphasizing is in every tool's website and user guide [4] [5] [8], which exactly motivated this thesis, as well as provides references for the design of experiments in Chapter 2.

    1.2 Problem formulation

    This thesis aims to figure out the capabilities and characteristics of FOSS scanning tools on the market. Since it is a challenge for an organization to know which scanning tool to use in its development organization, we will try to determine FOSS scanning tools' performance by controlled experiments. By analyzing the scanning results, and record each tool's computational efficiency and accuracy as a database for choosing suitable FOSS scanning tools for the next projects in Ericsson.

    1.3 Motivation It is an era defined by software; the included FOSS components in merging products are universal and increasing [12]. The FOSS scanning tools thus gained attention from commercial companies. They all declare that they have the most comprehensive knowledge base of open source components, vulnerability, and license information [8]. This project aims to provide experiment data as references for the open-source compliance in product development, the enterprise or individuals could save time and expenses for testing the various commercial scanning tools. The proper tool can ensure the company's intellectual property rights are not unintentionally exposed while contributing to FOSS and FOSS forums. The usage of scanning tools also assures legal fulfillment of the company's obligations relative to open source license as well as not limiting the company's ability to commercialize and retain product proprietorship. Besides, the protection of copyright could flourish open-source software by supervising the users to respect authors' requirements. After all, making better software is what open source is all about. This thesis attempts to help the FOSS components users legitimately to develop and publish their products, thus optimizing the software industry by popularizing the concept of software license compliance.

    1.4 Objectives The objectives of this thesis are listed below.

  • O1

    Compare capabilities of FOSSology, FOSSA, FOSSID(SCAS), and Black Duck by using them to apply license detection on the same project.

    O2

    Compare the scanning time of FOSSology and Black Duck in projects with different sizes and programming languages.

    1.5 Scope The scope of the thesis project is limited; we will only test the scanning tools mentioned earlier. Because they are non-free licenses, so the analysis of scanning results will not involve the source code and the algorithms that caused the different performances. For a similar reason, the description of test objects will include programming language, lines of codes, and the instructions of open source components. We designed the experiments to observe the performance of candidate tools under different programming languages instead of code statements.

    We discussed the license definitions in Chapter 1.1, from the practical public view, the FOSS scanning aims to find the license and code that may jeopardize product security instead of recognizing the FOSS licenses that are approved by both OSI and FSF. Since scientific writing is supposed to use plain and accurate descriptions rather than rhetorical flourishes, this project will not limit the scanning scope into the valid FOSS license approved by FSF and OSI, but popular licenses of each category as approved by OSI or FSF. Besides, the vendor tends to emphasize that their tool can integrate into the continuous integration and delivery pipeline, but discussion of this function will not be in this thesis. Because the difference does not affect their performance, and the testing samples will not integrate with any parental project. This project is in the computer science area, and the author does not have any legal background, so this project does not give legal advice. Although some tools also have other functions more than FOSS license detection, such as vulnerability identification, risk evaluation, and dependency version confirmation, this project would not launch a discussion on these aspects. This extra function refers to another kind of scanning tool for finding security vulnerabilities such as Cross-site scripting, SQL Injection, and insecure server configuration.

    1.6 Target group Companies across all industries are racing to use, participate in, and contribute to open source projects for the various advantages they offer from leveraging external engineering resources that accelerate time to market and enable faster innovation [25]. Open source is the key to accelerate innovation, productivity, quality, and growth in any technology company. It represents a competitive advantage when used correctly, but

  • rapid evolution and proliferation often cause enterprises to struggle with due diligence and identification of open source components in a codebase. The experimental results may attract the corporate who want to achieve maximum open-source adoption effortlessly and securely. Companies could consider this project as references to choose a proper scanning tool that would mitigate potential risks and security vulnerabilities by satisfying the discovered license obligations and avoid costly litigations and intellectual property losses [2]. On the other hand, since the competition among scanning tools is intensified, the scanning tools' developing companies could also consider this project a suggestion for business improvement. For meeting the customers' diverse requirements, the comparison among scanning results could advise the next development direction.

    Excepting the business value, this project also attempts to regulate the usage of FOSS components. The free-software movement and the open-source software movement are online social movements behind FOSS's widespread production and adoption [27]. Open source license compliance (OSLC) is the process of ensuring that an organization satisfies the licensing requirements of the open-source software it uses, whether for its internal use or as a product (or part of one) that it develops and redistributes [18]. This project supports these global non-profit organizations to champion software freedom in society through education, collaboration, and infrastructure, by promoting license compliance among individuals and companies. Simultaneously, the project fosters real innovation and creativity in software development by respecting the developers' willingness to use FOSS components. After all, the various communities participating in the development are vital for FOSS systems superior to proprietary security systems.

    1.7 Outline The rest of this thesis is structured as follows: Chapter 1 gives an overview of this project. We briefly introduce the studying objects and reasons here. The second chapter stated the related legal definitions of licenses and necessary information of chosen FOSS scanning tools. Chapter 3 would explain the experiment design in detail. we have arranged several controlled experiments according to the problem formulation in this chapter. The instructions for four scanning tools, the description of projects as testing samples, the group arrangement for experiments objects, and expected results for each executing will be structured. Chapter 4 records the implementation of the test. This chapter will record the testing environment and the process of using each tool to scan the project. The generated scanning results will be concluded and compared in Chapter 5 with tables and diagrams. As the main contents of this project, discussions of the

  • comparison among tools will be multiple ways. We provide the accuracy and performance of each tool with experiments results in this chapter. The suggestions referred from the experiment data will be stated in Chapter 6. We will recommend suitable using scenarios according to the features of each tool. After that, Chapter 7 would answer if the analysis and data from this project are sufficient evidence to give suggestions to choose a featured scanning tool. Chapter 8 will conclude this project and explore the further improvements in this project and the possibilities to promote user experiences.

  • 2 Background This chapter introduces the theoretical background of FOSS licenses and scanning tools, which provides explanations and credibility for the experiments’ design.

    2.1 Software licenses A software license is a legal instrument governing the software's use or redistribution, what regulates users can, and cannot, do with this software and any obligations upon them [11] [21]. Upon the legal, ethical, or commercial concerns, the author can choose from open source licenses, proprietary licenses, or even multi-licensing. Downloading open source is the same as entering a legal agreement on behalf of the company, so the company must handle licensing problems carefully for protecting intellectual property [24]. The formal licensing format adds the license file in the root directory of the product [21]. However, several statements mentioned the license information with different practice locations, which rapidly increased license detection difficulty.

    2.1.1 Free and Open Source Software FOSS stands for Free and Open Source Software. It can also be known as FLOSS (free, libre, open-source) or OSS (open source software) [22]. FOSS software is openly available in source code form and can be used and distributed free of charge. This project would perform license detection with a focus on FOSS licenses. Thus the terms Free and Open will be explained here according to the definitions from license foundations: According to the description from the Open Source Initiative (OSI), "open" means anyone can freely access, use, modify, and share the source code for any purpose [19]. The software’s rapid evolution is possible because open source has fewer restrictions than free software on use or distribution by any organization or user. The Free Software Foundation (FSF) indicate that a program is "Free" software if the program's users have the four essential freedoms [10]:

    ● Freedom 0: The freedom to run the program as the programmer wish, for any purpose.

    ● Freedom 1: The freedom to study how the program works and change it, so it does the computing as the user wishes. Access to the source code is a precondition for this.

    ● Freedom 2: The freedom to redistribute copies so the user can help others. ● Freedom 3: The freedom to distribute copies of the programmer's modified

    versions to others. By doing this, the programmer can give the whole community

  • a chance to benefit from programmer changes. Access to the source code is a precondition for this.

    The source code of free software must be available for ensuring four essential

    freedoms, provided the user has complied thus far with the conditions of the free license covering the software. Therefore, "free software" is a matter of liberty, not price [10]. A free program must be available for commercial use, commercial development, and commercial distribution. Regardless of the user paid or obtained copies at no charge, the freedoms to copy, change, even sell the copies.

    The term "open source" software is used by some people to mean the same concept as free software, in the official instructions, both OSI and FSF state that counterparts adopt their philosophy based on the definition [10] [18]. The difference is that "Free" focuses on the competence of licensees using the software [22], and "open source" emphasizes an unrestricted development methodology driven by the community. In the strict definition, the license is FOSS licenses if and only if both FSF and OSI approve it; for example, the Reciprocal Public License is an open but non-free license. Because it requires notification to the original developer and publication of any modified version that an organization uses, even privately. However, people already agree to use a combinational term FOSS that refers to the contrast of proprietary software, where the software is under restrictive copyright licensing, and it usually hides the source code from the users.

    2.1.2 Software license compliance There are three types of Standard FOSS licenses[6], permissive, strong copyleft, and weak copyleft. If the usage permission regulated by licenses are contradictory, then these licenses are called incompatible. We state the different permissions of usage among them in Table 2.1.

    Usage Strong copyleft Weak copyleft Permissive Link N Y Y

    Change N N Y Table 2.1: Software license Category and Usage Permission

    The value Y/N means whether the license allows derivative works to become

    proprietary software. For example, suppose commercial software is released under GPLv2 but included a plugin used Apache 2.0 license. In that case, it could cause

  • license incompatibility because of the usage permissions regulated by these two licenses conflict. The cause of the incompatibility is usually the conflict in the semantics of the license terms. According to clause 3 in the text of Apache v2.0: "If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this license for that Work shall terminate as of the date such litigation is filed." However, the abstract indication of patents given by clause 7, GPLv2.0 regards different nature as in the Apache v2.0 license: "If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this license, they do not excuse you from the conditions of this license. If you cannot distribute to satisfy simultaneously your obligations under this license and any other pertinent obligations, then as a consequence you may not distribute the Program at all."

    Thus Apache 2 software can be included in GPLv3 projects, but not vice versa. We show cooperative relationships in Figure 2.1. The red dashed lines mean incompatibility between two licenses. The arrow points from license A to license B implies that A and B are compatible. The primary license depends on B. For example; there is a one-way connection among MIT - BSD - Apache - MPLv2,0 - LGPLv3 - GPLv3, which means that arbitrary two or more licenses of them are compatible, and the endpoint, GPLv3, decides the main licenses.

    Figure 2.1: Compatibility Relationship among FOSS licenses

  • FOSS licenses usually are not compatible with commercial or copyright-licenses

    [15]. The permissive licenses are compatible with each other. However, a product's primary license still depends on the strong copyleft license since the more strict licenses are downward compatible with permissive licenses. Copyleft implies a more reliable set of restrictions in the license than the terms "Free software" or "Open Source" imply [17]. The enterprises are usually unwilling to make the source code public after investing plenty of time and money. Strong copyleft licenses' infectiousness, the typical example is the GPL license, could cause severe insecurity. When the product uses FOSS components with different licenses, the product owner must consider two possible license compatibility problems. Whether the primary license allows adding elements with different licenses, and whether the FOSS component's license terms conflict with the main license terms regarding the same rights. For verifying each tool's detection abilities, the testing samples will be generated based on Figure 1, which includes designed incompatibility problems.

    2.2 Tools Introduction FOSS scanning is the process of searching the source code for FOSS components that might have potential insecurity problems for product release. There are numerous FOSS scanning tools on the market. Due to litigation cases and disputes caused by FOSS license compliance problems are emerging, enterprises raise interest in scanning tools [3]. The users expect that a proper tool could take responsibility to protect their intellectual property. Primarily, ensure the products complying with third-party commercial software terms and FOSS license statements [17], at the same time, promote the development of open-source software.

    All the chosen FOSS scanning tools claim to perform license detection on mainstream programming languages, including the selected languages in this project, Python, Golang, and C++. The comparison among available programming languages will skip until the future work is required to verify new languages. Docker images are available for these tools, except SCAS. All of them have Web API that could access on Windows, macOS, and Linux. Summary of the function of the chosen tools is in Table 2.2:

    Tools Conflict1 Auto2 Legal3 Cost4 CI/CD5 Base6

    FOSSology N N N Y Y7 360 FOSSA N N Y Y Y 698 SCAS N Y N N Y Unknown7

    Black Duck Y Y Y N Y 2645 Table 2.2: General Function of chosen FOSS Scanning Tools

  • 1. Conflict: refers to function to detect license conflict without default license management rules.

    2. Auto: refers to function to automatically verify the matched snippets without manual review.

    3. Legal suggestions: refers to function to give legal tips related to the found license. 4. Cost: refers to whether this tool provides a free trial 5. CI/CD: refers to whether this tool can integrate with Jenkins 6. Base: refers to the number of included licenses in the knowledge base from

    official documents. 7. This function is still under test [8] 8. No statement found about the amount of the included licenses in FOSSID, but

    ''623 billion source code snippets.' 9. 69 is the default knowledge base. FOSSA’s license information is dynamically

    loaded [4].

    Table 2.2 shows the function of each tool according to the vendors’ statements. For stating the reasons for choosing them as testing objects in this project, we separately describe the functional features that attract us.

    2.2.1 FOSSology FOSSology (http://fossology.org) is an open-source license compliance software system and toolkit [13]. As a toolkit, the user can run license, copyright, and export control scans from the command line. As a system, it provides a database and web UI for providing a compliance workflow. The user can generate an SPDX file or a ReadMe with the copyrights notices from scanned software. It uses multiple scanners to scan the text in uploaded source codes. Two leading scanners, Nomos and Monk, are necessary for the license detection. Because Monk will only tell the user the existence of known licenses, Nomos could identify a “style” type of license if it has similarities with a known license type enabling Nomos to recognize new or unknown licenses. Nomos uses keywords to identify license relevant statements, and then it would identify appropriate licenses by the hierarchical structure of regular expressions. Monk is another one of FOSSology’s license scanners that performs text-based searches. It uses the Jaccard index as a text similarity metric added with a weighting for ranking different matches by their size. Since FOSSology provides source codes with interpretative comments, we chose it as a tutorial. By using this tool, we are supposed to understand the basic principle for FOSS scanning.

    2.2.2 FOSSA FOSSA (https://fossa.io/) deployed in Ericsson's Kubernetes Engine Service exposes a REST API for each analyzed project and allows users to access the API by a unique API token. Using the FOSSA platform needs to allow inbound connection from the internal

    http://fossology.org/

  • user account and Jenkins. The dynamic analysis allows FOSSA to know what dependencies are pulled into builds. Static analysis supplements the results with metadata on how dependencies are included. Thus, instead of trying to guess at the build system's behavior, FOSSA runs locally using build tools to determine a list of exact dependencies used by the binary file. By default, FOSSA will enable daily or hourly scans on the default branch. It will notify the user with email reports if it finds any issue. JFrog Xray takes the primary responsibility to perform the binary scan at every deliverable built-in Ericsson. Because most language integrations with FOSSA support authentication through private JFrog Artifactory registries [4], this tool is chosen to explore the possibility of internal cooperation between Xray and FOSSA.

    2.2.3 FOSSID Software Composition Analysis Services (SCAS) is a platform with FOSSID's technique kernel. Ericsson's internal portal performs audits of source code to detect the additional 3PP dependencies and potential use of FOSS components. The service is provided through a Web Interface. Where files are not identifiable or partial matching, the manual review is used to determine whether files are open source. Whitelisting is the action of creating a Decision Rule that will place a given FOSS Alert into a Target Category. FOSSID is an essential sponsor of software heritage. This organization aims to collect publicly available source code from various software projects. Thus SCAS's knowledge base might share the knowledge base from this initiative. SCAS takes the primary responsibility to scan source code in Ericsson, so this project attempts to find the improvement space by comparing it with other popular tools.

    2.2.4 Black Duck Black Duck Software (https://www.blackducksoftware.com/) is a software composition analysis tool that provides license compliance and associated security risks management by scanning open-source software. It has three main supply-side complements. The commercial Hub service scans code to identify all embedded open source components and automatically search for known vulnerabilities for remediation. It can send alerts when it finds new vulnerabilities in code. Protex is a commercial, fee-based license compliance management tool from Black Duck which integrates with existing tools to scan, identify, and inventory open source software automatically, while also enforcing license compliance and corporate policy requirements. Black Duck CoPilot can connect to GitHub repositories and provide the user with security risk information for dependencies in the user’s repository, such as associated vulnerabilities and recommendations for adjacent vulnerability-free versions if the components currently using have security issues. The Synopsys company also provides customized products to integrate with pipelines or the specific scanning tool; for example, Black Duck Detect is a solution for FOSS scanning hosted by Black Duck. Used by Azuki Systems, acquired by Ericsson, and integrated with BUSS. However, this combined product is now

    https://www.blackducksoftware.com/

  • deprecated, Azuki would move to Black Duck Protex now. Because it is a classical FOSS scanning tool with full function and 15-year-old experiences, it is considered a standard of comparison in this project.

  • 3 Method

    3.1 Method selection For figuring out the capabilities and characteristics of chosen FOSS scanning tools, this project uses controlled experiments to compare the scanned size, the number of found licenses, and the scanning time. The experiment data is supposed to reflect the knowledge base and scanning efficiency of each tool. So the user can choose a suitable tool according to the performance with different project sizes and programming languages.

    Using different tools to scan the same project in the same testing environment, the controlled experiment can prove which tool had the best performance in the specific independent variable. Regarding alternative methods, the survey's personal opinions using questionnaires or interviews are too subjective to determine the performance of a FOSS scanning tool. For example, programming skills can also affect user experiences. We can not find enough participants to draw a general conclusion, with experience with multiple chosen tools and projects. Besides, the project will not create any new artifact, so design science is unavailable. Although the case study is suitable for the detailed examination of one subject, this project will analyze four tools. So the controlled experiment is chosen to minimize the effects of variables other than the tool's capability.

    The first set of controlled experiments in the sanity test is supposed to show each tool's capability to find licenses in the same project. We expect to determine which tool can find the most licenses with the highest accuracy. The knowledge base's efficiency and algorithms could be reflected by the scanned size and the number of found licenses because the knowledge base decided the identifiable license type. The algorithm regulated which files to scan.

    The second set of controlled experiments in the advanced test is supposed to test the scanning efficiency. For meeting Ericsson's requirements, each tool will scan six repositories. We chose only FOSSology and Black Duck to participate in the advanced test due to time limitations. The main reason for this decision is that they present the most different features in the sanity test. Besides, comparing them can also show the differences between the binary scan and text similarity. These two tools directly give information about the summary of licenses and corresponding third-party components and scanned files number and sizes. This function would decrease the mistakes caused by human factors. By executing the advanced test, We expect to know how will the changes in programming languages and project size affect the scanning time.

    Table 3.3 shows the variables of two sets of controlled experiments. In the controlled experiments, we want to observe the dependent variables’ changes caused by the independent variable. We discuss the experiment design and execution in Chapter 4, and we answer the above questions in Chapter 5 as the conclusion of controlled experiments.

  • Test Independent variable Dependent variable Controlled variable

    Sanity test

    The tool that is currently used (FOSSology, FOSSA, FOSSID (SCAS), and Black Duck)

    1.Scanned size 2.The number of found licenses

    1.Executing environment 2.Testing sample (scan the same project)

    Advanced

    test

    A

    Project size

    Scanning time

    1.Executing environment 2.Testing objectives (use the same tool)

    B Programming language

    Table 3.3: Variables of controlled experiments

    3.2 Reliability and validity The previous chapters explained the terms and definitions related to FOSS scanning. For demonstrating the construct validity, Chapter 7 compares the experiment results and analysis from this project with the outcomes from peer-reviewed research papers. The version iteration of FOSS scanning tools would affect the validity because the performance will change with the algorithm's updates and extension of the knowledge base. For extending the validity, this project gives the updating suggestions in Chapter 8.

    The tools automatically generate the scanning results without any artificial factors. If others replicate the experiment steps with strict control of the testing samples and objectives, it is possible to get the same results as Chapter 4.4. The related works have already scientifically evaluated metrics for tools' comparison, the category of FOSS scanning approaches, and licenses used in this experimental design.

  • 4 Implementation

    4.1 Experiment design This section explains the experiment design for figuring out the capabilities and characteristics of chosen FOSS scanning tools. The sanity test aims to verify the scanned size and found licenses for FOSSology, FOSSA, FOSSID (SCAS), and Black Duck. The advanced analysis will compare the scanning efficiency of Fossology and Black Duck. The following subsections will discuss the testing samples and metrics used in each experiment.

    4.1.1 Sanity test design The primary expectation of users is that this tool can find license information from given test samples. For confirming this function, this project would observe four popular tools' scanning results, after using them to scan the same project.

    Ericsson provides a repository as the testing sample in the sanity test. In this project, we named it as FWK because it is a framework. Although this repository may already finish license compliance management, the tools still have prospects working because the internal whitelist does not import in the user-defined library. The proportion of PHP is the most in this project's programming language, which is 49.9%, the next accounts are Roff 21.5% and C++ 20.2%, then it also includes 3.1% Shell, 1.8% Java, and1.6% Go. Other features of the chosen project do not affect the FOSS scanning tools' function, so there will not be any other details.

    We use the number of recognized licenses as a metric since it is definite proof of its performance. At least the knowledge base of a qualified tool is supposed to cover each category's most widely employed licenses as approved by the OSI or/and the FSF. The primary expectation of these tools is to find license information for included FOSS components. If the tool could also give suggestions for risk averter or relevant information about found licenses, it would be considered a bonus item. We listed the metrics for measuring a tool's quality in Table 4.4.

    Metric Definition Value Number The number of licenses found by the tool Number

    Size The total scanned size Number File The total scanned file number Number

    Table 4.4: Metrics in the Sanity Test

    4.1.2 Advanced test design For a detailed comparison, this project would choose qualified tools from the sanity test to conduct two controlled experiments. The first one is to compare the tools'

  • performances by scanning testing samples with different sizes, and another one is with different programming languages. In each experiment, the independent variable is the currently used scanning tool, and the dependent variable is the tool's performance.

    The advanced test has two sets of testing samples. The first set has three different sizes, and the second one has different programming languages: Go, Python, and C++. These samples include multiple dependencies or libraries with other different FOSS licenses. Ericsson provides some of them, and others we cloned from GitHub on 3rd May. Table 4.5 gives information for each composition. We used components in Table 4.5 to generate the first set of testing samples. They differ in the sizes, as Table 4.6 shows.

    Numbe

    r

    license Link/Designation Language Size/MB

    1 MIT https://github.com/keon/algorithms Python 1.19 2 Unknown RCSMW SVNFM Python 34.57 3 PSF https://github.com/python/cpython Python 324.55 4 LGPL-3.0 https://github.com/nicolargo/glances Python 28 5 Unknown DIA TRANSPORT PROTOCOL Go 3.7 6 Apache-2.0 https://github.com/derailed/k9s Go 59.81 7 Unknown RCSEE RCS RHAI C++ 15.3 8 Dual license https://github.com/mbasso/asm-dom.gi

    t C++ 47.08

    Table 4.5: Testing Sample Components in Advanced Test

    Name Component Size/MB Language A1 1 1.19 Python A2 2 34.57 Python A3 3 324.55 Python

    Table 4.6: First set of Testing Sample in Advanced Test

    There are four kinds of scans in Ericsson's working flow: firstly, the baseline scan of source codes usually be performed once per month before project release. It depends on the repository capacity and license agreements. Secondly, the commit delta scan of the source code is performed when a new code is committed for the continuous detection of not-evaluated FOSS components. The developer could also scan their code before pushing to review. Also, the delta scans are performed on changes of source code

    https://github.com/keon/algorithmshttps://github.com/python/cpythonhttps://github.com/nicolargo/glanceshttps://github.com/derailed/k9s

  • between new builds of the deliverable. The last one is the binary scans that would be performed at every deliverable build. The design of testing samples in Table 4 aims to simulate the possible scanning size in the above scenarios. For example, a commit could be around 1 MB, and a repository on the master branch may have hundreds of MB. Thus, we plan to test the three Python projects with increasing sizes in the first experiments. For the second set of tests, we generate the testing samples with similar sizes and different programming languages, as in Table 4.7. Because there should only be one variable in the controlled experiments, we cloned four more projects from GitHub for polishing the similar sizes of testing samples. Ericsson’s repositories mainly use these three programming languages. Thus the advanced test chose them to evaluate the performances of tools.

    Name Component Size/MB Language

    B1 5+6 63.51 Go B2 7+8 62.38 C++ B3 2+4 62.57 Python

    Table 4.7: Second set of Testing Sample in Advanced Test

    The advanced test would evaluate the effect of project size to the tool's efficiency. The dependent variable in each experiment is time, which refers to the execution time required by this tool to scan this test sample. Most vendors now work on a continuous integration model that makes their software development life cycle faster and far more dynamic. With this agile model, the developing team needs to move quickly, removing legal risks and vulnerabilities as they build. The scanning tool's efficiency directly affects product delivery, so the total execution time is considered a vital metric.

    4.2 Experiment preparation

    We have defined the experiment objectives in Chapter 3.1. We needed confirmation of the accessibility of all chosen tools before experiment execution. FOSSology is free to use, Ericsson provides FOSSA, Black Duck, and SCAS. We state the testing environment and software version during the implementation in Table 4.8. The computer configuration in Table 4.8 represents the standard equipment for Ericsson's employees, which provides scanning results with universal references. Besides, the default operating system is Windows in Ericsson. However, because SCAS does not have a Docker image or Linux version, and FOSSology only supports Linux

  • distributions, we only unify the controlled experiments’ hardware configuration, not the operating system.

    Name Version Processor Intel Core i7-8650U 1.90GHz Memory 32 GB

    Operating system Windows 10 Enterprise 64bit FOSSology 3.7

    FOSSA 1.0 FOSSID(SCAS) 1.0

    Black Duck 6.2.1 Table 4.8: Testing environment information

    4.3 Experiment execution In this section, we explain the brief process of using four tools.

    4.3.1 FOSSology FOSSology can be installed from the source on GitHub, using Docker or Vagrant and VirtualBox. Although it also provides a test instance for quickly trying the function. The instance will be reset every night, and the standard login data is public. Under the consideration of security, we use the Docker image instead of the beta version.

    User installs FOSSology by inputting command docker run -p 8081:80 fossology/fossology after the preparing Docker environment. After build success, the user can access the service by http://localhost:8081/repo/. The user can then upload a single file or a directory from a remote web or FTP server to FOSSology. The file or directory to upload must be accessible via a URL and must not require human interaction, such as login credentials. When uploading a single file from the local computer, the FOSSology server has imposed a maximum upload file size of seven hundred Mbytes. We uploaded the testing samples in this project as a compressed folder to FOSSology. The user can use the Jobs page for view and manage the progress, situation, and estimated arrival time for the uploaded tasks. The Browse page shows scanning results, and we will discuss it in the next chapter.

    4.3.2 FOSSA FOSSA provides two methods for the user to import the project. After registering and login in the homepage, the user can connect to cloud-based version control system providers, such as GitHub.com, Bitbucket.com, or GitLab.com. If the user wants to upload the testing samples from the local environment, the payment for a business trial is required. We choose to install the dependency analysis plugin for analyzing the code in the local environment. This method is recommended by the FOSSA developing team

  • due to accuracy and security. The fossa configuration file with multiple modules can be created manually by

    the primary command, fossa init. The command "fossa analyze -o" is used to verify pre-configuration for performing license detection success. If it runs without error, the user can navigate the target code directory, and perform personalized configurations or analysis with given commands. The user can view the uploaded project and scan results on the web interface, but it is optional. The user can also generate offline documentation for license notices and third-party attributions. Fossa adds a unique API token as a prefix in the command for every operation to audit the user's authorization to access the project.

    4.3.3 SCAS SCAS requires Ericsson's verified account to log in to the web interface. The user could upload the zip folder to begin a new scanning request. Since testing samples should not merge into any master repository, the user could view the result in the Individual Results page by selecting the request ID. The user could also check the progress, folder size, and created date of all created at the Requests page before completing the request. In the practical working flow, the Aggregate Results page shows the scanning results. The responsible department needs to manually review the exported scanning results and mark the components with suspicious snippets as one of the following categories: unauthorized FOSS component, pending approval, or whitelist. The whitelisting function is vital in SCAS, which decides how to deal with matched files.

    4.3.4 Black Duck Black Duck provides a temporary POC Server, and Ericsson employees can access the web API with an internal account. Thus the following resources might be unenviable without the internal computer. It is the most straightforward and quick way to test its function. Synopsys Detect manages all scanning of this platform. After login, the user could download and install the Synopsys Detect as a desktop application. The first step of configuration is that the user needs to connect to the server by providing the generated API token and the Black duck server URL. The user can then choose scanning type from source directory scan, binary or executable file, and Docker image name or .tar distribution. Other settings, such as creating the project description or using which workspace to pull the external dependencies, are optional, so we will not extend the specifications here. Upon completing the scanning task, the desktop application would provide a link for the user to view the results. All the uploaded testing samples used the same APT token would be listed under one project category on the Dashboard page. The user can click each task within the owned project to view details.

    4.4 Experiment results In the sanity test, we used the four FOSS scanning tools to scan the testing sample FWK. The initial testing result also gives suggestions to the advanced experiment. We

  • choose two tools with the widest gap between performances for further comparison.

    4.4.1 FOSSology When complete with the scanning request, FOSSology shows the summary of found licenses on the Browser page.

    Figure 4.2: Sanity testing result from FOSSology

    Figure 4.3: Scanner Decision Conflict in FOSSology

    As Figure 4.2 shows, it found 19 licenses from the FWK. For getting the final

    scanning report, the user needs to review every file manually because the system can not automatically decide the license when scanners could draw different conclusions for the same data as Figure 4.3. In the first row of Figure 4.3, both scanners could not provide enough evidence to automatically confirm this scanning result, because the external codes released under an MIT license only matches with part of this file. The second row illustrates another possibility of decision conflict that needs manual review. Nomos doubt it includes the SIL Open Font license or The Creative Commons Attribution 3.0 license, but another scanner Monk thinks it used MIT license. In this situation, the user is supposed to access the manual review page as Figure 4.4. Taking MIT as an example, the user should view the matched file by clicking the license name. The left text provides the basis for the scanner's judgment, while the user can correct or confirm this matched file's decision on the right side. After reviewing all marked files, document the licensing situation in SPDX files as a scanning report is possible.

  • Figure 4.4: Mandatory Manual Review in FOSSology

    Figure 4.2 shows, there are 1512 files found with license MIT in total, 882 files

    are unique with the same file hash. That means the user needs to check if any false negative in testing results of 882 files. For a big project with thousands of matched files, the manual review becomes a time and effort consuming task. Thus the license searching function is provided may be under consideration for mitigating the workload, as Figure 4.5 shown. By inputting or choosing the license name, FOSSology will show all matched files.

    Figure 4.5: license Searching Function in FOSSology

    Since the FOSS license priority is different during the open-source compliance,

    the user could first find the strong copyleft or copyright license from the matched list. Besides, suppose the enterprise plans to negotiate with a specific software owner to change some terms or use multiple licenses for better compatibility. In that case, this function could help quickly collect all components released under one kind of license.

    4.4.2 FOSSA FOSSA will automatically analyze these branches when updates are triggered and send an email to inform the user about the newly found issue in the updated repository. The user can then access the homepage with the registered account and see the scanning summary, as Figure 4.6 shows.

  • Figure 4.6: Scanning Result in FOSSA

    For the strong copyleft licenses, such as GPL v3.0 only, FOSSA underlines the

    potential risk by red color. This kind of license may require the licensee to disclose the source code under a compatible license unless they are distributed and run entirely separate processes and packages. Thus FOSSA mentions the user to pay extra attention to them. The user can click the license name for manually reviewing the matched file with a specific license. According to the source, the scanning result listed the found licenses, including code or different deep levels of dependencies. The payment is required to increase the scan depth to a higher level than five layers. After reviewing the scanning result, the user could choose to upload or generate local reports. The multiple formats are available to export the report, such as HTML(Figure 4.7), PDF, or even customized reports that include editable texts.

    Figure 4.7: Scanning Report from FOSSA

    The user can click the license name for getting extra information, and the page

    layout is clear and informative. Furthermore, the customized configuration is nearly available on all functions, such as using which channel to send notifications, scheduling builds, controlling how FOSSA parses, builds, and displays the source code. For

  • example, the user can set Project Scopes to limit the displayed dependencies. The configuration file also can be chosen for Gradle based projects via the Gradle build file. These options increased the applicable scenarios of FOSSA.

    4.4.3 SCAS After uploading the testing sample, the user can view the scanning process on the Requests page of SCAS’s web interface, then check the result by clicking the button besides completed status. As Figure 4.8 shows, SCAS gives the sum of files included matched components, 2310 in FWK.

    For the aggregated project, export the scanning result as an excel file is possible, including scan date, request ID, match rate, file name, and path. However, if the user wants to know the involved license type, the only way is to click every matched file for manual records. We conclude the statistics of the number of recognized licenses by SCAS after reviewing 2310 files in the A2 area. In this sanity test, SCAS found 18 kinds of licenses in 2310 matched files. SCAS provides very comprehensive information and a detailed category for found licenses. For example, it finds different versions for GPL, AGPL, and LGPL licenses. GPL v2.0, GPL 2.0+, and GPL 2.0 or later are marked by SCAS as different versions of GPL license. As same as the scanning report in excel format, SCAS only provides basic information for the unmatched files in the A1 area. The comparison between matched components and external evidence, such as the A3 area, is provided after the user clicks one matched file from the A1 area.

    Figure 4.8: Scanning Result in SCAS

    4.4.4 Black Duck The Black Duck desktop application would provide a link to the web interface for each completed task from the local scan history and the time for performing the scan and the total scan size. The user can browse the scanning result after login the mentioned account in Chapter 2.2. Black Duck directly gives the details of used third-party components instead of listing the files based on included licenses like FOSSology. It also provides the risk level for found third-party components and licenses, and the user can manage the matched file with proper priority. Black Duck prompts a link to the open-source hub, released license, and a brief description of Bootstrap as it is a toolkit

  • from Twitter designed to kick-start the development of web apps and sites. Besides, the user could look into details for each component by clicking the name list. Suppose this component is released under the license that is incompatible with the scanned project's primary license. In that case, the developing team is supposed to replace the marked files under legal security consideration. As Figure 4.9 shows, Black Duck finds 12 matched components in FWK.

    Figure 4.9: Scanning Result in Black Duck Detect

    Users click each component to view more details. Black Duck uses the

    component to explain the reasons for marking files. Figure 4.10 gives information about a mobile version of the Ubuntu operating system, Ubuntu Touch. The user could see the link to the developing team, several included licenses, and others. Suppose removal of the matched files from the project is needed. In that case, the developing team could refer to relevant information provided by Black Duck for finding the alternatives with the same function.

    Figure 4.10: Matched Component Information provided by Black Duck Detect

  • 5 Results In this section, we will compare and analyze the experiment results. According to the screenshot in Chapter 4.4, Chapter 5.1 describes capabilities to find license information from given test samples of all chosen tools. Chapter 5.2 discusses the scanning efficiency of the Black Duck and FOSSology.

    5.1 Sanity test results We conclude the sanity test results in Table 5.9, with the metrics mentioned in Table 4.4, Chapter 4.1.1.

    Tools Number Size/MB File FOSSology 19 49.53 5118

    FOSSA 15 None1 None1

    SCAS 18 156.53 5075 Black Duck 12 147.3 5089

    1.None: FOSSA did not provide these data Table 5.9: Sanity Test Result

    Take FOSSology as an example, and it found 19 licenses after scanning 5118

    files of FWK, which size is 49.53 MB. Generally, the number of found licenses reflects the knowledge base’s coverage, and the scanned size reflects the scanning depth. Chart 5.1 gives a more intuitive comparison of the number of found licenses and scanned size by all chosen tools.

    Chart 5.1: The number of found licenses and scanned sizes of sanity test

  • We did not include the scanned FOSSA size in Chart 5.1 since it does not

    directly provide these data. As Chart 5.1 shows, SCAS collected the most significant samples from scanning. FOSSology scanned minimal files but got the most licenses since the tool is based on text comparison. The data of Black Duck are in the middle. The build process and specific modules are mandatory for FOSSA, FOSSology uses text-based scanning, SCAS mainly compares the snippets in source code with the external files, Black Duck provides both source code and binary scanning services. Because no existing tools could guarantee the false positive rate to arrive zero, the number of found licenses does not represent accuracy. The algorithms and knowledgebase may be the main reasons for finding different licenses in the same project.

    5.2 Advanced test results Because of the time limitation, we have chosen FOSSology and Black Duck for the advance test. The advanced test does not include SCAS because it requires manual statistics for collecting the found licenses by SCAS. Besides, suppose using the commercial tool's completed function requires payment or additional preconditions, such as FOSSA, which defines different scanning depth according to price. In that case, we do not include this kind of product in further comparison since the partial function analysis does not conform to this project's aim, which is to provide references for business projects.

    The advanced test would use FOSSology and Black Duck to scan testing samples in Table 4.6 and 4.7 from Chapter 4.1.2. We will not repeat the way to apply license detection and observe scanning results in FOSSology and Black Duck. Because we already discussed it in Chapter 4.3. This test aims to verify if the project size and programming language would affect FOSS scanning tools.

    5.2.1 Results of advanced test A Black Duck counts time to second, even milliseconds. Hence, the recorded time from FOSSology also changes to the same unit by dividing the number of scanned items with the average items per second. The minimum unit rounds to three digits. Table 5.10 records the scanning time required by both FOSS scanning tools.

    Tools A1 A2 A3 FOSSology 40.999 1231.002 2024.996 Black Duck 47.109 62.463 90.345

    Table 5.10: Scanning Time Table of Advance Test A

  • Chart 5.2: Scanning Time Chart of Advanced Test A

    The scanning time required by both tools monotonically increases with the size

    of the testing sample. We generated Chart 5.2 for reflecting the differences. The project sizes of testing samples are set as X-axis and scanning time as Y-axis. From A1 to A3, the project size increases from around 1 MB to 300 MB. Because the range in consuming time is so big that it makes the comparison of the testing sample A1 nearly invisible, we proportionally truncated the histogram for showing differences. As Chart 5.2 shows, both tools’ scanning times are monotonically increasing with project size. As shown in Chart 5.2, the curve of Black Duck is relatively flat, which only slightly undulates with the changes in project sizes. On the other side, the consuming time of FOSSology rapidly increased from A1 to A2. The double scanning time needed while the size of the testing sample increased around ten times.

    5.2.2 Results of advanced test B Another advanced test will compare the scanning time required by FOSSology and Black Duck to scan the project with different programming languages.

    FOSSology takes the most time to analyze software heritage. The estimated time of arrival given by the system is at least ten hours, so here would not take the consuming time on this function into account for comparison. The programming language of B1 is Go, B2 is C++, and B3 is Python. Table 5.11 records required the scanning time for each project.

  • Tools B1 B2 B3 FOSSology 243.994 104.999 1249.999 Black Duck 42.594 53.257 77.754

    Table 5.11: Scanning Time Table of Advance Test B

    The scanning time required for them might be affected by the implementation language in this test. FOSSology is known to be implemented by PHP and C. Although the most extended scanning option was Nomos, a scanner used short phrases (regular expressions) and heuristics to identify license, and ignoring this function is unreasonable since the developing team of FOSSology recommends including all scanners for accuracy. The analysis of software heritage is another time-consuming option in FOSSology, but we did not consider itthis project’s the efficiency comparison. Because FOSSology always gives the scanning results before finishing the heritage status. This function of FOSSology uses SHA256 to calculate the hash value for determining whether this component belongs to the preserved code by the software heritage. Chart 5.3 shows the changes in scanning time with different programming languages.

    Chart 5.3: Scanning Time Chart of Advanced Test B

    Generally, Black Duck is quicker than FOSSology in known conditions, as Chart

    5.3 shows. Both of them required the shortest time to scan the C++ project, the longest time for the Python project. The gap is smallest in the C++ project, broadest in the Python project. The influences caused by programming languages on the scanning speed are less on Black Duck than FOSSology. The reaction to Python projects from the latter tool seems slowest, which might also be a part of the reason that expanded the difference in the first set of experiments. However, FOSSology's scanners spent similar time in projects with different languages. Monk spent around 27.07s in Go, 21.6s in

  • C++, and 22.5s in Python. Nomo spent around 7.9s in Go, 6.4s in C++, and 7.3s in Python. Since these two scanners took the primary responsibilities in license detection, the network environment, the build, or the deployment process might cause the difference in performances.

  • 6 Analysis In this section, we give opinions to choose FOSS scanning tools according to their features from the view of user experiences, security concerns, and integration possibilities.

    6.1 FOSSology FOSSology is suitable for subsequent development or small projects that need cost control. It is a concise scanning tool. The whole scanning process is transparent from the encryption algorithm to the report template, and the customized options allow users to control the time and scanning types. We consider FOSSology as a tutorial tool for FOSS beginners because, as an open-source project, it provides detailed evidence and explanations of scanning results by marking matched texts. Furthermore, the usage of FOSSology can be considered support for the development of the open-source community. The emerging companies can use it to confirm included FOSS licenses, thus protecting the intellectual property. It even has remarkable performance as free software and commercial tools, from the experiment results, Tables 4.8, 5.9, and 5.10. We think the license searching function shows in Figure 4.5 is an advantageous feature in FOSSology. As mentioned as the sanity test result before, the other three tools, especially SCAS, could learn from this function in the later version updates.

    The drawback is that its function is relatively monotonous and lack of automatic confirmation. FOSSology does not present alerts to vulnerability risk. Although the web interface is understandable for more full user groups, even individuals without any programming background, the reviews require superior knowledge in software licensing terms and related laws. As above chart 5.1 and 5.2 show, it would take hours to scan and days to review each file for the big project. In the newly released version (after 3.4.0), FOSSology has added a REST API that allows the Integration of FOSSology into the CI/CD environments. However, as the notes in Table 2.2 mentioned, Jenkins's integration is still a beta test now, which is behind the other three tools. Although many companies and organizations are willing to support FOSSolog's exploration to extend function, such as Siemens, HP, ARM, and others, some commercial enterprises may still have the security and efficiency concern before actually putting FOSSology on the product line. Since the only required authorization to view or download the source code is the user account, the transparent scanning process also provides an opportunity to be exploited to hacker's advantage.

    6.2 FOSSA The comprehensive possibility to perform customized configuration on every step is the advantage of FOSSA. It has the most user-friendly scanning report. As Figure 4.7 shows, the FOSS scanning report is more readable than plain Excel formats. It also assists the user in determining the priority of the found license. The own email function even does not need users to configure it as the post-trigger action, which could

  • automatically send issues after the periodical scan. Figure 4.6 shows that it used red color to mark the strong copyleft license, and gives a brief statement about the regulated licensee's rights and obligations for each matched component. This feature helps the user to decide further schemes for this component.

    FOSSA's target user is the enterprise team, so the payment is required for the best performance. It offers different functions based on the budget, for non-profits educational institutions, free trial, and paid teams. Due to one specific version that could not represent FOSSA's completed function, the advanced test in this project did not include it. However, various services make FOSSA meet the requirements of more users. It requires the user to have programming skills to execute scripts to fetch and install the corresponding operating system's latest releases. If the development environment is not supported, the user needs to manually check out the archive uploader, allowing direct license scanning of source code files. Different commands are necessary for each step from environment configuration to generating the license notices with each CI build.

    Every operation in FOSSA requires the API token. As a currently used FOSS scanning tool in Ericsson, the security level of FOSSA is relatively reliable. Also, FOSSA gives the supported languages and CI/CD environment in the documents [4]. It is available with popular services, such as Jenkins, JFrog Artifactory, GitLab, and CircleCI, and programming languages, like Java, Javascript, but not Erlang.

    6.3 SCAS It gives the most detailed category for found licenses. For example, it could recognize GPL-v1.0, GPL-v2.0, GPL-v2.0-only, or GPL-v2.0- and-latter. Since the GPL license is only compatible with other GPL versions with the higher versions, this function is important for the one-direction compatibility chain discussed in Chapter 2. The whitelisting function is another key feature in SCAS, regulating whether this component is supposed to include properly in the current project.

    However, the space of improvement in showing the scanning results still exists. Because SCAS only imports whitelists and aggregated results in excel files for now. The user needs to click each matched file for viewing found licenses. Also, the scanning result could not import if the developer wants to check his codes by individual results without aggregation. Due to the page design limitation, visual fatigue would exacerbate after staring at the SCAS web interface for finding or reviewing one item in matched files. Especially, for the projects containing thousands of files, it might be more reasonable to show the information on a new page, such as the A2 and A3 areas in Figure 4.8, in a new page for the clear view. SCAS could improve user experiences and accuracy after fixing the bug in the filtering function and providing more precise scanning results by adding a searching function. Further analysis could not extend due to the lack of data support. Since SCAS is the new alternative tool for Ericsson's legacy source code scanning, the relative documents are updated with the process to use it in the actual working flow.

  • 6.4 Black Duck As the classic FOSS scanning tool, the function of Black Duck is relatively most comprehensive. It provides tutorial videos, documents, and other demo resources for paid teams. Still, the new user can not find any information except the conceptual introduction about FOSS scan’s importance and the owner company, Synopsys. On the other side, the confidential feature of Black Duck may be purposely designed for eliminating customers' concerns related to source code disclosure. One can describe the standard FOSS scanning as the FOSS scanning tool uses different algorithms to compare the codes from the user's uploads and knowledge base. Although some vendors provide a customized knowledge base that does not link to external internet but periodical updates. Most users prefer to directly connect to the central knowledge base for tracking terms changes due to FOSS license versions updates, or recently released libraries. In this case, some users may consider the potential disclosure caused by uploading the source code to the cloud services. From a theoretical view, this potential insecurity is related to account management rather than the FOSS process.

    Because Black Duck only provides accounts to paid users, this strategy decreased the possibility of using a free trial or one-off email address to access the common knowledge base. As similar to SCAS, the scanning results from Black Duck also limits to excel formats. FOSSA might be a proper reference in this feature. The user of Black Duck Dect has a better view by browsing the web interface with the provided link for each scan. As the test results from Table 5.10 and 5.11, the performance is hardly affected by project sizes or programming languages. Thus the developing team with a sufficient budget and already known about this product seems to be the target customer of Black Duck.

    At the end of this section, we recall the questions from Chapter 3.1. For the sanity test question, the experiment results in Table 5.9 illustrated that FOSSology found most licenses; however, the accuracy of the other three commercial tools’ accuracy is more reliable. The tools' vendors' official statements tend to avoid mentioning the undeveloped function and have different definitions. After using four tools to scan the same project to verify their license detection function, comparing the scanning sizes and numbers of found licenses occurs. Since the false positive always exists, the number of found licenses is not proportional to the knowledge base's accuracy or sizes. Taking the only open-source tool, FOSSology, as a counterexample, reported most found licenses with the least scanning sizes. The scanning results contain many "Unclassified licenses." The relatively narrow coverage of the knowledge base might be a reason for FOSSology to find the most license. Chapter 2.2 mentioned that it takes all files into potential open-source components, including keywords, such as copyright, licenses, and others. So FOSSology found most suspicious licenses but plain performance in other fields. This test proved that the number of the found license number is not the absolute metric for the FOSS scanning tool's performance.

    We can conclude the advanced test; Black Duck has a more stable performance

  • to adapt changes in the programming languages and project sizes. It provides the most exhaustive scanning but no product information before payment. FOSSology is slower than Black Duck in all testing cases. As an open-source project founded by the Linux Foundation, FOSSology gives the most detailed explanation about the scanning process. However, the scanning time required by FOSSology rapidly fluctuates with the changes in scanning size and programming language. Thus, we concluded that the budget and scanning size could be the main factors for choosing a proper FOSS scanning tool.

  • 7 Discussion Ericsson's current scanning strategy uses as much as possible FOSS scanning tools to perform multiple scanning on a commit, every deliverable build, and each repository. The license negotiation of Synopsis's Black Duck ongoing, JFrog Xray, thus takes responsibility for binary scanning now. For the source code scanning, SCAS is the new solution planned to replace the legacy BAZAAR. Both of them are Ericsson's internal platforms with the kernel of the algorithm provided by FOSSID. From Table 5.9, the sanity test result reflected that the scanned files and sizes are different when using four tools to scan the same project. Because the scanning depth and data collection methods are different for one testing object, this deviation may cause infringement due to a false negative. Besides, each team can choose their preferred FOSS scanning tool, such as FOSSA. According to the official tutorial documents, the tools selection is determined based on forms of submitted codes. The diverse options could cause problems for teams that do not know about all available FOSS scanning tools. So this project provides general recommendations for all users who intend to use one of the FOSS scanning tools to perform license detection. We listed it as below:

    ● We recommended FOSSology for nonprofit projects or open-source supporters. Although the required times rapidly increase with scanning sizes, it is an Impressive tutorial scanning tool because it even has detailed explanations for each class in code [9].

    ● FOSSA is suitable for the deliverable build scan or the small project with a maximum size, 100 MB. Ericsson's architecture has been deployed in the Kubernetes Engine service to guard the incoming public repositories entrance. The payment is flexible; thus, the user could choose a balanced program between price and function if the possible included components are predictable.

    ● SCAS represents the FOSSID, which stands in the license category but also has space for improvement. It is an example of the customized product that developed for users' specific requirements.

    ● Black Duck is a professional tool with the most comprehensive function among these four tools. The evaluation of Black Duck is in progress in Ericsson. Its performance hardly affects external factors. Thus it is the best choice for the team with sufficient project budget.

    The scanning results reflect the capabilities and characteristics of FOSSology,

    SCAS, Black Duck Detect, and FOSSA. The users can choose optimal FOSS management tools according to project requirements, budgets, and usability. From the data collected, it is safe to say this project achieved the research objective because it provides knowledge that can be referred by the user to choose a suitable FOSS scanning tool.

    Tools' performances recorded as testing results in Chapter 4 are foundations for these recommendations. These four scanning tools have different advantages that fit

  • various user requirements. For example, Black Duck does not support Typescript or Java with Ant as a package manager, while FOSSA could not integrate with many CI/CD tools, such as TeamCity, Bamboo, and Team Foundation Server [4]. About the detailed statements related to deploying environment and user experiences, we have already discussed it in Chapter 6. Although this project aims to indicate suitable using scenarios for FOSS scanning tools, which is different from the related works that focus on algorithm analysis and self-development, the experiments' results are still consistent with the previous studies. Kapitsaki, Tselikas, and Foukarakis's research [14] drew the same conclusion about FOSSology and Black Duck features, as Table 2.2 shows in this project. Besides, since inspirations of comparison methods in this project come from German and Di Penta's published paper [7], as well as Tomáš's thesis [20], the reasoning process also followed these reviewed studies. Besides, the definitions and compatibility chain from FSF and OSI [18] [22] confirmed the license information found by tools. The similarity among tools' function revealed the importance of comparing their features. The previous works provide a new solution for FOSS scanning, but this thesis gives a guideline among the plentiful FOSS scanning tools market.

  • 8 Conclusion This project used two sets of controlled experiments to determine four FOSS scanning tools' capabilities and characteristics. The first set of controlled experiments is the sanity test, which compared the scanned sizes and found licenses among all chosen tools. The experiment results reflect the coverage of the knowledge base and the accuracy of the algorithm. We have also concluded the user experiences, security concerns, and integration possibilities of each tool from the experiment process. The second set of controlled experiments compared the scanning time of FOSSology and Black Duck. The positive correlation exists between the required scanning time and project sizes. For the project with the same size, both FOSS scanning tools need the longest time to scan the Python project, and the shortest time for the C++ project, the Golang project consumes the average time. Black Duck has a more stable performance than FOSSology, and the efficiency just slightly changes with the testing samples.

    The purpose is to provide knowledge for the potential users to choose a suitable tool, which could perform open-source compliance in product development. So the enterprise or individuals could save time and expenses for testing the various commercial scanning tools.

    8.1 Future work This project does not investigate the possible reasons caused by the different FOSS scanning tools’ performances and potential security problems because of knowledge limitation. As the core competency, the vendors tightly protect the algorithms. For now, the only released factor is the knowledge base that could affect the performance. It still could be an exciting topic to test the algorithms by designing testing samples. For example, if we replace the keywords, like the release, copyright (c) would cause more partial matches in similarity scanners. The knowledge related to cryptography belongs to network security, which is beyond our studying scope. So we could not provide actual experiment data for proving our opinion in Chapter 6. Since the source code disclosure is the primary concern of most users, the following works could pay more effort to study exploits in scanning algorithms or account protection.

    Also, we could extend the testing samples and testing objectives. Excepting Black Duck, other chosen tools have no option on scanning types. Suppose we include more tools, such as JFrog Xray, Snyk, WhiteSource, GitLab, and others [23]. In that case, we could divide the comparison into more detailed categories, like source code and the binary scanning can be compared separately. In this case, the controlled variables will be more strict in the experiments. Furthermore, the function of integration in the CI/CD flow spotlights advertisements for the popular FOSS scanning tools [17]. The testing samples in this project did not aggregate with any pipeline, so the statements related to integration function in tools introduction, Chapter 2 also lack experimental support. The future work could execute the sanity test in the actual workflow for verifying Table 2.2 in detail. Since the documents would delay after the version update,

  • broader test coverage will increase its validity. Although most vendors claim that their tools are available for any programming languages and have the most significant knowledge base, it will be accurate to add more programming languages and licenses in the testing samples.

  • References [1] Boehm, M., 2019. The emergence of governance norms in volunteer driven open

    source communities. Journal of Open Law, Technology and Society, 11(1), pp.3-39.

    [2] Feller, J., Fitzgerald, B. and Hissam, S., 2005. Perspectives On Free And Open

    Source Software. Cambridge, Mass: MIT, pp.79-81.

    [3] Fogel, K., 2005. Producing Open Source Software. Sebastopol: O'Reilly Media, Inc.,

    pp.233-240.

    [4] FOSSA - Guides & Documentation. 2020. Supported Languages. [online] Available

    at:

    [Accessed 17 May 2020].

    [5] Fossology.github.io. 2020. Fossology: Overview Of Fossology. [online] Available

    at: [Accessed 13 May 2020].

    [6] Gangadharan, G., D’Andrea, V., De Paoli, S. and Weiss, M., 2009. Managing license

    compliance in free and open source software development. Information Systems

    Frontiers, 14(2), pp.143-154.

    [7] German, D. and Di Penta, M., 2012. A Method for Open Source license Compliance

    of Java Applications. IEEE Software, 29(3), pp.58-63.

    [8] Ghosh, A., Nashaat, M. and Miller, J., 2019. The current state of software license

    renewals in the I.T. industry. Information and Software Technology, 108, pp.139-152.

    [9] GitHub. 2020. Fossology/Fossology. [online] Available at:

  • [Accessed 17 May 2020].

    [10] Gnu.org. 2020. Various licenses And Comments Abo