identifying third party software with scancode
TRANSCRIPT
Identifying open source software with ScanCodeMay 2016
Open Source for Open Source
▷ Introduction to ScanCode○ Toolkit○ App
▷ Demo▷ More Details▷ About nexB
Benefits of an open source scannerAs a developer:
▷ I get normalized data for comprehensive origin and license
▷ I can find the license immediately when I evaluate a library
▷ I can identify and resolve license issues before a release
▷ I can identify issues for each commit
▷ I can communicate clearly with legal and business about license
and origin of third-party code
You can use the Apache-licensed ScanCode Toolkit now!
Participate by contributing code, license rules, bugs or suggestions.
What does ScanCode Toolkit do?It scans source and binary code to find:
▷ License notices, texts and “mentions”
▷ Copyright notices
▷ Package-level information (RPM, nuget, NPM, Jar, etc.)
▷ Other provenance clues (author, email, etc.)
▷ File-level information (type, name, checksums, etc.)
ScanCode Results are provided as:
▷ JSON file
▷ Dynamic HTML
▷ Static HTML table usable in a
spreadsheet
▷ AND
▷ ... the new ScanCode App
▷ ... next, in the ScanCode.io server
Place your screenshot here
ScanCode Toolkit Demo
Available on GitHub
▷ Get the codehttps://github.com/nexB/scancode-toolkit/
▷ Read morehttps://github.com/nexB/scancode-toolkit/wiki
▷ Report an issue or ideahttps://github.com/nexB/scancode-toolkit/issues
▷ Commercial support and services available from nexB : ScanCode starter pack http://www.nexb.com/
ScanCode Licensing
License Notes
Software Apache 2.0With an acknowledgement in the scan output.
Reference Data
CC0 1.0 Public Domain
Third Party Components
L/GPL, MIT, BSD, Apache Various Licenses
ScanCode Toolkit Roadmap▷ New scans for software packages (RPM, NPM, Gems, Java Jars,
Debian, Nuget, Python, etc.) ▷ Approximate license detection▷ SPDX license expressions▷ Speed improvements▷ See https://github.com/nexB/scancode-toolkit/wiki/Roadmap
ScanCode AppWhat we’ve been working on!
ScanCode App
Motivation:
▷ Analyze ScanCode results▷ Document your conclusion about the
provenance and license for a software component.
▷ Save conclusions▷ Share results
ScanCode Conclusions
Document Component-level conclusions such as:▷ Component Name▷ Component Version▷ Component Owner▷ Concluded License▷ Concluded Copyright
Preview of ScanCode App
Summary of Features
▷ View results in tree or tabular view▷ Add conclusion data at any node of the
existing codebase hierarchy▷ Save Components and conclusions to a
JSON file
Thanks!Any questions?
CreditsSpecial thanks to all the people who made and released these awesome free resources:
▷ Presentation template by SlidesCarnival▷ Photographs by Unsplash▷ And all the software authors who made ScanCode possible
About nexB Inc.
We offer:
▷ DejaCode™- Open Data Platform for Managing Open Source - http://www.dejacode.com/
▷ Open Source Scanning & Tracking Tools - https://github.com/nexB
▷ Open Source Software Expert Audit Services - http://www.nexb.com/services.html
ScanCode Details
▷ ScanCode by the numbers▷ What is scanning?▷ How does ScanCode work?
Over 6,000 tests
Over 500 large software products scanned
Over 3,000 licenses, notices and samples
ScanCode by the numbers
ScanCode Toolkit- Technology
▷ Written primarily in Python○ also JavaScript, Ruby, Java and C/C++
▷ Tested on Linux, OS X and Windows▷ Command line tool or library▷ Simple HTML browser-app (any modern
browser) - runs locally
ScanCode App - Technology
▷ Based on Electron and written primarily in JavaScript
▷ D3.js used for data visualizations
What is Scanning?
Detect and discover “evidence” of origin and license in code (source or binary files)
▷ Copyright notice▷ License notice and/or license test▷ Software package manifests▷ Email, URL, author or other names▷ Other origin and license clues found in the
code
Scanning is not Matching
Matching looks for similarities between your code and an index (digital fingerprints) of OSS code
▷ If your code is similar it “may” share a similar origin
▷ Matching may be applied at multiple levels○ Package○ File or snippet
Scanning plus Matching
▷ Scanning will identify origin and license in most cases, but○ Does not detect copying of snippets, or○ Intentional stripping of notices, etc.
▷ Matching can identify code that was copied and/or stripped, but
○ Typically produces MANY false
positives and requires extensive review
○ Especially for the most commonly used
OSS projects
How does ScanCode work? (1)
▷ Each file is categorized based on its type▷ Archives and compressed files are fully extracted▷ The text of each file is collected (source and binaries)▷ Each file's text is then "scanned"▷ Results are formatted and returned as a JSON file▷ You can view the results in a browser, or▷ Use the JSON file as you want
How does ScanCode work? (2)
▷ For licenses, the techniques are similar to DNA analysis with multi-pattern matching
▷ Licenses are found exactly or approximately based on a set of thousands of license texts, notices and examples
▷ For copyrights, a syntax and grammar analyzer captures the many forms of copyright statements
▷ Emails, URLs, authors, person names and other data are captured using similar pattern matching techniques
Alternatives and complements
▷ Open source such as:○ Fossology (c, PHP): regex-based○ ninka (Perl): regex & sentences-based○ OSLC (Java, unmaintained)
▷ Commercial ▷ Complementary:
○ AboutCode: document origin side-by-side with code, collect inventory, generate attribution doc
○ TraceCode (not yet released): trace the source to binary transformation to find (static) linking and what is the subset of the source code used (dynamically trace a build or does a static analysis)