identifying open source party software with scancode toolkit
TRANSCRIPT
Open Source for Open Source
Agenda
▷ Introduction to ScanCode Toolkit▷ Demo▷ ScanCode Details▷ About nexB
Benefits of an open source scannerAs a developer:
▷ I get normalized and comprehensive license and origin data
▷ I can find the license immediately when I evaluate a library
▷ I can identify and resolve license issues before a release
▷ I can identify issues for each commit
▷ I can communicate clearly with legal and business about license
and origin of third-party code
You can use the Apache-licensed ScanCode Toolkit now!
Participate by contributing code, license rules, bugs, suggestions.
What does ScanCode Toolkit do?It scans source and binary code to find:
▷ License notices, texts and “mentions”
▷ Copyright notices
▷ Package-level information (RPM, nuget, NPM, Jar, etc.)
▷ Other provenance clues (author, email, etc.)
▷ File-level information (type, name, checksums, etc.)
ScanCode Results are provided as:
▷ JSON file
▷ Dynamic HTML
▷ Static HTML table usable
in a spreadsheet
Place your screenshot here
Demo Time
Available on GitHub
▷ Get the codehttps://github.com/nexB/scancode-toolkit/
▷ Read morehttps://github.com/nexB/scancode-toolkit/wiki
▷ Report an issue or ideahttps://github.com/nexB/scancode-toolkit/issues
▷ Commercial supporthttp://www.nexb.com/
ScanCode Toolkit Licensing
License Notes
Software Apache 2.0With an acknowledgement in the scan output.
Reference Data
CC0 1.0 Public Domain
Third Party Components
L/GPL, MIT, BSD, Apache Various Licenses
ScanCode Toolkit Roadmap
▷ nexB is migrating features from our proprietary scanning tools to ScanCode incrementally over the next year (2016)
▷ Roadmap at https://github.com/nexB/scancode-
toolkit/wiki/Roadmap
Thanks!Any questions?
CreditsSpecial thanks to all the people who made and released these awesome free resources:
▷ Presentation template by SlidesCarnival▷ Photographs by Unsplash▷ And all the software authors that made ScanCode possible
Additional Details
▷ ScanCode by the numbers▷ What is scanning?▷ How does ScanCode work?▷ About nexB
Over 6,000 tests
500+ large software products scanned
Over 3,000 licenses, notices and samples
ScanCode by the numbers
ScanCode - Technology
▷ Written primarily in Python○ also JavaScript, Ruby, Java and C/C++
▷ Tested on Linux, OS X and Windows▷ Command line tool or library▷ Simple HTML browser-app (any modern
browser) - runs locally
What is Scanning?
Detect and discover “evidence” of origin and license in code (source or binary files)
▷ Copyright notice▷ License notice and/or license test▷ Software package manifests▷ Email, URL, author or other names▷ Other origin and license clues found in the
code
Scanning is not Matching
Matching looks for similarities between your code and an index (digital fingerprints) of OSS code
▷ If your code is similar it “may” share a similar origin
▷ Matching may be applied at multiple levels○ Package○ File or snippet
Scanning plus Matching
▷ Scanning will identify origin and license in most cases, but○ Does not detect copying of snippets, or○ Intentional stripping of notices, etc.
▷ Matching can identify code that was copied and/or stripped, but
○ Typically produces MANY false
positives and requires extensive review
○ Especially for the most commonly used
OSS projects
How does ScanCode work? (1)
▷ Each file is categorized based on its type▷ Archives and compressed files are fully extracted▷ The text of each file is collected (source and binaries)▷ Each file's text is then "scanned"▷ Results are formatted and returned as a JSON file▷ You can view the results in a browser, or▷ Use the JSON file as you want
How does ScanCode work? (2)
▷ For licenses, the techniques are similar to DNA analysis with multi-pattern matching
▷ Licenses are found exactly or approximately based on a set of thousands of license texts, notices and examples
▷ For copyrights, a syntax and grammar analyzer captures the many forms of copyright statements
▷ Emails, URLs, authors, person names and other data are captured using similar pattern matching techniques
Alternatives and complements
▷ Open source such as:○ Fossology (c, PHP): regex-based○ ninka (Perl): regex & sentences-based○ OSLC (Java, unmaintained)
▷ Commercial such as ...▷ Complementary:
○ AboutCode: document origin side-by-side with code, collect inventory, generate attribution doc
○ TraceCode (not yet released): trace the source to binary transformation to find (static) linking and what is the subset of the source code used (dynamically trace a build or does a static analysis)
About nexB Inc.
We offer:
▷ DejaCode™- Open Data Platform for Managing Open Source - http://www.dejacode.com/
▷ Open Source Scanning & Attribution Generation Tools - https://github.com/nexB
▷ Open Source Software Expert Audit Services - http://www.nexb.com/services.html