identifying third party software with scancode

Identifying open source software with ScanCodeMay 2016

Open Source for Open Source

▷ Introduction to ScanCode○ Toolkit○ App

▷ Demo▷ More Details▷ About nexB

Benefits of an open source scannerAs a developer:

▷ I get normalized data for comprehensive origin and license

▷ I can find the license immediately when I evaluate a library

▷ I can identify and resolve license issues before a release

▷ I can identify issues for each commit

▷ I can communicate clearly with legal and business about license

and origin of third-party code

You can use the Apache-licensed ScanCode Toolkit now!

Participate by contributing code, license rules, bugs or suggestions.

What does ScanCode Toolkit do?It scans source and binary code to find:

▷ License notices, texts and “mentions”

▷ Copyright notices

▷ Package-level information (RPM, nuget, NPM, Jar, etc.)

▷ Other provenance clues (author, email, etc.)

▷ File-level information (type, name, checksums, etc.)

ScanCode Results are provided as:

▷ JSON file

▷ Dynamic HTML

▷ Static HTML table usable in a

spreadsheet

▷ AND

▷ ... the new ScanCode App

▷ ... next, in the ScanCode.io server

Place your screenshot here

ScanCode Toolkit Demo

Available on GitHub

▷ Get the codehttps://github.com/nexB/scancode-toolkit/

▷ Read morehttps://github.com/nexB/scancode-toolkit/wiki

▷ Report an issue or ideahttps://github.com/nexB/scancode-toolkit/issues

▷ Commercial support and services available from nexB : ScanCode starter pack http://www.nexb.com/

https://github.com/nexB/scancode-toolkit/



https://github.com/nexB/scancode-toolkit/wiki



https://github.com/nexB/scancode-toolkit/issues



http://www.nexb.com/



ScanCode Licensing

License Notes

Software Apache 2.0With an acknowledgement in the scan output.

Reference Data

CC0 1.0 Public Domain

Third Party Components

L/GPL, MIT, BSD, Apache Various Licenses

ScanCode Toolkit Roadmap▷ New scans for software packages (RPM, NPM, Gems, Java Jars,

Debian, Nuget, Python, etc.) ▷ Approximate license detection▷ SPDX license expressions▷ Speed improvements▷ See https://github.com/nexB/scancode-toolkit/wiki/Roadmap

https://github.com/nexB/scancode-toolkit/wiki/Roadmap

ScanCode AppWhat we’ve been working on!

ScanCode App

Motivation:

▷ Analyze ScanCode results▷ Document your conclusion about the

provenance and license for a software component.

▷ Save conclusions▷ Share results

ScanCode Conclusions

Document Component-level conclusions such as:▷ Component Name▷ Component Version▷ Component Owner▷ Concluded License▷ Concluded Copyright

Preview of ScanCode App

Summary of Features

▷ View results in tree or tabular view▷ Add conclusion data at any node of the

existing codebase hierarchy▷ Save Components and conclusions to a

JSON file

Thanks!Any questions?

CreditsSpecial thanks to all the people who made and released these awesome free resources:

▷ Presentation template by SlidesCarnival▷ Photographs by Unsplash▷ And all the software authors who made ScanCode possible

http://www.slidescarnival.com/

http://unsplash.com/

About nexB Inc.

We offer:

▷ DejaCode™- Open Data Platform for Managing Open Source - http://www.dejacode.com/

▷ Open Source Scanning & Tracking Tools - https://github.com/nexB

▷ Open Source Software Expert Audit Services - http://www.nexb.com/services.html

http://www.dejacode.com/

http://www.dejacode.com/

https://github.com/nexB




http://www.nexb.com/services.html



ScanCode Details

▷ ScanCode by the numbers▷ What is scanning?▷ How does ScanCode work?

Over 6,000 tests

Over 500 large software products scanned

Over 3,000 licenses, notices and samples

ScanCode by the numbers

ScanCode Toolkit- Technology

▷ Written primarily in Python○ also JavaScript, Ruby, Java and C/C++

▷ Tested on Linux, OS X and Windows▷ Command line tool or library▷ Simple HTML browser-app (any modern

browser) - runs locally

ScanCode App - Technology

▷ Based on Electron and written primarily in JavaScript

▷ D3.js used for data visualizations

What is Scanning?

Detect and discover “evidence” of origin and license in code (source or binary files)

▷ Copyright notice▷ License notice and/or license test▷ Software package manifests▷ Email, URL, author or other names▷ Other origin and license clues found in the

code

Scanning is not Matching

Matching looks for similarities between your code and an index (digital fingerprints) of OSS code

▷ If your code is similar it “may” share a similar origin

▷ Matching may be applied at multiple levels○ Package○ File or snippet

Scanning plus Matching

▷ Scanning will identify origin and license in most cases, but○ Does not detect copying of snippets, or○ Intentional stripping of notices, etc.

▷ Matching can identify code that was copied and/or stripped, but

○ Typically produces MANY false

positives and requires extensive review

○ Especially for the most commonly used

OSS projects

How does ScanCode work? (1)

▷ Each file is categorized based on its type▷ Archives and compressed files are fully extracted▷ The text of each file is collected (source and binaries)▷ Each file's text is then "scanned"▷ Results are formatted and returned as a JSON file▷ You can view the results in a browser, or▷ Use the JSON file as you want

How does ScanCode work? (2)

▷ For licenses, the techniques are similar to DNA analysis with multi-pattern matching

▷ Licenses are found exactly or approximately based on a set of thousands of license texts, notices and examples

▷ For copyrights, a syntax and grammar analyzer captures the many forms of copyright statements

▷ Emails, URLs, authors, person names and other data are captured using similar pattern matching techniques

Alternatives and complements

▷ Open source such as:○ Fossology (c, PHP): regex-based○ ninka (Perl): regex & sentences-based○ OSLC (Java, unmaintained)

▷ Commercial ▷ Complementary:

○ AboutCode: document origin side-by-side with code, collect inventory, generate attribution doc

○ TraceCode (not yet released): trace the source to binary transformation to find (static) linking and what is the subset of the source code used (dynamically trace a build or does a static analysis)

identifying third party software with scancode

Business