carol v. alexandru, sebastiano panichella, sebastian...

Carol V. Alexandru, Sebastiano Panichella, Sebastian Proksch, Harald C. GallSoftware Evolution and Architecture Lab

University of Zurich, Switzerland{alexandru,panichella,proksch,gall}@ifi.uzh.ch

26.03.2018

59th CREST Open WorkshopCentre for Research on Evolution, Search and Testing

University College London, London, United Kingdom

The Problem Domain

• Static analysis (e.g. #Attr., McCabe, coupling...)

The Problem Domain

v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0

The Problem Domain

v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0

• Many revisions, fine-grained historical data

A Typical Analysis Process

select project

checkout

select project

select revision

checkout

Purpose-built,language specific tool

select project

select revision

applytool

storeanalysis

results

checkout

select project

select revision

applytool

storeanalysis

results

more revisions?

checkout

select project

select revision

applytool

storeanalysis

results

more revisions?

more projects?

Redundancies all over...

Redundancies in historical code analysis

Impact on Code Study Tools

Few files change

Only small parts of a file change

Across RevisionsImpact on Code

Study Tools

Few files change

Study Tools

Repeated analysis of "known" code

Few files change

Changes may not even affect results

Study Tools

Storing redundant results

Few files change

Across Languages

Each language has their own toolchain

Yet they share many metrics

Study Tools

Few files change

Across Languages

Study Tools

Re-implementing identical analyses

Generalizability is expensive

Few files change

Across Languages

Study Tools

Re-implementing identical analyses

Generalizability is expensive

#1: Avoid Checkouts

Avoid checkouts

checkout

read write

Avoid checkouts

checkout

read write

analyze

Avoid checkouts

checkout

For every file: 2 read ops + 1 write opCheckout includes irrelevant filesNeed 1 CWD for every revision to be analyzed in parallel

read write

analyze

Avoid checkouts

clone analyze

Avoid checkouts

clone analyze

Only read relevant files in a single read opNo write opsNo overhead for parallization

Avoid checkouts

clone analyze

Analysis Tool

File Abstraction Layer

Avoid checkouts

clone analyze

Analysis Tool

E.g. for the JDK Compiler:

class JavaSourceFromCharrArray(name: String, val code: CharBuffer)extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {override def getCharContent(): CharSequence = code

Avoid checkouts

clone analyze

Analysis Tool

E.g. for the JDK Compiler:

class JavaSourceFromCharrArray(name: String, val code: CharBuffer)extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {override def getCharContent(): CharSequence = code

#2: Use a multi-revision representationof your sources

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

rev. 1

rev. 2

rev. 3

rev. 4

rev. 1

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

rev. 2

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

rev. 3

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

rev. range [1-4]

rev. range [1-2]

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4AspectJ (~440k LOC):

1 commit: 2.2M nodesAll >7000 commits: 6.5M nodes

Merge ASTs

rev. 1

rev. 2

rev. 3

Merge ASTs

rev. 1

rev. 2

rev. 3

#3: Store AST nodes only ifthey're needed for analysis

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

System.out.println(i)

What's the complexity (1+#forks) and name for each method and class?

140 AST nodes(using ANTLR)

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

parseCompilationUnit

TypeDeclaration

Modifiers

public

Members

Method

Modifiers

public

Parameters ReturnType

PrimitiveType

Statements

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

parse filtered parse

TypeDeclaration

Method

DemoForStatement

IfStatement

ConditionalExpression

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

What's the complexity (1+#forks) and name for eachmethod and class?

TypeDeclaration

Method

DemoForStatement

IfStatement

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

What's the complexity (1+#forks) and name for eachmethod and class?

TypeDeclaration

Method

DemoForStatement

IfStatement

#4: Use non-duplicative data structuresto store your results

rev. 1

rev. 2

rev. 3

rev. 4

rev. 1

rev. 2

rev. 3

rev. 4

rev. 1

rev. 2

rev. 3

rev. 4

InnerClass

rev. 1

rev. 2

rev. 3

rev. 4 [1-1]

InnerClass

LISA also does:#5: Parallel Parsing#6: Asynchronous graph computation#7: Generic graph computationsapplying to ASTs from compatiblelanguages

A light-weight view onmulti-language analysis

Typical solutions

• Toolchains / Frameworks

• Integrate language-specific tooling

• Lots of engineering required

• Meta-models

• Translate language code to some common

representation

• Significant overhead / rigid models

Structure matters most

• Complexity?

• # of Functions / Attributes etc.

• Coupling between Classes

• Call graphs

if (true) {

if (true) { }

} # CYCLO: 3

if (true) { }

# CYCLO: 4

Relative structure is similar

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

class Demo:

def run():

for i in range(1, 100):

if i % 3 == 0 or i % 5 == 0:

print(i)

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

class Demo:

def run():

if i % 3 == 0 or i % 5 == 0:

print(i)

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

class Demo:

def run():

if i % 3 == 0 or i % 5 == 0:

print(i)

function

Entities are different

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

class Demo:

def run():

if i % 3 == 0 or i % 5 == 0:

print(i)

function

CLASS_DECL

METHOD_DECL

FOR_STATIF_STAT

METHOD_INVOK

classdef

funcdefforstat

ifelsestatatomexpr

Can filter irrelevant nodes

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

class Demo:

def run():

if i % 3 == 0 or i % 5 == 0:

print(i)

function

Can filter irrelevant nodes

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

class Demo:

def run():

if i % 3 == 0 or i % 5 == 0:

print(i)

function

How LISA handlesMultiple languages

Analysis formulation

• Signal/Collect (like “Google Pregel” for Scala)

• Graph vertices send information packets

(signals) and do something when receiving

(collecting) signals.

function

functionfunc-count signal

signal

function

collect

func-count signal

fc = fc+1+1 = 2

functionfunc-count signal

signal

function

collect

func-count signal

fc = fc+1 = 3

• Analyses use generic lables for entities

Light-weight entity mappings

Lables used by Analyses

Lables used by the parser or

grammar

Parsing

• AST is kept as the parser supplies it

• ANTLRv4 integration

• Filtering can be enabled

• Only AST nodes that correspond to a label

used in an analysis are kept

• Reduces graph size by a factor of 10 or more

Adding new languages

1. Integrate a parser (or generate one)

• Graph interface allows adding nodes/edges

2. Write a node mapping

3. Re-use existing analyses on new ASTs

To Summarize...

The LISA Analysis Process

select project

parallel parseinto mergedgraph

select project

Language Mappings(Grammar to Analysis)

determines which ASTnodes are loaded

ANTLRv4Grammar used by

Generates Parser

select project

storeanalysis

results

Analysis formulated as Graph Computation

Language Mappings(Grammar to Analysis) used by

Async. compute

determines whichdata is persisted

Generates Parser

runson graph

select project

storeanalysis

results

more projects?

Analysis formulated as Graph Computation

Language Mappings(Grammar to Analysis) used by

Async. compute

determines whichdata is persisted

Generates Parser

runson graph

How well does it work, then?

Marginal cost for +1 revision

41.670

4.6330.525 0.109 0.082 0.071 0.052 0.041 0.032 0.033

1 10 100 1000 2000 3000 4000 5000 6000 7000

Average Parsing+Computation time per Revision when analyzing n revisions of AspectJ (10 common metrics)

# of revisions

Overall Performance Stats

Language Java C# JavScript Python

#Projects 1000 1000 1000 1000

#Revisions 2 Million 1.4 Million 1.5 Million 2.3 Million

#Files (parsed!) 10 Billion 3 Billion 380 Thousand 1.2 Billion

#Lines (parsed!) 1.6 Trillion 0.6 Trillion 43 Billion 0.3 Trillion

Total Runtime (RT)¹ 2d 5h 3d 23h 2d 4h

Median RT¹ 8.35s 40.5s 14.4s 24.5s

Tot. Avg. RT per Rev.²

97ms 183ms 57ms 83ms

Med. Avg. RT per Rev.²

41ms 88ms 25ms 48ms

¹ Including cloning and persisting results² Excluding cloning and persisting results

What's the catch?(There are a few...)

The (not so) minor stuff

• Must implement analyses from scratch

• No help from a compiler

• Non-file-local analyses need some effort

The (not so) minor stuff

• Must implement analyses from scratch

• No help from a compiler

• Non-file-local analyses need some effort

• Moved files/methods etc. add overhead

• Uniquely identifying files/entities is hard

• (No impact on results, though)

Parser performance matters

Javascript C# Java Python

LISA is E X TREME

complexfeature-richheavyweight

simplegeneric

lightweight

Thank you for your attention

59th CREST Open Workshop, London, 26.03.2018

Read the paper: http://t.uzh.ch/Fj(upcoming EMSE publication is much more detailed)

Try the tool: http://t.uzh.ch/Fk

Get the slides: http://t.uzh.ch/NR

Contact me: alexandru@ifi.uzh.ch

carol v. alexandru, sebastiano panichella, sebastian...

Documents

sebastiano pulvirenti, faro febbraio:marzo 2015

sebastiano pulvirenti sepulvi@libero.it 1 dal c.i.p.p. model...

sebastiano cappa...

sebastiano pulvirenti, la modularità formativa

carol v. alexandru, sebastiano panichella, harald c. gall...

1 strumenti della teoria dei giochi per linformatica a.a....

sebastiano aglieco - quaderni

plaquette sbt sebastiano :

40 sebastiano del piombo cilmi

sebastiano scandura ctmm814018

san sebastiano saluta sassari

sebastiano della lena fabrizio panebianco

2010 - sebastiano a. patanè - album

ron lavi – chaitanya swamy 1 strumenti della teoria dei...

engleza - cristea george alexandru - dobraniste alexandru

andrea palladio y sebastiano serlio

130719 sebastiano panichella - who is going to mentor...

sebastiano a.g. lava

alexandru din opera alexandru refuzind apa

sebastiano pulvirenti, faro febbraio 2015