scam 2013 — agile modeling keynote

46
Oscar Nierstrasz Software Composition Group scg.unibe.ch SCAM 2013 Agile Modeling When can we have it?

Upload: oscar-nierstrasz

Post on 01-Jul-2015

938 views

Category:

Technology


2 download

DESCRIPTION

Keynote presentation for 13th IEEE International Working Conference on Source Code Analysis and Manipulation — Eindhoven, The Netherlands, 22-23 September 2013. http://www.ieee-scam.org/2013/program.html

TRANSCRIPT

Page 1: Scam 2013 — Agile Modeling Keynote

Oscar NierstraszSoftware Composition Group

scg.unibe.ch

SCAM 2013

Agile ModelingWhen can we have it?

Page 2: Scam 2013 — Agile Modeling Keynote

Directions

Agile Software Assessment

Roadmap

Agile Modeling

Page 3: Scam 2013 — Agile Modeling Keynote

Agility Software Assessment

Page 4: Scam 2013 — Agile Modeling Keynote

4

Legacy code is hard to understand

Real software systems change with time. Much of development is not new but maintenance, adaptation and extension of existing systems. But these systems are hard to understand as they tend to get more complex over time.

Page 5: Scam 2013 — Agile Modeling Keynote

5

The architecture

... is not in the code

Legacy code is hard to understand

Although the architecture is one of the most important artifacts of a system to understand, it is not easily recoverable from code. This is because: (1) a system may have many architectures (eg layered and thin client), (2) there are many kinds of architecture static, run-time, build etc), (3) PLs do not offer any support to encode architectural constraints aside from coarse structuring and interfaces.

Page 6: Scam 2013 — Agile Modeling Keynote

6

Developers spend more time reading than writing code

Legacy code is hard to understandThe architecture is not in the code

Especially with OO, where the code does not reflect run time behaviour well, more time is spent reading than writing, to understand code and to understand impact of change. IDEs do not well support this reading activity, since they focus on PL concepts, like editing, compiling and debugging, not architectural constraints, or user features.

Page 7: Scam 2013 — Agile Modeling Keynote

Legacy code is hard to understandThe architecture is not in the code

Developers spend more time reading than writing code

7

Specialized analyses require custom tools

Analysis tool are pretty limited to programming tasks. We have built many specialized tools, such as to mine features, extract architecture, or analyze evolution, but these tools are expensive to build. We need an IDE with better tools, and offering a better meta-tooling environment (tools to build tools quickly).

Page 8: Scam 2013 — Agile Modeling Keynote

8

Legacy code is hard to understandThe architecture is not in the code

Developers spend more time reading than writing code

Specialized analyses require custom tools

Page 9: Scam 2013 — Agile Modeling Keynote

9

Moose is a platform for software and data analysis

www.moosetechnology.org

Moose is a platform for modeling software artifacts to enable software analysis.Moose has been developed for well over a decade. It is the work of dozens of researchers, and has been the basis of numerous academic and industrial projects.

Page 10: Scam 2013 — Agile Modeling Keynote

10

Smalltalk

Navigation

Metrics

Querying

Grouping

Smalltalk

Java

C++

COBOL

MSEExternalParser

CodeCrawler

ConAn Van ...Hapax

Extensible meta model

Model repository

Nierstrasz et al. The Story of Moose. ESEC/FSE 2005

The key to Moose’s flexibility is its extensible metamodel.Every tool that builds on Moose is able to extend the metamodel with concepts that it needs, such as history.

Page 11: Scam 2013 — Agile Modeling Keynote

11

System complexity

Lanza et al. Polymetric Views. TSE 2003

Page 12: Scam 2013 — Agile Modeling Keynote

12

System complexity - Clone evolution viewClass blueprint - Topic Correlation Matrix - Distribution Map for topics spread over classes in packagesHierarchy Evolution view - Ownership Map

Page 13: Scam 2013 — Agile Modeling Keynote

13

Smalltalk

Navigation

Metrics

Querying

Grouping

Smalltalk

Java

C++

COBOL

CodeCrawler

ConAn Van ...Hapax

Extensible meta model

Model repository

But, we have a huge bottleneck for new

languages ... The bottleneck is to construct a parser to build the model.

Page 14: Scam 2013 — Agile Modeling Keynote

A S AA

A

14

What is Agile Software Assessment?

Page 15: Scam 2013 — Agile Modeling Keynote

15

Challenge

“What will my code change impact?”

Mircea’s visionLarge software systems are so complex that one can never be sure until integration whether certain changes can have catastrophic effects at a distance.Ideas: Tracking Software Architecture; exploiting Big Software Data

Page 16: Scam 2013 — Agile Modeling Keynote

16

Build a new assessment tool in ten minutes

Challenge

Custom analyses require custom tools. Building a tool should be as easy as writing a query in SQL or a form-based interface.

Page 17: Scam 2013 — Agile Modeling Keynote

17

Load the model in the morning, analyze it in the afternoon

Challenge

The key bottleneck to assessment is creating a suitable model for analysis. If a tool does not already exist, it can take days, weeks or months to parse source files and generate models.

Page 18: Scam 2013 — Agile Modeling Keynote

Agile Modeling

Page 19: Scam 2013 — Agile Modeling Keynote

19

Agile Modeling LifecycleBuild a

coarse model

Build a custom analysis

Refine the model

Page 20: Scam 2013 — Agile Modeling Keynote

20

Problems

Heterogeneous projects

Unknown languages

Unstructured textDeveloping a parser for a new language is a big challenge. Parsers may be hard to scavenge from existing tools.Not only source code, but other sources of information, like bug reports and emails can be invaluable for model building.Few projects today are built using a single language. Often a GPL is mixed with scripting languages, or even home-brewed DSLs.

Page 21: Scam 2013 — Agile Modeling Keynote

Ideas

Grammar Stealing

Hooking into an existing toolof this phase will be a model of the Ruby software system. As the meta-modelis FAME compliant, also the model will be. Information about the ClassLoader,an instance responsible for loading Java classes, is covered in section 4.7.

The Fame framework automatically extracts a model from an instance of anEclipse AST. This instance corresponds to the instance of the Ruby plugin ASTrepresenting the software system. Automation is possible due to the fact thatwe defined the higher level mapping. Figure 2.1 reveals the need for the highermapping to be restored. In order to implement the next phase independentlyfrom the environment used in this phase we extracted the model into an MSEfile.

Figure 2.1: The dotted lines correspond to the extraction of a (meta-)model.The other arrows between the model and the software system hierarchy showwhich Java tower level corresponds to which meta-model tower element.

2.3 Model Mapping by Example phase

Our previously extracted model still contains platform dependent informationand thus is not a domain specific model for reverse engineering. It could beused by very specific or very generic reverse engineering tools, as it containsthe concrete syntax tree of the software system only. However such tools donot exist. In the Model Mapping by Example phase we want to transform themodel into a FAMIX compliant one. With such a format it will be easier to usein several software engineering tools.

The idea behind this approach relies on Parsing by Example [3]. Parsingby Example presents a semi-automatic way of mapping source code to domain

9

Recycling Trees

Parsing by ExampleEvolutionary

Grammar Generation

18 CHAPTER 3. GENETIC PROGRAMMING

Since biological evolution starts from an existing population of species, we need tobootstrap an initial population before we can begin evolving it. This initial populationis generally a number of random individuals. These initial individuals usually don’tperform well, although some will already be a tad better than others. That is exactlywhat we need to get evolution going.

The final part is reproduction, i.e. to generate a new generation from the surviving pre-vious generation. For that purpose an evolutionary algorithm usually uses two typesof genetic operators: point mutation and crossover (We will refer to point mutations asmutations, although crossover is technically also a mutation). Mutations change anindividual in a random location to alter it slightly, thus generating new information.Crossover1 however, takes at least two individuals and cuts out part of one of them, toput it in the other individual(s). By only moving around information, Crossover doesnot introduce new information. Be aware that every modification of an individual hasto result in a new individual that is valid. Validity is very dependent on the searchspace - it generally means that fitness function as well as the genetic operators shouldbe applicable to a valid individual. A schematic view is shown in fig. 3.1.

generate new

random population

select most fit

individuals

generate new

population with

genetic operators

fit enough?

mutation crossover

Figure 3.1: Principles of an Evolutionary Algorithm

There are alternatives to rejecting a certain number of badly performing individualsper generation. To compute the new generation, one can generate new individualsfrom all individuals of the old generation. This would not result in an improvementsince the selection is completely random. Hence the parent individuals are selected

1Crossover in biology is the process of two parental chromosomes exchanging parts of genes in themeiosis (cell division for reproduction cells)

21

Page 22: Scam 2013 — Agile Modeling Keynote

22

Hooking into an existing tool

Page 23: Scam 2013 — Agile Modeling Keynote

Nice if you can have it — but just

defers the problem to another platform

23

Page 24: Scam 2013 — Agile Modeling Keynote

24

Grammar Stealing

Page 25: Scam 2013 — Agile Modeling Keynote

25

Page 26: Scam 2013 — Agile Modeling Keynote

26

stance). Figure 2 shows the first two cases—the third is just a combination. If you startwith a hard-coded grammar, you must re-verse-engineer it from the handwritten code.Fortunately, the comments of such code of-ten include BNF rules (Backus Naur Forms)indicating what the grammar comprises.Moreover, because compiler construction iswell-understood (there is a known referencearchitecture), compilers are often imple-mented with well-known implementation al-gorithms, such as a recursive descent algo-rithm. So, the quality of a hard-coded parserimplementation is usually good, in whichcase you can easily recover the grammarfrom the code, the comments, or both. Ex-cept in one case, the Perl language,14 thequality of the code we worked with was al-ways sufficient to recover the grammar.

If the parser is not hard-coded, it is gen-erated (the BNF branch in Figure 2), andsome BNF description of it must be in thecompiler source code. So, with a simple toolthat parses the BNF itself, we can parse theBNF of the language that resides in the com-piler in BNF notation, and then extract it.

When the compiler source code is notaccessible (we enter the Language Refer-ence Manual diamond in Figure 2), either areference manual exists or not. If it is avail-able, it could be either a compiler vendormanual or an official language standard.The language is explained either by exam-

ple, through general rules, or by both ap-proaches. If a manual uses general rules, itsquality is generally not good: referencemanuals and language standards are full oferrors. It is our experience that the myriaderrors are repairable. As an aside, we oncefailed to recover a grammar from the man-ual of a proprietary language for which thecompiler source code was also available(so this case is covered in the upper half ofFigure 2). As you can see in the coveragediagram, we have not found low-qualitylanguage reference manuals containinggeneral rules for cases where we did nothave access to the source code. That is, tobe successful, compiler vendors must pro-vide accurate and complete documenta-tion, even though they do not give awaytheir compilers’ source code for economicreasons. We discovered that the quality ofthose manuals is good enough to recoverthe grammar. This applies not only to com-piler-vendor manuals but also to all kindsof de facto and official language standards.

Unusual languages rarely have high-qualitymanuals: either none exists (for example, ifthe language is proprietary) or the companyhas only a few customers. In the proprietarycase, a company is using its in-house lan-guage and so has access to the source code;in the other case, outsiders can buy the codebecause its business value is not too high.For instance, when Wang went bankrupt, its

8 2 I E E E S O F T W A R E N o v e m b e r / D e c e m b e r 2 0 0 1

No

Yes

Yes

Start

Hard-codedparser

Recover thegrammar

Recover thegrammar

BNF

No cases known

Yes

One case:perl

No

Languagereferencemanual?

No

Generalrules

Recover thegrammar

One case:RPG

Constructionsby example

Yes

No casesknown

No

Compilersources?

Quality?

Quality?

Figure 2. Coverage diagram for grammar stealing.

Still takes a couple of weeks and lots of expertise

Page 27: Scam 2013 — Agile Modeling Keynote

27

Recycling Treesof this phase will be a model of the Ruby software system. As the meta-modelis FAME compliant, also the model will be. Information about the ClassLoader,an instance responsible for loading Java classes, is covered in section 4.7.

The Fame framework automatically extracts a model from an instance of anEclipse AST. This instance corresponds to the instance of the Ruby plugin ASTrepresenting the software system. Automation is possible due to the fact thatwe defined the higher level mapping. Figure 2.1 reveals the need for the highermapping to be restored. In order to implement the next phase independentlyfrom the environment used in this phase we extracted the model into an MSEfile.

Figure 2.1: The dotted lines correspond to the extraction of a (meta-)model.The other arrows between the model and the software system hierarchy showwhich Java tower level corresponds to which meta-model tower element.

2.3 Model Mapping by Example phase

Our previously extracted model still contains platform dependent informationand thus is not a domain specific model for reverse engineering. It could beused by very specific or very generic reverse engineering tools, as it containsthe concrete syntax tree of the software system only. However such tools donot exist. In the Model Mapping by Example phase we want to transform themodel into a FAMIX compliant one. With such a format it will be easier to usein several software engineering tools.

The idea behind this approach relies on Parsing by Example [3]. Parsingby Example presents a semi-automatic way of mapping source code to domain

9

Daniel Langone. Recycling Trees: Mapping Eclipse ASTs to Moose Models.

Bachelor's thesis, University of Bern

Page 28: Scam 2013 — Agile Modeling Keynote

28

1.Infer AST implementation from IDE plugin2.Extract metamodel from plugin3.Map model elements to FAMIX (Moose)

Page 29: Scam 2013 — Agile Modeling Keynote

29

Hard to recognize ASTs; still need to map to

model elements

AST implementations always different, hard to generalize.

Page 30: Scam 2013 — Agile Modeling Keynote

30

Parsing by Example

Nierstrasz et al. Example-Driven Reconstruction of

Software Models. CSMR 2007

Page 31: Scam 2013 — Agile Modeling Keynote

31

m..n

1Model

ParserSource code

1

import

2

specify examples

5parse

Example mappings

... := ... ... ... ... | ... ... ... ... | ... ... ... ...... := ... ... ... ... | ... ... ... ... | ... ... ... ...... := ... ... ... ... | ... ... ... ... | ... ... ... ...

Grammar

3infer

4

generate

5

6export

Page 32: Scam 2013 — Agile Modeling Keynote

32

CodeSnooper

Page 33: Scam 2013 — Agile Modeling Keynote

33

Page 34: Scam 2013 — Agile Modeling Keynote

34

result to that obtained with a robust parser we find that weonly miss five classes:

Precise Model Our ModelNumber of Model Classes 366 361Number of Abstract Classes 233 233

In a second iteration we give examples of methods in ab-stract and concrete classes as well as interfaces. This leadsto three separate grammars which cannot easily be mergedsince this would lead to an ambiguous grammar [12]. In-stead we generate three parsers and apply them in parallelto the source files. We now obtain the following results:

Precise Model Our ModelNumber of Model Classes 366 316Number of Abstract Classes 233 233Total Number Of Methods 1887 1648

In addition to the two files we could not parse earlier,we now have some problems due to (i) attributes being con-fused with methods, (ii) language constructs (like static)occurring in unexpected contexts, (iii) different kinds ofdefinitions of methods. Additional examples would help tosolve these problems.

In a third iteration we add examples to recognize at-tributes. Once again we obtain three parsers based on threesets of examples for abstract classes, concrete classes andinterfaces. We obtain the following results:

Precise Model Our ModelNumber of Model Classes 366 346Number of Abstract Classes 233 230Total Number Of Methods 1887 1780Total Number of Attributes 395 304

This process can be repeated to cover more and more ofthe subject language. The question on when to stop can beanswered with “When the results are good enough”. Goodenough in this context means when we have enough infor-mation for a specific reverse engineering task. For example,a “System Complexity View” [18] is a visualization used toobtain an initial impression of a legacy software system. Togenerate such a view we need to parse a significant numberof the classes, identify subclass relations, and establish thenumbers of methods and attributes of each class. Even if weparse only 80% of the code, we can still get an initial im-pression of the state of the system. If on the other we wouldwant to display a “Class Blueprint” [17], a semantically en-riched visualization of the internal structure of classes wewould need a refined grammar to extract more information.The “good enough” is thus given by the reverse engineeringgoals, which vary from case to case.

4.2 Ruby

As second case study we chose the language Ruby, be-cause it is quite different from Java and it has a non-trivialgrammar. We took the unit testing library distributed withRuby version 1.8.2 released at the end of 2004. This part ofthe library contains 22 files written in Ruby. We do not havea precise parser for Ruby that can generate a FAMIX model(actually, to our knowledge, for Ruby there is only one pre-cise parser, namely the Ruby interpreter itself). Instead weretrieve the reference model by inspecting the source codemanually.

In Ruby there are Classes and Modules. Modules arecollections of Methods and Constants. They cannot gen-erate instances. However they can be mixed into Classesand other Modules. A Module cannot inherit from anything.Modules also have the function of Namespaces. Ruby doesnot support Abstract Classes [22].

For the definition of the scanner tokens for identifiers andcomments we use the following regular expressions:

<IDENTIFIER>: [ a�zA�Z $ ] \w⇤ ( \? | \ ! ) ? ;<comment>: \# [ ˆ \ r \n ]⇤ <eol> ;

Using just 2 examples each of namespaces, classes,methods and attributes, we are able to parse 7 of the 22 files.

Precise Model 7 files Our ModelNumber ofNamespaces 8 6 6Number ofModel Classes 25 4 4Total Number ofMethods 247 26 26Total Number ofAttributes 136 9 9

Amongst the files we could not parse, there are 4 large filescontaining GUI code. If we ignore these files, we are ableto detect about 25% of the target elements.

There are two main reasons that so few files can be suc-cessfully parsed:

1. The comment character # occurs frequently in stringsand regular expressions, causing our simple-mindedscanner to fail. A better scanner would fix this prob-lem. With some simple preprocessing (removing anyhash character that occurs inside a string and removingall comments) we can improve recall to 65-85%.

2. Ruby offers a very rich syntax for control constructs,allowing the same keywords to occur in many differentpositions and contexts. One would need many moreexamples to recognize these constructs.

result to that obtained with a robust parser we find that weonly miss five classes:

Precise Model Our ModelNumber of Model Classes 366 361Number of Abstract Classes 233 233

In a second iteration we give examples of methods in ab-stract and concrete classes as well as interfaces. This leadsto three separate grammars which cannot easily be mergedsince this would lead to an ambiguous grammar [12]. In-stead we generate three parsers and apply them in parallelto the source files. We now obtain the following results:

Precise Model Our ModelNumber of Model Classes 366 316Number of Abstract Classes 233 233Total Number Of Methods 1887 1648

In addition to the two files we could not parse earlier,we now have some problems due to (i) attributes being con-fused with methods, (ii) language constructs (like static)occurring in unexpected contexts, (iii) different kinds ofdefinitions of methods. Additional examples would help tosolve these problems.

In a third iteration we add examples to recognize at-tributes. Once again we obtain three parsers based on threesets of examples for abstract classes, concrete classes andinterfaces. We obtain the following results:

Precise Model Our ModelNumber of Model Classes 366 346Number of Abstract Classes 233 230Total Number Of Methods 1887 1780Total Number of Attributes 395 304

This process can be repeated to cover more and more ofthe subject language. The question on when to stop can beanswered with “When the results are good enough”. Goodenough in this context means when we have enough infor-mation for a specific reverse engineering task. For example,a “System Complexity View” [18] is a visualization used toobtain an initial impression of a legacy software system. Togenerate such a view we need to parse a significant numberof the classes, identify subclass relations, and establish thenumbers of methods and attributes of each class. Even if weparse only 80% of the code, we can still get an initial im-pression of the state of the system. If on the other we wouldwant to display a “Class Blueprint” [17], a semantically en-riched visualization of the internal structure of classes wewould need a refined grammar to extract more information.The “good enough” is thus given by the reverse engineeringgoals, which vary from case to case.

4.2 Ruby

As second case study we chose the language Ruby, be-cause it is quite different from Java and it has a non-trivialgrammar. We took the unit testing library distributed withRuby version 1.8.2 released at the end of 2004. This part ofthe library contains 22 files written in Ruby. We do not havea precise parser for Ruby that can generate a FAMIX model(actually, to our knowledge, for Ruby there is only one pre-cise parser, namely the Ruby interpreter itself). Instead weretrieve the reference model by inspecting the source codemanually.

In Ruby there are Classes and Modules. Modules arecollections of Methods and Constants. They cannot gen-erate instances. However they can be mixed into Classesand other Modules. A Module cannot inherit from anything.Modules also have the function of Namespaces. Ruby doesnot support Abstract Classes [22].

For the definition of the scanner tokens for identifiers andcomments we use the following regular expressions:

<IDENTIFIER>: [ a�zA�Z $ ] \w⇤ ( \? | \ ! ) ? ;<comment>: \# [ ˆ \ r \n ]⇤ <eol> ;

Using just 2 examples each of namespaces, classes,methods and attributes, we are able to parse 7 of the 22 files.

Precise Model 7 files Our ModelNumber ofNamespaces 8 6 6Number ofModel Classes 25 4 4Total Number ofMethods 247 26 26Total Number ofAttributes 136 9 9

Amongst the files we could not parse, there are 4 large filescontaining GUI code. If we ignore these files, we are ableto detect about 25% of the target elements.

There are two main reasons that so few files can be suc-cessfully parsed:

1. The comment character # occurs frequently in stringsand regular expressions, causing our simple-mindedscanner to fail. A better scanner would fix this prob-lem. With some simple preprocessing (removing anyhash character that occurs inside a string and removingall comments) we can improve recall to 65-85%.

2. Ruby offers a very rich syntax for control constructs,allowing the same keywords to occur in many differentpositions and contexts. One would need many moreexamples to recognize these constructs.

JBoss case

Ruby case

Problems• Ambiguity• False positives• False negatives• Embedded languages

Markus Kobel. Parsing by Example. MSc, U Bern, April 2005.

One major failure was to use SmaCC, which is LALR with a separate scanner.Better use a scannerless parser and PEGs, Earley parsing or GLR Parsing.

Page 35: Scam 2013 — Agile Modeling Keynote

35

Evolutionary Grammar Generation

18 CHAPTER 3. GENETIC PROGRAMMING

Since biological evolution starts from an existing population of species, we need tobootstrap an initial population before we can begin evolving it. This initial populationis generally a number of random individuals. These initial individuals usually don’tperform well, although some will already be a tad better than others. That is exactlywhat we need to get evolution going.

The final part is reproduction, i.e. to generate a new generation from the surviving pre-vious generation. For that purpose an evolutionary algorithm usually uses two typesof genetic operators: point mutation and crossover (We will refer to point mutations asmutations, although crossover is technically also a mutation). Mutations change anindividual in a random location to alter it slightly, thus generating new information.Crossover1 however, takes at least two individuals and cuts out part of one of them, toput it in the other individual(s). By only moving around information, Crossover doesnot introduce new information. Be aware that every modification of an individual hasto result in a new individual that is valid. Validity is very dependent on the searchspace - it generally means that fitness function as well as the genetic operators shouldbe applicable to a valid individual. A schematic view is shown in fig. 3.1.

generate new

random population

select most fit

individuals

generate new

population with

genetic operators

fit enough?

mutation crossover

Figure 3.1: Principles of an Evolutionary Algorithm

There are alternatives to rejecting a certain number of badly performing individualsper generation. To compute the new generation, one can generate new individualsfrom all individuals of the old generation. This would not result in an improvementsince the selection is completely random. Hence the parent individuals are selected

1Crossover in biology is the process of two parental chromosomes exchanging parts of genes in themeiosis (cell division for reproduction cells)

Sandro De Zanet. Grammar Generation with Genetic Programming — Evolutionary Grammar

Generation. MSc, U Bern, July 2009.

Page 36: Scam 2013 — Agile Modeling Keynote

36

18 CHAPTER 3. GENETIC PROGRAMMING

Since biological evolution starts from an existing population of species, we need tobootstrap an initial population before we can begin evolving it. This initial populationis generally a number of random individuals. These initial individuals usually don’tperform well, although some will already be a tad better than others. That is exactlywhat we need to get evolution going.

The final part is reproduction, i.e. to generate a new generation from the surviving pre-vious generation. For that purpose an evolutionary algorithm usually uses two typesof genetic operators: point mutation and crossover (We will refer to point mutations asmutations, although crossover is technically also a mutation). Mutations change anindividual in a random location to alter it slightly, thus generating new information.Crossover1 however, takes at least two individuals and cuts out part of one of them, toput it in the other individual(s). By only moving around information, Crossover doesnot introduce new information. Be aware that every modification of an individual hasto result in a new individual that is valid. Validity is very dependent on the searchspace - it generally means that fitness function as well as the genetic operators shouldbe applicable to a valid individual. A schematic view is shown in fig. 3.1.

generate new

random population

select most fit

individuals

generate new

population with

genetic operators

fit enough?

mutation crossover

Figure 3.1: Principles of an Evolutionary Algorithm

There are alternatives to rejecting a certain number of badly performing individualsper generation. To compute the new generation, one can generate new individualsfrom all individuals of the old generation. This would not result in an improvementsince the selection is completely random. Hence the parent individuals are selected

1Crossover in biology is the process of two parental chromosomes exchanging parts of genes in themeiosis (cell division for reproduction cells)

Page 37: Scam 2013 — Agile Modeling Keynote

26 CHAPTER 4. COMBINATION OF PEGS AND GENETIC PROGRAMMING

C

A B

A B

Figure 4.4: Push a node up

Insertion ensures diversity in graph depth. Graph depth is important for the emergenceof more complex structures. The deletion, on the other hand, puts a counterweightto it. This ensures that grammars do not grow too big and that is possible to simplifystructures.

4.1.2 Crossover

Crossover is the second, less often used modification. The term is borrowed from thebiological chromosomal crossover, where it describes the exchange of genes of twopaired up chromosomes (one from the mother and one from the father) in the processof the Meiosis 1.

Applied to grammars, this results in copying a subgraph of one grammar into anothergraph. If one has two grammars p1 and p2, a new grammar is created by selecting arandom node of p1 and adding it to random node of p2. Note that the parsers willnot be directly linked. Rather the subgraph of p1 is copied independently from itsparent.

The motivation for crossover is to preserve useful, more complex structures that havealready evolved. By combining two modestly performing parsers one can generate anew one that combines the useful parts of the parents into a parser that performs wellin the two places of the parents.

1Process of generating reproductive cells like sperms and eggs

37

4.1. TUNING PEGS FOR GENETIC PROGRAMMING 25

Figure 4.2: Add back link node

...

...

Figure 4.3: Delete a node

To ensure the evolvability of more complex parsers we need more complex mutations.After the initial population got sorted mostly only single character parsers were leftand couldn’t mutate to parsers with more nodes.

The following mutations add the possibility to insert nodes between the current parserand the root parser:

deletion The selected parser first moves all its children to the parent parser, thusreplacing itself by its children (fig. 4.4)

insertion The selected parser is replaced by a composite parser (sequence or choice).The selected parser is then added to the new parser. This results in the insertionof a new parser in between the selected parser and its parser. If the selectedparser is the root, the new parser becomes the new root. (fig. 4.5)

PEG mutation and crossover

24 CHAPTER 4. COMBINATION OF PEGS AND GENETIC PROGRAMMING

To transform PEGs to evolvable structures, some adaptions have to be made. Inthe definition by Bryan Ford [7] sequence and choice operators only have two childparsers. Longer sequences (or choose statements) can be built by nesting them. Tomake profitable changes more probable and to keep the number of nodes down (whichalso increases the execution speed as well) they have been generalized to an arbitrarylength.

First we will look at the modifications that are possible for grammars and are commonlyused in evolutionary algorithms: mutation and crossover.

4.1.1 Mutation

A mutation of a grammar is the modification of one of its randomly chosen nodes.Every parser node is different and therefore it has to change itself in a different wayaccording to its type. So for instance the character parser will mutate the character itparser. Similarly a range parser alters its range.

There are though some more common mutations that affect the structure of the gram-mar and are not dependent on the node. They will only affect not primitive and notunary parser nodes:

add child A new randomly generated parser will be inserted in the list of children (fig.4.1)

Figure 4.1: Add a node

add link Works similarly to adding a child. Although in this case the new parser is notrandomly generated but selected from one of the nodes of the already existingPEG. This results in a link to this parser (like a CFG rule, fig. 4.2)

remove child A randomly selected child will be removed. No effect, if there is onlyone child. Remark that we don’t allow composite parsers with no children, sincethey don’t constitute a valid grammar (4.3)

The mutation has no effect on unary or primitive parsers like the character parser.

4.1. TUNING PEGS FOR GENETIC PROGRAMMING 25

Figure 4.2: Add back link node

...

...

Figure 4.3: Delete a node

To ensure the evolvability of more complex parsers we need more complex mutations.After the initial population got sorted mostly only single character parsers were leftand couldn’t mutate to parsers with more nodes.

The following mutations add the possibility to insert nodes between the current parserand the root parser:

deletion The selected parser first moves all its children to the parent parser, thusreplacing itself by its children (fig. 4.4)

insertion The selected parser is replaced by a composite parser (sequence or choice).The selected parser is then added to the new parser. This results in the insertionof a new parser in between the selected parser and its parser. If the selectedparser is the root, the new parser becomes the new root. (fig. 4.5)

4.1. TUNING PEGS FOR GENETIC PROGRAMMING 27

C

A B

C

A B

Figure 4.5: Insert a node

4.1.3 Fitness function

The fitness function is the most important part of the evolutionary algorithm by in-directly defining the envisioned goal. It determines the quality of PEGs which is arepresentation of the distance to the solution. Without an elaborate fitness function, thesearch will head in the wrong direction. There is also the risk of finding non-intendedsolutions or to be trapped in local extrema.

Normally the fitness function is defined as a number which stands for a better rating,the higher it is. We decided to define the fitness function the other way around: worsePEG have a higher fitness number; the fitness of 0 defines the best achievable solution.This makes sense because we use the size metric of a PEG which is worse the bigger aPEG gets. Furthermore, we have found a possible solution if all the characters of thesource code have been parsed. Hence we measure the number of characters to parsewhich is better the smaller it is.

For our problem we first implemented the most straightforward solution. We let eachPEG parse every source code file. The number of characters that it cannot parse isadded to the final fitness. There is an additional penalty for not being able to parse afile at all (parser fails on the first character). The reason behind this is that we wanted astronger differentiation between the grammars that could at least parse one characterand the completely useless grammars. In this way a better grammar is a grammar witha lower fitness, a grammar with fitness zero being one that can fully parse every sourcecode. We don’t allow negative fitness (the reason is explained later).

Page 38: Scam 2013 — Agile Modeling Keynote

38

([a-z] (‘_‘ | [0-9] | [a-z])*)

(([a-z] ({‘\n‘ | ‘_‘ | [0-9]})*))*

0 -> (‘c‘ ‘a‘ ‘t‘ ‘:‘ ‘ ‘ ([a-z])+ 1 -> {2 -> (‘\n‘ 0) | e})

(0 -> (‘c‘ ‘a‘ ‘t‘ ‘:‘ ‘ ‘ 2 -> (([a-z])+ ‘\n‘)))+

0 -> (‘+‘ | ‘-‘ | ‘<‘ | ‘>‘ | ‘,‘ | ‘.‘ | 1 -> (‘[‘ 2 -> (0)* ‘]‘))

(0 -> {‘<‘ | ‘]‘ | ‘.‘ | ‘,‘ | ‘>‘ | ‘-‘ | ‘[‘ | ‘+‘})*

Desired grammar:

Found grammar:

Desired grammar:

Found grammar:

Desired grammar:

Found grammar:

Slow and expensive. Modest results for complex languages.

Page 39: Scam 2013 — Agile Modeling Keynote

Directions

Page 40: Scam 2013 — Agile Modeling Keynote

Incrementally refine island grammars

40

Page 41: Scam 2013 — Agile Modeling Keynote

41

Progress

Global Islands are usefulScoped Islands are useful

Verification of a scope is expensive

Tokenization and memoization help :-)

Recursive Islands and Scoped Repeating Islands lead to Left-Recursion Problems

1: If we search for method island “globally” we don’t know to which class it belongs2: At each step, we have to verify whether we have found the end of scope — that is also expensive!3: If you have an island containing the same island (Class with inner classes) you get indefinite recursive descent: The same states for repeating “scoped” islands (Class followed by another class in the same file, something like class*) The end of “class” scope is another “class” and first thing we have to do is to check end of scope, which leads to parsing the same “class”, which has to check end of scope leading to parsing the same class “class”.... It is maybe hard to describe, but it is true...

Page 42: Scam 2013 — Agile Modeling Keynote

Composing parsers from parts

42

Page 43: Scam 2013 — Agile Modeling Keynote

Ideas

Exploit similarities between languages (adapt and compose)

• similar syntax, similar constructs• combine “reusable” islands?• adapt with genetic algorithms?• combine with parsing by example?

43

Page 44: Scam 2013 — Agile Modeling Keynote

Automatic structure recognition

44

Page 45: Scam 2013 — Agile Modeling Keynote

Exploit indentation as a proxy for structure

Ideascatch, throw, try, bool, class, enum,

explicit, export, friend, inline, mutable, namespace, operator, private, protected,

public, template, typename, using, virtual, volatile, wchar_t, and, and_eq, bitand, bitor, compl, const_cast, delete, dynamic_cast, false, new, not, not_eq, or,

or_eq, reinterpret_cast, static_cast, this, true, typeid, xor, xor_eq

Heuristics to automatically detect keywords

• don’t parse; just recognize structure?• combine with parsing by example?• combine with grammar evolution?• combine with parser composition?

45

Page 46: Scam 2013 — Agile Modeling Keynote

Conclusions

Model construction is an obstacle to agile software assessment

You don’t need precise parsers to start analysis

Are there effective shortcuts to building a parser/importer by hand?