[ieee 2012 ieee international conference on software maintenance (icsm) - trento, italy...

Dead Code Elimination for Web Systems Writtenin PHP: Lessons Learned from an Industry Case

Hidde BoomsmaHostnet B.V.

Department of Software Engineering

De Ruyterkade 6, 1013 AA Amsterdam, The Netherlands

Email: [email protected]

Hans-Gerhard GrossDelft University of Technology

Software Engineering Research Group

Mekelweg 4, 2628 CD Delft, The Netherlands

Email: [email protected]

Abstract—Web systems undergo constant evolution. Thismakes them prone to accumulating dead code. In turn, deadcode is commonly understood to inhibit software evolution. Theonly way out of this vicious circle is the careful analysis of theweb system, identifying unused features, and eliminating them.However, modern web systems are often built with server sidescripting languages such as PHP. Their inherent dynamic featuresrender traditional static dead code identification approachesuseless.

We describe the technical issues involved in detecting deadPHP code, and propose an identification and removal approachbased on dynamic analysis. Further, we describe the examinationof our approach in an industry-scale web system, and discuss ourlessons learned.

I. INTRODUCTION

Web systems undergo constant evolution [1], typically im-

posed by changing user requirements, improvement of func-

tions, removal of faults, or introduction of new technologies. A

considerable source of maintenance effort, therefore, must be

spent in identifying features that become disused and should be

removed from the system. Such disused features are commonly

referred to as dead code [2]. Removing dead code of a

system helps to make understanding and maintaining its future

versions easier. However, in many organizations, engineers are

often reluctant to remove dead code, because of potentially un-

known dependencies with existing features. Proper dead code

identification and elimination strategies help reduce system

size and complexity, improve system understandability [3],

[4], retard software ageing [5], and, consequently, alleviate

maintenance.

Web systems are increasingly being developed with domain

specific languages such as Python, Ruby, Perl, or PHP. In par-

ticular, PHP is very popular as server-side scripting language,

according to W3Tech’s Web Technology Surveys.1

Dead code elimination is well known in compiler optimiza-

tion, e.g. [6], and was applied in software maintenance, e.g.

[7], [8]. There is also a PHP static dead code detector tool

available [9], but it is failing on large PHP file bases. We

did not find, to the best of our knowledge, neither in the

1www.w3techs.com

literature nor in the open source community, useful approaches

or tools aimed at dead code detection for extensive PHP code

bases. The issues in dead code elimination arise from the

language’s inherent dynamic features, such as runtime module

inclusion, dynamic and weak typing, ”duck-typed objects”,

implicit object and array creation, runtime aliasing, reflection,

and closures. These make dead code identification for PHP

more difficult than in other languages [10], [11].In practice, appropriate dead code identification for PHP

should be performed dynamically. Therefore, we develop and

evaluate a dynamic analysis approach and tools which help

engineers to identify dead code in web systems written in

PHP. We address the following research questions:

• Which data must be retrieved from a system, how can

this be done, and what is the overhead incurred?

• How should the data be presented, and how can unused

code be declared dead?

Our contributions are a dynamic analysis approach for

identifying and eliminating disused PHP modules, and two

tools, a web application and an Eclipse2 plugin. Both tools can

be used to visualize the data retrieved from dynamic analysis,

in order to support engineers in eliminating unused code. The

approach and tools have been developed and tested by means

of a real web system deployed by Hostnet,3 the company

responsible for hosting and marketing the ”.nl”-domain in the

Netherlands. The source code for the analysis techniques and

the tools can be downloaded.4

This article is structured as follows: Sections II, III and IV

discuss dynamic analysis techniques for detecting dead code,

show how this can be visualized, and eliminated, respectively.

Sect. V describes the examination of our approach with lessons

learned, Sect. VI lists related work, and Sect. VII concludes

the paper.

II. DEAD CODE IDENTIFICATION

A. Required Runtime Information

Dead code obstructs system maintenance [12], in particular,

if it is cluttering up the production system, which is typical in

2www.eclipse.org3www.hostnet.nl4https://github.com/hostnet978-1-4673-2312-3/12/$31.00 c© 2012 IEEE

2012 28th IEEE International Conference on Software Maintenance (ICSM)

511

PHP applications. The first step towards detecting unused code

is determining the most suitable granularity for the analysis.

This depends on the predicted effort to remove dead code and

the assumed improvement in terms of better maintenance. We

choose the PHP file as smallest unit of granularity, since file

usage is easy to measure, and files are easily removed.

Because it is impossible to measure unused files dynam-

ically, we analyze used files and deduct them from all files

in the system. This results in the ”potentially dead” files,

and turns our dead file identification problem into a coverage

analysis problem. Hence, dynamic analysis is used to deter-

mine file coverage and frequency [13]. Coverage is a minimal

requirement, although, frequency can determine when a PHP

file was used last.

Files under development will be wrongly identified as dead

files, because they are never executed. This can be avoided if,

in addition to coverage, also the access time stamp from the

version control system (VCS) is considered.

For each file in the production system, we store (1) first

time used, (2) number of times used, since first time, (3) last

time used, and (4) last time changed in the repository of the

VCS. By deducting the accessed files from all files in the

system, after some system execution time, we can identify the

potentially dead files.

B. Dynamic Information Extraction

All files in a system can be identified through the PHP

function call get_included_files(void). Each file

results in a key-value-entry in a data store whose value is

extended with a new time stamp whenever the file is loaded.

Dynamically loaded and used files are identified through an

autoloader function built into most PHP application develop-

ment frameworks such as Symfony5 which is used by Hostnet.

The autoloader can be configured to execute additional logging

code whenever a new feature of a web application is accessed

by a user, and therefore, loaded into the server. This amounts to

100% of the used files, and can be deducted from the collection

of all files in order to yield the number and/or percentage of

unused files.

A final step for information extraction consists in gathering

all dynamically created log information in the data store for

further analysis. Through its dynamic nature, PHP provides

an elegant solution through automatically appending code to

any PHP script executed by the web server. This is com-

parable with an Aspect [14], and can be realized through

adding an auto_append_file-clause, in the web server’s

htaccess file. That way, the last action performed in any

PHP script, after it has been loaded by the web server, will be

gathering the log information which is local to that particular

script, and writing it to the data store. We refer to the PHP

documentation6, and to our publicly available code examples7

for implementation details.

5www.symfony-project.org6http://php.net/docs.php7https://github.com/hostnet

Fig. 1. Example tree map with colors indicating used vs. unused PHP files

III. VISUALIZATION OF THE EXTRACTED DATA

Our main goal is the maintainability of web applications

written in PHP. Maintenance is carried out by engineers who

have to understand the system, and decide which files they

should remove. Understanding can be facilitated if it is based

on good visualizations. Two kinds of visualizations were found

to be useful by the software engineers at Hostnet, and they

resulted in the construction of two tool prototypes described

below.

A. Tree Map Visualizer Web Application

The main tool is based on a tree map visualization following

[15]. It displays the extracted information about used vs.

unused files in the most complete and accurate form. An

Example from Hostnet is depicted in Fig. 1. The screen is

subdivided into three sections.

The top section shows a number of boxes in various sizes

and different colors, indicating the directory structure of the

overall project. Every box represents a sub-directory contain-

ing other directories or PHP files. The box size corresponds

to the number of contained unused files. The color denotes

the number of used vs. unused files in a box. Shades of

green indicate low percentage of unused files, and shades of

red indicate high percentage of unused files. Size and color

combined indicate absolute and relative numbers of unused

files in the directory structure, e.g. a big reddish rectangle

suggests that its associated directory branch contains many

unused files. Two distinct green and red values denote fully

used and fully unused files/directories (left-hand side green

and right-hand side red on the color bar in Fig. 1). Clicking

a box navigates into the directory structure for inspection of

sub-directories and files, showing more detailed information,

and more definite shades of green and red.

The table section on the left hand side of Fig. 1 shows more

detailed information about the directories and files. Besides the

name of the directory, it shows the percentage of unused files,

the total number of unused files, and the overall number of

files, plus the access date of a directory/file according to the


512

Fig. 2. Aurora sub-system: used files over time

Fig. 3. Example Eclipse dead file decorator

VCS, and the date of the first execution of a file. These values

are derived from the four values stored for each file.

The graph on the right hand side indicates the overall

number of files being activated over time. A more detailed

picture is shown in Fig. 2 for the biggest sub-system used in

the Hostnet web application. In this particular case, the sub-

system was started in mid January, and it was still activating

unused files by mid April. This is a good indicator of how

long one has to wait before coming to a definitive conclusion

over the really used/unused files in a web system.

B. Eclipse Dead File Decorator

The second tool shows less information and is intended to

be used in everyday development. It is based on the Eclipse

file decorator plugin,8 and it also indicates, in green to red, the

percentage of used vs. unused files in the directory structure

of the project. That way, developers get a hint, for example,

that they might be attempting to edit discontinued files, when

entering a reddish directory in their Eclipse development

environment. An example from Hostnet is displayed in Fig.

3, showing the colored project directories in the file browser

on the left hand side. At Hostnet, information and colors are

auto-updated from the data store once per hour (this is an

arbritrary choice).

8http://www.eclipse.org/articles/Article-Decorators/decorators.html

Fig. 4. Process used by Hostnet for removing alleged dead files

IV. DEAD CODE ELIMINATION

Up to this point, we have only looked at how potentially

dead files in a Web system may be identified based on dynamic

analysis. The fact that they are merely ”potentially” dead is an

important distinction, because in dynamic web systems written

in PHP, it can never be determined a-priori, whether a file will

not be used in the future. This is the most essential difference

when compared to dead code identification and elimination in

more ”traditional”, monolithic and static types of systems. In

those systems, identifying a dead code section also means that

it is really, and utterly dead, and may be removed. However, in

dynamic Web applications, this is not the case. Here, someone

must take the active decision of declaring a ”potentially dead”

file really dead, and burying, i.e. removing, it.

In order to being able to declare files dead, the graph

showing file usage over time (Fig. 2) acts as the primary source

of information, but additional domain-specific information

about when functions should be executed is also required. This

cannot be derived from analyzing the running system but must

be looked up from the specification or documentation of the

application. For example, a login script is expected to be used

every day, whereas the script for generating the yearly tax

report is only invoked once per year.

Figure 4 illustrates the dead file elimination process em-

ployed by Hostnet. It starts at the top with the dead code

identification [start]. The numbers indicate where specific

action must be taken, or specific tools are used: (1) is based

on the tree map visualizer. If a completely dead file is detected

(distinctly red), it can be located, checked, and removed fol-

lowing the removal process steps in Fig. 4 (indicated through

”[file removal]”). (2) is based on additional, domain specific


513

information on how often a feature in a file/directory should

run. (3) is based on the usage of files over time graph. (4)

is based on the tabular view, on the left hand side of Fig. 1.

(5) is based on trying to actively trigger a feature in the web

application, if there are no side effects to be expected. (6) is

based on human reasoning, and coming to a decision.

V. EVALUATION

We have implemented dead code identification in Hostnet’s

web system, and evaluated its performance overhead, as well

as its capability to pinpoint potentially dead files. The web

system is comprised of six sub-systems which have been

augmented with the techniques described earlier. Table I

summarizes features of the sub-systems used. Relevant prop-

erties are the number of files contained in each sub-system’s

directory structure, its age, and its average number of requests

per day (in thousand). HFT2 and 3 are constantly running

batch jobs performing data provisioning services for the other

modules. They do not have, as such, page accesses.

A. Overhead

In order to determine the overhead, we have augmented

the Aurora sub-system with additional profiling code. In PHP,

this can be achieved through inbuilt prepend and append

functionality, which automatically invokes PHP files before

and after execution of a page request. These files contain the

code for end-to-end timing. Aurora is, by far, the biggest and

most used sub-system in the Hostnet suite (Table I), and it can

be regarded as the worst case concerning page access response

time overhead.

The total overhead is comprised of two components: (1) the

actual logging of the used PHP files, and (2) the connection

to the data store in order to save the logged data. Through the

strict separation of the two steps, i.e., the store is only accessed

when the requested page is already being rendered by the web

server and all dynamically loaded PHP files are known, we can

achieve low waiting times for the user. In Aurora, the average

additional waiting time for a page request was measured to

be below 6 ms in 95% of the cases, with an average time for

connecting to the database of 1.6 ms. This means, the waiting

time per page request is not noticeable to the user, which is a

requirement of Hostnet.

B. Dead Code Identification

Earlier, the dead code analysis problem was reformulated as

coverage analysis problem, which means waiting long times

until features in the system are invoked. However, how long

is long enough in order to be certain?

The only useful information to determine the waiting time

comes from the graph showing the number of activated files

over time (Fig. 1). An additional source is the domain knowl-

edge. Table II shows the time we had to wait in order to be

reasonably sure that no more new files would be accessed in

the respective sub-systems of Hostnet. In addition, it shows

the fraction of the used files up to that point, and also a

value indicating the number of page views over the number of

TABLE ISUB-SYSTEMS OF THE HOSTNET WEB APPLICATION

# of Age in Req./DayName Description Files Years x 1000HFT3 New provisioning system 750 1 n/aShop Web shop 923 3 40Aurora CRM application 9755 5 60Mailbase Legacy mail filter frontend 490 5 3My Hostnet Customer portal 2422 5 55HFT2 Old provisioning system 3518 6 n/a

TABLE IIFILES USED, AND WAITING TIMES BEFORE DEAD FILE REMOVAL

Name page views / files % used files no new filesHFT3 n/a 60.13 after 1 monthShop 42.95 68.26 after 2 weeksAurora 6.18 48.91 after 5 monthsMailbase 5.72 45.31 after 5 monthsMy Hostnet 22.08 35.92 after 1 weekHFT2 n/a 73.00 after 2 months

used files. Through the percentage of used files in the table,

it becomes apparent how many dead files may still be lurking

in the code base. Also, the extent to which the dead files

could be aggravating the maintenance task becomes easily

apprehensible.

The ”page views / files” indicates how the relation between

the number of files and number of page accesses correlates

with the time we have to wait to be certain that all active

files have been accessed. A high number of page requests and

a low number of files inevitably leads to quicker activation

of unused features. It might be a good measure for ”waiting

time”, though, this is not validated.

C. Discussion and Lessons Learned

In general, the dead code identification and elemination ap-

proach works well. Based on the visualizations, three Hostnet

engineers were able to safely remove 2740 disused Aurora-

files in one working day, by following the process summarized

in Fig. 4. However, before commencing, they had to wait a

very long time. The number of files they removed amounts

to almost 30% of Aurora’s original code inventory. This is

quite substantial, but, unfortunately, there are no numbers to

compare, since dead file removal was never done before at

the company, and performing it without any tool support was

perceived as extremely tedious by the engineers. Figure 2

suggests that only half of the files in Aurora are used after

three months, so that we can still expect many more files to

be denoted as potentially dead.

Interesting is the steep increase in the graph at Jan/Feb,

a time when many monthly jobs are done. This observation

is difficult to make merely by looking at the graph, which

may be regarded as flaw in the visualization. We had to

analyze these newly loaded files carefully in order to come to

this conclusion. Future versions of the visualizer tool should

indicate when we can expect an increase in new files being

loaded based on the domain knowledge available.

Special attention is required for sporadic or rarely used fea-


514

tures, such as error handling, rarely sold products, marketing

tools, or statistics. These will remain marked dead for very

long time until they are eventually activated. One treatment

would be the organization of the directory structure according

to expected time of usage, for example create a folder for

yearly, monthly, fortnightly, weekly, or daily occurring files.

This way, files would be grouped according to an occurrence

aspect, which would make the domain knowledge of when

features should be used more explicit in the code base.

However, this would break functional cohesion of the project,

and it is questionable whether this would be favored.

Another important observation from our examination is

that sporadically invoked files such as exception handling

operations should be entirely left out of dead code analysis.

They should also be treated as a separate aspect in the system

and located in a dedicated directory.

The biggest inhibiting factor of our proposed approach is

the time required for dynamic analysis in order to come up

with sufficient certainty that an unused file is indeed dead.

Although, we can trigger some use cases in order to increase

the coverage earlier, this can only be done if no side effects

are expected. The only additional static information addressing

this issue, currently, is the domain knowledge of when some

functions should run.

An interesting final observation, which goes against our

initial intuition, is that older applications seem not necessarily

more prone to dead file accumulation than younger applica-

tions, even though, they might have been changed more often

over time. It would be interesting to study properties that

would help determine directories/files which are more likely

becoming disused.

VI. RELATED WORK

Dead code elimination is a concept originally used in com-

piler optimization [16], [17], [18], but it has also become an

issue in software maintenance [4], [12]. Srivista [8] addresses

unreachable procedures in OO code through static analysis,

work, which is extended by Bacon [7] to include virtual

functions in the analysis. Dead code elimination for PHP

based on static analysis is addressed in an attempt by Biggar

to provide a PHP compiler frontend for C code [10], [19].

However, many of the dynamic features of PHP cannot be

dealt with using this technique. Di Lucca and Di Penta [20]

describe how static analysis is only of limited use in dynamic

web applications and propose to combine it with dynamic

analysis. We have taken this idea further and focused on

dynamic analysis. However, we have also learned that static

information may improve the outcome.

VII. CONCLUSIONS AND FUTURE WORK

We have developed a dynamic dead code identification

and elimination approach for web systems written in PHP

and evaluated it in the context of a large web system of

our industrial partner. In general, the approach was found

to work well in the context of Hostnet, and engineers could

quickly eliminate a substantial number of unused files in one of

their main applications. The long runtime of the analysis was

identified as the biggest inhibiting factor for the usability of

the approach which might be improved through incorporating

more statically available information. In the future, we will

evaluate the static analysis techniques employed by the PHP

dead code detector [9] and assess to which extent their

techniques can be adopted.

REFERENCES

[1] F. Ricca and L. Chao, Special Section on Web Systems Evolution, In-ternational Journal on Software Tools for Technology Transfer (STTT),11(6), pp. 419–425, Springer, Berlin/Heidelberg, 2009.

[2] Y.F. Chen, A C++ data model supporting reachability analysis and dead

code detection, IEEE Transactions on Software Engineering, 24(9), pp.682–694, 1998.

[3] M.W. Godfrey and Q. Tu, Evolution in Open Source Software: A Case

Study, In Proceedings of the 16th International Conference of SoftwareMaintenance, pp. 131–142, San Jose, CA, October 11–14, 2000.

[4] P. Oman, Metrics for assessing a software system’s maintainability. InProceedings of the 8th International Conference of Software Mainte-nance, pp. 337–344, November 9–12, 1992.

[5] D.L. Parnas, Software Ageing. In Proceedings of the 16th InternationalConference on Software Engineering, pp. 279–287, Sorrento, Italy, May16–21, 1994.

[6] J. Knoop, O. Ruthing, and B. Steffen, Partial Dead Code Elimination, InProceedings of the ACM Sigplan Conference on Programming LanguageDesign and Implementation, ACM Sigplan Notices, 29(6), pp. 147–158,1994.

[7] D.F. Bacon and P.F. Sweeney, Fast Static Analysis of C++ Virtual

Function Calls, ACM Sigplan Notices, 31(10), pp. 324–341, 1996.[8] A. Srivastava, Unreachable Procedures in Object-Oriented Program-

ming, ACM Letters on Programming Languages, 1(4), pp. 355–364,1992.

[9] S. Bergmann, Dead Code Detector (DCD) for PHP code, Web:https://github.com/sebastianbergmann/phpdcd.

[10] P. Biggar, E. de Vries and D. Gregg, A Practical Solution for Scripting

Language Compilers, In Proceedings of the 24th ACM Symposium onApplied Computing, pp. 1916–1923, Honolulu, Hawaii, March 9–12,2009.

[11] L. Tratt, Dynamically Typed Languages, Advances in Computers, 77,pp. 149–184, 2009.

[12] G. Scanniello, Source Code Survival with the Kaplan Meier Estimator,In Proceedings of the 27th International Conference of Software Main-tenance, pp. 524–527, Williamsburg, Virginia, September 25 – October1, 2011.

[13] T. Ball, The Concept of Dynamic Analysis, In Proceedings of the 7thEuropean Software Engineering Conference, pp. 216–234, Springer,London, 1999.

[14] K. Hokamura, N. Ubayashi, S. Nakajima and A. Iwai, Aspect-Oriented

Programming for Web Controller Layer, In Proceedings of the 15th Asia-Pacific Software Engineering Conference, pp. 529–536, Beijing, China,December 03–05, 2008.

[15] B. Johnson, Tree visualization with tree-maps: 2-d space-filling ap-

proach, ACM Transactions on Graphics, 11(1), pp. 92–99, January 1992.[16] S.K. Debray, W. Evans, R. Muth, and E. Sutter, Compiler Techniques

for Code Compaction, ACM Transactions on Programming Languagesand Systems (TOPLAS), 22(2), pp. 378–415, 2000.

[17] Y. Liu and S. Stoller, Eliminating Dead Code on Recursive Data, In:A. Cortesi and G. Fil (eds), Static Analysis, Lecture Notes in ComputerScience (LNCS), volume 1694, Springer, Berlin/Heidelberg, 1999.

[18] K.V.N. Sunitha and V.V. Kumar, A new Technique for Copy Propagation

and Dead Code Elimination using Hash-based Value Numbering, In Pro-ceedings of the 14th International Conference on Advanced Computingand Communications, pp. 601–604, Surathkal, India, December 20–23,2006.

[19] P. Biggar, Design and Implementation of an Ahead-of-Time Compiler

for PHP, PhD Thesis, Trinity College, Dublin, 2010.[20] G.A. Di Lucca and M. Di Penta, Integrating Static and Dynamic

Analysis to Improve the Comprehension of Existing Web Applications, InProceedings of the 7th International Symposium on Web Site Evolution,pp. 87–94, Budapest, Hungary, September 26, 2005.


515

[ieee 2012 ieee international conference on software maintenance (icsm) - trento, italy...

Documents