[ieee 2012 ieee international conference on software maintenance (icsm) - trento, italy...
TRANSCRIPT
Dead Code Elimination for Web Systems Writtenin PHP: Lessons Learned from an Industry Case
Hidde BoomsmaHostnet B.V.
Department of Software Engineering
De Ruyterkade 6, 1013 AA Amsterdam, The Netherlands
Email: [email protected]
Hans-Gerhard GrossDelft University of Technology
Software Engineering Research Group
Mekelweg 4, 2628 CD Delft, The Netherlands
Email: [email protected]
Abstract—Web systems undergo constant evolution. Thismakes them prone to accumulating dead code. In turn, deadcode is commonly understood to inhibit software evolution. Theonly way out of this vicious circle is the careful analysis of theweb system, identifying unused features, and eliminating them.However, modern web systems are often built with server sidescripting languages such as PHP. Their inherent dynamic featuresrender traditional static dead code identification approachesuseless.
We describe the technical issues involved in detecting deadPHP code, and propose an identification and removal approachbased on dynamic analysis. Further, we describe the examinationof our approach in an industry-scale web system, and discuss ourlessons learned.
I. INTRODUCTION
Web systems undergo constant evolution [1], typically im-
posed by changing user requirements, improvement of func-
tions, removal of faults, or introduction of new technologies. A
considerable source of maintenance effort, therefore, must be
spent in identifying features that become disused and should be
removed from the system. Such disused features are commonly
referred to as dead code [2]. Removing dead code of a
system helps to make understanding and maintaining its future
versions easier. However, in many organizations, engineers are
often reluctant to remove dead code, because of potentially un-
known dependencies with existing features. Proper dead code
identification and elimination strategies help reduce system
size and complexity, improve system understandability [3],
[4], retard software ageing [5], and, consequently, alleviate
maintenance.
Web systems are increasingly being developed with domain
specific languages such as Python, Ruby, Perl, or PHP. In par-
ticular, PHP is very popular as server-side scripting language,
according to W3Tech’s Web Technology Surveys.1
Dead code elimination is well known in compiler optimiza-
tion, e.g. [6], and was applied in software maintenance, e.g.
[7], [8]. There is also a PHP static dead code detector tool
available [9], but it is failing on large PHP file bases. We
did not find, to the best of our knowledge, neither in the
1www.w3techs.com
literature nor in the open source community, useful approaches
or tools aimed at dead code detection for extensive PHP code
bases. The issues in dead code elimination arise from the
language’s inherent dynamic features, such as runtime module
inclusion, dynamic and weak typing, ”duck-typed objects”,
implicit object and array creation, runtime aliasing, reflection,
and closures. These make dead code identification for PHP
more difficult than in other languages [10], [11].In practice, appropriate dead code identification for PHP
should be performed dynamically. Therefore, we develop and
evaluate a dynamic analysis approach and tools which help
engineers to identify dead code in web systems written in
PHP. We address the following research questions:
• Which data must be retrieved from a system, how can
this be done, and what is the overhead incurred?
• How should the data be presented, and how can unused
code be declared dead?
Our contributions are a dynamic analysis approach for
identifying and eliminating disused PHP modules, and two
tools, a web application and an Eclipse2 plugin. Both tools can
be used to visualize the data retrieved from dynamic analysis,
in order to support engineers in eliminating unused code. The
approach and tools have been developed and tested by means
of a real web system deployed by Hostnet,3 the company
responsible for hosting and marketing the ”.nl”-domain in the
Netherlands. The source code for the analysis techniques and
the tools can be downloaded.4
This article is structured as follows: Sections II, III and IV
discuss dynamic analysis techniques for detecting dead code,
show how this can be visualized, and eliminated, respectively.
Sect. V describes the examination of our approach with lessons
learned, Sect. VI lists related work, and Sect. VII concludes
the paper.
II. DEAD CODE IDENTIFICATION
A. Required Runtime Information
Dead code obstructs system maintenance [12], in particular,
if it is cluttering up the production system, which is typical in
2www.eclipse.org3www.hostnet.nl4https://github.com/hostnet978-1-4673-2312-3/12/$31.00 c© 2012 IEEE
2012 28th IEEE International Conference on Software Maintenance (ICSM)
511
PHP applications. The first step towards detecting unused code
is determining the most suitable granularity for the analysis.
This depends on the predicted effort to remove dead code and
the assumed improvement in terms of better maintenance. We
choose the PHP file as smallest unit of granularity, since file
usage is easy to measure, and files are easily removed.
Because it is impossible to measure unused files dynam-
ically, we analyze used files and deduct them from all files
in the system. This results in the ”potentially dead” files,
and turns our dead file identification problem into a coverage
analysis problem. Hence, dynamic analysis is used to deter-
mine file coverage and frequency [13]. Coverage is a minimal
requirement, although, frequency can determine when a PHP
file was used last.
Files under development will be wrongly identified as dead
files, because they are never executed. This can be avoided if,
in addition to coverage, also the access time stamp from the
version control system (VCS) is considered.
For each file in the production system, we store (1) first
time used, (2) number of times used, since first time, (3) last
time used, and (4) last time changed in the repository of the
VCS. By deducting the accessed files from all files in the
system, after some system execution time, we can identify the
potentially dead files.
B. Dynamic Information Extraction
All files in a system can be identified through the PHP
function call get_included_files(void). Each file
results in a key-value-entry in a data store whose value is
extended with a new time stamp whenever the file is loaded.
Dynamically loaded and used files are identified through an
autoloader function built into most PHP application develop-
ment frameworks such as Symfony5 which is used by Hostnet.
The autoloader can be configured to execute additional logging
code whenever a new feature of a web application is accessed
by a user, and therefore, loaded into the server. This amounts to
100% of the used files, and can be deducted from the collection
of all files in order to yield the number and/or percentage of
unused files.
A final step for information extraction consists in gathering
all dynamically created log information in the data store for
further analysis. Through its dynamic nature, PHP provides
an elegant solution through automatically appending code to
any PHP script executed by the web server. This is com-
parable with an Aspect [14], and can be realized through
adding an auto_append_file-clause, in the web server’s
htaccess file. That way, the last action performed in any
PHP script, after it has been loaded by the web server, will be
gathering the log information which is local to that particular
script, and writing it to the data store. We refer to the PHP
documentation6, and to our publicly available code examples7
for implementation details.
5www.symfony-project.org6http://php.net/docs.php7https://github.com/hostnet
Fig. 1. Example tree map with colors indicating used vs. unused PHP files
III. VISUALIZATION OF THE EXTRACTED DATA
Our main goal is the maintainability of web applications
written in PHP. Maintenance is carried out by engineers who
have to understand the system, and decide which files they
should remove. Understanding can be facilitated if it is based
on good visualizations. Two kinds of visualizations were found
to be useful by the software engineers at Hostnet, and they
resulted in the construction of two tool prototypes described
below.
A. Tree Map Visualizer Web Application
The main tool is based on a tree map visualization following
[15]. It displays the extracted information about used vs.
unused files in the most complete and accurate form. An
Example from Hostnet is depicted in Fig. 1. The screen is
subdivided into three sections.
The top section shows a number of boxes in various sizes
and different colors, indicating the directory structure of the
overall project. Every box represents a sub-directory contain-
ing other directories or PHP files. The box size corresponds
to the number of contained unused files. The color denotes
the number of used vs. unused files in a box. Shades of
green indicate low percentage of unused files, and shades of
red indicate high percentage of unused files. Size and color
combined indicate absolute and relative numbers of unused
files in the directory structure, e.g. a big reddish rectangle
suggests that its associated directory branch contains many
unused files. Two distinct green and red values denote fully
used and fully unused files/directories (left-hand side green
and right-hand side red on the color bar in Fig. 1). Clicking
a box navigates into the directory structure for inspection of
sub-directories and files, showing more detailed information,
and more definite shades of green and red.
The table section on the left hand side of Fig. 1 shows more
detailed information about the directories and files. Besides the
name of the directory, it shows the percentage of unused files,
the total number of unused files, and the overall number of
files, plus the access date of a directory/file according to the
2012 28th IEEE International Conference on Software Maintenance (ICSM)
512
Fig. 2. Aurora sub-system: used files over time
Fig. 3. Example Eclipse dead file decorator
VCS, and the date of the first execution of a file. These values
are derived from the four values stored for each file.
The graph on the right hand side indicates the overall
number of files being activated over time. A more detailed
picture is shown in Fig. 2 for the biggest sub-system used in
the Hostnet web application. In this particular case, the sub-
system was started in mid January, and it was still activating
unused files by mid April. This is a good indicator of how
long one has to wait before coming to a definitive conclusion
over the really used/unused files in a web system.
B. Eclipse Dead File Decorator
The second tool shows less information and is intended to
be used in everyday development. It is based on the Eclipse
file decorator plugin,8 and it also indicates, in green to red, the
percentage of used vs. unused files in the directory structure
of the project. That way, developers get a hint, for example,
that they might be attempting to edit discontinued files, when
entering a reddish directory in their Eclipse development
environment. An example from Hostnet is displayed in Fig.
3, showing the colored project directories in the file browser
on the left hand side. At Hostnet, information and colors are
auto-updated from the data store once per hour (this is an
arbritrary choice).
8http://www.eclipse.org/articles/Article-Decorators/decorators.html
Fig. 4. Process used by Hostnet for removing alleged dead files
IV. DEAD CODE ELIMINATION
Up to this point, we have only looked at how potentially
dead files in a Web system may be identified based on dynamic
analysis. The fact that they are merely ”potentially” dead is an
important distinction, because in dynamic web systems written
in PHP, it can never be determined a-priori, whether a file will
not be used in the future. This is the most essential difference
when compared to dead code identification and elimination in
more ”traditional”, monolithic and static types of systems. In
those systems, identifying a dead code section also means that
it is really, and utterly dead, and may be removed. However, in
dynamic Web applications, this is not the case. Here, someone
must take the active decision of declaring a ”potentially dead”
file really dead, and burying, i.e. removing, it.
In order to being able to declare files dead, the graph
showing file usage over time (Fig. 2) acts as the primary source
of information, but additional domain-specific information
about when functions should be executed is also required. This
cannot be derived from analyzing the running system but must
be looked up from the specification or documentation of the
application. For example, a login script is expected to be used
every day, whereas the script for generating the yearly tax
report is only invoked once per year.
Figure 4 illustrates the dead file elimination process em-
ployed by Hostnet. It starts at the top with the dead code
identification [start]. The numbers indicate where specific
action must be taken, or specific tools are used: (1) is based
on the tree map visualizer. If a completely dead file is detected
(distinctly red), it can be located, checked, and removed fol-
lowing the removal process steps in Fig. 4 (indicated through
”[file removal]”). (2) is based on additional, domain specific
2012 28th IEEE International Conference on Software Maintenance (ICSM)
513
information on how often a feature in a file/directory should
run. (3) is based on the usage of files over time graph. (4)
is based on the tabular view, on the left hand side of Fig. 1.
(5) is based on trying to actively trigger a feature in the web
application, if there are no side effects to be expected. (6) is
based on human reasoning, and coming to a decision.
V. EVALUATION
We have implemented dead code identification in Hostnet’s
web system, and evaluated its performance overhead, as well
as its capability to pinpoint potentially dead files. The web
system is comprised of six sub-systems which have been
augmented with the techniques described earlier. Table I
summarizes features of the sub-systems used. Relevant prop-
erties are the number of files contained in each sub-system’s
directory structure, its age, and its average number of requests
per day (in thousand). HFT2 and 3 are constantly running
batch jobs performing data provisioning services for the other
modules. They do not have, as such, page accesses.
A. Overhead
In order to determine the overhead, we have augmented
the Aurora sub-system with additional profiling code. In PHP,
this can be achieved through inbuilt prepend and append
functionality, which automatically invokes PHP files before
and after execution of a page request. These files contain the
code for end-to-end timing. Aurora is, by far, the biggest and
most used sub-system in the Hostnet suite (Table I), and it can
be regarded as the worst case concerning page access response
time overhead.
The total overhead is comprised of two components: (1) the
actual logging of the used PHP files, and (2) the connection
to the data store in order to save the logged data. Through the
strict separation of the two steps, i.e., the store is only accessed
when the requested page is already being rendered by the web
server and all dynamically loaded PHP files are known, we can
achieve low waiting times for the user. In Aurora, the average
additional waiting time for a page request was measured to
be below 6 ms in 95% of the cases, with an average time for
connecting to the database of 1.6 ms. This means, the waiting
time per page request is not noticeable to the user, which is a
requirement of Hostnet.
B. Dead Code Identification
Earlier, the dead code analysis problem was reformulated as
coverage analysis problem, which means waiting long times
until features in the system are invoked. However, how long
is long enough in order to be certain?
The only useful information to determine the waiting time
comes from the graph showing the number of activated files
over time (Fig. 1). An additional source is the domain knowl-
edge. Table II shows the time we had to wait in order to be
reasonably sure that no more new files would be accessed in
the respective sub-systems of Hostnet. In addition, it shows
the fraction of the used files up to that point, and also a
value indicating the number of page views over the number of
TABLE ISUB-SYSTEMS OF THE HOSTNET WEB APPLICATION
# of Age in Req./DayName Description Files Years x 1000HFT3 New provisioning system 750 1 n/aShop Web shop 923 3 40Aurora CRM application 9755 5 60Mailbase Legacy mail filter frontend 490 5 3My Hostnet Customer portal 2422 5 55HFT2 Old provisioning system 3518 6 n/a
TABLE IIFILES USED, AND WAITING TIMES BEFORE DEAD FILE REMOVAL
Name page views / files % used files no new filesHFT3 n/a 60.13 after 1 monthShop 42.95 68.26 after 2 weeksAurora 6.18 48.91 after 5 monthsMailbase 5.72 45.31 after 5 monthsMy Hostnet 22.08 35.92 after 1 weekHFT2 n/a 73.00 after 2 months
used files. Through the percentage of used files in the table,
it becomes apparent how many dead files may still be lurking
in the code base. Also, the extent to which the dead files
could be aggravating the maintenance task becomes easily
apprehensible.
The ”page views / files” indicates how the relation between
the number of files and number of page accesses correlates
with the time we have to wait to be certain that all active
files have been accessed. A high number of page requests and
a low number of files inevitably leads to quicker activation
of unused features. It might be a good measure for ”waiting
time”, though, this is not validated.
C. Discussion and Lessons Learned
In general, the dead code identification and elemination ap-
proach works well. Based on the visualizations, three Hostnet
engineers were able to safely remove 2740 disused Aurora-
files in one working day, by following the process summarized
in Fig. 4. However, before commencing, they had to wait a
very long time. The number of files they removed amounts
to almost 30% of Aurora’s original code inventory. This is
quite substantial, but, unfortunately, there are no numbers to
compare, since dead file removal was never done before at
the company, and performing it without any tool support was
perceived as extremely tedious by the engineers. Figure 2
suggests that only half of the files in Aurora are used after
three months, so that we can still expect many more files to
be denoted as potentially dead.
Interesting is the steep increase in the graph at Jan/Feb,
a time when many monthly jobs are done. This observation
is difficult to make merely by looking at the graph, which
may be regarded as flaw in the visualization. We had to
analyze these newly loaded files carefully in order to come to
this conclusion. Future versions of the visualizer tool should
indicate when we can expect an increase in new files being
loaded based on the domain knowledge available.
Special attention is required for sporadic or rarely used fea-
2012 28th IEEE International Conference on Software Maintenance (ICSM)
514
tures, such as error handling, rarely sold products, marketing
tools, or statistics. These will remain marked dead for very
long time until they are eventually activated. One treatment
would be the organization of the directory structure according
to expected time of usage, for example create a folder for
yearly, monthly, fortnightly, weekly, or daily occurring files.
This way, files would be grouped according to an occurrence
aspect, which would make the domain knowledge of when
features should be used more explicit in the code base.
However, this would break functional cohesion of the project,
and it is questionable whether this would be favored.
Another important observation from our examination is
that sporadically invoked files such as exception handling
operations should be entirely left out of dead code analysis.
They should also be treated as a separate aspect in the system
and located in a dedicated directory.
The biggest inhibiting factor of our proposed approach is
the time required for dynamic analysis in order to come up
with sufficient certainty that an unused file is indeed dead.
Although, we can trigger some use cases in order to increase
the coverage earlier, this can only be done if no side effects
are expected. The only additional static information addressing
this issue, currently, is the domain knowledge of when some
functions should run.
An interesting final observation, which goes against our
initial intuition, is that older applications seem not necessarily
more prone to dead file accumulation than younger applica-
tions, even though, they might have been changed more often
over time. It would be interesting to study properties that
would help determine directories/files which are more likely
becoming disused.
VI. RELATED WORK
Dead code elimination is a concept originally used in com-
piler optimization [16], [17], [18], but it has also become an
issue in software maintenance [4], [12]. Srivista [8] addresses
unreachable procedures in OO code through static analysis,
work, which is extended by Bacon [7] to include virtual
functions in the analysis. Dead code elimination for PHP
based on static analysis is addressed in an attempt by Biggar
to provide a PHP compiler frontend for C code [10], [19].
However, many of the dynamic features of PHP cannot be
dealt with using this technique. Di Lucca and Di Penta [20]
describe how static analysis is only of limited use in dynamic
web applications and propose to combine it with dynamic
analysis. We have taken this idea further and focused on
dynamic analysis. However, we have also learned that static
information may improve the outcome.
VII. CONCLUSIONS AND FUTURE WORK
We have developed a dynamic dead code identification
and elimination approach for web systems written in PHP
and evaluated it in the context of a large web system of
our industrial partner. In general, the approach was found
to work well in the context of Hostnet, and engineers could
quickly eliminate a substantial number of unused files in one of
their main applications. The long runtime of the analysis was
identified as the biggest inhibiting factor for the usability of
the approach which might be improved through incorporating
more statically available information. In the future, we will
evaluate the static analysis techniques employed by the PHP
dead code detector [9] and assess to which extent their
techniques can be adopted.
REFERENCES
[1] F. Ricca and L. Chao, Special Section on Web Systems Evolution, In-ternational Journal on Software Tools for Technology Transfer (STTT),11(6), pp. 419–425, Springer, Berlin/Heidelberg, 2009.
[2] Y.F. Chen, A C++ data model supporting reachability analysis and dead
code detection, IEEE Transactions on Software Engineering, 24(9), pp.682–694, 1998.
[3] M.W. Godfrey and Q. Tu, Evolution in Open Source Software: A Case
Study, In Proceedings of the 16th International Conference of SoftwareMaintenance, pp. 131–142, San Jose, CA, October 11–14, 2000.
[4] P. Oman, Metrics for assessing a software system’s maintainability. InProceedings of the 8th International Conference of Software Mainte-nance, pp. 337–344, November 9–12, 1992.
[5] D.L. Parnas, Software Ageing. In Proceedings of the 16th InternationalConference on Software Engineering, pp. 279–287, Sorrento, Italy, May16–21, 1994.
[6] J. Knoop, O. Ruthing, and B. Steffen, Partial Dead Code Elimination, InProceedings of the ACM Sigplan Conference on Programming LanguageDesign and Implementation, ACM Sigplan Notices, 29(6), pp. 147–158,1994.
[7] D.F. Bacon and P.F. Sweeney, Fast Static Analysis of C++ Virtual
Function Calls, ACM Sigplan Notices, 31(10), pp. 324–341, 1996.[8] A. Srivastava, Unreachable Procedures in Object-Oriented Program-
ming, ACM Letters on Programming Languages, 1(4), pp. 355–364,1992.
[9] S. Bergmann, Dead Code Detector (DCD) for PHP code, Web:https://github.com/sebastianbergmann/phpdcd.
[10] P. Biggar, E. de Vries and D. Gregg, A Practical Solution for Scripting
Language Compilers, In Proceedings of the 24th ACM Symposium onApplied Computing, pp. 1916–1923, Honolulu, Hawaii, March 9–12,2009.
[11] L. Tratt, Dynamically Typed Languages, Advances in Computers, 77,pp. 149–184, 2009.
[12] G. Scanniello, Source Code Survival with the Kaplan Meier Estimator,In Proceedings of the 27th International Conference of Software Main-tenance, pp. 524–527, Williamsburg, Virginia, September 25 – October1, 2011.
[13] T. Ball, The Concept of Dynamic Analysis, In Proceedings of the 7thEuropean Software Engineering Conference, pp. 216–234, Springer,London, 1999.
[14] K. Hokamura, N. Ubayashi, S. Nakajima and A. Iwai, Aspect-Oriented
Programming for Web Controller Layer, In Proceedings of the 15th Asia-Pacific Software Engineering Conference, pp. 529–536, Beijing, China,December 03–05, 2008.
[15] B. Johnson, Tree visualization with tree-maps: 2-d space-filling ap-
proach, ACM Transactions on Graphics, 11(1), pp. 92–99, January 1992.[16] S.K. Debray, W. Evans, R. Muth, and E. Sutter, Compiler Techniques
for Code Compaction, ACM Transactions on Programming Languagesand Systems (TOPLAS), 22(2), pp. 378–415, 2000.
[17] Y. Liu and S. Stoller, Eliminating Dead Code on Recursive Data, In:A. Cortesi and G. Fil (eds), Static Analysis, Lecture Notes in ComputerScience (LNCS), volume 1694, Springer, Berlin/Heidelberg, 1999.
[18] K.V.N. Sunitha and V.V. Kumar, A new Technique for Copy Propagation
and Dead Code Elimination using Hash-based Value Numbering, In Pro-ceedings of the 14th International Conference on Advanced Computingand Communications, pp. 601–604, Surathkal, India, December 20–23,2006.
[19] P. Biggar, Design and Implementation of an Ahead-of-Time Compiler
for PHP, PhD Thesis, Trinity College, Dublin, 2010.[20] G.A. Di Lucca and M. Di Penta, Integrating Static and Dynamic
Analysis to Improve the Comprehension of Existing Web Applications, InProceedings of the 7th International Symposium on Web Site Evolution,pp. 87–94, Budapest, Hungary, September 26, 2005.
2012 28th IEEE International Conference on Software Maintenance (ICSM)
515