white paper: using application performance management for ... · the answer was sadly a negative,...
TRANSCRIPT
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 1
The Virtualization Practice
White Paper:
Using Application Performance Management for Security
Edward L. Haletky Analyst – Virtualization and Cloud Security The Virtualization Practice Sponsored by New Relic Version 1.0 August 2012 © 2012 The Virtualization Practice, LLC. All Rights Reserved. All other marks are property of their respective owners.
Abstract
At VMworld and RSA Conference last year, The Virtualization Practice, LLC, inquired of security
professionals if there are any early warning systems built within the virtual or cloud security tools
available today. The answer was sadly a negative, but when application performance management
tools were mentioned as an alternative, there was a spark in the conversation that often lead to
how would one know if there was an application problem or a security issue. This distinction often
requires in depth knowledge about an application, it’s normal processing, and the normal paths
through the code; something that is only learned over time. However, there is a new breed of tool
available that can provide some important security information ranging from where you are
spending your time (what you need to know), where your site is going, and from where you have
been reached.
Table of Contents
I. Introduction ................................................................................................ 3
II. Detecting Attacks with System Performance Measurements ............ 4
CPU Trending ............................................................................................. 4
2 USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY
Network Trending ....................................................................................... 5
Memory Trending ........................................................................................ 8
Disk IO Trending ......................................................................................... 9
Putting it Together .................................................................................... 10
III. Detecting Attacks with Application Performance Measurements .. 11
Response Time ......................................................................................... 11
Application Index (ApDex) and Throughput .............................................. 11
Database Throughput ............................................................................... 12
IV. Conclusion and Steps to Using APM as an Early Warning ............. 16
V. About The Virtualization Practice ........................................................ 17
VI. About New Relic .................................................................................... 18
VII. References ............................................................................................ 18
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 3
I. Introduction The biggest problem security practitioners face today is finding an unknown security issue as soon
as possible after the issue occurs. We currently use several types of solutions to read application
and system log files, correlate the data, and eventually come out with a possible security
element, but for a SIEM or tool to pick it up, it has to be a known type of issue. It is very difficult
to determine if an unknown is a security issue because all our existing security tools are designed
to look for patterns of events. Eventually, the unknowns will be spotted and logs filters will be
updated. All in all this could take significant amount of time.
There is an easier way to determine if something unknown is happening within an application, and
that is to use an application performance management tool that tracks not only the time for
actions, but also from where actions start and to where actions end up. Actions can occur through
all tiers of an application from the web frontend through to the database. But this is performance
data, how can it be used for security purposes?
Performance data is sampled often, perhaps even on every query. This data will contain not only
the total time for an action, but perhaps the actual command issued all the way through each tier
and eventually the actual database or back end call made. It would be ideal if timing data was
available through each tier. But such timing data requires that you know exactly what the
applications does on a regular basis. This implies we need to know what is different or unique,
perhaps a unique code path but also unique timing information. The timing information could end
up showing an unusual, non-normal action, which in effect could be an unknown attack to a
system.
Equally important from a security perspective is knowledge of how the application is accessed,
and what it accesses further down its processing path. The reason for this is that most attacks are
trying to get to somewhere else or to specific data. An attack that is trying to go somewhere else
is a pivot attack, while other attacks are trying to access data or subvert a subsystem to gain
deeper access.
Attacks within an application will change performance timings associated with aspects of the
application based on the style of the attack. The attack could slow down an application, but could
also speed it up. One example would be an SQL injection attack, which could cause a database to
query more than it should or timeout due to defense in depth security implementations. Or they
could cause a sub routine to short circuit. Another common attack is to insert malware that once
it latches onto your application and then calls back to a command and control center somewhere
else on the Internet, most likely in a foreign country.
The items we are mentioning here-in are from real world experience as a website I maintained
was hacked, and I was able to determine when the attack occurred, the face of the attack, and
the solution very quickly due to a recently installed APM tool from New Relic. By investigating the
sudden increase in utilization I was able to successfully find the problem. APM as an early warning
system for security issues works. All you need to understand is how to interpret what you are
seeing and starting the security investigation side by side with the application or system
4 USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY
investigation. Eventually, we either rule out a security issue, or we determine due to an attack,
the performance of an application or system changed.
II. Detecting Attacks with System Performance Measurements
As we discussed, to use APM for performance management reasons you often need to be familiar
with an application or technology to interpret results, but to use APM for security measures you
often need to know the timings and normal operations of the application or system. There are two
parts to using APM for security purposes, the first is to use performance measures to look at the
system and the second is to look at the application. When looking at the system we are interested
in key issues regarding a system, specifically the normal resources we can find in many
performance management tools and are the standard resources of a virtual environment: CPU,
Network, Memory, Disk.
CPU Trending We need to trend CPU to determine if there is any changes to our current CPU utilization over
time. In Figure 1, we show a flat CPU trend over roughly a 30-minute period. However, what
would you do if there was a spike in this overall flat CPU usage?
Figure 1: CPU Usage Trending
In many cases, this could be due to some other normal behavior, so the first thing would to
expand the view from 30 minutes to several days or even months to determine if there is a well
known pattern to the behavior you have experienced. Assuming, there is not, the next element
would be to check out change management, to determine if there was a recent change to the
application or server. If nothing, changed, then we may have a security problem.
Why could this be a security problem, because most exploits will increase CPU utilization if they
do not already hide their processing amongst other processing. There are several web attacks that
will gain attackers shell access, the applications that are run will use up CPU, without a
monitoring tool that systematically looks at all CPU utilization, you may not know the attack was
even made. This is a trigger for the web application and why a baseline like one shown in Figure 1
is a very good thing to have.
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 5
Increased or even decreased CPU utilization is a trigger to dig deeper for security reasons. It is not
the only telltale however.
Network Trending To understand how network trending will help with attack detection, one must first understand
how attacks work. The first phase of any attack is to use the network to enumerate the protocols
being used by a server. If the server is not locked down sufficiently, they attacker will determine
what applications are being run, and from there launch an attack over the network against the
services in use. The goal for such network attack is to further pivot attacks deeper into your
network. One other trigger to an attack is an increase in network activity as the attacker
transfers their payload to the compromised system or network activity could increase due to
denial of service attempts.
Figure 2: Network Trending Baseline
So we should pay close attention to any network trending baselines to determine if there is a
sudden increase in network activity. Figure 2, is one such baseline that shows normal behavior.
Abnormal behavior could be a sudden spike of traffic into a potentially compromised system, out
of the system, as well as perhaps a sudden dip in traffic. Once a site is infected, Google and other
browsers can detect well-known infections and prevent the traffic from being delivered. In that
case, overall traffic to a site could also dip or go flat based on what tools customers use.
Furthermore, quite a bit of modern malware ‘calls home’ and a sudden increase in out-bound
traffic would trigger further research, specifically into what the outbound traffic consisted. For
some tools this is either a list of external services or a graph of external service calls.
External Services
External service calls, such as external web calls from within an application can also be tracked as
shown in Figure 3. If you notice an increase in external web traffic you will want to perform
further investigation. While Network Trending will show the behavior of the overall network,
determining what makes up on outgoing network change would require a deeper view into an
application.
6 USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY
Figure 3: Increase in External Services
We may further want to delve into a full list of external services. Figure 4 shows a list of some
external services for a given application. Such a list tied to Figure 3 could tell us if there is a ‘call
home’ scenario in play. One one such investigation, I noticed an increase in Web External traffic,
that a site should never be making as the site was fairly simple and straightforward. Which lead to
investigating a list of web sites similar to Figure 4.
Figure 4: List of External Services
If your APM tool provides a list of sites contacted by your application, periodically review this list.
Ideally the list should be expanded to include country of origin and time the site has been
available. What happens when malware calls home, is that it calls home to short-lived sites. If you
knew the country of origin you could quickly determine if you ever expected traffic to end up
within the country in question. In Figure 4, you would need to take the list, run it through some
tool that would output the country of origin for the site as well as the age of the site, in this the
whois tool would be useful. In the case I mentioned previously, the attack was ‘calling-home’ to
a site that was short lived and located in a country that I did not have coded into the application.
In some cases the external service called could be well known, or look to be well known based on
age and country of origin. In this case it will be necessary to view the data in a different fashion
as well, as a graph of contact. You may suddenly see an increase in traffic to an existing site. If
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 7
this is happens, an investigation is warranted. Once more checking with change control to
determine if the application changed, or if the external service in use also changed. Figure 5,
shows the top 5 external services called. If the malware ‘calls home’ this may be seen visually
with an increase in traffic over a period of time. Once more, expand your viewing to sufficient
size to determine if this is expected or unexpected behavior.
Figure 5: Top 5 External Services
Furthermore, some malware is extremely sneaky and may hop from site to site to site based on its
command and control. If this is the case, you will want to get a full list of all sites to which the
application talked to, even if it was only 1 connection. In general, if the malware has been there
long enough patterns will also appear. You will want to easily spot even these one off services
over time. The list method described above will work, but there are other methods such as a
service map.
Service Map
Figure 6: Service Map
8 USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY
A service map introduces a new view of the application, one that tags times for all tiers of the
application as well as for all external services in use. In this example, Figure 6, we see that there
are some well-known locations that take up relatively high times, but there are “15 more External
Services” in use. The service map we see in Figure 6 identifies some of those well-known services
and is expanded in Figure 7. Even before we look a the “15 more External Services”, our list of
external services starts with Tinyurl, Wordpress, Feedburner, and Meandmymac.net.
Figure 7: Well-Known Services?
We know from this list of services that our application has spoken Wordpress, Talkshoe,
Pingomatic, Ask.com, Akismet, Twitter, Wordpress, Something Unknown, Something Unknown,
Bing, Google, and two more unknown locations. This service map allows us to narrow down our
research to just the 4 unknown services and hovering over them clearly shows the site to which
they belong.
There is a huge amount of data available within an application performance management suite
and the goal of a security professional is to go through this data and narrow down the problem
space as quickly as possible. A useful service map, such as in Figure 6, shows a clear distinction
between what is known and what is unknown, thereby narrowing our search for malware that
‘calls home’. Even if malware does not ‘call home’ it may use your web application as a
launchpad to go elsewhere either within your own network or to an external network. This could
be a known site, or an unknown site, so we have to use networking, lists of external services, and
service maps as triggers to possible further investigation.
The simplest malware may be easy to see and via an increase in network utilization and such
service maps, but then again malware writers can be sneaky.
Memory Trending All programs that run within a computer system use memory. So memory utilization becomes
another trigger for determining if an attack has succeeded. If the malware cannot hide itself from
memory utilization tools within an operating system (such as an attack that uses a rootkit), it is
possible to trend memory over time and determine if something is not in sync with our existing
baseline, Figure 8. While figure 8 only shows the top 5 consumers, this may be sufficient to
determine if malware was successfully installed.
For web applications, malware embeds itself within the web server, the application being run
depending on the language used. Since one of the top consumers of memory should be the web
server, which it is in Figure 8, we can tell if there was a spike if web server memory utilization.
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 9
On an increase in memory utilization not tied to an existing change management request, we
could surmise that some unknown behavior is taking place.
Figure 8: Trending Memory Baseline
In Figure 8 we show that there is no real increase in memory utilization, but what we do have is a
solid baseline for what is considered normal for the application. The application owners are the
ones that will help determine normality. However, trending graphs give those who look at the
data some semblance of what is normal, without needing to know the details of the application.
In this case we are looking at a LAMP application and httpd is a major feature of this type of
application.
Memory, while it will not tell you where the problem lies it is one more trigger to tell you that
there is a sudden and unknown issue.
Disk IO Trending
In many cases malware wants to write something, perhaps a rootkit, or data for later transmittal
to a foreign to the system location. There are two valuable numbers to look at when you review
system disk IO performance: the I/O Rate and then I/O operations per second (IOPs). Either of
these trends could inform you that there is a problem with an application.
Figure 9: Disk IO Trending
However, if the application is running nominally and either of these measurements show different
behavior, then this is a sign that something has changed. That change could be a security related
issue. As a trigger, it is not one you find often as most malware actually uses more networking
than disk I/O however, some malware can full you and spike a bit of traffic to the disk as it writes
the whatever payload it contains.
10
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY
Putting it Together When you look at system statistics you want to first see all the baselines, then show all those
things that are different, not just over a short period of time but over a longer period of time,
perhaps, over more than several days, months, or even years. We need to determine what is
abnormal behavior as quickly as possible.
Figure 10: All System Stats
As an example let us investigate figure 10, which is a 7-day look back of the critical system
performance issues, and key processes. What we immediately see is that CPU Usage is flat as is
physical memory usage, which implies these do not show us much in the way of a trigger for any
event. Disk I/O also appears to be flat. Which leaves us with the system Load Average as well as
Network I/O to show possible problems.
In this case, we immediately see that there is a higher than normal network I/O starting 7 days
ago and lasting for a few days. This is a big flag, to continue more of the research we mentioned
before starting with our previous Network Services discussion. We also see a spike in load
average, which would also trigger an investigation into what was actually happening a few days
ago with in the application.
Are these attack related issues or normal activities? Actually, as a spoiler, the Network I/O
presented as an attack, but ended up being a backup process gone bad while the load average
spike was related to a change management action related to applications upgrades.
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 11
III. Detecting Attacks with Application Performance Measurements
We have seen how system performance measurements could trigger possible security issues and in
some cases how malware could not be seen from within the system. Malware is often extremely
sneaky in how it performs its activities, so we need to increase our vigilance to a view of the
application as well. APM ends up providing a very good early warning system. We have already
discussed how a networking change could lead us to investigate the application using the list of
external services as well as a service map of known and unknown services in use by the
application, but how can we use the other bits of data that comes out of an APM tool?
Response Time One of the key tools we have within any APM tool is response time, the time it takes for certain
actions within an application to take place. In the example we are using we can see immediately
that there is an increase in response time during our suspect times we found by looking at the
system performance measurements over the last 7 days (Figure 11). One is a spike in response
time that correlates to our increase in network traffic, while the other correlates fairly closely to
our increase in load average.
Figure 11: Response Time (7-day lookback)
While this is a PHP based application, APM tools handle a number of different languages including
Java, Ruby, .NET, Python, and most other interpreted languages. Unfortunately, it is very hard to
find APM tools that can directly become a part of C/C++ or compiled language applications.
We notice in the case depicted in Figure 11, that there is a massive increase in database activity.
While we know there is network activity, we now have another piece to the puzzle. Why is there
an increase in database traffic?
Application Index (ApDex) and Throughput Most APM tools will show you a number that corresponds to the general health of an application.
Call it an application index if you will. These are generalized numbers that have a predefined
range with generally the higher values being better. These numbers use a weighted balance to
12
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY
give the overall health of an application that is related to throughput as well as response time,
error rates, and other useful bits of information. While generalized, they can also be triggers to
the health and therefore possible security breach of an application. If the numbers suddenly go
down, we can assume something has adversely impacted over all performance.
Figure 12:ApDex and Throughput
In Figure 12, we have two artificially generated numbers. The Throughput measured as RPM and
the Apdex score. During our high database load, we notice that the RPM value has gone up as has
the time period related to load average. So there is overall more through put to and from the site.
With ApDex showing relatively relational changes. Because of this we can tell that ApDex and
Throughput are related and that there was an increase in traffic to the site in both the times we
noticed previously. However, we do not know if these are good changes or bad changes. We may
assume that an increase in throughput would be good, but if that is malware talking to the site,
that would be bad.
Therefore weighted application indexes (where we really do not know the formula used due to the
proprietary nature of such formulas) become another trigger for further investigation.
Database Throughput
Let us review our case of increased database throughput shown in Figure 11. This large amount of
database traffic only shows up under response time within all the charts available this is the one
that shows the most of a possible attack. However, the question becomes is this normal. We are
only looking at a 7-day look back, could this be a normal weekly activity? How to proceed?
1) Expand the View to a 3-Month Look back per figure 13.
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 13
Figure 13: 3 Month Response Time Look Back
Given the 3 Month Response Time look back we can show that the 8/05-8/07 database
activity was not something that was every week, but we do show some activity 3 months ago
that could be suspect as well.
2) Delve Deeper
To delve deeper, we need to determine exactly what was happening during the time frame in
question, 8/05-8/07. We can do that by looking at transaction traces for the application. If
we look at an order list of transactions by date, we can quickly find the culprit days and entry
points into the application.
Depending on the APM tool you may get a nice list that has the start time of a task as well as
a URL or other entry point plus the time it took for the specific entry point of the application
to run. We are looking specifically for anything with a high database throughput. However,
we should also look for things that look abnormal. In the case of the above we find repeated
calls to the same entry point (Figure 14). This in itself may not be fishy, but a further review
of all transaction traces listed show that this is not a common occurrence.
3) Review Code Paths
So now that we have found something abnormal, we need to further delve into what is
actually happening. The next step in our process is to investigate the code paths taken by the
application. Up until now, everything we have done does not need a large amount of
knowledge about an application. We are using the tools available to us to determine what is
considered normal vs abnormal. With a good APM tool, those should be glaringly obvious. The
question becomes how do we determine if the issue was related to a security problem or an
issue with the code in use.
To answer that we need to look at what exactly the code is doing, without that knowledge we
will not be able to determine if the problem is caused by a security breach, badly written
code, or normal behavior for the code executed.
14
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY
Figure 14: Delve Deeper
Code Paths
To further our investigation we need to look at the code paths taken from the entry point to the
completion of the specific task. Specifically we want to look at anything that would have an
increased usage of a database. So our investigation starts with the list of entry points and then
dives down another level to a summary of all activity. A transaction summary will show if we are
on the right track. We need to know if this task spends a lot of time within a database. Figure 15,
is one such summary and the first item on the list is a SELECT.
A SELECT is definitely a database command, and as such we have our culprit. As you can see from
the summary, the SELECT is taking the vast majority of time compared to any other action for the
request in question. While this may be a good time to through things over the wall to a DBA, we
can do better and attempt to find out what is actually happening within the code, which would
also help a DBA determine what is happening.
Perhaps this is also a time to involve a developer of the code as well. However, we still have
generalities to deal with. We have found our smoking gun, now to see what it does. To do that we
need to get some transaction details.
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 15
Figure 15: Transaction Summary
Transaction details will give us all sorts of useful information, specifically the actual database call
that is causing the problem as shown in in Figure 16.
Figure 16: Troublesome Database Query
We now have even more information to go to our database administrator or developer with to
make a determination as to whether or not this issue is a security problem or something more
mundane as a code issue.
We have gone from a simple to read graph to a relatively straightforward SQL query in a very
short order. But now we can go even further. The transaction trace in Figure 17, we can review
for more information. But there are a couple of questions that need to be posed in order to go
further.
1) Is this a normal database query?
2) Is this the normal code or has the code contained in the suspect .php file been modified in
some way?
3) If the code was modified, when was it modified?
4) Is it normal to have this code running?
16
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY
Figure 17: Transaction Trace
The answer at this point however will take intimate knowledge of the code in order to answer
these questions. In our case the database query was expected, but was expected to complete
quickly. The code should not have been called that often, and the code has not been modified. So
that adds another set of questions:
5) Is this a DoS Attack?
6) Is it a fault in some external service?
7) Was the external service hacked?
As you can see the questions keep coming.
IV. Conclusion and Steps to Using APM as an Early Warning We have now gone through two aspects of APM, looking for triggers that will be part of any
security early warning system. We have delved into standard resources such as memory, CPU,
disk, and network as well as those specific to applications.
In our example, we notice there is an issue, and it does not take intimate knowledge of the code
to make that determination, however, ultimately it may take a developer to answer some of our
questions.
So the requirements of any APM system to be used as an early warning system are:
• Does not require a developer to determine if there is an anomaly
• Should by able to tell us graphically if there is an abnormality, with the ability to delve
further over time to determine if the activity is normal over a span of time.
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 17
• Can tell us directly if some network activity is from well known or short-lived domain
names used by attackers
• Should be tied in some fashion to change management to account for known changes to
the code base, and to restart baselines.
Yes when here is a problem found how you proceed may depend entirely on the trigger for the
problem. In the case of network activity, we want to review the following:
1) Any external services used either by name (hopefully with a way to determine if that is
normal access) or a service map that can automatically determine the well know external
service locations, such as Google, Ask.com, Facebook, and others.
2) A look at application response time over a period of time to determine if there is anything
application specific that is happening.
3) A method to delve down into the application to determine a list of possible transactions that
could be the cause of the possible security issue
4) A method to quickly determine if the transaction trace has anything to do with the response
time aspect under review. A transaction summary
5) And finally a list of the exact calls of the suspect transaction.
What is interesting is that only the last element will require intimate knowledge of the code, as
the code would have to be reviewed to determine if there is a problem and that is only if you are
going to the code level. We could short-circuit the process at the first step and determine that
we are accessing an external service without reason and that the malware is calling home.
Then the site needs to be investigated, the breach fixed, and the entry point for the breach
closed. However, by using an APM tool we have discovered very quickly that there was a problem
and can begin our investigation. A good APM tool can aid in that investigation and what could have
taken days will now take less than an hour depending on the tool, application, and length of time
the APM tool has been running. Actually, the ability to detect external service activity allows an
APM tool to become immediately useful as a security violation detection tool.
V. About The Virtualization Practice The Virtualization Practice is the leading online resource of objective and educational analysis
focusing upon the virtualization and cloud computing industries.
Edward L. Haletky is the author of VMware vSphere(TM) and Virtual Infrastructure Security:
Securing the Virtual Environment as well as VMware ESX and ESXi in the Enterprise: Planning
Deployment of Virtualization Servers, 2nd Edition. Edward owns AstroArch Consulting, Inc.,
providing virtualization, security, network consulting and development and The Virtualization
Practice where is also an Analyst. Edward is the Moderator and Host of the Virtualization Security
Podcast as well as a guru and moderator for the VMware Communities Forums, providing answers
to security and configuration questions. Edward is working on new books on Virtualization.
18
USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY
VI. About New Relic New Relic, Inc. is the all-in-one web application performance management provider for the cloud
and the datacenter. Its SaaS solution combines real user monitoring, application monitoring,
server monitoring and availability monitoring in a single solution built from the ground up and
changes the way developers and operations teams manage web application performance in real-
time. More than 25,000 organizations use New Relic to optimize over 55 billion metrics in
production each day. New Relic also partners with leading cloud management, platform and
hosting vendors to provide their customers with instant visibility into the performance of deployed
applications. New Relic is a private company headquartered in San Francisco, Ca. New Relic is a
registered trademark of New Relic, Inc.
VII. References Edward L. Haletky. VMware vSphere(TM) and Virtual Infrastructure Security: Securing the Virtual
Environment, Prentice Hall PTR; 1 edition (June, 2009).