white paper: using application performance management for ... · the answer was sadly a negative,...

USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY 1

The Virtualization Practice

White Paper:

Using Application Performance Management for Security

Edward L. Haletky Analyst – Virtualization and Cloud Security The Virtualization Practice Sponsored by New Relic Version 1.0 August 2012 © 2012 The Virtualization Practice, LLC. All Rights Reserved. All other marks are property of their respective owners.

Abstract

At VMworld and RSA Conference last year, The Virtualization Practice, LLC, inquired of security

professionals if there are any early warning systems built within the virtual or cloud security tools

available today. The answer was sadly a negative, but when application performance management

tools were mentioned as an alternative, there was a spark in the conversation that often lead to

how would one know if there was an application problem or a security issue. This distinction often

requires in depth knowledge about an application, it’s normal processing, and the normal paths

through the code; something that is only learned over time. However, there is a new breed of tool

available that can provide some important security information ranging from where you are

spending your time (what you need to know), where your site is going, and from where you have

been reached.

Table of Contents

I. Introduction ................................................................................................ 3

II. Detecting Attacks with System Performance Measurements ............ 4

CPU Trending ............................................................................................. 4

2 USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY

Network Trending ....................................................................................... 5

Memory Trending ........................................................................................ 8

Disk IO Trending ......................................................................................... 9

Putting it Together .................................................................................... 10

III. Detecting Attacks with Application Performance Measurements .. 11

Response Time ......................................................................................... 11

Application Index (ApDex) and Throughput .............................................. 11

Database Throughput ............................................................................... 12

IV. Conclusion and Steps to Using APM as an Early Warning ............. 16

V. About The Virtualization Practice ........................................................ 17

VI. About New Relic .................................................................................... 18

VII. References ............................................................................................ 18


I. Introduction The biggest problem security practitioners face today is finding an unknown security issue as soon

as possible after the issue occurs. We currently use several types of solutions to read application

and system log files, correlate the data, and eventually come out with a possible security

element, but for a SIEM or tool to pick it up, it has to be a known type of issue. It is very difficult

to determine if an unknown is a security issue because all our existing security tools are designed

to look for patterns of events. Eventually, the unknowns will be spotted and logs filters will be

updated. All in all this could take significant amount of time.

There is an easier way to determine if something unknown is happening within an application, and

that is to use an application performance management tool that tracks not only the time for

actions, but also from where actions start and to where actions end up. Actions can occur through

all tiers of an application from the web frontend through to the database. But this is performance

data, how can it be used for security purposes?

Performance data is sampled often, perhaps even on every query. This data will contain not only

the total time for an action, but perhaps the actual command issued all the way through each tier

and eventually the actual database or back end call made. It would be ideal if timing data was

available through each tier. But such timing data requires that you know exactly what the

applications does on a regular basis. This implies we need to know what is different or unique,

perhaps a unique code path but also unique timing information. The timing information could end

up showing an unusual, non-normal action, which in effect could be an unknown attack to a

system.

Equally important from a security perspective is knowledge of how the application is accessed,

and what it accesses further down its processing path. The reason for this is that most attacks are

trying to get to somewhere else or to specific data. An attack that is trying to go somewhere else

is a pivot attack, while other attacks are trying to access data or subvert a subsystem to gain

deeper access.

Attacks within an application will change performance timings associated with aspects of the

application based on the style of the attack. The attack could slow down an application, but could

also speed it up. One example would be an SQL injection attack, which could cause a database to

query more than it should or timeout due to defense in depth security implementations. Or they

could cause a sub routine to short circuit. Another common attack is to insert malware that once

it latches onto your application and then calls back to a command and control center somewhere

else on the Internet, most likely in a foreign country.

The items we are mentioning here-in are from real world experience as a website I maintained

was hacked, and I was able to determine when the attack occurred, the face of the attack, and

the solution very quickly due to a recently installed APM tool from New Relic. By investigating the

sudden increase in utilization I was able to successfully find the problem. APM as an early warning

system for security issues works. All you need to understand is how to interpret what you are

seeing and starting the security investigation side by side with the application or system


investigation. Eventually, we either rule out a security issue, or we determine due to an attack,

the performance of an application or system changed.

II. Detecting Attacks with System Performance Measurements

As we discussed, to use APM for performance management reasons you often need to be familiar

with an application or technology to interpret results, but to use APM for security measures you

often need to know the timings and normal operations of the application or system. There are two

parts to using APM for security purposes, the first is to use performance measures to look at the

system and the second is to look at the application. When looking at the system we are interested

in key issues regarding a system, specifically the normal resources we can find in many

performance management tools and are the standard resources of a virtual environment: CPU,

Network, Memory, Disk.

CPU Trending We need to trend CPU to determine if there is any changes to our current CPU utilization over

time. In Figure 1, we show a flat CPU trend over roughly a 30-minute period. However, what

would you do if there was a spike in this overall flat CPU usage?

Figure 1: CPU Usage Trending

In many cases, this could be due to some other normal behavior, so the first thing would to

expand the view from 30 minutes to several days or even months to determine if there is a well

known pattern to the behavior you have experienced. Assuming, there is not, the next element

would be to check out change management, to determine if there was a recent change to the

application or server. If nothing, changed, then we may have a security problem.

Why could this be a security problem, because most exploits will increase CPU utilization if they

do not already hide their processing amongst other processing. There are several web attacks that

will gain attackers shell access, the applications that are run will use up CPU, without a

monitoring tool that systematically looks at all CPU utilization, you may not know the attack was

even made. This is a trigger for the web application and why a baseline like one shown in Figure 1

is a very good thing to have.


Increased or even decreased CPU utilization is a trigger to dig deeper for security reasons. It is not

the only telltale however.

Network Trending To understand how network trending will help with attack detection, one must first understand

how attacks work. The first phase of any attack is to use the network to enumerate the protocols

being used by a server. If the server is not locked down sufficiently, they attacker will determine

what applications are being run, and from there launch an attack over the network against the

services in use. The goal for such network attack is to further pivot attacks deeper into your

network. One other trigger to an attack is an increase in network activity as the attacker

transfers their payload to the compromised system or network activity could increase due to

denial of service attempts.

Figure 2: Network Trending Baseline

So we should pay close attention to any network trending baselines to determine if there is a

sudden increase in network activity. Figure 2, is one such baseline that shows normal behavior.

Abnormal behavior could be a sudden spike of traffic into a potentially compromised system, out

of the system, as well as perhaps a sudden dip in traffic. Once a site is infected, Google and other

browsers can detect well-known infections and prevent the traffic from being delivered. In that

case, overall traffic to a site could also dip or go flat based on what tools customers use.

Furthermore, quite a bit of modern malware ‘calls home’ and a sudden increase in out-bound

traffic would trigger further research, specifically into what the outbound traffic consisted. For

some tools this is either a list of external services or a graph of external service calls.

External Services

External service calls, such as external web calls from within an application can also be tracked as

shown in Figure 3. If you notice an increase in external web traffic you will want to perform

further investigation. While Network Trending will show the behavior of the overall network,

determining what makes up on outgoing network change would require a deeper view into an

application.


Figure 3: Increase in External Services

We may further want to delve into a full list of external services. Figure 4 shows a list of some

external services for a given application. Such a list tied to Figure 3 could tell us if there is a ‘call

home’ scenario in play. One one such investigation, I noticed an increase in Web External traffic,

that a site should never be making as the site was fairly simple and straightforward. Which lead to

investigating a list of web sites similar to Figure 4.

Figure 4: List of External Services

If your APM tool provides a list of sites contacted by your application, periodically review this list.

Ideally the list should be expanded to include country of origin and time the site has been

available. What happens when malware calls home, is that it calls home to short-lived sites. If you

knew the country of origin you could quickly determine if you ever expected traffic to end up

within the country in question. In Figure 4, you would need to take the list, run it through some

tool that would output the country of origin for the site as well as the age of the site, in this the

whois tool would be useful. In the case I mentioned previously, the attack was ‘calling-home’ to

a site that was short lived and located in a country that I did not have coded into the application.

In some cases the external service called could be well known, or look to be well known based on

age and country of origin. In this case it will be necessary to view the data in a different fashion

as well, as a graph of contact. You may suddenly see an increase in traffic to an existing site. If


this is happens, an investigation is warranted. Once more checking with change control to

determine if the application changed, or if the external service in use also changed. Figure 5,

shows the top 5 external services called. If the malware ‘calls home’ this may be seen visually

with an increase in traffic over a period of time. Once more, expand your viewing to sufficient

size to determine if this is expected or unexpected behavior.

Figure 5: Top 5 External Services

Furthermore, some malware is extremely sneaky and may hop from site to site to site based on its

command and control. If this is the case, you will want to get a full list of all sites to which the

application talked to, even if it was only 1 connection. In general, if the malware has been there

long enough patterns will also appear. You will want to easily spot even these one off services

over time. The list method described above will work, but there are other methods such as a

service map.

Service Map

Figure 6: Service Map


A service map introduces a new view of the application, one that tags times for all tiers of the

application as well as for all external services in use. In this example, Figure 6, we see that there

are some well-known locations that take up relatively high times, but there are “15 more External

Services” in use. The service map we see in Figure 6 identifies some of those well-known services

and is expanded in Figure 7. Even before we look a the “15 more External Services”, our list of

external services starts with Tinyurl, Wordpress, Feedburner, and Meandmymac.net.

Figure 7: Well-Known Services?

We know from this list of services that our application has spoken Wordpress, Talkshoe,

Pingomatic, Ask.com, Akismet, Twitter, Wordpress, Something Unknown, Something Unknown,

Bing, Google, and two more unknown locations. This service map allows us to narrow down our

research to just the 4 unknown services and hovering over them clearly shows the site to which

they belong.

There is a huge amount of data available within an application performance management suite

and the goal of a security professional is to go through this data and narrow down the problem

space as quickly as possible. A useful service map, such as in Figure 6, shows a clear distinction

between what is known and what is unknown, thereby narrowing our search for malware that

‘calls home’. Even if malware does not ‘call home’ it may use your web application as a

launchpad to go elsewhere either within your own network or to an external network. This could

be a known site, or an unknown site, so we have to use networking, lists of external services, and

service maps as triggers to possible further investigation.

The simplest malware may be easy to see and via an increase in network utilization and such

service maps, but then again malware writers can be sneaky.

Memory Trending All programs that run within a computer system use memory. So memory utilization becomes

another trigger for determining if an attack has succeeded. If the malware cannot hide itself from

memory utilization tools within an operating system (such as an attack that uses a rootkit), it is

possible to trend memory over time and determine if something is not in sync with our existing

baseline, Figure 8. While figure 8 only shows the top 5 consumers, this may be sufficient to

determine if malware was successfully installed.

For web applications, malware embeds itself within the web server, the application being run

depending on the language used. Since one of the top consumers of memory should be the web

server, which it is in Figure 8, we can tell if there was a spike if web server memory utilization.


On an increase in memory utilization not tied to an existing change management request, we

could surmise that some unknown behavior is taking place.

Figure 8: Trending Memory Baseline

In Figure 8 we show that there is no real increase in memory utilization, but what we do have is a

solid baseline for what is considered normal for the application. The application owners are the

ones that will help determine normality. However, trending graphs give those who look at the

data some semblance of what is normal, without needing to know the details of the application.

In this case we are looking at a LAMP application and httpd is a major feature of this type of

application.

Memory, while it will not tell you where the problem lies it is one more trigger to tell you that

there is a sudden and unknown issue.

Disk IO Trending

In many cases malware wants to write something, perhaps a rootkit, or data for later transmittal

to a foreign to the system location. There are two valuable numbers to look at when you review

system disk IO performance: the I/O Rate and then I/O operations per second (IOPs). Either of

these trends could inform you that there is a problem with an application.

Figure 9: Disk IO Trending

However, if the application is running nominally and either of these measurements show different

behavior, then this is a sign that something has changed. That change could be a security related

issue. As a trigger, it is not one you find often as most malware actually uses more networking

than disk I/O however, some malware can full you and spike a bit of traffic to the disk as it writes

the whatever payload it contains.

10

USING APPLICATION PERFORMANCE MANAGEMENT FOR SECURITY

Putting it Together When you look at system statistics you want to first see all the baselines, then show all those

things that are different, not just over a short period of time but over a longer period of time,

perhaps, over more than several days, months, or even years. We need to determine what is

abnormal behavior as quickly as possible.

Figure 10: All System Stats

As an example let us investigate figure 10, which is a 7-day look back of the critical system

performance issues, and key processes. What we immediately see is that CPU Usage is flat as is

physical memory usage, which implies these do not show us much in the way of a trigger for any

event. Disk I/O also appears to be flat. Which leaves us with the system Load Average as well as

Network I/O to show possible problems.

In this case, we immediately see that there is a higher than normal network I/O starting 7 days

ago and lasting for a few days. This is a big flag, to continue more of the research we mentioned

before starting with our previous Network Services discussion. We also see a spike in load

average, which would also trigger an investigation into what was actually happening a few days

ago with in the application.

Are these attack related issues or normal activities? Actually, as a spoiler, the Network I/O

presented as an attack, but ended up being a backup process gone bad while the load average

spike was related to a change management action related to applications upgrades.


III. Detecting Attacks with Application Performance Measurements

We have seen how system performance measurements could trigger possible security issues and in

some cases how malware could not be seen from within the system. Malware is often extremely

sneaky in how it performs its activities, so we need to increase our vigilance to a view of the

application as well. APM ends up providing a very good early warning system. We have already

discussed how a networking change could lead us to investigate the application using the list of

external services as well as a service map of known and unknown services in use by the

application, but how can we use the other bits of data that comes out of an APM tool?

Response Time One of the key tools we have within any APM tool is response time, the time it takes for certain

actions within an application to take place. In the example we are using we can see immediately

that there is an increase in response time during our suspect times we found by looking at the

system performance measurements over the last 7 days (Figure 11). One is a spike in response

time that correlates to our increase in network traffic, while the other correlates fairly closely to

our increase in load average.

Figure 11: Response Time (7-day lookback)

While this is a PHP based application, APM tools handle a number of different languages including

Java, Ruby, .NET, Python, and most other interpreted languages. Unfortunately, it is very hard to

find APM tools that can directly become a part of C/C++ or compiled language applications.

We notice in the case depicted in Figure 11, that there is a massive increase in database activity.

While we know there is network activity, we now have another piece to the puzzle. Why is there

an increase in database traffic?

Application Index (ApDex) and Throughput Most APM tools will show you a number that corresponds to the general health of an application.

Call it an application index if you will. These are generalized numbers that have a predefined

range with generally the higher values being better. These numbers use a weighted balance to

12


give the overall health of an application that is related to throughput as well as response time,

error rates, and other useful bits of information. While generalized, they can also be triggers to

the health and therefore possible security breach of an application. If the numbers suddenly go

down, we can assume something has adversely impacted over all performance.

Figure 12:ApDex and Throughput

In Figure 12, we have two artificially generated numbers. The Throughput measured as RPM and

the Apdex score. During our high database load, we notice that the RPM value has gone up as has

the time period related to load average. So there is overall more through put to and from the site.

With ApDex showing relatively relational changes. Because of this we can tell that ApDex and

Throughput are related and that there was an increase in traffic to the site in both the times we

noticed previously. However, we do not know if these are good changes or bad changes. We may

assume that an increase in throughput would be good, but if that is malware talking to the site,

that would be bad.

Therefore weighted application indexes (where we really do not know the formula used due to the

proprietary nature of such formulas) become another trigger for further investigation.

Database Throughput

Let us review our case of increased database throughput shown in Figure 11. This large amount of

database traffic only shows up under response time within all the charts available this is the one

that shows the most of a possible attack. However, the question becomes is this normal. We are

only looking at a 7-day look back, could this be a normal weekly activity? How to proceed?

1) Expand the View to a 3-Month Look back per figure 13.


Figure 13: 3 Month Response Time Look Back

Given the 3 Month Response Time look back we can show that the 8/05-8/07 database

activity was not something that was every week, but we do show some activity 3 months ago

that could be suspect as well.

2) Delve Deeper

To delve deeper, we need to determine exactly what was happening during the time frame in

question, 8/05-8/07. We can do that by looking at transaction traces for the application. If

we look at an order list of transactions by date, we can quickly find the culprit days and entry

points into the application.

Depending on the APM tool you may get a nice list that has the start time of a task as well as

a URL or other entry point plus the time it took for the specific entry point of the application

to run. We are looking specifically for anything with a high database throughput. However,

we should also look for things that look abnormal. In the case of the above we find repeated

calls to the same entry point (Figure 14). This in itself may not be fishy, but a further review

of all transaction traces listed show that this is not a common occurrence.

3) Review Code Paths

So now that we have found something abnormal, we need to further delve into what is

actually happening. The next step in our process is to investigate the code paths taken by the

application. Up until now, everything we have done does not need a large amount of

knowledge about an application. We are using the tools available to us to determine what is

considered normal vs abnormal. With a good APM tool, those should be glaringly obvious. The

question becomes how do we determine if the issue was related to a security problem or an

issue with the code in use.

To answer that we need to look at what exactly the code is doing, without that knowledge we

will not be able to determine if the problem is caused by a security breach, badly written

code, or normal behavior for the code executed.

14


Figure 14: Delve Deeper

Code Paths

To further our investigation we need to look at the code paths taken from the entry point to the

completion of the specific task. Specifically we want to look at anything that would have an

increased usage of a database. So our investigation starts with the list of entry points and then

dives down another level to a summary of all activity. A transaction summary will show if we are

on the right track. We need to know if this task spends a lot of time within a database. Figure 15,

is one such summary and the first item on the list is a SELECT.

A SELECT is definitely a database command, and as such we have our culprit. As you can see from

the summary, the SELECT is taking the vast majority of time compared to any other action for the

request in question. While this may be a good time to through things over the wall to a DBA, we

can do better and attempt to find out what is actually happening within the code, which would

also help a DBA determine what is happening.

Perhaps this is also a time to involve a developer of the code as well. However, we still have

generalities to deal with. We have found our smoking gun, now to see what it does. To do that we

need to get some transaction details.


Figure 15: Transaction Summary

Transaction details will give us all sorts of useful information, specifically the actual database call

that is causing the problem as shown in in Figure 16.

Figure 16: Troublesome Database Query

We now have even more information to go to our database administrator or developer with to

make a determination as to whether or not this issue is a security problem or something more

mundane as a code issue.

We have gone from a simple to read graph to a relatively straightforward SQL query in a very

short order. But now we can go even further. The transaction trace in Figure 17, we can review

for more information. But there are a couple of questions that need to be posed in order to go

further.

1) Is this a normal database query?

2) Is this the normal code or has the code contained in the suspect .php file been modified in

some way?

3) If the code was modified, when was it modified?

4) Is it normal to have this code running?

16


Figure 17: Transaction Trace

The answer at this point however will take intimate knowledge of the code in order to answer

these questions. In our case the database query was expected, but was expected to complete

quickly. The code should not have been called that often, and the code has not been modified. So

that adds another set of questions:

5) Is this a DoS Attack?

6) Is it a fault in some external service?

7) Was the external service hacked?

As you can see the questions keep coming.

IV. Conclusion and Steps to Using APM as an Early Warning We have now gone through two aspects of APM, looking for triggers that will be part of any

security early warning system. We have delved into standard resources such as memory, CPU,

disk, and network as well as those specific to applications.

In our example, we notice there is an issue, and it does not take intimate knowledge of the code

to make that determination, however, ultimately it may take a developer to answer some of our

questions.

So the requirements of any APM system to be used as an early warning system are:

• Does not require a developer to determine if there is an anomaly

• Should by able to tell us graphically if there is an abnormality, with the ability to delve

further over time to determine if the activity is normal over a span of time.


• Can tell us directly if some network activity is from well known or short-lived domain

names used by attackers

• Should be tied in some fashion to change management to account for known changes to

the code base, and to restart baselines.

Yes when here is a problem found how you proceed may depend entirely on the trigger for the

problem. In the case of network activity, we want to review the following:

1) Any external services used either by name (hopefully with a way to determine if that is

normal access) or a service map that can automatically determine the well know external

service locations, such as Google, Ask.com, Facebook, and others.

2) A look at application response time over a period of time to determine if there is anything

application specific that is happening.

3) A method to delve down into the application to determine a list of possible transactions that

could be the cause of the possible security issue

4) A method to quickly determine if the transaction trace has anything to do with the response

time aspect under review. A transaction summary

5) And finally a list of the exact calls of the suspect transaction.

What is interesting is that only the last element will require intimate knowledge of the code, as

the code would have to be reviewed to determine if there is a problem and that is only if you are

going to the code level. We could short-circuit the process at the first step and determine that

we are accessing an external service without reason and that the malware is calling home.

Then the site needs to be investigated, the breach fixed, and the entry point for the breach

closed. However, by using an APM tool we have discovered very quickly that there was a problem

and can begin our investigation. A good APM tool can aid in that investigation and what could have

taken days will now take less than an hour depending on the tool, application, and length of time

the APM tool has been running. Actually, the ability to detect external service activity allows an

APM tool to become immediately useful as a security violation detection tool.

V. About The Virtualization Practice The Virtualization Practice is the leading online resource of objective and educational analysis

focusing upon the virtualization and cloud computing industries.

Edward L. Haletky is the author of VMware vSphere(TM) and Virtual Infrastructure Security:

Securing the Virtual Environment as well as VMware ESX and ESXi in the Enterprise: Planning

Deployment of Virtualization Servers, 2nd Edition. Edward owns AstroArch Consulting, Inc.,

providing virtualization, security, network consulting and development and The Virtualization

Practice where is also an Analyst. Edward is the Moderator and Host of the Virtualization Security

Podcast as well as a guru and moderator for the VMware Communities Forums, providing answers

to security and configuration questions. Edward is working on new books on Virtualization.

18


VI. About New Relic New Relic, Inc. is the all-in-one web application performance management provider for the cloud

and the datacenter. Its SaaS solution combines real user monitoring, application monitoring,

server monitoring and availability monitoring in a single solution built from the ground up and

changes the way developers and operations teams manage web application performance in real-

time. More than 25,000 organizations use New Relic to optimize over 55 billion metrics in

production each day. New Relic also partners with leading cloud management, platform and

hosting vendors to provide their customers with instant visibility into the performance of deployed

applications. New Relic is a private company headquartered in San Francisco, Ca. New Relic is a

registered trademark of New Relic, Inc.

VII. References Edward L. Haletky. VMware vSphere(TM) and Virtual Infrastructure Security: Securing the Virtual

Environment, Prentice Hall PTR; 1 edition (June, 2009).

white paper: using application performance management for ... · the answer was sadly a negative,...

Documents