detecting hardware virtualization rootkits

Edgar BarbosaCOSEINC Advanced Malware Labs

SyScan’07

Speaker info� Edgar Barbosa� Security researcher� Currently employed at COSEINC� Experience with reverse engineering of Windows kernel

and x86/x64 cpu architecture� Published some articles at rootkit.com� Participated in the creation of BluePill, a virtualization

hardware based rootkit

Content� Part I

� How hardware virtualization rootkits (HVR) works?

� Part II� How to detect HVR?

Detection of virtualization rootkits

Hardware virtualization rootkits� Intel and AMD developed virtualization extensions to the

x86 architecture - VT-x and SVM.� There are 2 famous hardware virtualization based rootkits:

� Vitriol, created by Dino Dai Zovi – uses Intel VT-x� Bluepill, designed by Joanna Rutkowska – uses AMD SVM

� Source code not public� We will focus the Bluepill rootkit in this presentation, but

the concepts and methods are very similar to the Intel plataform.

Bluepill� Designed by Joanna Rutkowska� Intellectual property of COSEINC� Uses AMD Secure Virtual Machine (SVM) extensions� Runs in 64-bit mode� Supports multicore systems

AMD SVM� SVM stands for “Secure Virtual Machine”� It’s a CPU extension to support Virtual Machine Monitors

(VMM), a.k.a. hypervisor.� 8 new instructions:

� VMRUN� VMSAVE� VMLOAD� VMMCALL� CLGI� STGI� SKINIT� INVLPGA

Initialization of a SVM rootkit� Before any SVM instruction can be used, the EFER.SVME

must be set to 1.� Trying to execute a SVM instruction with SVME equal 0

results in #UD (Invalid opcode) exception.� Allocates and initialize the VMCB structure.� VMCB (Virtual Machine Control Block) address must be 4KB-

aligned� VMCB describes a virtual machine to be executed.� It contains:

� Instruction or events in the guest to be intercepted� Control bits� Guest processor state( General registers, RIP, CR registers, … )

Initialization of a SVM rootkit� After VMCB initialization, set the VM_HSAVE_PA MSR.

This is the physical address where the VMRUN instruction saves host processor state information.

� Then execute the VMRUN instruction with RAX register value equal the physical address of the VMCB

Initialization of a SVM rootkit

VMRUN instruction� Available only at CPL-0� CPU enters in a new processor mode: Guest Mode� In guest mode the behavior of some instructions changes

to facilitate virtualization� Consistency checks on the host and guest state� Saves the host processor state� Load the guest process state configured in the VMCB� CPU now runs in guest mode until an intercept occurs

#VMEXIT� When a intercept triggers, the processor performs a #VMEXIT� On #VMEXIT the processor:

� Disable interrupts� Clear all intercepts� Sets the host CPL to 0� Disable all breakpoints� Checks the reload host state for consistency

� The reason of the #VMEXIT is saved in the EXITINFO field of the VMCB structure

� Execute the Bluepill interception handler routine

Bluepill hypervisor

Detection of virtualization rootkits

“Undetectable” rootkits� Popek and Goldberg VMM properties:

� Efficiency� Resource control� Equivalence

� Equivalence “implies that any program executing on a virtual machine must behave in a manner identical to the way it would have behaved when running directly on the native hardware” [1]

� SVM/VT-x rootkits are only theoreticaly ‘undetectable’� However, the equivalence principle is not fully respected in the hardware

virtualization extensions� There are computer resources that hypervisor has not full control:

� TLB (partially)� Branch prediction� SMP processing

Timing attacks� The most obvious attack against hardware virtualization

rootkits is timing attack.� We measure the time of execution of some probably

intercepted instruction and compare the value against some trusted baseline.

� But AMD and Intel hardware virtualization extensions has support to intercept any internal source of timing:� RDTSC� RDMSR� I/O ports

� Hardware virtualization even supports a TSC offset value to be subtracted from every TSC access attempt.

� This is the reason that local timing attacks fails

Detection methods� Methods:

� TLB� Branch prediction� Counter-based clock� #GP exceptions

� DMA-based attacks will not be discussed due to the new IOMMU unit.

TLB� A Translation Lookaside Buffer (TLB) is a CPU cache that is

used to improve the speed of virtual address translation.� Detailed TLB information can be obtained by CPUID

instruction. Returns information like the number of entries of each TLB, the type and the associativity of the cache.

� For each line in the TLB is stored information like:� Tag, used to compare with the virtual address� Physical address, the result of the VA translation� Page attributes

� If the translation is not store in the cache (cache miss), the system must execute the ‘table-walk’ procedure. This is a expensive clock-cycle operation.

TLB� The TLB has a limited number of entries.� The contents of each line is not accessible by software� However we can fill the TLB by accessing several pages.� The idea is to fill all the TLB entries and measure the time

to access these cached pages. Now we execute a privileged instruction that must be intercepted by a hypervisor. If there is a hypervisor running on the system, it will evict some TLB entries. After executing the privileged instruction we measure the time to execute the previous cached pages. If it takes more time to be accessed, there is a hypervisor running.

TLB� The idea of using TLB to detect hypervisor was first published

by Peter Ferrie [2]. However, in the second version of his paper [3], Ferrie states that the TLB method does not work on AMD-based hypervisors because they can direct the hardware to not flush the TLB when a hypervisor event occurs.

� Ferrie suggests the CPUID instruction to be used in the TLB method. But Bluepill doesn’t need to intercept cpuidinstruction. Another instruction could be used instead, the rdmsr EFER, which bluepill must intercept.

� It is still possible to use the TLB method to detect bluepill even if the hypervisor controls TLB flush! How?

TLB� TLB entries are tagged with ASID (Address Space Identifier) bits to

distinguish different host and/or guest space address.� ASID #00 assigned to VMM and #1..#63 to guests.� TLB_CONTROL field:

� The VMM can control the TLB flush operations by setting the TLB_CONTROL field on the VMCB. If set to 1, the VMRUN instruction will flush the entire TLB (all ASID’s).

� Even with tagged ASID TLB, we can evict all lines in the TLB. The number of TLB entries are limited, so it will evict lines if necessary. Opteron primary TLB has only 40 entries [4].

� AMD optimization manual suggests to avoid using the TLB_CONTROL = 1 to flush the guest TLB. Instead, it is best to assign a new ASID to the guest!

Branch prediction� Studies have shown that the behavior of branch instruction is

highly predictable [5]

� Execution trace history of branch instructions can be used to predict its future behavior.

� If a branch is predicted to be taken and this prediction turns out to be incorrect, there is a huge performance penalty because allthe pipeline must be flushed.

� There are a lot of branch prediction schemes. Explaining these schemes are out of the scope of this presentation.

� There are some very good references about this subject[5]

� Branch prediction unit uses a small cache to store the history of the branch instruction execution.

Branch prediction� There is another buffer to store the target address of the branch,

the BTB (Branch Target Buffer )� How to use the branch prediction unit (BPU) to detect

hypervisor code?� Using the prediction rules of static and dynamic predictors, we

can fill the entries of the branch history tables and measure the time to execute our code. Now the detector executes a privilegedinstruction that will be intercept if there is a hypervisor running. The hypervisor code will affect the branch history tables. We execute now the ‘branch test code’ again without the privileged instruction and measure the time. If the execution of the privileged instruction was intercepted, the measured times will be different.

Branch prediction� The Branch Prediction Unit was successfully used to obtain a

512-bit encryption key by using a Branch Prediction Analysis (BPA) attack[6]. This attack is based in some interesting features of BPU:� The execution history cache is accessed using just a few low-

order bits from the branch instruction address. Two different address can use the same history. This is called Branch Aliasingor Branch Interference.

� The cache is shared between all threads.� The spy thread was running simultaneously with the decryption

thread. Since the two threads was using the same branch prediction cache (branch aliasing), the spy thread can determine which branches the decryption thread has taken.

Branch prediction

Branch prediction� It is not possible to use Branch Aliasing effect to detect

virtualization rootkits due to the fact that we do not know the virtual address where the rootkit code is being executed.

� To use our detection method we must know internal details of the implementation of the branch prediction unit. However, this information is not easy to find in the cpu manuals.

� We can use some very clever benchmarks developed by MilenaMilenkovic, Aleksandar Milenkovic and Jeffrey Kulick [7]. These benchmarks give us very specific details of the BPU, like the number of entries and associativiy of BTB, the bits used from the branch address that are used as the set index and the local branch history length.

Counter based detection� Bluepill can intercept attempts to read the TSC MSR or any

other clock.� However, we can create our own clock by using a simple

counter algorithm.� Using a multicore system, we can create two threads running

concurrently.� Thread #2 can measure how long Thread #1 takes to execute

some routine or instruction.� It is necessary a synchronization protocol between the threads

to signal when to start and when to finish the counter incrementoperation.

� SMP uses shared memory, which can be used for the synchronization protocol.

Counter based detection� How this detection method really works?� Basic idea:

� Thread #1 will set the ‘start counter’ synchronization signal and wait the reply of thread #2.

� Thread #1 will execute a instruction that bluepill must intercept ( rdmsr EFER ), while thread #2 starts to increment the counter at the same time.

� After the intercepted instruction, thread #1 set the ‘stop counter’ synchronization signal and thread #2 stops the counting process.

� Now we check the value of the counter

Counter based detection

Counter based detection� Using a simple counter code like:

while( stopCounter == 0 )counter++;

The results of a ‘rdmsr EFER’ test are:� Counter = 016h for a clean system.� Counter = 02000h for a system running a HBR

� The counter code can be optimized using the AMD optimization manual rules.

Counter based detection� To avoid detection, bluepill must stop thread #2 counter as

soon it intercepts any event.� However just the #VMEXIT control transferring process takes

around 1000h clock cycles! � The CPU#2 bluepill hypervisor is in ‘sleep mode’ while the

counter runs and even if the CPU#1 sends a IPI (Inter processor interrupt ) to CPU#2, it will take even more time.

� The CPU#1 hypervisor doesn’t have access to the CPU#2 registers context.

� It is too late to change any thread schedule quantum value

Counter based detection� What if our counter code is interrupted by some external

interrupt, like the clock, at the start of the counter process?� It is good to avoid interrupts in our counter code, but not really

necessary� We can’t guarantee that the counter code will not be

interrupted� Clear interrupt methods are interceptable by the rootkit:

� Temporarily disable the APIC (interceptable)� CLI instruction (interceptable)� PUSHF and POPF instructions (interceptable)

� Solution:� We can run the detection code several times. All we need is a

weird counter value.

Counter based detection� There is another way for the rootkit to detect this detection

method?� Very difficult. We can implement several different

synchronization routines and algorithms to make sure that the threads are running concurrently.

� There is no time for the rootkit to unload itself to avoid detection after the intercept.

BP in hibernation-mode � One interesting idea discussed is the possibility of bluepill

being able to unload itself while some attack is being executed and reload itself after the finish of the attack.[8]

� That’s a weird idea because if we know that the rootkit is unloaded, we can load our own detector hypervisor and waits for any code trying to get access to SVM resources! Remember bluepill is predicted to be undetectable even if the source is published.

� However, the unload idea can be cleverly used against the next detection idea. It is interesting to present this attack to knowhow virtualization rootkits can use this ‘unload’ trick.

#GP detection� EFER (Extended Feature Enable Register ) is a model

specific register (MSR)� Can be accessed by RDMSR and WRMSR instructions.� MSR EFER index is 0xC0000080� Before using the AMD SVM extensions, it is necessary to

set the EFER.SVME bit to 1.� Bluepill intercepts all attempt of read or write in the

EFER.� There is a way to know the value of SVME bit without

being intercepted?

VMSAVE instruction� The VMSAVE instruction stores a subset of the processor state into

the VMCB specified by the physical address in the RAX register.� This is a Secure Virtual Machine Instruction.� This instruction generates a #UD exception if SVM is not enabled.� Pseudo code:

VMSAVE and EFER� What happens if we execute VMSAVE instruction with RAX

containing a invalid physical address and SVM?If the EFER.SVME = 0 the system generates a #UD exception!If the EFER.SVME = 1 the system generates a #GP exception!

� The VMSAVE instruction microcode is able to read the real value of the EFER.SVME register without being intercepted!

� We can use the VMSAVE instruction to detect HVBR.� The VMSAVE in not the only one SVM instruction that can be

used for detection. Take a look at AMD manuals.

Counter-attack - I� If the rootkit set the VMCB to intercept the VMSAVE

instruction, it will not detect the attack because the system will generate a exception before executing VMSAVE.

� The rootkit can set the VMCB to intercept #GP exceptions!� After a #GP exception intercept, the rootkit must verify it

the guest RIP is pointing to a VMSAVE instruction!� If the VMSAVE instruction is being executed and the guest

EFER.SVME = 0, it can now inject a #UD exception in the guest.

#GP attack - II� The detection code must not allow the rootkit to detect the

VMSAVE instruction at the guest RIP.� We can use the Translation Lookaside Buffer to hide our

detection code.� Exactly before executing the VMSAVE instruction, the

detector will change the PTE of the detection code to make it point to a fake page which will have another code that generates a #GP exception.

� When the rootkit intercepts the #GP, it will not be able to look at the real code of RIP because the PTE is not showing the real detection code page. The real address of the detection code is inside the I-TLB and there is no way to access the contents of the TLB cache.

Counter-attack II� The rootkit is not able to find the real detector page because it

is cached at ITLB. But it can detect if the PTE address of the RIP is fake.

� How? � The rootkit will write a 0xCC (int 0x3) opcode at the RIP address

and restart guest execution at the same RIP.� If the system generates a #BP exception, the page is not fake.� If the system generates a #GP again, the page is fake.

� If the rootkit detects such attack, it can’t know what is the correct exception that must be inject in the guest because the hidden code can be any instruction able to generate a #GP exception. If it injects a #UD exception it will be easily detected!

Counter-attack II� What the rootkit can do now?

� It knows that a exception must be generated.� It hooks the guest exception handlers.� Next, it unload the hypervisor and now it calls the intercepted

instruction again.� In this case, the instruction will generate the correct exception

that will be detected by the hooked exception handlers. � Now, the exception handler just needs to load the hypervisor

again!� Due to the #GP attack, every virtualization rootkit must

implement configure the VMCB to intercept #GP exceptions.

CPU bugs� It is possible to use CPU bugs to detect HVBR?

� Yes, but it is not a reliable way to detect rootkits.� I found that the execution of the Address-Size Prefix (0x67)

opcode together with the VMSAVE instruction is aparentlyable to freeze systems running hypervisors ! �

� A detector which freezes the system is not very useful outside of lab environments.

Credits� All the cool crypto research papers using cpu

microarchitecture based attacks.� Alexander Tereshkin, for the creation of the counter-

attacks against the #GP exception method to detect Bluepill.

References� [1] J. Smith and R. Nair. Virtual Machines. Versatile platforms for systems and processes. Morgan Kaufmann, 2005.� [2]http://pferrie.tripod.com/papers/attacks.pdf� [3]http://pferrie.tripod.com/papers/attacks2.pdf� [4]http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html� [5]J. Shen and M. Lipasti. Modern Processor Design. Fundamentals of Superscalar processors. McGraw-Hill , 2005.� [6]O. Acuçmez, Ç. Koç and J. Seifert. On the power of simple branch prediction analysis. http://eprint.iacr.org/2006/351.pdf� [7] M. Milenkovic, A. Milenkovic and J. Kulick. Demystifying Intel Branch Predictors.

http://www.ece.wisc.edu/~wddd/2002/final/milenkovic.pdf� [8]http://blogs.zdnet.com/Ou/?p=297

Questions?

Thank you for your time!

detecting hardware virtualization rootkits

Technology

svm instruction

tlb entries

tlbthe tlb

virtual machine monitorsvmm

famous hardware virtualization

detailed tlb information

host state

guest modein guest mode