Blog | Context Information Security

The first blog post provides a technical overview of the initial delivery and exploit stages involving a malicious PDF, as has been commonly observed with both targeted and opportunistic attacks over the past year. The remaining posts will cover both the analysis of the malware obtained as a result of the infection stage and subsequent communications between the infected host and remote malicious servers.

The malicious PDF I obtained from a recent APT investigation where we extracted the file from an infected workstation. This initial post in the series covers the infection vector stage of a malware attack and will look at JavaScript that is identified in the PDF file. Analysis will then continue onto the shellcode analysis and subsequent malware acquisition from a remote site.

For the initial PDF analysis, we can use the PDF-Id tool from Didier Stevens that may be obtained from here. This PDF script provides an overview of the presence of certain PDF keywords associated with malware or exploits. The presence or absence of these keywords will help you to decide if a PDF file is potentially malicious and requires further analysis, or if it is effectively benign, i.e. requires no further analysis.

The screenshot below provides an overview of the PDF document. The presence of JavaScript is an indicator that this file is likely to be malicious and it is therefore important to extract this JavaScript, as it is likely to contain an exploit.

There are numerous ways to extract the JavaScript, such as using the PDF-Parser script to deflate streams and analyse objects contained within or alternatively use an online site to extract the data for you. The quickest solution is to use Jsunpack to automatically write the script to a file. Using Jsunpack from the command line, the program will extract the JavaScript from the PDF and write it to disk in a file (filename.pdf.out). The resultant code is shown below:

Looking at the file reveals obfuscated JavaScript and what appears to be shellcode. The code checks for Adobe Reader version and will exploit either CVE-2007-5659 or CVE-2009-0927. In the image, we can see that the malware author has attempted to obfuscate the calls. The previous code uses a series of eval statements and String.fromCharCode to extract values from the large array of integers to finally generate the script below that contains the shellcode. Using a debugger like Rhino, we can step through the code and deobfuscate the code to identify how it functions. In the screenshot below, we can see that code is concatenated together to form a 'for loop' that extracts that next stage of Javascript:

The next stage code contains the exploit code and shellcode as expected. Now it is possible to extract the shellcode from the script and identify its function. The exploit used is dependent on version and checks for Adobe Reader 7,8,9 and up. The code is shown below.

Again, there are various ways of extracting and analysing the shellcode. We could insert a breakpoint in the form of “%uCCCC” into the start of the code and execute the PDF in Adobe, whilst running in a debugger such as Ollydbg. The debugger would hit the breakpoint at the start of the shellcode and we could view it running in the context of Adobe Reader. Alternatively, we can take the shellcode directly from the script, convert it to hex and then use a tool such as Shellcode2Exe to create a standalone executable that may be analysed in a debugger or IDA Pro. Using the latter approach, it is necessary to convert the code to hexadecimal and then convert to an executable.

After creating an executable from the shellcode, analysis can begin in an attempt to understand what happens as a result of exploitation. The shellcode is small in size and is likely to initiate a download to obtain the malware. Viewing the code in a disassembler, we can see a number of strings, DLL calls and a URL.

The shellcode may be encrypted and it is useful to step through the code in a debugger to fully understand the capability. This shellcode entry point already tells the reverser a lot of information, if they know where to look. Looking at this code we can guess that the shellcode will attempt to find the location of some DLLs. Any code will need to use the Windows API, and the actual address in memory space of a given process must be known. The most successful solution for finding addresses of required Windows API functions is connected with the use of LoadLibrary() and GetProcAddress() routines from the kernel32.dll library. Therefore malware authors will employ a number of techniques for finding these on any Windows host and Service Pack. Such mechanisms include walking through the SEH (Structured Exception Handler) and, as in this case, scanning the list of loaded modules in the PEB (Process Environment Block). The next screenshot from IDA demonstrates the standard functions to find Kernel32.dll using the PEB. Comments are also included in code.

The code at offset 401044 above is the standard method for identifying the location of the PEB. This instruction will store the pointer to PEB in EAX. This is always located at fs:[30h] in the TEB (Thread Environment Block). The next important instruction will load the pointer to the loader data structure - PEB_LDR_DATA, which is present at the 0x0c offset in PEB. Further into this PEB_LDR_DATA struct, the code can then use the offset 0x1c to identify InitializationOrderModuleList. struct PEB_LDR_DATA{
...
struct LIST_ENTRY InLoadOrderModuleList;
struct LIST_ENTRY InMemoryOrderModuleList;
struct LIST_ENTRY InInitializationOrderModuleList;
};

It is this list that is used to identify each loaded module and assists the shellcode in locating the base addresses of kernel32.dll and other API functions that may be required to infect the system. Again, the code moves on in predictable fashion, provided the analyst knows what to expect. It now proceeds to load the entry list and the first entry contains information about ntdll.dll module. By moving through the list to the second entry at the 0x08 offset we can obtain the kernel32 base address, as seen in the image below:

The base address is seen as 7C80000 and EBX now contains a pointer to this address. Now that the shellcode contains the location of kernel32.dll, the code can load the necessary DLLs and functions required to continue execution. The next stage of execution is for the shellcode to identify the location of the Temp folder, to store the downloaded malware. It makes a call to Kernel32.GetTempPath, stores the location and concatenates the filename “n.exn” to the path.

The shellcode continues execution and resolves the address for Urlmon.dll, which is normally used to call the function “UrlToDownloadFile”. Shellcode will implement this function to download malware. The function address is resolved and the shellcode finally loads the malicious URL to download the malware. The final stages of execution are illustrated in the following screenshots.

This concludes part 1 of this series of posts. In part 2, we shall analyse the malware that has infected the system as a result of the initial PDF exploit. This post intends to show you how a large number of malware exploits begin with a targeted PDF that leads to shellcode execution to obtain the next stage malware.

References

Ollydbg – http://www.ollydbg.de
PDF Tools – Didier Stevens – http://blog.didierstevens.com/programs/pdf-tools
IDA Pro – http://www.hex-rays.com/idapro
Malware Domain List – http://www.malwaredomainlist.com
Rhino Debugger – http://www.mozilla.org/rhino/debugger.html
Malzilla – http://malzilla.sourceforge.net

Malware 1 - From Exploit to Infection

By Mark Nicholls

References

About Mark Nicholls

Malware 1 - From Exploit to Infection

By Mark Nicholls

References

Print Article

About Mark Nicholls