Fuzzing Image Parsing in Windows, Part Four: More HEIF
Mandiant
Written by: Dhanesh Kizhakkinan
Continuing our discussion of image parsing vulnerabilities in the Windows HEIF codec, we take a look at analyzing a new crash, reconstructing function symbols, and the root cause analysis of the vulnerability, CVE-2022-24457. This vulnerability is present on a default install of Windows 10 and 11 and only requires browsing to a folder containing the malicious image file to trigger the vulnerability. The vulnerability is triggered when Windows attempts to automatically generate a thumbnail for the image. All vulnerabilities have been remediated by Microsoft following the disclosure by Mandiant.
The Crash - CVE-2022-24457
Figure 1: Crash details
The crash in Figure 1 is an out of bound memory write within an AVX2 instruction; note that the crash function is considerably large and contains a notable series of AVX2 instructions. After a quick look around the decompilation, the operations, and internal calls to memcpy
functions, we can deduce this as an AVX2 optimized version of memcpy
. An out-of-bounds write inside the memcpy
function is a good crash for further analysis.
Identifying other functions
With the crash function identified as memcpy
, we next try to identify other functions in the binary. For this, we go one frame up the call stack and look at the decompilation in Figure 2. We can see that this decompliation contains a call to sub_18017dd88
, which appears to be a logging function with the function name as its second parameter.
Figure 2: Decompilation showing debug logging call
Some software ships with logging capabilities which can be helpful for analyzing crashes and performance issue in a production environment. In this case, the logging calls can help us to reconstruct multiple function names and understand the implemented functionality of these functions, which in turn helps us to easily determine the root cause of vulnerabilities. Looking at the cross references, we can see over 5000 calls to this logging function (see Figure 3).
Figure 3: Cross references to logging function
Given the large number of logging calls, we write a script which recovers the function names from the second argument and renames the caller functions.
The IDA Script
An IDAPython script was written to automate the process. It works according to the following algorithm.
- Get list of cross references (xrefs) to the logging function
- Get unique caller functions from the list of xrefs
- Decompile each caller
- Find the logging function call and retrieve the second argument as the function name
- Rename the caller function with the retrieved function name
I decided to write a generic script which can be reused in other projects. For that, I used IDA’s decompiler API to avoid processor and calling convention specific code. The script also makes use of our decompiler wrapper FIDL. The script is provided in Table 1.
from idc import *
from idaapi import *
from idautils import *
import FIDL.decompiler_utils as du
# f_name: The logging function name
# indx: 0 based argument index to retrieve
def rename(f_name, indx):
f_ea = get_name_ea_simple(f_name)
if f_ea == BADADDR:
print("Failed to resolve address for {}".format(f_name))
return
callers = set()
# Get a set of unique callers
for ref in XrefsTo(f_ea, True):
if not ref.iscode:
continue
f = get_func(ref.frm)
if f is None:
continue
f_ea = f.start_ea
callers.add(f_ea)
for caller_ea in callers:
current_fname = get_func_name(caller_ea)
# Rename only if the function name starts with sub_
if current_fname.startswith("sub_"):
c = du.find_all_calls_to_within(f_name, caller_ea)
try:
# Validate the logging function and arguments
if len(c) > 0 and len(c[0].args) > indx:
f_name_str = c[0].args[indx].val
set_name(caller_ea, "{}".format(f_name_str), SN_FORCE)
else:
print("Failed in {}\n".format(current_fname))
except:
print("Exception in {}\n".format(current_fname))
rename('sub_18017DD88', 1)
With thousands of functions renamed in IDA, it gets considerably easier to do a full root-cause analysis. Even though we use IDA for static analysis, WinDBG + Time Travel Debugging (TTD) is regularly used for most of the dynamic analysis. We port our renamed symbols into WinDBG by using an IDA plugin: FakePDB. FakePDB creates a PDB from the IDA database, which can be loaded in WinDBG to enhance our debugging/tracing capabilities. An example is shown in Figure 4.
Figure 4: Ported symbols in WinDBG
Root cause analysis of the bug
The relevant code from the function msheif_store!CHEIFStreamReader::ReadItemData
is presented in Table 2.
/*
length calculation from CHEIFItemInfoEntry::GetDataSize
0x309 + 0xee7 => 0x11f0
0x11f0 + 0xfffff200 => 0x1000003f0
*/
QWORD currentOffset = 0;
QWORD length = 0x1000003f0; // CHEIFItemInfoEntry::GetDataSize
status = MFCreateMemoryBuffer(length, allocBuff);
if (status < 0)
{
// bail
}
while (1)
{
...
// 0xee7 + 0x0 >= 0x1000003f0
if (currentSize + currentOffset >= length)
{
// bail
}
// crash
OptimizedMemcpy(currentOffset + allocBuff, srcBuff, currentSize);
currentOffset += currentSize;
...
}
Astute readers will quickly point at the if
condition for a possible integer overflow scenario. But in this case, such a scenario is mitigated while calculating the length in CHEIFItemInfoEntry::GetDataSize
. The vulnerability is only visible when we look closely at the MFCreateMemoryBuffer
function and its parameters. Figure 5 shows the function’s documentation from MSDN.
Figure 5: MFCreateMemoryBuffer function documentation
The MFCreateMemoryBuffer
function accepts a 32-bit DWORD as the length parameter and returns an allocated buffer. But if we look at Table 2, we can see that the length parameter passed to the function is a 64-bit QWORD. In such an instance, the compiler decides to truncate the QWORD to DWORD. In this case, length 0x1000003f0
gets truncated to much smaller 0x3f0
. This allocates a smaller buffer and larger data gets copied into the buffer, causing the out-of-bounds write.
From the function names CHEIFStreamReader::ReadItemData
, we guess that the vulnerability occurs while trying to parse the item
box. Further backtracing the calls, we see the lengths are read from the function CItemLocationAtom::ParseAtom
, which points to the iloc
box shown in Figure 6.
Figure 6: iloc box
Looking at the box content, we can see all the three length values (0x309, 0xEE7 and 0xFFFFF200) specified in the box. Now we can look at the iloc
specification to figure out the exact details for those lengths. HEIF is based on ISO Base Media File Format (ISOBMFF) but getting the right specification with box parsing algorithms tends to be complicated or paywalled.
Another approach we can try, is to look at open-source implementations of HEIF image parsers such as libheif or nokiatech-heif. Running our PoC file through decoding routines gives us the exact details of the lengths from the iloc
box as shown in Figure 7.
Figure 7: iloc parsing in a debugger
The three lengths we see in our PoC file are called extent lengths. Microsoft’s HEIF implementation reads all the extent lengths and adds them together before the resulting length is used in allocating memory through the API function MFCreateMemoryBuffer
. This API truncates the lengths to a DWORD and allocates a smaller buffer, causing the out-of-bounds write.
Patch
Microsoft patched this vulnerability in March 2022 by bailing out with an error if the total length is greater than 0xC8000000 (~3GiB).
Conclusion
Part four of this blog series presents a vulnerability in Microsoft’s HEIF decoder and shows how to reconstruct symbols to do a full root-cause analysis of the vulnerability. A list of latest reported vulnerabilities in HEIF codec can be found in the following appendix and found referenced in the Mandiant Vulnerability Disclosures.
Appendix