Memory Disclosure of the kernel in modern OS

Under the cut is a translation of the introductory part of the document Detecting Kernel Memory Disclosure with Emulation and Taint Tracking ( Article Project Zero ) by Mateusz Jurczyk .

In the translated part of the document:

specificity of the C programming language (within the framework of the memory expansion problem)
specifics of the operation of the Windows and Linux OS kernels (within the framework of the memory expansion problem)
the significance of kernel memory expansion and the impact on OS security
existing methods and techniques for detecting and countering kernel memory disclosure

Although the document focuses on the communication mechanisms of the privileged kernel of the OS with user applications, the essence of the problem can be generalized for any data transfer between different security domains: the hypervisor is a guest machine, a privileged system service (daemon) - a GUI application, network client - server, etc. .

KDPV

Introduction

One of the tasks of modern operating systems is to ensure the separation of privileges between user applications and the OS kernel. Firstly, this includes the fact that the effect of each program on the execution environment should be limited to a specific security policy, and secondly, the fact that programs can access only the information that they are allowed to read. The second is difficult to achieve, given the properties of the C language (the main programming language used in kernel development), which make it extremely difficult to securely transfer data between different security domains.

Modern operating systems running on x86 / x86-64 platforms are multi-threaded and use a client-server model in which user-mode applications (clients) run independently of each other and call the OS kernel (server) when they intend to work with a resource managed by the system. The mechanism used by the user mode code ( ring 3 ) to call a predetermined set of OS kernel functions (ring 0) is called system calls (system calls) or (shortly) syscalls. A typical system call is shown in Figure 1:

Figure 1: System call life cycle.

It is very important to avoid unintended leakage of the contents of the kernel memory when interacting with user-mode programs. There is a significant risk of disclosing sensitive OS kernel data. Data can be implicitly transmitted in the output parameters of safe (from other points of view) system calls.

Privileged system memory expansion occurs when the kernel of the OS returns a region of memory that is larger (oversized) than is needed to store the relevant information (contained inside). Often, the excess bytes contain data that was filled in a different context, and then the memory was not pre-initialized, which would prevent the information from being disseminated to new data structures.

The specifics of the programming language C

In this section, we will look at several aspects of the C language that are most important for the problem of memory expansion.

Undefined state of uninitialized variables

Individual variables of simple types (such as char or int), as well as members of data structures (arrays, structures, and unions) remain in an undefined state until first initialization (regardless of their location on the stack or on the heap). Relevant quotes from the C11 specification (ISO / IEC 9899: 201x Committee Draft N1570, April 2011):

6.7.9 Initialization
...
If the object is not initialized, its value is indeterminate .

7.22.3.4 The malloc function
...
2 The malloc is an indeterminate function .

7.22.3.5 The realloc function
...
2 and the reallocation of the interlocation function. This is the case for the new object. The old object has indeterminate values .

Most applicable to the system code is the part that relates to objects located on the stack, since the OS kernel usually has dynamic allocation interfaces with their own semantics (not necessarily compatible with the standard C library, as will be described later).

As far as we know, none of the three most popular C compilers for Windows and Linux (Microsoft C / C ++ Compiler, gcc, LLVM) create code that pre-initializes variables uninitialized by the programmer on the stack in the Release-assembly mode (or its equivalent). There are compiler options that allow you to mark stack frames with special bytes — markers (/ RTCs in Microsoft Visual Studio, for example), but they are not used in Release builds for performance reasons. As a result, uninitialized variables on the stack inherit the old values of the corresponding memory areas.

Consider an example of a standard implementation of a fictional Windows system call that multiplies the input integer by two and returns the result of the multiplication (Listing 1). Obviously, in the particular case (InputValue == 0), the OutputValue variable remains uninitialized and is copied back to the client. This error allows you to uncover four bytes of memory in the kernel stack with each call.

NTSTATUS NTAPI NtMultiplyByTwo(DWORD InputValue, LPDWORD OutputPointer) { DWORD OutputValue; if (InputValue != 0) { OutputValue = InputValue * 2; } *OutputPointer = OutputValue; return STATUS_SUCCESS; }

Listing 1: Expanding memory through an uninitialized local variable.

Leaks through an uninitialized local variable are not very common in practice: on the one hand, modern compilers often detect and warn about such problems, on the other hand, such leaks are functional errors that can be detected during development or testing. However, a second example (in Listing 2) shows that a leak can also occur through a structure field.

In this case, the reserved field of the structure is never explicitly used in the code, but it is still copied back to user mode and, therefore, also reveals the four bytes of kernel memory to the calling code. In this example, it is clearly seen that initialization of each field of each structure returned to the client for all branches of the code execution is not an easy task. In many cases, forced initialization looks illogical, especially if this field does not play any practical role. But the fact that a non-initialized variable (or structure field) on the stack (or on the heap) receives the contents of data previously stored in these memory areas (in the context of another operation) is the core of the problem of kernel memory expansion.

 typedef struct _SYSCALL_OUTPUT { DWORD Sum; DWORD Product; DWORD Reserved; } SYSCALL_OUTPUT, *PSYSCALL_OUTPUT; NTSTATUS NTAPI NtArithOperations( DWORD InputValue, PSYSCALL_OUTPUT OutputPointer ) { SYSCALL_OUTPUT OutputStruct; OutputStruct.Sum = InputValue + 2; OutputStruct.Product = InputValue * 2; RtlCopyMemory(OutputPointer, &OutputStruct, sizeof(SYSCALL_OUTPUT)); return STATUS_SUCCESS; }

Listing 2: Expanding memory through a reserved structure field.

Alignment of structures and padding bytes

Initializing all the fields of the output structure is a good start to avoid memory expansion. But this is not enough to ensure that there are no uninitialized bytes in the low-level representation. Let's go back to the C11 specification:

6.5.3.4 The sizeof and Alignof operators
...
4 [...] When it is applied, it can be defined as an object, including internal and trailing padding .

6.2.8 Alignment of objects
This can be allocated . This is an alignment of integer values. [...]

6.7.2.1 Structure and union speci ﬁ ers
...
17 There may be unnamed padding .

That is, C compilers for x86 (-64) architectures use natural alignment of structure fields (of primitive type): each such field is aligned by N bytes, where N is the size of the field. In addition, whole structures and unions are also aligned when they are declared in an array, and the alignment requirement of nested fields is fulfilled. To ensure alignment, implicit padding bytes are inserted into structures where necessary. Although they are not directly accessible in the source code, these bytes also inherit old values from memory areas and can transfer information to user mode.

In the example in Listing 3, the SYSCALL_OUTPUT structure is returned to the calling code. It contains 4 and 8 byte fields, separated by 4 padding bytes, necessary for the address of the LargeSum field to be a multiple of 8. Despite the fact that both fields are correctly initialized, the padding bytes are not explicitly specified, which again leads to the disclosure of the kernel stack memory. Specific location of the structure in memory is shown in Figure 2.

 typedef struct _SYSCALL_OUTPUT { DWORD Sum; QWORD LargeSum; } SYSCALL_OUTPUT, *PSYSCALL_OUTPUT; NTSTATUS NTAPI NtSmallSum( DWORD InputValue, PSYSCALL_OUTPUT OutputPointer ) { SYSCALL_OUTPUT OutputStruct; OutputStruct.Sum = InputValue + 2; OutputStruct.LargeSum = 0; RtlCopyMemory(OutputPointer, &OutputStruct, sizeof(SYSCALL_OUTPUT)); return STATUS_SUCCESS; }

Listing 3: Expanding memory through alignment of a structure.

Figure 2: Structure Alignment
Figure 2: Representation of the structure in memory with alignment.

Leaks through alignments are relatively common, since quite a lot of output parameters of system calls are represented by structures. The problem is particularly acute for 64-bit platforms, where the size of pointers, size_t, and similar types increases from 4 to 8 bytes, which leads to padding required for leveling the fields of such structures.

Since padding bytes cannot be addressed in the source code, you must use memset or a similar function to reset the entire memory area of a structure before initializing any of its fields and copying it into user mode, for example:

  memset(&OutputStruct, 0, sizeof(OutputStruct));

However, Seacord RC in its book "The CERT C Coding Standard, Second Edition: 98 Rules for Developing Safe, Reliable, and Secure Systems. Addison-Wesley Professional" 2014 states that this is not an ideal solution because padding bytes (padding ) may still be shot down after calling memset, for example, as a side effect of operations with adjacent fields. Concern may be justified by the following statement in specification C:

6.2.6 Representations of types
6.2.6.1 General
...
If you’re on the ground, you’ll find out what type of value it is . [...]

However, in practice, none of the C compilers we tested did not read or write explicitly declared fields outside the memory areas. It seems that this view is shared by the developers of operating systems that use memset.

Unions and fields of different sizes

Unions are another complex construction of the C language in the context of communicating with less privileged calling code. Consider how the C11 specification describes the representation of associations in memory:

6.2.5 Types
...
If you’re looking at what you’ve been using, you’ll be able to make up your choice.

6.7.2.1 Structure and union speci ﬁ ers
...
It was noted that 6.2.8, it was discussed in 6.2.5, that is,
...
16 The size of a union is to contain the largest of its members . It can be stored in any union object at any time.

The problem is that if the join consists of several fields of different sizes and only one field of a smaller size is explicitly initialized, the remaining bytes allocated to accommodate larger fields remain uninitialized. Let's look at an example of a hypothetical system call handler, shown in Listing 4, along with the SYSCALL_OUTPUT union memory allocation, shown in Figure 3.

 typedef union _SYSCALL_OUTPUT { DWORD Sum; QWORD LargeSum; } SYSCALL_OUTPUT, *PSYSCALL_OUTPUT; NTSTATUS NTAPI NtSmallSum( DWORD InputValue, PSYSCALL_OUTPUT OutputPointer ) { SYSCALL_OUTPUT OutputStruct; OutputStruct.Sum = InputValue + 2; RtlCopyMemory(OutputPointer, &OutputStruct, sizeof(SYSCALL_OUTPUT)); return STATUS_SUCCESS; }

Listing 4: Expanding memory through partial initialization of a join.

Figure 3: Alignment Merge
Figure 3: The union view in memory with alignment.

It turns out that the total size of the SYSCALL_OUTPUT union is 8 bytes (due to the size of the larger LargeSum field). However, the function sets only the value of the smaller field, leaving the 4 final bytes uninitialized, which subsequently leads to a leakage of their client application.

The secure implementation should set only the Sum field in the user address space, and not copy the entire object with potentially unused memory areas. Another working version of the fix is to call the memset function to reset the copy of the union in the kernel memory before setting any of its fields and transferring it back to user mode.

Insecure sizeof

As shown in the two previous subsections, the use of the sizeof operator can directly or indirectly contribute to the disclosure of kernel memory, causing more data to be copied than previously initialized.

In C, there is no device needed to safely transfer data from the kernel to user space — or, more generally, between any different security contexts. The language does not contain runtime metadata that can explicitly indicate which bytes were set in each data structure that is used to interact with the OS kernel. As a result, the responsibility lies with the programmer, who must himself determine which parts of each object should be transferred to the calling code. If done correctly, then you need to write a separate secure copy function for each output structure used in system calls. Which in turn will lead to bloating of the size of the code, deterioration of its readability and in general will be a tedious and time-consuming task.

On the other hand, it is convenient and simple to copy the entire memory area of the kernel with a single memcpy call and the sizeof argument, and let the client determine which parts of the output will be used. It turns out that this approach is used today in Windows and Linux. And when a specific case of information leakage is detected, a patch with a memset call is immediately provided and distributed by the OS manufacturer. Unfortunately, this does not solve the problem in the general case.

OS Specificity

There are certain solutions for kernel design, programming methods, and code patterns that affect how prone the operating system is to memory vulnerabilities. They are considered in the following subsections.

Reuse of dynamic memory

The current allocators of dynamic memory (both in user mode and in kernel mode) are highly optimized, since their performance has a significant impact on the performance of the entire system. One of the most important optimizations is memory reuse: when released, the corresponding memory is rarely completely discarded; instead, it is stored in the list of regions that are ready to be returned on the next allocation request. To save CPU cycles, default memory spaces are not cleared between release and new allocation. As a result of this, it turns out that two unrelated parts of the kernel work with the same memory range for a short time. This means that the leakage of the contents of the dynamic memory of the kernel allows you to disclose data from various components of the OS.

In the following paragraphs, we provide a brief overview of the allocators used in the Windows and Linux kernels, and their most remarkable qualities.

Windows
The key function of the Windows kernel pool manager is ExAllocatePoolWithTag , which can be called directly or through one of the available shells: ExAllocatePool {∅, Ex, WithQuotaTag, WithTagPriority}. None of these functions reset the contents of the returned memory, neither by default, nor through any input flags. Instead, they all have the following warning in their respective MSDN documentation:

Note Memory that function allocates is uninitialized. If you are going to see this privileged contents).

The calling code can choose from six basic pool types: NonPagedPool, NonPagedPoolNx, NonPagedPoolSession, NonPagedPoolSessionNx, PagedPool, and PagedPoolSession. Each of them has a separate region in the virtual address space, and therefore allocated memory areas can only be reused within the same type of pool. The frequency of reuse of chunks of memory is very high, and zeroed areas are usually returned only if a suitable entry is not found in the lookaside lists, or the request is so large that new memory pages are required. In other words, there are currently almost no factors preventing the pooling memory from opening in Windows, and almost every such error can be used to leak sensitive data from different parts of the kernel.

Linux
The Linux kernel has three main interfaces for dynamic memory allocation:

kmalloc is a common function used to allocate blocks of memory of arbitrary size (continuous in both virtual and physical address space), using slab memory allocation .
kmem_cache_create and kmem_cache_alloc is a specialized mechanism for allocating objects of fixed size (structures, for example), it also uses slab memory allocation .
vmalloc is a rarely used allocation function that returns regions whose continuity is not guaranteed at the level of physical memory.

These functions (by themselves) do not guarantee that the selected regions will not contain old (potentially confidential) data, which makes it possible to open the kernel heap memory. However, there are several ways in which the calling code can request zero memory:

The kmalloc function has an analog kzalloc , which ensures that the returned memory is cleared.
An additional __GFP_ZERO flag can be passed to kmalloc , kmem_cache_alloc, and some other functions to achieve the same result.
kmem_cache_create accepts a pointer to an optional constructor function that is called to pre-initialize each object before returning it to the calling code. The constructor can be implemented as a wrapper around the memset to reset a specified memory area.

We see the availability of these options favorable conditions for kernel security, because they encourage developers to make informed decisions and allow them to simply work with existing memory allocation functions instead of adding additional memset calls after each allocation of dynamic memory.

Fixed-sized arrays

Access to a number of OS resources can be obtained by their test names. The variety of named resources in Windows is very large, for example: files and directories, keys and values of registry keys, windows, fonts, and more. For some of them, the length of the name is limited and is expressed by a constant, such as MAX_PATH (260) or LF_FACESIZE (32). In such cases, kernel developers often simplify the code by declaring buffers as large as possible and copying them entirely (for example, using the sizeof keyword) instead of working only with the corresponding part of the string. This is especially useful if the strings are members of larger structures. Such objects can be freely moved in memory without worrying about managing pointers to dynamic memory.

As one would expect, large buffers are rarely fully used, and the remaining storage space is often not reset. This can lead to particularly strong leaks of long contiguous areas of kernel memory. In the example shown in Listing 5, the system call uses the RtlGetSystemPath function to load the system path into the local buffer, and if the call succeeds, all 260 bytes are passed to the caller, regardless of the actual string length.

 NTSTATUS NTAPI NtGetSystemPath(PCHAR OutputPath) { CHAR SystemPath[MAX_PATH]; NTSTATUS Status; Status = RtlGetSystemPath(SystemPath, sizeof(SystemPath)); if (NT_SUCCESS(Status)) { RtlCopyMemory(OutputPath, SystemPath, sizeof(SystemPath)); } return Status; }

Listing 5: Expanding memory through partial initialization of a string buffer.

The memory region copied back into user space in this example is shown in Figure 4.

Figure 4: Partially initialized row buffer memory.

A safe implementation should return only the requested path, not the entire buffer used for storage. This example again demonstrates how the size estimate of the data by the sizeof operator (used as a parameter for RtlCopyMemory) may be completely incorrect with respect to the actual amount of data that the kernel must transfer to the user area.

Arbitrary size of system call output

Most system calls accept pointers to user mode output along with buffer size. In most cases, size information should only be used to determine if the buffer provided is sufficient to obtain the output of the system call. Do not use the full size of the provided output buffer to set the amount of copied memory. However, we see cases where the kernel will try to use each byte of the user's output buffer, not counting the amount of actual data that needs to be copied. An example of this behavior is shown in Listing 6.

 NTSTATUS NTAPI NtMagicValues(LPDWORD OutputPointer, DWORD OutputLength) { if (OutputLength < 3 * sizeof(DWORD)) { return STATUS_BUFFER_TOO_SMALL; } LPDWORD KernelBuffer = Allocate(OutputLength); KernelBuffer[0] = 0xdeadbeef; KernelBuffer[1] = 0xbadc0ffe; KernelBuffer[2] = 0xcafed00d; RtlCopyMemory(OutputPointer, KernelBuffer, OutputLength); Free(KernelBuffer); return STATUS_SUCCESS; }

Listing 6: Expanding memory through an output buffer of arbitrary size.

The purpose of the system call is to provide the calling code with three special 32-bit values, occupying a total of 12 bytes. Although checking the correctness of the buffer size at the very beginning of the function is correct, the use of the OutputLength argument should end there. Knowing that the output buffer is large enough to store the result, the kernel can allocate 12 bytes of memory, fill it, and copy the contents back into the provided user mode buffer. Instead, a system call allocates a pool block (and, moreover, with a user-controlled length) and copies the entire allocated memory into user space. It turns out that all bytes, except the first 12, are not initialized and are mistakenly disclosed to the user, as shown in Figure 5.

Figure 5: Random Size Buffer Memory
Figure 5: Buffer memory of arbitrary size.

The scheme discussed in this section is especially true for Windows. Such an error can provide an attacker with an extremely useful primitive for uncovering memory:

, Windows, . , .
. , , . , ( — ) .

, . , , .

,

, . , Windows .

, , . , : AddressSanitizer , PageHeap Special Pool . , , - . , . , , , , , . , ( ).

, , , . , .

, API
API, Windows (Win32/User32 API). API , , , . , , , , . .

, . , . , , , . , , .

, , . , KASLR (Kernel Address Space Layout Randomization ), . : Windows, Hacking Team 2015 ( Juan Vazquez. Revisiting an Info Leak ) (derandomize) win32k.sys, . , Matt Tait' Google Project Zero ( Kernel-mode ASLR leak via uninitialized memory returned to usermode by NtGdiGetTextMetrics ) MS15-080 (CVE-2015-2433).

(/) , , (control ﬂow), : , , , , StackGuard Linux /GS Windows . , . , , .

(/)
(/) , , , : , , , . , , . . , ( , ) , , .

Microsoft Windows

2015 Windows. 2015 Matt Tait win32k!NtGdiGetTextMetrics. Windows Hacking Team. , , , 0-day Windows.

2015, WanderingGlitch (HP Zero Day Initiative) ( Acknowledgments – 2015 ). Ruxcon 2016 ( ) "Leaking Windows Kernel Pointers" .

, 2017 fanxiaocao pjf IceSword Lab (Qihoo 360) "Automatically Discovering Windows Kernel Information Leak Vulnerabilities" , , 14 2017 (8 ). Bochspwn Reloaded, , . VMware (Bochs) . , Bochspwn Reloaded, .

, , 2010-2011 , win32k: "Challenge: On 32bit Windows7, explain where the upper 16bits of eax come from after a call to NtUserRegisterClassExWOW()" "Subtle information disclosure in WIN32K.SYS syscall return values" . Windows 8, 2015 Matt Tait , : Google Project Zero Bug Tracker .

( ), , 2017 - Windows -, : Joseph Bialek — "Anyone notice my change to the Windows IO Manager to generically kill a class of info disclosure? BuﬀeredIO output buﬀer is always zero'd" . , IOCTL- .

, Visual Studio 15.5 POD- , "= {0}", . , padding- () .

Linux

Windows, Linux , 2010 . , ( ) ( ) . , Windows Linux , — , .

, Linux . "Linux kernel vulnerabilities: State-of-the-art defenses and open problems" 2010 2011 28 . 2017- "Securing software systems by preventing information leaks" Lu K. 59 , 2013- 2016-. . : Rosenberg Oberheide 25 , Linux 2009-2010 , . Linux c grsecurity / PaX-hardened . Vasiliy Kulikov 25 2010-2011 , Coccinelle . , Mathias Krause 21 2013 50 .

, , Linux. — -Wuninitialized ( gcc, LLVM), . kmemcheck , Valgrind' . , . , KernelAddressSANitizer KernelMemorySANitizer . KMSAN syzkaller ( ) 19 , .

Linux. 2014 — 2016 Peir´o Coccinelle , Linux 3.12: "Detecting stack based kernel information leaks" International Joint Conference SOCO14-CISIS14-ICEUTE14, pages 321–331 (Springer, 2014) "An analysis on the impact and detection of kernel stack infoleaks" Logic Journal of the IGPL. , . 2016- Lu UniSan — , , : , . , 20% (350 1800), 19 Linux Android.

— (multi-variant program execution), , . , . , KASLR, -, . , 2006 DieHard: probabilistic memory safety for unsafe languages, 2017 — BUDDY: Securing software systems by preventing information leaks. John North "Identifying Memory Address Disclosures" 2015- . , SafeInit (Comprehensive and Practical Mitigation of Uninitialized Read Vulnerabilities) , , . , , , Linux.

, . , : , . , , - , . .

CONFIG_PAGE_POISONING CONFIG_DEBUG_SLAB, -. -, . , , , Linux.

grsecurity / PaX . , PAX_MEMORY_SANITIZE , slab , ( — ). , PAX_MEMORY_STRUCTLEAK , ( ), . padding- (), 100% . , — PAX_MEMORY_STACKLEAK, . , , . (Kernel Self Protection Project) STACKLEAK .

Linux:

Secure deallocation, Chow , 2005

Chow, Jim and Pfaﬀ, Ben and Garﬁnkel, Tal and Rosenblum, Mendel. Shredding Your Garbage: Reducing Data Lifetime Through Secure Deallocation. In USENIX Security Symposium, pages 22–22, 2005.

, , ( ) . Linux .

Split Kernel, Kurmus Zippel, 2014

Kurmus, Anil and Zippel, Robby. A tale of two kernels: Towards ending kernel hardening wars with split kernel. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pages 1366–1377. ACM, 2014.

, .

SafeInit, Milburn , 2017

Milburn, Alyssa and Bos, Herbert and Giuﬀrida, Cristiano. SafeInit: Comprehensive and Practical Mitigation of Uninitialized Read Vulnerabilities. In Proceedings of the 2017 Annual Network and Distributed System Security Symposium (NDSS)(San Diego, CA), 2017.

, , .

UniSan, Lu , 2016

Lu, Kangjie and Song, Chengyu and Kim, Taesoo and Lee, Wenke. UniSan: Proactive kernel memory initialization to eliminate data leakages. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 920–932. ACM, 2016.

SafeInit , , , , .

, Linux .

( )

, , ( ). : (), , , , ( - ) . , . , , .

, :

Bochspwn Reloaded – detection with software x86 emulation
Windows bug reproduction techniques
Alternative detection methods
Other data sinks
Future work
Other system instrumentation schemes

, :) , .

Source: https://habr.com/ru/post/415685/

All Articles