Towards a Hybrid Approach to Protect Against Memory Safety Vulnerabilities

—Memory corruption bugs continue to plague low- level systems software generally written in unsafe programming languages. In order to detect and protect against such exploits, many pre- and post-deployment techniques exist. In this position paper, we propose and motivate the need for a hybrid approach for the protection against memory safety vulnerabilities, combining techniques that can identify the presence (and absence) of vulnerabilities pre-deployment with those that can detect and mitigate such vulnerabilities post-deployment. Our hybrid approach involves three layers: hardware runtime protection provided by capability hardware, software runtime protection provided by compiler instrumentation, and static analysis provided by bounded model checking and symbolic execution. The key aspect of the proposed hybrid approach is that the protection offered is greater than the sum of its parts – the expense of post- deployment runtime checks is reduced via information obtained during pre-deployment analysis. During pre-deployment analysis, static checking can be guided by runtime information.


I. INTRODUCTION
Memory errors in low-level systems software written in unsafe programming languages such as C or C++ represent one of the main problems in computer security [1]. In particular, in the MITRE ranking [2], the top ten vulnerabilities include four types of memory errors. Microsoft reports that around 70% of all security updates in their products address memory issues [3], and Google reports a similar number regarding bugs in the Chrome Browser [4].
Techniques to detect memory errors can be broadly classified in two categories: detecting and removing vulnerabilities before deployment [5]- [8], or detecting and mitigating them post deployment [9]- [17]. Post-deployment techniques necessarily run as part of the executed code, i.e., at runtime. Predeployment techniques are more diverse, including runtime techniques designed to be used as part of testing and static techniques that directly analyze the source code. Runtime techniques are exact because they check a set of concrete behavior defined by a set of given inputs. Conversely, static techniques aim to check all possible program behaviors but necessarily approximate this due to a lack of context and the well-known state-explosion problem (i.e., scalability limitations). Compared to static techniques, runtime techniques can be more generally applicable, but they may still introduce unacceptable overhead for post-deployment.
The result is a set of techniques with varying coverage and performance profiles (summarised in Section II). In this position paper, we present an experimental analysis (Section III) that demonstrates that these techniques (or at least some tools representing them) are complementary in the sense that no tool captures all vulnerabilities. We then propose a hybrid framework (Section IV) that aims to combine techniques but also, most interestingly, provides an opportunity for cooperation. Our goal is to combine techniques that (i) work with legacy code, (ii) do not require modification to the source code, and (iii) provide a low barrier to adoption. This goal guides our choice of memory protection techniques in this work.

II. MEMORY PROTECTION TECHNIQUES
There are two main approaches to detecting memory errors pre-deploymentruntime techniques that aim to identify potential errors with a high overhead restricting them to predeployment test runs, as well as static techniques that explore the possible behaviors of the program without executing it. The main approach to providing protection post-deployment is to check memory accesses to ensure that they are safethis may be via compiler-level instrumentation or via hardware support with new technologies that go beyond the traditional page table-based protection (e.g., Intel MPX [18], MPK [19], or hardware capabilities [17]). Post-deployment usually provides strong assurance against vulnerabilities but discovery of vulnerabilities (or false positives) at the post-deployment stage can lead to considerable disruptions. Below we outline the main techniques for runtime and static analysis.

A. Runtime Analysis
Checking memory access at runtime requires additional work. There is a trade-off between the amount of security provided and the level of overhead required. Often, techniques with large overheads are deemed incompatible with postdeployment except in the most security-critical settings.
Runtime checks may occur in the software or hardware. In software, such checks are typically inserted by the compiler. However, how this is performed and the overhead/coverage profile varies between tools. Alternatively, checks may be supported by unique hardware mechanisms. In this work, we initially consider three runtime analysis techniques: • AddressSanitizer (ASAN) [6]: This tool uses a combination of shadow memory and so-called red zones with poisoning to detect spatial errors and a special memory allocator that provides address quarantine to detect temporal errors (with extra checks behind options). Developers suggest that ∼ 2x slowdown is standard. • SoftBoundCETS 1 (SB) [9], [20]: This tool tracks pointers' metadata (e.g., base, bound) using shadow space inspired mechanisms (instead of fat pointers) and uses this to insert checks into LLVM IR code to detect spatial and temporal errors. Experimental results [20] report average ∼ 2.16x slowdown (up to 4x). • PureCap [17]: The CHERI model implements memory access capabilities enforced by the hardware. A capability is a token giving access to a particular area of the virtual address space. In the PureCap model, each pointer of a C/C++ program is represented by a capability that carries metadata about the buffer bounds, access rights, etc. One of the advantages of PureCap is that it provides protection at the hardware level rather than intermediate levels that rely on correct implementation of compilers/machine code translation. Limitations include the need for specialised hardware and an increase in pointer sizes (∼ 2x) and corresponding increase in memory consumption. A limitation of runtime techniques for pre-deployment checking is the need for concrete inputs. One method for addressing this is fuzzing [21], which attempts to find inputs that produce specific behaviors.

B. Static Analysis
Static techniques analyze the source code itself, searching the possible set of execution traces. There are, broadly, two main approaches: breadth-first bounded-model checking [22] unrolls the program, representing the reachability of a particular state by any path as a verification condition; and depthfirst path-based symbolic execution [23] encodes a single path through the program as a set of symbolic constraints. Memory safety is cast as reachability of an unsafe state, and a satisfying assignment to the produced verification condition represents a counter-example, e.g., a set of inputs that leads to the error. In this work, we initially consider two static analysis tools since they achieved first place in the Cover-Error category (i.e., find a test that covers a bug) at the 3rd International Competition on Software Testing (Test-Comp 2021) [24]: [25]: This is a bounded-model checker utilizing Clang to transform C programs into an intermediate GOTO language. This is then symbolically executed, producing verification conditions for SMT solvers. • FuSeBMC [26], [27]: This is a white-box fuzzer that injects labels into C programs and then use a combination of ESBMC and a path-based symbolic execution tool called Map2check [28] to find inputs that reach those labels (while checking for vulnerabilities).

III. EXPERIMENTAL ANALYSIS
We perform an experimental analysis 2 with the selected tools using benchmarks taken from the 2021 memory safety category of SV-COMP [29], which contain various opensource applications, e.g., bftpd, which is an FTP server for Unix systems. We aim to demonstrate and explore their complementary nature. We begin by highlighting existing evidence; for example, the results of the most recent SV-COMP competitions [30] show that different techniques find different errors. We split our experimental analysis between benchmarks with given inputs and those without given inputs as the appropriate tools differ.

A. Programs with No Required Input
We run all tools on the 178 memory-safety benchmarks from SV-COMP 2021, where no input is required. We set the time limit of each run to 900 seconds (the SV-COMP time limit). These benchmarks are representative of a broad cross-section of essential vulnerabilities. They vary in size and complexity but are generally small, focusing on the vital vulnerability while being indicative of real-world scenarios.
The results are in Table I. The first thing to note is that every tool detects a different set of vulnerabilities. Runtime techniques detect more than static techniques, which is unsurprising as there is only a single behavior to analyze. However, static techniques detect some vulnerabilities, which runtime techniques miss to detect. One interesting case is a potential stack-use-after-scope vulnerability that is not triggered in the program but presents a future vulnerability detected by static techniques but not by runtime techniques.
Combining all three runtime tools (by taking the maximum set of reported bugs) produces six incorrect verdicts. The interesting cases are false negatives (existing bugs not reported) due to ASAN failing to detect invalid memory cleanup (SB and PureCap do not handle memory leaks) and a false positive (from SoftBoundCETS) falsely reports a bug due to a lack of support for the C library function memcpy.
As complementary, ASAN detects nine bugs that SB and PureCap do not detect, and for SB this number is 6. In contrast, PureCap did not detect any unique vulnerabilities (but should ultimately have a better performance profile).
In terms of performance, the current PureCap implementation used in the analysis is a prototype software model (emulated capability hardware) that does not give realistic performance numbers. Therefore, we compare the runtime overhead of ASAN and SB. The mean overhead for ASAN was 4.10x and for SB it was 4.46x but there was significant variance -27.91 for ASAN and 96.23 for SB. We note that the amount of overhead introduced in safe benchmarks is significantly lower (2.33x±0.28 for ASAN and 1.01x ± 0.04 for SB) than the unsafe ones (7.54x±64.4 and 12.27x±226.29). This is due to the relatively short runtime of the evaluated benchmarks (0.11s ± 0.23s and 0.14s ± 0.25s) in comparison to the overhead introduced by the termination procedure after finding a vulnerability. The static techniques demonstrated significantly more timeouts even though each program had a single path. In 5 cases, ESBMC produced incorrect answers: in one case, it could not detect a comparison of freed pointers, and in the remaining four, it reported a bug in a safe code (it wrongly identified dereference of a NULL pointer). FuSeBMC repeated 4 (including the comparison of freed pointers) out of these five incorrect verdicts.

B. Programs Requiring Input
We run ESBMC and FuSeBMC on the 127 unsafe benchmarks from SV-COMP 2021, where input is required for 900 seconds. We do not run Map2Check directly as it performed very poorly outside of the FuSeBMC setup. The results are in Table II. Both FuSeBMC and ESBMC returned incorrect verdicts for two benchmarks (undetected memory leaks). At the same time, ESBMC reached the timeout in 8 more cases (17 vs 9). For the unsafe verdicts, both ESBMC and FuSeBMC produced counter-examples (i.e., inputs) violating memory safety. Such inputs can be introduced into the original code (and possibly combined with the described runtime verification techniques) for further testing.

C. Vulnerability Analysis
We have identified vulnerabilities that cannot be detected by at least one of the selected tools during experiments and our exploration. These are summarized in Table III and briefly discussed below. a) Subobject-buffer-overflow: ASAN and SB do not track subobject bounds, so do not detect these vulnerabilities. PureCap has an additional option (requiring extra checks) that can detect subobject bounds. However, in some cases, this leads to more false positives, e.g., when performing pointer arithmetic on a pointer to a subobject [31].
b) Use-after-free: PureCap cannot detect this vulnerability as the current stable release only supports spatial safety. There is an experimental release based on CHERIvoke [32], which quarantines freed memory, but (for specific performance reasons) this does not handle use-after-free, rather the more specific use-after-reallocate vulnerability.
c) Stack-use-after-return: PureCap explicitly does not handle stack exploits, which would require complex (and expensive) revocation mechanisms. ASAN does not support this by default, although some versions (not the one we used) provide an option for additional checks. d) Stack-use-after-scope: PureCap cannot handle these stack-based vulnerabilities. SB cannot detect this as the scoping information is not handled during its instrumentation phase at the intermediate level of the LLVM compiler. e) Double-free: This is an example of a temporal memory safety vulnerability that the Cornucopia [33] extension of PureCap could detect, but the stable version does not.
f) Memory-leaks: SBC and PureCap do not explicitly track memory and cannot detect this class of vulnerability. g) Unions: PureCap does not support some program features. For example, due to separating pointers from other data and the larger pointer sizes, PureCap can incorrectly report buffer-overflow when unions are used.
h) Library Functions: It is worth noting that all mechanisms require access to the source code of any library functions in some way. SB and ESBMC provide mechanisms that allow the behavior of library calls to be emulated. SB, ASAN, and PureCap require external code to be compiled with the appropriate checks to provide coverage (and PureCap requires compatibility due to the different pointer sizes). ESBMC will over-approximate the behavior of library calls, but this can lead to many spurious false positives.

D. Summary
Our experimental analysis supports the motivation that runtime and static techniques can complement each other for pre-and post-deployment protection. Interestingly, PureCap provides a subset of safety guarantees that are expected to be very cheap, suggesting a hybrid setup where PureCap handles these cheap checks. In contrast, the rest are handled by insoftware checks -this is what we propose next.

IV. PROPOSED HYBRID FRAMEWORK
Our proposed hybrid framework is illustrated in Fig. 1. This combines static and runtime protection mechanisms to offer protection at both pre-and post-deployment stages. Whilst combining techniques is not a new idea (e.g. [34]), our focus is on the combination across different deployment stages. The framework utilizes the LLVM toolchain for (i) the insertion of assertions, (ii) the translation of C code for static analyzers, and (iii) the compilation to PureCap ISA. Conveniently, the selected tools already use this toolchain. The goal is to provide an architecturally independent set of memory safety guarantees with a minimal performance impact. Therefore, to maximize our framework's adoption, compilation to PureCap will be optional, with runtime checks performed by compiler-inserted assertions for non-capability hardware.
By combining techniques, we aim to provide the union of protection coverage as 'cheaply' as possible, selecting the cheapest way to provide each check (noting that some methods are incompatible with some compiler optimizations). Further, we are convinced that the hybrid approach can achieve more than this. Below we outline the main directions in which the cooperation of different techniques can lead to a framework that provides greater protection than the union of its parts.

A. Isolating Libraries
As previously discussed, a problematic issue for all techniques is the interaction with external libraries. We assume various methods to compartmentalize the program and isolate the protected code from external libraries that are not subject to memory safety protection. Hardware memory capabilities [35] are one of the most efficient technologies to achieve that, providing exception-less security domain transitions and efficient cross-compartment communication through capabilities.
Many other compartmentalization abstractions can be used for platforms that do not support hardware capabilities, relying on various isolation mechanisms. These can be process-based isolation leveraging page tables [36], [37]; VM-based isolation using hardware-assisted virtualization [38], [39]; trusted execution environments [40], [41] and other ISA extensions such as Intel MPK [42]- [45]; and finally software-only solutions such as SFI [46]. These techniques offer various security/performance trade-offs and generally require a particular porting effort to manage data shared between compartments.

B. Certifying the Removal of Assertions
As well as detecting bugs, static tools can certify the absence of specific bugs in some or all of the code to achieve partial or complete certification. Here, k-induction [47], [48] can be used to prove a safety property φ for any given depth of the program's state space. The main idea is to use an iterative deepening approach and check, for each step k up to a maximum value, that φ holds with in all states reachable within k iterations and that if φ holds for k iterations, it holds for the subsequent unfolding of the system. The main challenge of this approach relies on computing and strengthening loop invariants, which must be inductive (and not just invariant) to check the corresponding verification conditions [49]. Such certificates will be used to identify runtime checks that are no longer necessary and can be removed. We will also explore the leverage of (cheap) assurances from PureCap in this process i.e. explore whether (software-based) runtime checks can be removed when assuming the protection offered by PureCap.

C. Safe under Assumptions
Combining the two previous ideas and isolating unknown code, we will also explore the isolation of safe code, e.g., where some safe code is statically shown safe under certain assumptions (typically at entry) or invariants, we will insert runtime checks to check those assumptions or invariants. We may also be able to prove safety under additional assumptions, e.g., replace a series of expensive runtime checks with fewer, cheaper ones. Finally, information about isolation can be used within the static analysis to modularise the checking process to (partially) address the state-explosion issue.

D. Static Analysis to Support Capability Revocation
One of the main limitations of capability-based hardware within the context of temporal memory safety is the need to revoke permissions and the overhead this requires. We propose using static analysis methods to identify when capabilities should be revoked and insert these directly into the code. For example, this should increase the number of use-after-free bugs detectable by the CHERIvoke [32] extension of PureCap.
V. CONCLUSION This paper motivates and describes a proposed hybrid framework for memory safety protection. We analyze some techniques and tools for providing memory safety protection and identify areas in which they complement. We then propose a hybrid framework that aims to achieve joint coverage as cheaply as possible. Finally, we identify further research directions to take advantage of the potential cooperation of the combined techniques.

ACKNOWLEDGEMENT
This work was undertaken as part of the SCorCH: Secure Code for Capability Hardware project funded by EPSRC and Innovate UK as part of the Digital Security by Design (DSbD) challenge.