How to handle Out of Memory (OOM) in C

Overview

Out of memory conditions happen, but what can you do about it?

Nothing, abort() your application. What do the so-called memory-safe languages do? They abort(). No, seriously, that pretty much sums it up. I love short blog posts.

Oh, ok, if you really want me to back up my statement? Ugh, I guess I will.

What does malloc() actually do?

It allocates memory (heap) of the requested size for exclusive use by the application, right?

Here’s an excerpt from the malloc() manpage on Linux:

By default, Linux follows an optimistic memory allocation strategy.  This means that when malloc() returns non-NULL there is no guarantee that the memory really is available.

Ok, so, maybe not, at least on Linux (without tweaking system tunables that aren’t available at the application level). So what happens when you try to use the memory that you think is available but it isn’t? Well, the kernel will just kill your application, this is known as the OOM Killer.

So if your application can just be terminated randomly by the system, why would you possibly worry about it when the system just so happens to let you know? abort()!

But what, you say, if you aren’t on Linux (or a similar system with optimistic memory allocation), or if you can guarantee the system is configured to not optimistically allocate? Continue reading…

Handling malloc() failures

Everyone knows that if malloc() knows the memory isn’t available, it returns NULL right? Every C programming book tells you you should always check the return value of malloc(), right? Sounds simple, just back out anything your function has done and let the caller know there’s no available memory. Done.

Well, lets take a 10,000ft view of a complex application. The system is out of memory, the application gets some request, maybe via a network connection or direct user input. You get a memory allocation failure due to the system being out of memory at a low level library routine. You’re a great coder, you check all returns from all functions and have no void functions so you’re guaranteed to catch every memory allocation failure, and you can get this error propagated all they way up to the application layer that submitted the request, now what? Its likely in order to let the user know you’re out of memory that … you’ve got to allocate memory, whoops. So your application is quite literally now hung (or would seem to be to the outside party), with no way to let anyone know, you just dropped the request to the floor. At least abort()ing your application would notify the end user something was wrong.

Ok, ok, maybe I’m just a bad coder and haven’t come up with anything creative enough, like preallocating buffers at application startup just to handle this scenario.

Let me ask you this, you’re 100 functions deep in your call stack, and you get a memory allocation failure, how are you testing to make sure you actually unwind everything properly? Are you really writing tooling to test memory allocation failures at every possible failure point (independently!) so you can validate your application can recover? Are you using code coverage to confirm and validate this while running under a dynamic memory analysis tool like Address Sanitizer? I’m going to guess the answer to this is a resounding no or best case sometimes, which means you are much more likely to crash or introduce security vulnerabilities in this completely untested code path. You’ve added probably thousands of additional branches to your application to check for something that is ultimately unlikely to occur and is ultimately unrecoverable, is this really better?

Note:
You might know I work on c-ares and wonder why it is designed to detect memory allocation failures then? My answer to that is, it wasn’t my decision ?, I’m not the original author. If you look at the Coveralls code coverage reports, you’ll see a majority of the unhandled branches are due to untested memory allocation failures. You’ll also see some that are covered because someone decided at one point to try something clever in the test cases to test memory allocation failures, which at least covers some of the conditions. I do know, however, that there are some edge cases in c-ares where a memory allocation failure will never be propagated back up to the application and might cause a request to never trigger its callback which could stall the calling application. These have been there forever (c-ares was forked in 2004 from ares created in 1998), no one has opened an issue report, likely because their application terminates on memory allocation failures the next time they need additional memory.

Acceptable alternatives to abort()

There is really only one acceptable alternative to just calling abort() but it doesn’t eliminate the need completely. It is also application-specific.

First it assumes you are using some sort of wrapper around malloc() (and therefore likely realloc() and free() as well), and if not, you’d need to create one. Most applications do this for various reasons already such as using alternative memory allocators, tracking memory consumption of the application, or the ability to zero out memory when it is no longer needed for security reasons.

Next, it assumes your application has a cache of some sort, data its retrieved that is not critical to its operation and can simply be re-retrieved when needed. This cache could be quite large in some applications, like a web browser might have hundreds of MB in a memory cache for objects its downloaded.

At this point, it should be obvious what the suggestion is. In your malloc() wrapper, simply detect the out of memory condition, trigger the application to flush some portion of its cache, then re-try the memory allocation. Rinse and repeat if there might be more cache’s to flush in some sort of priority order. And finally, when nothing else is left to flush and memory allocation still fails, then abort().

Critical (life & death) applications

Ok, you got me. Maybe there is an edge case like this where the budget for validation testing is 100x greater than the budget for the code implementation itself. Likely, however, most of these are going to be in very scope-limited devices with micro controllers, and not running full stack operating systems, and have a very small code footprint to audit. I’d also argue in such an environment, they probably can pre-calculate how much memory they will use and ensure they can never run out in the first place. Many embedded systems may not even support the concept of memory allocation (and thus virtual memory) in the first place.

That said, looking at the track record for medical device vulnerabilities, I’m not sure checking for out of memory should be their top concern.

Conclusion

Some of you might take offense to this, call me an idiot for suggesting this, or maybe you’re doing it perfectly and testing all conditions. But, how much time did you spend doing this that you could have spent enhancing your application for something with more real-world benefit, what percentage of your possible productivity was wasted? At the end of the day, us programmers need to address real world problems in real world timeframes, don’t let perfect be the enemy of good.