C Is Great, Just Not All of It

Overview

C has obviously been around a very long time, starting in 1972 and eventually becoming standardized by ANSI in 1989 (and then ISO in 1990). This initial standardized version is commonly referred to as C89. In this article, I only plan on discussing issues in C89 and later.

The C language feature releases since C89 have been C99, and C11. The C17/C18 standard only corrects defects in C11.

Many libraries still treat C89 as the lowest common denominator and refuse to use newer language features for fear of breaking compatibility with legacy systems. A lot of this concern stemmed from one of the major compiler vendors (Microsoft) not even providing full C99 conformance until their Visual Studio 2015 release. Luckily as of September 14, 2020 Microsoft has announced they are adding C11/C17 support in their next release, only half the time it took them to add C99 support!

In this post, rather than discussing things missing from the language as per my previous blog post, I plan on focusing on features that should NOT have been added in the first place.

Mixed code and declarations

In C99, a “feature” was added to allow variable declaration anywhere within the body of a function, thus intermingling declarations and code. Prior to this, all declarations were required be first at the start of a block.

Allowing this leads to very messy code and also makes it harder to track what variables are actually declared. It can also make the possibility of memory leaks more common by association. Presumably the argument for allowing mixed code and declarations is people have a hard time tracking variable data types due to their code structure. That’s a false argument, it means the coder is not properly modularizing their code. It might also mean they’re under the misconception that declarations may only reside at the beginning of a function, rather than the reality that they can be declared at the beginning of any code block such as an if statement or loop.

I’m a big fan of a languages enforcing proper methods and code structure so people must do things the right way. It makes it a lot easier for others coming behind to maintain the code and understand what is going on. A good example from another language would be Python with their forced indentation as part of the language.

Now there is one exception to this rule that I think should be allowed, and that is to allow a declaration within a for loop iterator, such as for (int i=0; i<len; i++) { }. A for loop is so specialized anyhow, adding such an exception shouldn’t be difficult.

VLAs

Introduced with C99, VLAs are Variable Length Arrays, meaning a variable can be passed to a function to control the size of a stack-allocated array dynamically.

float read_and_process(int n)
{
    float vals[n];

    for (int i = 0; i < n; ++i)
        vals[i] = read_val();

    return process(n, vals);
}

Prior to VLAs, someone with an actual need to do this could use the alloca() function to force allocation on the stack (though I truthfully believe this is bad form as well). The advantage to stack-allocated variables is that they are scoped to the current code block or function and there is no chance of memory leaks as it is just unwinded as part of the stack.

The real problem, however, is that you start breaking all sorts of portability guarantees when you make large allocations on the stack. There are many embedded systems with very small stack sizes (e.g. 8K), and it is very difficult to effectively estimate stack usage. If there is insufficient stack space for the requested size, your program will simply terminate with no ability to correct.

It’s really not that hard as a programmer to simply allocate the memory dynamically and clean it up. If that is not feasible or desirable, it’s much more sensible to determine the maximum array size you might need and always declare it as that size. You’re less likely to have any sort of unforeseen runtime issues like blowing your stack.

Usage of VLAs also generates much more code, and much slower code. There’s simply no benefit and simply causes confusion among novice programmers who haven’t yet learned the C memory model.

Luckily as of C11, VLAs are now an optional feature for compiler vendors to implement and not guaranteed to be available, but unfortunately since they pretty much all supported C99, they all support VLAs, and people will continue to use them.

Comma Operator

The comma operator, where do I begin? Perhaps examples are the best way to start…

int a, b=10, c, d=11;

That makes sense, and is used liberally throughout all of C, what’s wrong with that? Oh yeah, that’s technically considered a comma separator not operator, but I don’t like that form either, but we’ll discuss that later.

How about this?

int a = (1, 2, 3);

What’s that do? Oh, a is 3, that makes perfect sense. And let’s see, this?

printf("hello\n"), printf("goodbye\n");

Clearly that must make more sense than

printf("hello\n");
printf("goodbye\n");

Right?

The comma operator was originally designed to allow neat “tricks” with Macros, so it could appear to return a value due to operator precedence and act more like a real function. Maybe that made sense when C was first created and the systems running code were incredibly slow, but these days function calls are so incredibly cheap. Modern CPUs can process tens or hundreds of billions of operations per second, so a few extra operations for function call overhead is mostly meaningless. For cases where the function call overhead really does matter, C99 added true inline support to the standard. That said, I’m not convinced the comma operator was ever actually worthwhile.

If you look at the most common uses of the comma operator, you’re likely to come across The International Obfuscated C Code Contest. Yes, the comma operator is showing its strength there, to obfuscate code.

The comma operator really has a very limited use case, one which there’s always going to be a better way to do things that is more clear and concise.

Having said that, I want to loop back around on the comma when used as a separator for variable declaration (other use cases for comma separators make much more sense such as array initializers). In the initial example, I showed multiple variables being declared using the same data type, some with default value assignments, some not. My issue with this is the only problem it really solves for is condensing the size of the code slightly, which reduces readability. Variable names can tend to get lost, and increases sprawl of the total count of variables. It may discourage code modularization (e.g. creating reusable helper functions) as it makes the code harder to quantify.

My personal preference is to have one variable declaration per line, aligning variable names and assignments to improve readability. E.g.:

int         a;
int         b   = 10;
int         c;
int         d   = 11;
const char *foo = NULL;
uint64_t    e   = 0x0123456789ABCDEF;

It’s much easier to see when you might need to split code into multiple functions just based on the number of variables you are tracking in one code block, leading to more readable and more auditable code.

For the love of god, define the sign of char

Most programmers that are stuck in the Intel world don’t even realize this is a thing. But it’s true, the sign of char is implementation defined. On Intel-based systems, it is usually signed by default (which in my opinion makes the most sense, since it aligns with normal integer types). But if you’ve ever compiled on an Arm-based system, the default is usually unsigned. Why? Because of some legacy performance reasons that aren’t relevant at all today and probably shouldn’t have mattered anyhow as it isn’t relevant to any language higher level than Assembly. It’s completely asinine. All other integer data types default to signed. And yes, char really is an integer data type with a [quite] limited range.

I am a big fan of turning compiler warnings all the way up and ensuring every bit of code is warning free. But lets take this range check to validate a character falls within the ASCII range:

bool is_ascii(char c)
{
  if (c >= 0 && c <= 127) 
    return true;
  return false;
}

If char is signed you’ll get a warning stating c <= 127 is always true, and if char is unsigned, you’ll get a warning stating c >= 0 is always true. Fun.

The proper solution is of course to never directly use char and instead use the appropriate typedef of int8_t or uint8_t, based on your needs, which of course no one does.

Obviously not knowing about this can cause all sorts of other obscure bugs that may not have been intended because the original programmer didn’t know or didn’t care enough to address.

Conclusion

Those are just a few of my complaints about some of C’s features that I believe need to be removed as the language matures. I don’t think it’s bad to deprecate legacy features (or bad decisions), and eventually completely remove them from a language when there is no longer a strong use case. Sure compilers may still have modes that can be used to compile legacy code, but developers shouldn’t ever expect their code to run forever.

The computing landscape has come a long way in the last 50 years. C often gets a bad reputation, and maybe deservingly so if it is not willing to adapt with the times. Security should be at the forefront; coders should be encouraged to write clear and auditable code over clever tricks and micro-optimizations.

For more discussion on modernization of C, see my prior blog post, C is Great, But Needs Modernization.