The Problem with Null
Tony Hoare called it his “billion-dollar mistake.” What’s so bad about
The programming languages that have
null are all languages with references1. References are an important feature for programming languages to have because they allow using a large piece of data in multiple places without the cost of copying it each time. Programs also tend to use references to share state that changes. When one part of the code changes a value, everything else with a reference to that value see the new state2. The correct behavior of programs often depends on this feature.
null value in a reference is used to indicate the lack of any value - it is a reference that refers to nothing. This is handy for indicating that there may or may not be a value. The problem is that you often get this feature whether you want it or not. Any time you use a reference, even if the reference should always point to a value, the language will allow it to be
null. In theory, it would be possible to check for
null values any time a reference is used, but this would be extremely tedious. No one would actually write code like that. As a result, code in these languages is very vulnerable to
null pointer bugs.
Sometimes you do want to indicate that a value may or may not be there. The value may be small enough to copy (perhaps even smaller than the size of a pointer!) and you might never indent to share changes to the value, so you don’t need the other properties of a reference. Unfortunately, languages with references often fail to provide any other convenient way to indicate that a value is optional. Using a reference any time a value is optional can be very wasteful. Some code might use a 64 bit pointer to a one byte value just for the sake of indicating that the byte might not be set. This can incur memory management overhead, cause memory fragmentation, and increase the miss rate in the CPU cache.
So then, the problem with
null is that it might show up where you really don’t want it, and when you do want the option of using it, it is often inefficient. It only makes sense when you both want to indicate that a value is optional AND want the other properties of a reference.
Modern3 languages are finally starting to fix this issue. In Rust, for example, the
Option type is used to indicate whether a value is present or absent. Use of this type adds very minimal overhead. Rust also requires that references are never
null. When you do actually want the properties of both references and optionality, you must combine the
Option type with a reference.
I should note that in SQL databases,
NULL has neither of these problems. You can have references (foreign keys) without being forced to allowed
NULL values. When you do want to make a column optional by allowing
NULL values, it doesn’t add much overhead. As I understand it, most database engines add only a single bit to the size of each row when allowing a column to be nullable. Using
NULL in computations doesn’t result in an error, it just causes the computation to also result in
Or pointers. I’ll use the term references here to refer to either.↩
At first glance, this may appear to not apply to languages like Haskell where values are immutable, but it actually still can. In Haskell in particular, the value may be lazily computed, and sharing the reference means that it only needs to be computed the first time one of the pieces of code holding the reference needs the value.↩
ML is getting close to half a century old, yet new languages still sometimes miss the lessons it teaches. (I’m looking at you, Golang.)↩