The Problem with Null
Tony Hoare called it his “billion-dollar mistake.” What’s so bad about null
?
The programming languages that have null
are all languages with references1. References are an important feature for programming languages to have because they allow using a large piece of data in multiple places without the cost of copying it each time. Programs also tend to use references to share state that changes. When one part of the code changes a value, everything else with a reference to that value see the new state2. The correct behavior of programs often depends on this feature.
A null
value in a reference is used to indicate the lack of any value - it is a reference that refers to nothing. This is handy for indicating that there may or may not be a value. The problem is that you often get this feature whether you want it or not. Any time you use a reference, even if the reference should always point to a value, the language will allow it to be null
. In theory, it would be possible to check for null
values any time a reference is used, but this would be extremely tedious. No one would actually write code like that. As a result, code in these languages is very vulnerable to null
pointer bugs.
Sometimes you do want to indicate that a value may or may not be there. The value may be small enough to copy (perhaps even smaller than the size of a pointer!) and you might never indent to share changes to the value, so you don’t need the other properties of a reference. Unfortunately, languages with references often fail to provide any other convenient way to indicate that a value is optional. Using a reference any time a value is optional can be very wasteful. Some code might use a 64 bit pointer to a one byte value just for the sake of indicating that the byte might not be set. This can incur memory management overhead, cause memory fragmentation, and increase the miss rate in the CPU cache.
So then, the problem with null
is that it might show up where you really don’t want it, and when you do want the option of using it, it is often inefficient. It only makes sense when you both want to indicate that a value is optional AND want the other properties of a reference.
Modern3 languages are finally starting to fix this issue. In Rust, for example, the Option
type is used to indicate whether a value is present or absent. Use of this type adds very minimal overhead. Rust also requires that references are never null
. When you do actually want the properties of both references and optionality, you must combine the Option
type with a reference.
I should note that in SQL databases, NULL
has neither of these problems. You can have references (foreign keys) without being forced to allowed NULL
values. When you do want to make a column optional by allowing NULL
values, it doesn’t add much overhead. As I understand it, most database engines add only a single bit to the size of each row when allowing a column to be nullable. Using NULL
in computations doesn’t result in an error, it just causes the computation to also result in NULL
.
Or pointers. I’ll use the term references here to refer to either.↩
At first glance, this may appear to not apply to languages like Haskell where values are immutable, but it actually still can. In Haskell in particular, the value may be lazily computed, and sharing the reference means that it only needs to be computed the first time one of the pieces of code holding the reference needs the value.↩
ML is getting close to half a century old, yet new languages still sometimes miss the lessons it teaches. (I’m looking at you, Golang.)↩