Type inference for Haskell, part 11
This is part 11 of the series on type inference for Haskell.
Right now the type system doesn’t allow the definition of a binding to refer to itself. This means that you cannot have recursive functions. That’s a big problem in a functional language.
To allow for recursive functions, the rule for let bindings needs to change slightly. Remember that the general form of a let binding is
let v1 = e1 in e2. Up till now, the rule has been that the variable v1 is not in scope in the expression e1. To allow recursion, we simply change it so that the variable is in scope in e1. Note that this means that e1 can no longer refer to a shadowed variable of the same name as v1.
That rule change means that when figuring out the type of the expression e1, we need the variable v1 to be in scope already. The variable’s type is not known yet, but we can still create an assumption it. The type will just start off as a new type variable. Once inferring the type of e1 is done, we’ll update the notion of what type v1 ends up being before we start to look at e2.
This approach isn’t perfect. It requires that the function always make the recursive call with the same type at which it was called. Normally this isn’t a problem. For example, if you define your own map function:
and that function was called like
myMap show [1,2,3], the type of
myMap in that call would be
(Int -> String) -> [Int] -> [String]. In the recursive call (the
myMap f xs), the type of
myMap is the same.
Where you run into a problem is when the recursive call has a different type. The code below uses a strange data structure definition to cause each recursive call to be at a different type. The fact that the next link is a
CrazyList [a] instead of a
CrazyList a means that each link’s data is nested one level deeper inside nested lists. This means that the recursive call (
crazyListSize rest) is a different type than the first call was.
If you try to compile the code above in Haskell, it will fail. Type inference can’t handle that particular edge case. Fortunately, there is an easy work-around: you can make it work by explicitly declaring the type. The example will compile if you uncomment the type declaration above.
Dealing with explicitly typed bindings will be a problem for the next post.
There is actually still something missing. As defined right now, we cannot handle mutually recursive definitions. Fixing that requires that a single let expression be able to declare multiple variables. The definitions of the variables can then reference any of the introduced variables. We can now write expressions like:
let evens n = collatz (n `div` 2) odds n = collatz (3 * n + 1) collatz 1 = 1 collatz n = 1 + (if n `mod` 2 == 0 then evens n else odds n) in collatz 13
Above we looked at a problem caused by the fact that the recursive call has to be the same type as the original call of that function. That problem gets even worse when mutual recursion comes into play: the types of all the calls of a function have to match the type of the original function call. This could accidentally result in types that are far more restrictive than you wanted. Look at this (admittedly silly) example:
That restriction causes the type of
identity to be
String -> String instead of
a -> a because it is used as
String -> String in the definition of
foo. This only happens because the
unused = foo "bar" bit causes the definitions to be mutually recursive.
To avoid getting overly restricted types for things we need to be careful to only treat bindings as mutually recursive when they actually are. That means that bindings need to be split up as much as possible while still keeping the mutually recursive bindings together. This requires a topological sort of the strongly connected components of the call graph of these variables. The strongly connected components are the groups of bindings that are mutually recursive. Topologically sorting them makes it so we don’t deal with a binding that references something we haven’t dealt with yet.
Our algorithm assumes this grouping and sorting has already done for us. The new structure of bindings will reflect this.
Impl is an implicitly typed (i.e. no type declaration) binding. The
Id is the name of the variable.
BindGroup, each item in the outer list is a mutually-recursive group (which might be just one binding if it isn’t involved in mutual recursion), and each item in the group is one binding.
The data structure for
Let now references the
The inference logic references a function for finding the type of an entire group of bindings:
tiBindGroup returns an assumption for each variable in
as'. It works by repeatedly running
tiImpls for each binding group. The
tiSeq function is a helper function for looping over each binding group, collecting the added assumptions along the way.
iss is the list of lists of implicitly typed bindings.
The code for
tiImpls looks more scary that it actually is. The first four lines create a new list of assumptions
as' that contain the original assumptions plus the temporary assumptions for the variables being introduced.
bs is the list of bindings in the group,
as is the original assumptions,
ts is a list of new type variables for the temporary assumptions,
scs is those type variables converted to schemes, and
is is the names of the variables introduced.
The middle portion gets the list of expressions the variables are bound to and (with the help of
tiBinding), finds the type for all of them.
sequence is a function provided by Haskell that runs a list of monadic expressions. Interestingly, we don’t use the result of that expression. The fact that
tiBinding unifies the result of each expression with the respective type variable from
ts means that our substitution gets updated with the real type.
That’s where the third part comes in. It gets that substitution and applies it to all the types in
ts to get the real types
ts'. It then figures out the type variables in the assumptions (after applying the substitution to them) and in the new bindings for the sake of finding the type variables to quantify over (
gs). That list is the set of type variables that appear in any of the types of the variables being introduced and not in any of the existing bindings. That list is then used to quantify each type in
ts' to get the type scheme for each variable introduced. The last line assembles the names and the schemes into assumptions.
tiImpls :: Infer [Impl] [Assump] tiImpls as bs = do ts <- mapM (\_ -> newTVar) bs let scs = map toScheme ts let is = map fst bs let as' = (zipWith (:>:) is scs) ++ as let exprs = map snd bs _ <- sequence $ zipWith (tiBinding as') exprs ts s <- getSubst let ts' = apply s ts let fs = tv (apply s as) let vss = map tv ts' let gs = foldr1 union vss \\ fs let scs' = map (quantify gs) ts' return $ zipWith (:>:) is scs' tiBinding :: [Assump] -> Expr -> Type -> TI () tiBinding as e t = do t' <- tiExpr as e unify t t'
Here’s the implementation of
tiSeq. It runs the inference function
ti for each element in the list, adding the resulting assumptions to the set of assumptions before moving on to the next element in the list.
It is now also convenient to make the top-level of type inference take a whole binding group instead of a single expression. That way, you can find out the types of a whole list of bindings.
And that’s it! Here’s the finished code.
Now that the type system has let polymorphism and recursive bindings, it is a fairly useful one. That said, it still has a long way to go until it matches Haskell’s. There are a few more small features left to add before moving on to the huge feature, type classes.