Type inference for Haskell, part 11

Posted on January 11, 2019

This is part 11 of the series on type inference for Haskell.

Right now the type system doesn’t allow the definition of a binding to refer to itself. This means that you cannot have recursive functions. That’s a big problem in a functional language.

To allow for recursive functions, the rule for let bindings needs to change slightly. Remember that the general form of a let binding is let v1 = e1 in e2. Up till now, the rule has been that the variable v1 is not in scope in the expression e1. To allow recursion, we simply change it so that the variable is in scope in e1. Note that this means that e1 can no longer refer to a shadowed variable of the same name as v1.

That rule change means that when figuring out the type of the expression e1, we need the variable v1 to be in scope already. The variable’s type is not known yet, but we can still create an assumption it. The type will just start off as a new type variable. Once inferring the type of e1 is done, we’ll update the notion of what type v1 ends up being before we start to look at e2.

This approach isn’t perfect. It requires that the function always make the recursive call with the same type at which it was called. Normally this isn’t a problem. For example, if you define your own map function:

myMap _ []     = []
myMap f (x:xs) = f x : myMap f xs

and that function was called like myMap show [1,2,3], the type of myMap in that call would be (Int -> String) -> [Int] -> [String]. In the recursive call (the myMap f xs), the type of myMap is the same.

Where you run into a problem is when the recursive call has a different type. The code below uses a strange data structure definition to cause each recursive call to be at a different type. The fact that the next link is a CrazyList [a] instead of a CrazyList a means that each link’s data is nested one level deeper inside nested lists. This means that the recursive call (crazyListSize rest) is a different type than the first call was.

data CrazyList a = End | List a (CrazyList [a])

-- crazyListSize :: CrazyList a -> Int
crazyListSize End           = 0
crazyListSize (List _ rest) = 1 + crazyListSize rest

If you try to compile the code above in Haskell, it will fail. Type inference can’t handle that particular edge case. Fortunately, there is an easy work-around: you can make it work by explicitly declaring the type. The example will compile if you uncomment the type declaration above.

Dealing with explicitly typed bindings will be a problem for the next post.

There is actually still something missing. As defined right now, we cannot handle mutually recursive definitions. Fixing that requires that a single let expression be able to declare multiple variables. The definitions of the variables can then reference any of the introduced variables. We can now write expressions like:

let evens n = collatz (n `div` 2)

    odds n = collatz (3 * n + 1)

    collatz 1 = 1
    collatz n = 1 + (if n `mod` 2 == 0 then evens n else odds n)
in collatz 13

Above we looked at a problem caused by the fact that the recursive call has to be the same type as the original call of that function. That problem gets even worse when mutual recursion comes into play: the types of all the calls of a function have to match the type of the original function call. This could accidentally result in types that are far more restrictive than you wanted. Look at this (admittedly silly) example:

identity x = let unused = foo "bar" in x

foo s = "foo: " ++ (identity s)

That restriction causes the type of identity to be String -> String instead of a -> a because it is used as String -> String in the definition of foo. This only happens because the unused = foo "bar" bit causes the definitions to be mutually recursive.

To avoid getting overly restricted types for things we need to be careful to only treat bindings as mutually recursive when they actually are. That means that bindings need to be split up as much as possible while still keeping the mutually recursive bindings together. This requires a topological sort of the strongly connected components of the call graph of these variables. The strongly connected components are the groups of bindings that are mutually recursive. Topologically sorting them makes it so we don’t deal with a binding that references something we haven’t dealt with yet.

Our algorithm assumes this grouping and sorting has already done for us. The new structure of bindings will reflect this.

type Impl = (Id, Expr)
type BindGroup = [[Impl]]

An Impl is an implicitly typed (i.e. no type declaration) binding. The Id is the name of the variable.

In BindGroup, each item in the outer list is a mutually-recursive group (which might be just one binding if it isn’t involved in mutual recursion), and each item in the group is one binding.

The data structure for Let now references the BindGroup structure:

data Expr
  = ...
  | Let   BindGroup Expr

The inference logic references a function for finding the type of an entire group of bindings:

tiExpr as (Let binds e)    = do
  as' <- tiBindGroup as binds
  tiExpr (as' ++ as) e

tiBindGroup returns an assumption for each variable in as'. It works by repeatedly running tiImpls for each binding group. The tiSeq function is a helper function for looping over each binding group, collecting the added assumptions along the way. iss is the list of lists of implicitly typed bindings.

tiBindGroup :: Infer BindGroup [Assump]
tiBindGroup as iss = do
  tiSeq tiImpls as iss

The code for tiImpls looks more scary that it actually is. The first four lines create a new list of assumptions as' that contain the original assumptions plus the temporary assumptions for the variables being introduced. bs is the list of bindings in the group, as is the original assumptions, ts is a list of new type variables for the temporary assumptions, scs is those type variables converted to schemes, and is is the names of the variables introduced.

The middle portion gets the list of expressions the variables are bound to and (with the help of tiBinding), finds the type for all of them. sequence is a function provided by Haskell that runs a list of monadic expressions. Interestingly, we don’t use the result of that expression. The fact that tiBinding unifies the result of each expression with the respective type variable from ts means that our substitution gets updated with the real type.

That’s where the third part comes in. It gets that substitution and applies it to all the types in ts to get the real types ts'. It then figures out the type variables in the assumptions (after applying the substitution to them) and in the new bindings for the sake of finding the type variables to quantify over (gs). That list is the set of type variables that appear in any of the types of the variables being introduced and not in any of the existing bindings. That list is then used to quantify each type in ts' to get the type scheme for each variable introduced. The last line assembles the names and the schemes into assumptions.

tiImpls :: Infer [Impl] [Assump]
tiImpls as bs = do
  ts <- mapM (\_ -> newTVar) bs
  let scs = map toScheme ts
  let is = map fst bs
  let as' = (zipWith (:>:) is scs) ++ as

  let exprs = map snd bs
  _ <- sequence $ zipWith (tiBinding as') exprs ts

  s <- getSubst
  let ts' = apply s ts
  let fs = tv (apply s as)
  let vss = map tv ts'
  let gs = foldr1 union vss \\ fs
  let scs' = map (quantify gs) ts'
  return $ zipWith (:>:) is scs'

tiBinding :: [Assump] -> Expr -> Type -> TI ()
tiBinding as e t = do
  t' <- tiExpr as e
  unify t t'

Here’s the implementation of tiSeq. It runs the inference function ti for each element in the list, adding the resulting assumptions to the set of assumptions before moving on to the next element in the list.

tiSeq :: Infer bg [Assump] -> Infer [bg] [Assump]
tiSeq _  _  []       =
  return []
tiSeq ti as (bs:bss) = do
  as'  <- ti as bs
  as'' <- tiSeq ti (as' ++ as) bss
  return $ as'' ++ as'

It is now also convenient to make the top-level of type inference take a whole binding group instead of a single expression. That way, you can find out the types of a whole list of bindings.

type Program = [BindGroup]

tiProgram :: [Assump] -> Program -> [Assump]
tiProgram as bgs = runTI $ do
  as' <- tiSeq tiBindGroup as bgs
  s <- getSubst
  return (apply s as')

And that’s it! Here’s the finished code.

Now that the type system has let polymorphism and recursive bindings, it is a fairly useful one. That said, it still has a long way to go until it matches Haskell’s. There are a few more small features left to add before moving on to the huge feature, type classes.