\chapter{Annotations in MultiRec} \label{sec:generic} In this section we will reimplement the annotation framework of the previous section in the generic programming library \emph{MultiRec} \cite{rodriguez2008mutrec}. In a sense, MultiRec is more generic than the open recursion approach we took in the previous section, offering more flexibility but also more complexity. We will see why in a moment. Before we look at MultiRec in particular, let us have a short introduction to generic programming in general: what is it and why would you implement the same functionality in multiple generic libraries? \section{Introduction to generic programming} The report \emph{Comparing Libraries for Generic Programming in Haskell} \cite{rodriguez08comparing} gives a very nice explanation of the term \emph{generic programming}: \begin{quote} Software development often consists of designing a datatype to which functionality is added. Some functionality is datatype specific. Other functionality is defined on almost all datatypes, and only depends on the structure of the datatype; this is called datatype-generic functionality. Examples of such functionality are comparing two values for equality, searching a value of a datatype for occurrences of a particular string or other value, editing a value, pretty-printing a value, etc. \end{quote} In Haskell 98, datatype genericity is available for a limited number of functions as |deriving| clauses following datatypes.\footnote{See the Haskell 98 Report, section 10 ``Specification of Derived Instances''.} The clause |deriving Eq| following the definition of |data BareExpr|, for example, makes available an operator |(==) :: BareExpr -> BareExpr -> Bool| for comparing two values of type |BareExpr|. To implement such `deriving' behavior for a custom type class, one would have to extend the compiler and analyze the datatype in question. Because this is often a lot of tedious work, programmers have looked for suitable abstractions in which programming this generic behavior over datatypes is made easier, by providing different interfaces on this low-level view on datatypes: thus generic programming was born. Because these interfaces are often less flexible, they can be made substantially simpler and easier to use. Of all these interfaces, Template Haskell \cite{sheard02template} is one of the most powerful because it aims to offer a one-to-one mapping between the syntax of a Haskell datatype and runtime values describing such a datatype. It allows for generic programming because---among many other things---it lets the programmer write a function that takes a datatype as input and produces some function declarations. This function is evaluated at compile-time, and the produced declarations are then compiled into the final program. This is efficient: there is some overhead at compile-time, but the generated code is as fast as if it had been written by hand. Template Haskell has a few disadvantages, however. Because it is so powerful, it is also reasonably verbose and even simple generic functions easily span many lines. In fact, there is so much freedom that it is possible to build type-incorrect or otherwise nonsensical expressions and declarations. Errors in the resulting expressions are caught and reported at compile-time, but because they concern generated code, it is very difficult to give accurate position information about where those errors occurred.\footnote{Something we obviously value very much in this thesis. :-)} That the errors often concern generated names complicates matters even further. So when code generated by Template Haskell is compiled, the compiler will not fail during the parsing phase, but it might during the type-checking phase. This did not sit well with some programmers: Haskell is known for its powerful type system; was it not possible to create an approach to generic programming directly in Haskell, as a library, so that it is necessarily type-safe? And so the search for good generic programming strategies continued. Some of the resulting generic libraries are Scrap Your Boilerplate (SYB) \cite{laemmel05syb}, Uniplate \cite{Mitchell:2007:uniplate}, Extensible and Modular Generics for the Masses (EMGM) \cite{Oliveira_Hinze_Loh:2006EMGM} and the various |compos| operators \cite{bringert06pattern}. These approaches express their generic functionality in terms of type classes; generic programs then depend on the existence of type class instances of the types they are expected to work on. The programmer can write instances for these types by hand, but most of these libraries offer a way to bootstrap the generic functionality. SYB, for example, allows instances to be created by a |deriving| clause through a compiler extension. Uniplate, in turn, has an implementation in terms of SYB. So which of these libraries should you choose for your own generic program? The differences between these libraries lie mainly in how they deal with families of datatypes, fields of different types, lists (or other collections) of children, type arguments (if at all) and information about constructors such as their name and---if they are operators---their fixity. None of these libraries, however, allow the user to generically change the \emph{shape} of a datatype. Their functionality is restricted to \emph{queries} that compute some value over a tree (like a fold) or \emph{transformations} that allow rewriting tree terms. In section \ref{sec:fixedpoint} we saw how we can write a datatype in such a way that we can later decide what type the children will have by introducing an extra type argument. This added flexibility turned out to be crucial in adapting a datatype to hold position information. Although the degree in which |ExprF| exposes its structure is very useful, it is still very limited, from a generic programmer's point of view. For example, we cannot generically count the number of constructors of a datatype, or count the number of fields in a specific constructor, or generate members of a datatype, or check whether two types are equal. In the context of type-indexed datatypes there is also more structure to be exposed. Zippers, for example, cannot be expressed in terms of a base functor. To divulge more of a type's structure, we can model it using a fixed set of building blocks. All datatypes follow the same pattern: they have a number of constructors and each constructor has a number of fields. This way of modeling datatypes is called \emph{sums of products}, because of the way the number of values of a datatype add up: a single constructor has zero or more fields. Each of these fields individually has a number of values it can take on. The constructor can be initialized with any combination of these values, so the number of possible instantiations equals the product of each field's number of values. A datatype consists of several constructors. To compute the number of values in a datatype, we take the numbers of values of each constructor and add these up. In this sense, a datatype is a sum (the constructors) of products (the fields of the constructors). Using sums of products is very interesting for our application because it too, just like the approach in the previous section, allows the use of fixed points and thus the insertion of annotations at every level. ASTs often comprise several datatypes where one datatype refers to another. Sometimes they are even mutually recursive. Until recently it was unknown how to model such a \emph{family} of datatypes using sums and products without knowing in advance how many datatypes the family would consist of. MultiRec, however, offers a solution to this problem in a type-safe way. Here by `type-safe' we mean that it is impossible for the sums-of-products model of the datatypes to create trees which cannot be written using the original datatypes (i.e. they are completely isomorphic, up to $\bot$) and that nowhere in the implementation any `unsafe' functions (such as |unsafeCoerce|) are used. This is why MultiRec is an interesting generic programming library for our problem: it allows for families of possibly mutually recursive datatypes, exposes a lot of information about the datatypes using sums of products, and uses fixpoints to boot! Let us take a look at some concrete examples. \section{Pattern functors} So far we have used arithmetic expressions as our running example. To exploit MultiRec's capabilities of working with families of datatypes, we will modify the example to use an extra datatype: \begin{code} data Expr = EAdd Expr Expr | EMul Expr Expr | ETup Expr Expr | EIntLit Int | ETyped Expr Type deriving (Eq, Show) data Type = TyInt | TyTup Type Type deriving (Eq, Show) \end{code} We have gotten rid of the constructors for division and subtraction and in their place have two other constructors: one for creating tuples of expressions and one for expressions with type signatures. The latter uses a separate type |Type| to describe expression types. The tuple constructors of both |Expr| and |Type| allow arbitrary nesting, e.g.~|(1, (2, 3))| and |(Int, (Int, Int))| can be modeled as |ETup (EIntLit 1) (ETup (EIntLit 2) (EIntLit 3))| and |TyTup TyInt (TyTup TyInt TyInt)|, respectively. The type annotations introduce some ambiguity to the language so that we have reason to type-check expressions and possibly report errors later when we look at MultiRec's version of error catamorphisms. Let us translate these datatypes to sums of products. We will use \texttt{multirec-0.4}, available from Hackage\footnote{\url{http://hackage.haskell.org/package/multirec-0.4}}: \begin{code} type PF_Expr = I Expr :*: I Expr :*: U :+: I Expr :*: I Expr :*: U :+: I Expr :*: I Expr :*: U :+: K Int :*: U :+: I Expr :*: I Type :*: U type PF_Type = U :+: I Type :*: I Type :*: U type PF_Tuples = PF_Expr :>: Expr :+: PF_Type :>: Type \end{code} Here the |Tuples| in |PF_Tuples| is the name we have given to our family of datatypes, comprising of |Expr| and |Type|. The sums-of-products version of a family of datatypes is called the family's \emph{pattern functor}; that is what |PF| stands for. The pattern functors of either type are written separately as |PF_Expr| and |PF_Type|. Their structure mirrors the structure of the corresponding datatypes directly. The two pattern functors are then tagged and combined in |PF_Tuples|. In all, this type is pretty large but consists only of six distinct building blocks: \begin{itemize} \item |:+:| is used to denote choice (sums), be it between datatypes of the family or constructors of a particular datatype. \item |:*:| can be seen as the cons in the list of constructors, forming the products. Just like the cons operator |:| on lists, this operator associates to the right. \item |I xi| is used to indicate recursion into type |xi| of the family in question. It always appears on the left-hand side of a |:*:|. \item |K a| (for \emph{k}onstant) is used when a field has a type that falls outside the family. In our case, constants cannot be selected and will not be annotated with position information. They always appear on the left-hand side of a |:*:|. \item |U| is the |[]| of the list of constructors. \footnote{\label{foot:typelist}In the MultiRec paper, |U| is only used to describe constructors with zero fields and does not appear in pattern functors of constructors that do have fields. However, treating constructors consistently as cons lists allows for more elegant type functions later on.} \item |pf :>: ix| tags the pattern functor |pf| as representing datatype |ix|. \end{itemize} Using these building blocks |PF_Tuples| completely mirrors the structures of the |Expr| and |Type| datatypes. The building blocks' definitions are chosen such that the pattern functor and |Either Expr Type| are isomorphic, i.e.~values of one type can be translated to corresponding values of the other type. This fact is reflected by MultiRec's |Fam| type class: \begin{code} class Fam phi where from :: phi ix -> ix -> PF phi I0 ix to :: phi ix -> PF phi I0 ix -> ix \end{code} The first function, |from|, translates from values of the original two datatypes to the pattern functor versions while the second function, |to| takes care of the reverse. These functions' types introduce a few new things. The first of these is their first argument, |phi ix|. In our example |phi| is the family |Tuples|: \begin{code} data Tuples :: * -> * where Expr :: Tuples Expr Type :: Tuples Type \end{code} Type |Tuples| is a generalized algebraic datatype (GADT) that enumerates all the datatypes in the family, with one constructor for each datatype. Each constructor indexes |Tuples| with the type it represents. When |from| pattern matches on one of these constructors, the polymorphic type argument |ix| is refined to a specific, concrete datatype after which case analysis can be done on the second argument to produce a value of the pattern functor. The dual |to| also receives a |Tuples| value, after which it can pattern match on a value of the pattern functor to produce a value of one of the original types. MultiRec abstracts over specific pattern functors through a type synonym family |PF|, which appears in the type signatures of |from| and |to|: \begin{code} type family PF phi :: (* -> *) -> * -> * \end{code} Again, in our specific example, |phi| stands for |Tuples|, and we have already seen our example's pattern functor, so writing an instance of this type family is easy: \begin{code} type instance PF Tuples = PF_Tuples \end{code} If |PF Tuples| has kind |(* -> *) -> * -> *|, it must receive two type arguments before it becomes a concrete type. From |class Fam| we learn that they are called |r| and |ix|. These arguments are shared by each functor building block, all the way down to the leaves. The latter argument |ix| indicates the original type the pattern functor represents. For example, |PF Tuples r Type|, for some |r|, is the type of a pattern functor value representing a |Type|. Functions |from| and |to| use this to restrain their output and input functors appropriately. The |r| serves a purpose similar to that in |ExprF r|: it is used to control the values at the recursive child positions. But where the |r| in |ExprF r| had kind |*|, this |r| has kind |* -> *|: |r|'s argument indicates which type is recursed into. This argument was unnecessary in the previous chapter, because families of datatypes were not supported yet and there was only one type to recurse into. Let us take a look at the definitions of the building blocks so that we can write an |instance Fam Tuples|. \begin{code} data (f :+: g) r ix = L (f r ix) | R (g r ix) infixr 5 :+: \end{code} The two constructors reflect the choice between the |L|eft functor and the |R|ight functor. The two type arguments |r| and |ix| are simply passed down to either functor. \begin{code} data (f :*: g) r ix = (f r ix) :*: (g r ix) infixr 7 :*: \end{code} Values of the product type constructor contain one value of either side. Again, the type arguments are passed on unchanged. \begin{code} data (f :>: xi) r ix where Tag :: f r ix -> (f :>: ix) r ix infix 6 :>: \end{code} The |Tag| constructor fixes the index |ix| of the enclosed functor to be the same as the right-hand side of the |:>:| type constructor. This ensures that when creating a pattern functor value, only values of the proper branch representing a particular datatype are allowed, and---conversely---that only values of the right branch need to be pattern matched on when converting the pattern functor values back to the original datatypes. \begin{code} data K a r ix = K a \end{code} Constants discard the |r| and |ix| and only hold a simple value of type |a|. \begin{code} data I xi r ix = I (r xi) \end{code} An |I xi| in a pattern functor marks a recursive occurrence of type |xi|. Recursive positions do not need be the same as the enclosing type, so the |ix| argument is ignored. The |r| determines what is actually done with the |xi|. There are several interesting choices for |r|. Two of them are the following datatypes: \begin{code} newtype I0 a = I0 { unI0 :: a } newtype K0 a b = K0 { unK0 :: a } \end{code} The former results in values of the original datatypes at the recursive positions, while the latter ignores the index and uses a constant type. The former is used in the |Fam| type class, so |to| and |from| effectively transform the values only one level deep (a \emph{shallow} conversion). We will see another example of |r| later on when we look at fixed points of pattern functors. There is also a building block that describes constructor metadata, such as its name or fixity, but we will not need that information and it is safe to leave it out of our pattern functor, so we omit the description of its workings here. Now that we know how the building blocks are defined, we can write the conversion functions for our |Tuples| family. First conversion from the original datatypes to the pattern functors: \savecolumns \begin{code} instance Fam Tuples where from Expr ex = L . Tag $ case ex of EAdd x y -> L $ I (I0 x) :*: I (I0 y) :*: U EMul x y -> R . L $ I (I0 x) :*: I (I0 y) :*: U ETup x y -> R . R . L $ I (I0 x) :*: I (I0 y) :*: U EIntLit n -> R . R . R . L $ K n :*: U ETyped e t -> R . R . R . R $ I (I0 e) :*: I (I0 t) :*: U from Type ty = R . Tag $ case ty of TyInt -> L $ U TyTup x y -> R $ I (I0 x) :*: I (I0 y) :*: U \end{code} The outermost case analysis shows the magic that is pattern matching on GADT values: both |Expr| and |Type| are constructors of the |Tuples| GADT, and depending on which value is given to |from|, the inner case analyses can pattern match on constructors of the relevant, more specific types. The two outer case constructs wrap all of their results in |L . Tag| and |R . Tag|, respectively. This matches with the pattern functor's outer layers: |PF_Expr :>: Expr :+: PF_Type :>: Type|. Type |PF_Expr| uses nested binary sums. Because these associate to the right, the constructors progressively need more |R|'s in |from|'s definition to reach their position in the pattern functor. To the right of the dollar signs is the construction of each constructor's fields. The opposite |to| is very similar to |from|, except that the pattern functor constructors are on the left-hand side and the original datatype's constructors are on the right. Unfortunately we cannot eliminate some parentheses like we could in |from| using function composition. \restorecolumns \begin{code} to Expr (L (Tag ex)) = case ex of L (I (I0 x) :*: I (I0 y) :*: U) -> EAdd x y R (L (I (I0 x) :*: I (I0 y) :*: U)) -> EMul x y R (R (L (I (I0 x) :*: I (I0 y) :*: U))) -> ETup x y R (R (R (L (K n :*: U)))) -> EIntLit n R (R (R (R (I (I0 x) :*: I (I0 y) :*: U)))) -> ETyped x y to Type (R (Tag ty)) = case ty of L U -> TyInt R (I (I0 x) :*: I (I0 y) :*: U) -> TyTup x y \end{code} Finally, we need to write instances of |El| for every member of our family. This type class is used in certain instances on building blocks when proofs (elements of the GADT |Tuples|) are needed: \begin{code} class El phi ix where proof :: phi ix instance El Tuples Expr where proof = Expr instance El Tuples Type where proof = Type \end{code} That's it! Now we have everything we need to do some serious multi-datatype generic programming. \section{Fixed points in MultiRec} \label{sec:multirecfixedpoints} As we saw, the pattern functor embedding is shallow. Because the pattern functor abstracts over the recursive positions, we can use fixed points again to achieve a deep embedding: we just need to fix the type argument |r|---which determines the shape of the children---to the fixed point itself. To find out what the fixed point datatype should look like for a higher-order functor |f| (for example, |PF Tuples|), we write down three iterations of the recursion: \begin{code} type InfiniteFix f ix = f (f (f (...))) ix \end{code} As before, we observe the regularity of the recursion in this infinite datatype to be able to express the recursion in a Haskell datatype: \begin{code} newtype HFix f ix = HIn { hout :: f (HFix f) ix } \end{code} Now a deep embedding of the a value of the |Tuples| can be expressed as |HFix (PF Tuples)|. And because we are using fixed points again, we can insert our annotations at every level. We do not need to build a custom |Ann| datatype like before: we can reuse building blocks |K| and |:*:| to couple a constant with a functor. This gives us what we need to build the MultiRec version of |AnnFix| and |AnnFix1|. We reuse these names here: it will always be clear from the context which is meant, because the two versions of the fixed points are never mixed. \begin{code} type AnnFix x phi = HFix (K x :*: PF phi) type AnnFix1 x phi = (PF phi) (AnnFix x phi) \end{code} Using |K| means the index is discarded, making the annotation type |x| independent of the index. If we wanted, we could use an associated datatype \cite{chakravarty05associated01} to make the annotation type depend on the specific datatype in the family. But we will be using source locations throughout the whole tree, so |K| suffices. \section{Pattern functors are traversable} If we can translate a value to its pattern functor, we can automatically |traverse| it\footnote{Starting with GHC 6.12, datatypes may derive |Traversable| instances!}: all we need to do is build traversal functions for the building blocks. What would such a traversal function look like for the generic building blocks? Recall the traversal function described by McBride and Paterson: \begin{code} traverse :: (Traversable t, Applicative f) => (a -> f b) -> t a -> f (t b) \end{code} The first argument |a -> f b| is used on the recursive positions of a datatype constructor. The resulting actions are then sequenced using |(<*>)|, resulting in one bigger action with the new traversable datatype as result. In the higher-order version of |traverse|, the first function will also be called with the recursive positions as arguments. However, in the generic case, we do not know the type of the recursive position: it might be any type in the family |phi|. For this reason, the supplied function must be polymorphic in its index so that we can give it a child of any family type. To help the programmer that writes this function, we also pass it a witness |phi ix| so that the function can pattern match on this witness, refining |ix| to a known type. In |traverse| the first argument may also change the child type from |a| to |b|. We cannot allow this in the generic version: the index |ix| must remain the same, because a constructor cannot suddenly change the types of its children. We can however change the recursive type constructor |r|. All these considerations are reflected in MultiRec's version of |Traversable|, called |HFunctor|: \begin{code} class HFunctor phi f where hmapA :: Applicative a => (forall xi. phi xi -> r xi -> a (r' xi)) -> phi ix -> f r ix -> a (f r' ix) \end{code} The first argument is the action that is applied to every child position. The type guarantees that it works for all indices and that the action does not change the index of a specific child. The presence of |hmapA|'s second argument, a witness of type |phi ix|, makes it much easier to fix the type argument |phi| when calling hmapA. Just like in the |Data.Traversable| module, we can write monadic and pure versions of |hmapA|, making use of a (trivial) |instance Applicative I0|: \begin{code} hmapM :: (HFunctor phi f, Monad m) => (forall xi. phi xi -> r xi -> m (r' xi)) -> phi ix -> f r ix -> m ( f r' ix) hmap :: HFunctor phi f => (forall xi. phi xi -> r xi -> r' xi) -> phi ix -> f r ix -> f r' ix hmapM f p x = unwrapMonad (hmapA (\p_i x -> WrapMonad (f p_i x)) p x) hmap f p x = unI0 (hmapA (\p_i x -> I0 (f p_i x)) p x) \end{code} All that rests is to show that these functions can actually be implemented on pattern functors. Again, we start with the building blocks of individual constructors: \savecolumns \begin{code} instance HFunctor phi U where hmapA _ _ U = pure U instance HFunctor phi (K x) where hmapA _ _ (K x) = pure (K x) instance El phi xi => HFunctor phi (I xi) where hmapA f _ (I x) = I <$> f proof x \end{code} The most interesting one is |I|, where f is used on the recursive position. That instance also explains why |phi| is part of the class header: a proof for the recursive position must be produced to be able to call |f| on the child. The instances for the combinators use applicative style to traverse child functors: \restorecolumns \begin{code} instance (HFunctor phi f, HFunctor phi g) => HFunctor phi (f :+: g) where hmapA f p (L x) = L <$> hmapA f p x hmapA f p (R y) = R <$> hmapA f p y instance (HFunctor phi f, HFunctor phi g) => HFunctor phi (f :*: g) where hmapA f p (x :*: y) = (:*:) <$> hmapA f p x <*> hmapA f p y instance HFunctor phi f => HFunctor phi (f :>: ix) where hmapA f p (Tag x) = Tag <$> hmapA f p x \end{code} \section{Error catamorphisms in MultiRec} The next step is to translate the error catamorphisms we discussed in section \ref{sec:errorcata} to the pattern functors. It is reasonably easy to express an algebra in terms of pattern functors in the same way as was done with the base functors. We just need to take into account the higher-order recursion parameter |r| and the extra index |ix| of the pattern functor: \begin{code} type ErrorAlg_PF f e a = forall ix. f (K0 a) ix -> Either e a \end{code} Here we choose |K0 a| for |r|, saying that the children should always contain a value of type |a|, regardless of their index. The universal quantification says that it does not matter what the index of the ingoing value is: the result is always |Either e a|. With this algebra type we can write the pattern functor version of |errorCata|: \begin{code} errorCata :: (HFunctor phi f) => ErrorAlg_PF f e r -> phi ix -> HFix (K x :*: f) ix -> Except [(e, x)] r errorCata alg p_f (HIn (K k :*: f)) = case hmapA (\p_g g -> K0 <$> errorCata alg p_g g) p_f f of Failed xs -> Failed xs OK expr' -> case alg expr' of Left x' -> Failed [(x', k)] Right v -> OK v \end{code} Although the names of the constructors that are pattern matched on are different, the function's implementation is similar to the old |errorCata|. An extra proof must be carried around and passed to |hmapA|, but just as before the |Applicative| instance of |Except| is used to collect errors. However, writing such an algebra for this catamorphism is not as easy as previously, because the algebra cannot use the constructors of the original datatype. Instead, it has to pattern match on the chains of |L|'s and |R|'s of the pattern functor, which make the algebra a lot longer and less clear. The same problem occurs when writing normal algebras (as opposed to error algebras). The MultiRec paper \cite{rodriguez2008mutrec} solves the problem by automatically translating the pattern functor to a more convenient type in which the algebras can be expressed, using a type synonym family. This derived type cannot be applied directly to values of the pattern functor, but it is translated back to a function that is applicable to pattern functors using a type class with instances for the various functor building blocks. We will adapt this strategy to work for our error catamorphisms. We skip the discussion of normal catamorphisms; the reader is referred to the MultiRec paper for this. Instead, we jump directly to the encoding of the error catamorphisms in MultiRec. It all starts with the type synonym family that computes the more convenient type: \begin{code} type family ErrorAlg (f :: (* -> *) -> * -> *) -- pattern functor (e :: *) -- error type (a :: *) -- result type :: * -- resulting algebra type \end{code} Since this type function will be applied to pattern functor building blocks, we need to give an instance for every single building block. Let us start with the building blocks of individual constructors: \savecolumns \begin{code} type instance ErrorAlg U e a = Either e a type instance ErrorAlg (K b :*: f) e a = b -> ErrorAlg f e a type instance ErrorAlg (I xi :*: f) e a = a -> ErrorAlg f e a \end{code} The first case handles the base case of every constructor: for every constructor, the error algebra should end in |Either e a|. The second and third cases handle cons nodes of constant fields and recursive fields, introducing an extra argument to the algebra. For constant fields, the type of this extra argument is equal to the type |b| of the field. For recursive positions, however, we ignore the index that is recursed on and expect an argument whose type equals the result type of the algebra. So just like before, the algebra may assume that children have already been folded and that there were no errors. Again, if desirable, the algebra's result type can be made dependent on the index using a associated datatype. \restorecolumns \begin{code} type instance ErrorAlg (f :+: g) e a = (ErrorAlg f e a, ErrorAlg g e a) type instance ErrorAlg (f :>: xi) e a = ErrorAlg f e a \end{code} Constructors in a datatype and datatypes in a family are combined using |:+:|. In such cases, algebras have to be provided for both options, so the algebras are coupled together in a tuple. Tags are ignored; the recursive algebra type is used directly. With these instances, the computed types |ErrorAlg PF_Expr| and |ErrorAlg PF_Type| are equivalent to these type synonyms: \begin{code} type ExprErrorAlg e a = (a -> a -> Either e a) -- EAdd :&: (a -> a -> Either e a) -- EMul :&: (a -> a -> Either e a) -- ETup :&: (Int -> Either e a) -- EIntLit :&: (a -> a -> Either e a) -- ETyped type TypeErrorAlg e a = Either e a -- TyInt :&: (a -> a -> Either e a) -- TyTup \end{code} Here |(:&:) = (,)|, a right-associative infix notation for binary tuples, is used to make the types easier to read. As said before, these derived algebra types are very convenient to work with, but they cannot be directly applied to pattern functors. To still be able to use them, we describe how to convert such a convenient algebra to one that can work on pattern functors using a type class: \begin{code} class MkErrorAlg f where mkErrorAlg :: ErrorAlg f e a -> ErrorAlg_PF f e a \end{code} Instances of this class correspond directly with the instances of the type synonym family above: \begin{code} instance MkErrorAlg U where mkErrorAlg x U = x instance MkErrorAlg f => MkErrorAlg (K a :*: f) where mkErrorAlg alg (K x :*: f) = mkErrorAlg (alg x) f instance MkErrorAlg f => MkErrorAlg (I xi :*: f) where mkErrorAlg alg (I (K0 x) :*: f) = mkErrorAlg (alg x) f instance MkErrorAlg f => MkErrorAlg (f :>: xi) where mkErrorAlg alg (Tag f) = mkErrorAlg alg f instance (MkErrorAlg f, MkErrorAlg g) => MkErrorAlg (f :+: g) where mkErrorAlg (alg_f, _) (L x) = mkErrorAlg alg_f x mkErrorAlg (_, alg_g) (R y) = mkErrorAlg alg_g y \end{code} Here the benefit we alluded to in footnote \ref{foot:typelist} becomes apparent: because lists of field types always end in the nil type |U|, we don't have to write instances for lone occurrences of |K a| and |I xi| as they always appear on the left-hand side of a |(:*:)|. To demonstrate that this approach works in practice, we look at an algebra that infers the type of expressions of our Tuples language. \begin{code} inferType :: ExprErrorAlg String Type :&: TypeErrorAlg String Type inferType = ( equal "+" & equal "*" & tup & const (Right TyInt) & equal "::" ) & ( Right TyInt & tup ) where equal op ty1 ty2 | ty_1 == ty_2 = Right ty1 | otherwise = Left ("lhs and rhs of " ++ op ++ " must have equal types") tup ty_1 ty_2 = Right (TyTup ty1 ty2) \end{code} The algebra says that both operands of |+| and |*| must have equal types. The algebra also checks whether explicit type signatures using |::| match the types of the expressions on the left-hand sides. To test this algebra, we use a function |readExpr :: String -> AnnFix Bounds Tuples Expr| we will define later when we discuss parsing annotated expressions with MultiRec: \begin{code} > let expr_1 = readExpr "(1, (2, 3))" > errorCata (mkErrorAlg inferType) Expr expr_1 OK (TyTup TyInt (TyTup TyInt TyInt)) > let expr_2 = readExpr "(1 :: (Int, Int), 2 + (3, 4))" > errorCata (mkErrorAlg inferType) Expr expr_2 Failed [ ("lhs and rhs of :: must have equal types", Bounds {leftMargin = (1,1), rightMargin = (16,16)}) , ("lhs and rhs of + must have equal types", Bounds {leftMargin = (17,18), rightMargin = (28,28)}) ] \end{code} \section{Constructing recursively annotated trees} In section \ref{sec:multirecfixedpoints} we translated the |AnnFix| and |AnnFix1| type synonyms from section \ref{sec:fixedpoint} to use MultiRec terms. Function |mkAnnFix| can be translated in a similar fashion: \begin{code} mkAnnFix :: x -> AnnFix1 x s ix -> AnnFix x s ix mkAnnFix x = HIn . (K x :*:) \end{code} When we developed the parser combinators, we produced |AnnFix1|s by taking one of |ExprF|'s constructors and giving it fully annotated |AnnFix| children. But producing an |AnnFix1| in the MultiRec version is not as easy: we cannot use the constructors of the original datatypes |Expr| and |Type| and give them annotated children, because the constructors' types only allow unannotated values as fields. Instead, to make an |AnnFix1| we have to use the pattern functor versions of the constructors. For example, given two annotated expressions of type |AnnFix Bounds Tuples Expr|, we can construct their sum using constructor |Add|'s pattern functor constructor |\x y -> L . Tag . R . L $ I x :*: I y :*: U|. These constructor functions are long and tedious to write, and that makes it easier to make mistakes when writing them down. They have to be written for each possible constructor, and then they have to be used during parsing instead of the real constructors. It would be much nicer if we could find a way to avoid this. The way we will solve this problem in this thesis is by making use of the fact that the parsers we work with construct trees in post-order form: first the children are parsed and constructed (including position information), then the node itself. For example, when parsing the sentence |"2 + 3"|, first the |2| is parsed, then the |3| and only then their sum is constructed and returned. The parser maintains an explicit stack of fully annotated trees. The stack contains exactly those trees that will form the children of later nodes, until the very end when it will contain the final result of the parser. In the example above, the parser would first push |EIntLit 2|, then |EIntLit 3| and finally |EAdd (EIntLit 2) (EIntLit 3)|. Every time an unannotated node is pushed onto the stack, the parser supplies the annotation for the root of the node. Because the node's children and their annotations will already be present on the stack, the parser has enough information at that moment to form a new, fully annotated tree. Let us call the action of pushing an element onto the stack |yield|. Then the construction of the annotated parse tree for |"2 + 3"| looks like this: \begin{code} do n2 <- yield Expr (Bounds { leftMargin = (0, 0), rightMargin = (1, 2) }) (EIntLit 2) n3 <- yield Expr (Bounds { leftMargin = (3, 4), rightMargin = (5, 5) }) (EIntLit 3) n5 <- yield Expr (Bounds { leftMargin = (0, 0), rightMargin = (5, 5) }) (EAdd n2 n3) return n5 \end{code} The first argument to |yield| is a witness of the |Tuples| GADT, indicating the type of the expression that is being pushed onto the stack. The second argument is the annotation that goes with the root of the tree, while the third and final argument is the tree itself. Note that there is some overlap in the information that is pushed onto the stack. First |2| and |3| are pushed individually, and then in the third |yield| they are given to the stack again as children of the |EAdd| node. If the integer literals are given twice, which instances does |yield| use? The answer is that |yield| only uses the root constructors of the elements given to it. The children of the |EAdd| node in the third call to |yield| are discarded, so yielding |EAdd undefined undefined| would have worked just as well. This once again shows that redundant information is bad, but it is the price we pay for being able to use the original datatype constructors rather than the pattern functor versions. Alternative solutions to the problem are discussed in chapter \ref{sec:future}. Let us take a closer look at the exact steps |yield| takes whenever it is called: \begin{enumerate} \item A new element is being pushed onto the stack. We have been given the accompanying witness of type |phi ix|, so we can do a shallow conversion to the pattern functor version of the constructor using |from| in type class |Fam|. The resulting value is of type |PF phi I0 ix|. \item Figure out how many children the constructor has. We can express this computation in terms of |hmapA|, traversing the constructor's pattern functor and counting the child positions. Let us call the number of children |n|. \item \label{step:pop} Pop |n| children from the stack. Every child is fully annotated (by induction; read on) and has type |AnnFix x phi xi|, for some index |xi|. \item \label{step:distribute} Distribute the |n| children over the pattern functor of the constructor being pushed. This results in a node which has fully annotated children but still lacks a top-level annotation, i.e.~a value of type |AnnFix1 x phi ix|. \item Using |mkAnnFix|, tie the annotation that was passed to |yield| to the |AnnFix1| value, producing a value of |AnnFix x phi ix|. \item Finally, push this new, fully annotated tree onto the stack. \end{enumerate} A consequence of pushing redundant information is that some of the steps outlined above might fail. Step \ref{step:pop} might find there are not enough (fewer than |n|) children on the stack if a previous parse operation neglected to |yield| its results. Step \ref{step:distribute} might find it is trying to replace unannotated children by annotated values of the wrong type. For example, |EAdd| expects two |Expr| children, but the previous yield statements might have pushed |Type|s onto the stack. Finally, at the end of the parsing, we expect exactly one tree in the stack: another thing that might not be the case. The good news is that the |yield|ing of values can be hidden in standard combinators such as |unit|, |chainl| and |chainr|. This means that correct use of |yield| only has to be verified for these combinators and that the final parser cannot do anything wrong if it is expressed in terms of these combinators. In fact, as we will see in a bit, the parser for the |Expr| part of the |Tuples| family is surprisingly similar to the parsers defined in section \ref{sec:annoparse}. Before we look at these parsing combinators, however, let us look at the code behind |yield|. We can separate the parsing from the |yield|ing by defining a new monad transformer for the |yield|ing process and then later stacking the parser transformer with the |yield| transformer. As is common when defining a new transformer, we capture the primitive operations in their own type class: \begin{code} class Monad m => MonadYield m where type YieldFam m :: * -> * type AnnType m :: * yield :: YieldFam m ix -> AnnType m -> ix -> m ix \end{code} Class |MonadYield| needs only one primitive operation, but it uses two associated types, expressed using a type synonym family. By putting these functions inside the class, we express the fact that instances of these type functions go hand in hand with instances of the class itself. What type do the stack elements have? The elements might be of different types within a family. In order to be able to put them in the same list, we can existentially quantify over the family indices. By storing them together with proof values of the family GADT, we can always recover the indices again later by pattern matching on the proof. We define a new datatype to this end and call it |AnyF|, because it can store an |f| indexed by any type in a family |phi|: \begin{code} data AnyF phi f where AnyF :: phi ix -> f ix -> AnyF phi f type AnyAnnFix x phi = AnyF phi (AnnFix x phi) \end{code} The first argument to the |AnyF| constructor is the proof; the second is an |f| indexed by the type indicated by the proof. In our case, we set |f| to be |AnnFix x phi| and call such an element |AnyAnnFix x phi|. Now we can write an instance of |MonadYield|: \begin{code} newtype YieldT x phi m a = YieldT (StateT [AnyAnnFix x phi] m a) deriving (Functor, Monad) instance MonadTrans (YieldT x phi) where lift = YieldT . lift instance (Monad m, HFunctor phi (PF phi), EqS phi, Fam phi) => MonadYield (YieldT x phi m) where type YieldFam (YieldT x phi m) = phi type AnnType (YieldT x phi m) = x yield = doYield \end{code} In |YieldT x phi m a|, |x| is the type of the annotation (such as |Bounds|), |phi| is the family GADT (such as |Tuples|), |m| is the underlying monad and |a| is the return value of the monad. We keep track of the stack by making |YieldT| a |newtype| wrapper around a |StateT| monad with a list of |AnyAnnFix|es as state. As any good monad transformer, a |MonadTrans| instance is supplied. The type functions |YieldFam| and |AnnType| point to the relevant type arguments. The definition of |doYield| is postponed; we will look at it in a moment. First, we need to define the |EqS| class constraint. When we distribute the annotated children over a pattern functor, we need to make sure that the children are of the right type. We will use hmapA to perform the distribution, and the transformation function passed to |hmapA|, which has type |forall xi. phi xi -> r xi -> a (r' xi)|, reflects the fact that the new children should have the right index. The transformation function is passed a witness of type |phi xi| and the old child of type |r xi|, and the resulting action should return a new child of type |r' xi|, i.e.~with the same index. How can we make sure that the element we removed from the stack has the right index? This is what |EqS| is for: %format ^^ = "\;" \begin{code} class EqS phi where eqS :: phi ix_1 -> phi ix_2 -> Maybe (ix_1 :=: ix_2) data (:=:) :: * -> * -> * ^^ where Refl :: ix :=: ix \end{code} Constructor |Refl| can only be constructed if the two type operands to |(:=:)| are equal. As a result, if two type variables |ix_1| and |ix_2| are in scope, pattern matching on the |Refl| constructor of type |ix_1 :=: ix_2| tells the Haskell compiler that these two type variables must be equal. This explains the return type of |eqS|: given two witnesses, if they turn out to be one and the same witness, we do not just want the result |True| (as is the case with the normal equality operator |(==)|), but we also want a proof that |ix_1 :=: ix_2|. We can use |eqS| when distributing the children: compare the witness argument to |hmapA|'s first argument to the witness that is stored in the |AnyAnnFix|. If they are equal, then we can convince the compiler that it is okay to return the annotated child. \begin{code} doYield :: (Monad m, HFunctor phi (PF phi), EqS phi, Fam phi) => phi ix -> x -> ix -> YieldT x phi m ix doYield p bounds x = YieldT $ do let pfx = from p x let n = length (children p pfx) stack <- get if length stack < n then error $ "structure mismatch: required " ++ show n ++ " accumulated children but found only " ++ show (length stack) else do let (cs, cs') = splitAt n stack let newChild = evalState (hmapM distribute p pfx) (reverse cs) put (AnyF p (HIn (K bounds :*: newChild)) : cs') return x distribute :: EqS phi => phi ix -> I0 ix -> State [AnyAnnFix x phi] (AnnFix x phi ix) distribute p_1 _ = do xs <- get case xs of [] -> error "structure mismatch: too few children" AnyF p_2 x : xs' -> case eqS p_1 p_2 of Nothing -> error "structure mismatch: incompatible child type" Just Refl -> do put xs'; return x \end{code} The code closely follows the steps outlined earlier. One thing to note is that because |doYield| uses |splitAt| to pop the children, the resulting list has to be reversed: the child that was pushed last appears at the front of the stack, while we want that child to be in the last position of the list we give to |distribute|. That |distribute| ignores the unannotated children and simply replaces them is made explicit by the underscore as its second argument. \section{Parsing expressions} In the previous section we alluded to hiding the calls to |yield| in parser combinators. Now that we have fully defined the |YieldT| monad transformer, we can build a monad stack that allows yielding while parsing and actually define these parser combinators. In section \ref{sec:annoparse} we defined a type synonym |P|; we will reuse |P| and extend it to define a type synonym |YP| that incorporates |YieldT|: \begin{code} type YP s phi m = P s (YieldT Bounds phi m) \end{code} Why should the monads be stacked in this way? Stacked like this, we can use the choice parser combinator |(<||>)| to define alternatives in the grammar. If |YieldT| were on the outside, we would only be able to use choice in the inner monad |P|, from where we cannot |yield| any values. On the other hand, now that |P| is on the outside we can express choice and still have access to |yield| from within the alternatives. We have to be careful that while evaluating the choices, no |yield| side effects take place. Parsec partially solves this by specifying that |p_1 <||> p_2| first tries |p_1| and does not consider |p_2| anymore if |p_1| has consumed any input. Since we only |yield| values after consuming some input, we are safe. However, Parsec also offers the |try| combinator which lifts this restriction and allows arbitrary backtracking. If |try| is used, we have no safety guarantees anymore. For that reason, we should be very careful to use |try|, or---better yet---not use it at all. Fortunately, we will not need it in the definition of our example |Tuples| parser. Let us take a look at the |unit| parser combinator we have seen in section \ref{sec:annoparse}, translated to the YP monad stack: \begin{code} unit :: (Fam phi, EqS phi, HFunctor phi (PF phi), Monad m) => phi a -> YP s phi m a -> YP s phi m a unit w p = do left <- getPos x <- p mkBounded w left x mkBounded :: (Fam phi, EqS phi, HFunctor phi (PF phi), Monad m) => phi a -> Range -> a -> YP s phi m a mkBounded w left x = do right <- getPos lift $ yield w (Bounds left right) x \end{code} Although the types are different and more involved, the definition of |unit| is identical to that of the original |unit|, with the exception of an extra witness parameter of type |phi a|. Only |mkBounded|'s definition has changed significantly: instead of returning the result of a call to |mkAnnFix|, it calls |yield|. The other two combinators, |chainl| and |chainr|, are also the same as before, again with the exception of the witness parameter: \begin{code} chainr :: (Fam phi, EqS phi, HFunctor phi (PF phi), Monad m, Show a) => phi a -> YP s phi m a -> YP s phi m (a -> a -> a) -> YP s phi m a chainr w px pf = fix $ \loop -> do left <- getPos x <- px option x $ do f <- pf y <- loop mkBounded w left (f x y) chainl :: (Fam phi, EqS phi, HFunctor phi (PF phi), Monad m, Show a) => phi a -> YP s phi m a -> YP s phi m (a -> a -> a) -> YP s phi m a chainl w px pf = do left <- getPos px >>= rest left where rest left = fix $ \loop x -> option x $ do f <- pf y <- px mkBounded w left (f x y) >>= loop \end{code} The actual parser for the |Tuples| family of datatypes is unsurprising, but we include it here for the sake of symmetry with the previous section. As before, we do not include the code for the lexer. The lexer token constructor names can be recognized by a prefix |T|. \begin{code} type ExprParser = YP ExprToken Tuples Identity pExpr :: ExprParser Expr pExpr = do left <- getPos ex <- pAdd option ex $ do pToken TDoubleColon ty <- pType mkBounded Expr left (ETyped ex ty) pAdd :: ExprParser Expr pAdd = chainl Expr pMul (EAdd <$ pToken TPlus) pMul :: ExprParser Expr pMul = chainl Expr pFactor (EAdd <$ pToken TStar) pFactor :: ExprParser Expr pFactor = pIntLit <|> pTupleVal pIntLit :: ExprParser Expr pIntLit = unit Expr $ (\(TIntLit n) -> EIntLit n) <$> satisfy isIntLit pTupleVal :: ExprParser Expr pTupleVal = pTuple Expr pExpr ETup pType :: ExprParser Type pType = pTyInt <|> pTyTuple pTyInt :: ExprParser Type pTyInt = unit Type $ TyInt <$ pToken TInt pTyTuple :: ExprParser Type pTyTuple = pTuple Type pType TyTup pTuple :: Tuples ix -> ExprParser ix -> (ix -> ix -> ix) -> ExprParser ix pTuple w pEl f = do left <- getPos pToken TLParen lhs <- pEl ty <- option lhs $ do pToken TComma rhs <- pEl mkBounded w left (f lhs rhs) pToken TRParen return ty \end{code} \section{Annotation-guided exploring} In contrast with the annotation-guided exploring in section \ref{sec:fixedpoint}, pattern functors give us enough information to define genuine zippers. An implementation of such a zipper is available in package \texttt{zipper} on Hackage\footnote{\url{http://hackage.haskell.org/package/zipper}}. We will discuss the API of the zipper, but not its implementation. For details on the implementation, the reader is referred to the MultiRec paper \cite{rodriguez2008mutrec}. The current version of the zipper (\texttt{0.3}) is intended to work on shallowly converted datatypes. In order to be able to use it for recursively annotated trees, a slight modification has been made to the library. The modified version is listed in Appendix \ref{app:zipperfix}. \begin{figure}[t] \begin{code} data FixZipper phi f ix enter :: Zipper phi f => phi ix -> HFix f ix -> FixZipper phi f ix leave :: Zipper phi f => FixZipper phi f ix -> HFix f ix type Nav = forall phi f ix. Zipper phi f => FixZipper phi f ix -> Maybe (FixZipper phi f ix) down :: Nav up :: Nav right :: Nav left :: Nav on :: (forall xi. phi xi -> HFix f xi -> a) -> FixZipper phi f ix -> a \end{code} \caption{The API of the MultiRec-based fixpoint zipper.} \label{fig:zipperapi} \end{figure} An overview of the API is listed in figure \ref{fig:zipperapi}. The names of the arguments of |FixZipper| are consistent with those in previous datatypes: |phi| is the family GADT, |f| is the pattern functor over which the zipper is computed and |ix| is the top-level index. Function |enter| takes a fixpoint and returns a zipper over that fixpoint, with the focus at the tree's root. Function |leave| leaves the zipper structure and returns the tree that was originally |enter|ed. Because a zipper value implicitly contains a witness |phi ix|, functions that \emph{accept} a zipper (such as |leave|) do not need an extra witness argument. The navigation functions |down|, |up|, |right| and |left| all have the same type, which---as before---has been called |Nav|. And just as before, navigation steps can be composed using |>=>|. Finally, |on| takes a function, applies it to the current focus and then returns the result. If we instantiate a |FixZipper| with |K x :*: PF phi| for |f|, it will hold selections of recursively annotated trees: \begin{code} type AnnZipper phi x = FixZipper phi (K x :*: PF phi) \end{code} Given a zipper over an annotated tree, we can extract the annotation of the current focus using |on|: \begin{code} focusAnn :: AnnZipper phi x ix -> x focusAnn = on (\_ (HIn (K x :*: _)) -> x) \end{code} The heart of the code in section \ref{sec:annoops} was the |explore| function. Most other functions were expressed in terms of |explore|. The function's generic sibling is very similar: \begin{code} explore :: Zipper phi (PF phi) => phi ix -> (x -> ExploreHints) -> (AnnFix x phi) ix -> [AnnZipper phi x ix] explore p hints = explore' hints . enter p explore' :: Zipper phi (PF phi) => (x -> ExploreHints) -> AnnZipper phi x ix -> [AnnZipper phi x ix] explore' hints root = [ z | (dirOk, zs) <- dirs, dirOk (hints x), z <- zs ] where x = focusAnn root dirs = [ (matchHere, [root]) , (exploreDown, exploreMore (down root)) , (exploreRight, exploreMore (right root)) ] exploreMore = maybe [] (explore' hints) \end{code} Its type is more complicated, and it receives a witness. The extraction of the annotation is done using |focusAnn| rather than direct pattern matching. The other functions are also very similar. Again, the type changes and some witnesses are passed, but their essences are the same. \begin{code} findLeftmostDeepest :: (Zipper phi (PF phi)) => phi ix -> (x -> Bool) -> AnnFix x phi ix -> Maybe (AnnZipper phi x ix) findLeftmostDeepest p down = listToMaybe . reverse . explore p hints where hints x | down x = ExploreHints True True False | otherwise = ExploreHints False False True selectByRange :: Zipper phi (PF phi) => phi ix -> Range -> AnnFix Bounds phi ix -> Maybe (AnnZipper phi Bounds ix) selectByRange p range@(left, _) = listToMaybe . reverse . explore p hints where hints bounds@(Bounds _ (ir, _)) = ExploreHints { matchHere = range `rangeInBounds` bounds , exploreDown = range `rangeInRange` outerRange bounds , exploreRight = left >= ir } selectByPos :: (Zipper phi (PF phi)) => phi ix -> Int -> AnnFix Bounds phi ix -> Maybe (AnnZipper phi Bounds ix) selectByPos p pos = findLeftmostDeepest p (posInRange pos . innerRange) repairBy :: (Ord dist, HFunctor phi (PF phi)) => phi ix -> (Range -> Range -> dist) -> AnnFix Bounds phi ix -> Range -> Bounds repairBy p cost tree range = head (sortOn (cost range . innerRange) (allAnnotations p tree)) repair :: HFunctor phi (PF phi) => phi ix -> AnnFix Bounds phi ix -> Range -> Bounds repair p = repairBy p distRange moveSelection :: Zipper phi (PF phi) => phi ix -> AnnFix Bounds phi ix -> Nav -> Range -> Maybe Bounds moveSelection p tree nav range = focusAnn <$> (selectByRange p range tree >>= nav) \end{code}