\chapter{Annotated base functors} \label{sec:fixedpoint} The approach above was not satisfactory because we left the definition of |BareExpr| untouched; we only used it. Let us see what happens if we allow ourselves to change |BareExpr|'s definition. Our goal is still to add position information to every node. We can do this in two ways: either we add a new field to every constructor, or we couple each \emph{use} of |BareExpr| with |Bounds|. We choose the latter because it allows the extraction of a nice abstraction. \begin{code} type PositionalExpr = (Bounds, PositionalExpr') data PositionalExpr' = Num Int | Add PositionalExpr PositionalExpr | Sub PositionalExpr PositionalExpr | Mul PositionalExpr PositionalExpr | Div PositionalExpr PositionalExpr \end{code} Notice that |BareExpr| and |PositionalExpr'| look very much alike: they have the same structure. Only the types of their recursive positions are different. We can maximize reuse by abstracting over the parts that differ. In this case we add a type argument to the datatype, to be filled in later when we know whether we want a bounded or unbounded expression: \begin{code} data ExprF rT = Num Int | Add rT rT | Sub rT rT | Mul rT rT | Div rT rT \end{code} We can now redefine |BareExpr| in terms of |ExprF|. Remember that the type argument of |ExprF| determines the shape of its children. To reconstruct the original expression type, we want the children themselves to be expressions as well. Therefore need to give the expression type we are defining itself as type argument to |ExprF|. This leads to a recursive definition of |Expr|. We can encode such a definition by a new datatype: \begin{code} newtype Expr = Expr (ExprF Expr) \end{code} Expanding |Expr| repeatedly leads to the infinite type |ExprF (ExprF (ExprF ...))|, indicating that the obtained tree is of |ExprF|-shape at every level. |PositionalExpr| can be expressed in terms of |ExprF| in a similar way. To insert the position information at every level, we wish to obtain the infinite type |(Bounds, ExprF (Bounds, ExprF ...))|. We need another datatype to achieve this. \begin{code} newtype PositionalExpr = PositionalExpr (Bounds, (ExprF PositionalExpr)) \end{code} The idea of abstracting over a type's children like this is not new and goes by several names, including \emph{open recursion}. The |ExprF| version of the expression datatype is often called the \emph{base functor} and we will use that name in the rest of this thesis. The suffix |F| to indicate the functor version of a datatype comes from \emph{Polytypic Programming} \cite{jeuring96polytypicprogramming}. The fact that types such as |Expr| and |PositionalExpr| use themselves as arguments to functors in their own definitions is often made explicit by expressing them in terms of the well-known datatype |Fix|. Doing so adds to the modularity and enables some generic functions, which we will take a look at in a moment. An early document discussing the |Fix| datatype is \emph{Recursive types for free!} \cite{wadler90recursive}. The |Expr| datatype is immediately expressible in terms of |Fix|; we redefine it here: \begin{code} newtype Fix fT = In { out :: fT (Fix fT) } newtype Expr = Expr { runExpr :: Fix ExprF } \end{code} Because number literals and arithmetic operators are overloaded in Haskell, we can easily construct values of the new |Expr| type if we supply an appropriate |instance Num Expr| and |instance Fractional Expr|: \begin{code} > runExpr (2 + 3 * 4) In {out = Add (In {out = Num 2}) (In {out = Mul (In {out = Num 3}) (In {out = Num 4})})} \end{code} To redefine |PositionalExpr| in terms of |ExprF| we need to go through a bit more trouble. We introduce a new datatype we can use for adding the position information at every level before we give it to |Fix|, somewhat like a tuple type that is lifted on one side: \begin{code} data Ann x f a = Ann x (f a) instance Functor f => Functor (Ann x f) where fmap f (Ann x t) = Ann x (fmap f t) newtype PositionalExpr = PositionalExpr { runPositionalExpr :: Fix (Ann Bounds ExprF) } \end{code} These last definitions of |Expr| and |PositionalExpr| are the final ones for this section and they will be used throughout the following sections. Furthermore, we introduce two type synonyms that will prove useful later on. \begin{code} type AnnFix xT fT = Fix (Ann xT fT) type AnnFix1 xT fT = fT (AnnFix xT fT) \end{code} The first is a recursive tree of shape |f| at every level, fully annotated with |xT|'s; the second has fully annotated children but still lacks an annotation at the top level. It can be made fully annotated by providing the top-level annotation: \begin{code} mkAnnFix :: x -> AnnFix1 x f -> AnnFix x f mkAnnFix x = In . Ann x \end{code} There are numerous advantages to expressing |Expr| and |PositionalExpr| this way, some of which we have already seen. The most important one is that we have solved one of the issues with the approach in the previous section: we no longer use unbounded lists for the children, so we cannot build syntax trees anymore that do not actually correspond to expression trees and have too many or too few children. We are also not storing complete subtrees in separate fields anymore and so do not have any redundant information in our datatype. The expression parts and annotation parts are tightly interwoven, ensuring that all values of type |AnnFix Bounds ExprF| are valid. We have maximized reuse by providing a few components we can stack and compose as necessary. This allows for generic functions; a nice example of this is the following function that generically removes annotations from trees. The only thing we require of the functors given to |AnnFix| is that they actually implement the |Functor| type class: \begin{code} unannotate :: Functor f => AnnFix x f -> Fix f unannotate (In (Ann _ tree)) = In (fmap unannotate tree) \end{code} To use this function on annotated expressions, we need |ExprF| to be an instance of |Functor|. We omit that instance here because it is trivial: there is only one implementation that adheres to the functor laws. \section{Catamorphisms over fixed points} \label{sec:cata} Now that we have added the fields for storing position information to our expression datatype it is time to adapt the producers and consumers of the new expression types. We start with the consumers, specifically the \emph{catamorphisms}. The application of fixed points in catamorphisms is well-known, but because it is so important for the rest of the story, we describe it again in this section. A more formal treatment is offered in \emph{Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire} \cite{meijer91functional}. To understand what catamorphisms are, take a look at the use of the best-known one: |foldr| over lists. \begin{code} foldr :: (aT -> bT -> bT) -> bT -> [aT] -> bT \end{code} One way to understand the behavior of this function is to notice it replaces the two list constructors |(:)| and |[]| by programmer-supplied functions: the first two arguments to |foldr|. These two functions together are referred to as the \emph{algebra} for the list catamorphism. For example, |foldr (<+>) e| turns the list |x:y:z:[]| into |x <+> y <+> z <+> e|. The same can be done for any other algebraic datatype. For the |Expr| datatype, for example, we can create a function |cataExpr| that takes five arguments: one for every constructor. The function then recursively traverses any expression we give it, applying the appropriate functions to the fields of the constructors. Instead of giving the five arguments separately to the function, we can group them together in a special datatype capturing expression algebras: \begin{code} data ExprAlg aT = ExprAlg { cataNum :: Int -> aT , cataAdd :: aT -> aT -> aT , cataSub :: aT -> aT -> aT , cataMul :: aT -> aT -> aT , cataDiv :: aT -> aT -> aT } \end{code} The catamorphism then becomes: \begin{code} cataExpr :: ExprAlg aT -> Fix ExprF -> aT cataExpr alg = f where f (In expr) = case expr of Num n -> cataNum alg n Add x y -> cataAdd alg (f x) (f y) Sub x y -> cataSub alg (f x) (f y) Mul x y -> cataMul alg (f x) (f y) Div x y -> cataDiv alg (f x) (f y) \end{code} Again, |ExprAlg|'s definition is reminiscent of |BareExpr|'s definition: every constructor of |BareExpr| has a corresponding constructor in |ExprAlg|, and each constructor in |ExprAlg| has fields that directly correspond to those of |BareExpr|'s constructor. Can we do the same trick as before and find a suitable abstraction? It turns out we have yet another use for |ExprF|: the types |ExprAlg aT| and |ExprF aT -> aT| are isomorphic. We can show this using some basic algebra rules if we view the types as polynomials. Datatypes become sums (one term per constructor) of products (one factor per constructor field) and functions |a -> b| become powers $b^a$. \begin{align*} ExprF(a) &= |Int| + a^2 + a^2 + a^2 + a^2\\ \\ ExprAlg(a) &= a^{|Int|} * (a^a)^a * (a^a)^a * (a^a)^a * (a^a)^a \\ &= a^{|Int|} * a^{a^2} * a^{a^2} * a^{a^2} * a^{a^2} \\ &= a^{{|Int|} + a^2 + a^2 + a^2 + a^2} \\ &= a^{|ExprF|(a)} \end{align*} We see that |ExprAlg aT| and |ExprF aT -> aT| are isomorphic. Let us see how using this new type changes the definition of |cataExpr|: \begin{code} cataExpr :: (ExprF a -> a) -> Fix ExprF -> a cataExpr f (In expr) = f (fmap (cataExpr f) expr) \end{code} This definition is significantly shorter than the previous one! This is mostly because we no longer need to pattern match on |ExprF|'s constructors: this part is hidden in the function argument and the call to |fmap|. In fact, there is nothing anymore in this definition that is specific to |ExprF|. Therefore, |cataExpr|'s signature as it is above is too specific, both in name and in type. Writing the body in point-free style, the new function becomes: \begin{code} type Algebra fT aT = fT aT -> aT cata :: Functor fT => Algebra fT aT -> Fix fT -> aT cata f = f . fmap (cata f) . out \end{code} This means that for every datatype that is defined in explicit-recursion form, custom datatypes for their corresponding algebras and custom functions for their catamorphisms are no longer necessary: yet another benefit of reusing existing building blocks. As an added bonus, writing algebras in this form produces very elegant code. For example, evaluating expressions to integers (without taking division by zero into account) can be implemented as follows: \begin{code} exprEval :: Algebra ExprF Int exprEval expr = case expr of Num n -> n Add x y -> x + y Sub x y -> x - y Mul x y -> x * y Div x y -> x `div` y \end{code} \begin{code} > cata exprEval (runExpr (1 + 2 * 3)) 6 \end{code} Because |cata| takes care of the recursive positions, the algebra can assume the fields already contain the results of the evaluation. \section{Error algebras} \label{sec:errorcata} The catamorphisms above did not take the possibility of failure into account. Neither did they do anything with position information. Let us fix both of those issues. We have our evaluation algebra return an |Either String Int| to indicate failure or success. Furthermore, instead of working on |ExprF|, we have the algebra accept |Ann Bounds ExprF| (which is a |Functor| because |ExprF| is a |Functor|) so that we have position information available in case things go wrong. The algebra then looks like this: \begin{code} exprEval' :: Algebra (Ann Bounds ExprF) (Either (Bounds, String) Int) exprEval' (Ann z expr) = case expr of Num n -> Right n Add x y -> (+) <$> x <*> y Sub x y -> (-) <$> x <*> y Mul x y -> (*) <$> x <*> y Div x y -> do x' <- x y' <- y if y' == 0 then Left (z, "division by zero") else Right (x' `div` y') \end{code} Although the problem of redundancy and invalid shapes is gone, this algebra still suffers from the other problem the catamorphism at the end of section \ref{sec:grotetrapx} had: computations have to be written in applicative or monadic style. Also, the algebra has to pattern match on the |Ann| constructor to use or discard the position information. We can improve on this by making the possibility of failure explicit in the algebra type. We introduce a new type of algebra, called an \emph{error algebra}: \begin{code} type ErrorAlgebra fT eT aT = fT aT -> Either eT aT \end{code} The major difference between an |ErrorAlgebra fT eT aT| and an |Algebra fT (Either eT aT)| is that an |ErrorAlgebra| has an |fT aT| on the left-hand side of the function arrow instead of an |fT (Either eT aT)| and so assumes that when the catamorphisms were applied to children, \emph{they were successful} and produced |aT|'s instead of error values. This means that it is no longer necessary to use applicative style in the algebras and evaluation is once again pretty: \begin{code} exprEval :: ErrorAlgebra ExprF String Int exprEval expr = case expr of Num n -> Right n Add x y -> Right (x + y) Sub x y -> Right (x - y) Mul x y -> Right (x * y) Div x y | y == 0 -> Left "division by zero" | otherwise -> Right (x `div` y) \end{code} Whenever a node in the tree produces an error, it no longer fulfills its parent's assumption that it produces an |aT|. The catamorphism function will therefore have to propagate the error upwards in the tree, popping up as the result at the root. There are situations where several children simultaneously produce errors. Rather than arbitrarily picking one of the errors to bubble up, we |mappend| them together, introducing a |Monoid| constraint on the error type |eT|. Of course we cannot give error algebras to our generic |cata| function just like that, because |cata| expects normal algebras. Also, the applicative computations have not simply gone; they just need to be applied outside of the algebra. In \emph{Applicative programming with effects} \cite{mcbride08applicative}, McBride and Paterson show how to generically capture applicative computations over functors using the type class |Traversable|. That type class essentially provides one function: \begin{code} class Traversable t where traverse :: Applicative f => (a -> f b) -> t a -> f (t b) \end{code} Traversing |ExprF|, for example, looks like this: \begin{code} instance Traversable ExprF where traverse f expr = case expr of Num n -> pure (Num n) Add x y -> Add <$> f x <*> f y Sub x y -> Sub <$> f x <*> f y Mul x y -> Mul <$> f x <*> f y Div x y -> Div <$> f x <*> f y \end{code} By capturing |ExprF|'s traversal in a generic function, it can be reused in different circumstances, including our error algebras. By adding a |Traversable| constraint to our functors, we can convert any error algebra into a normal one, collecting errors as we go and producing an algebra we can give to |cata| again. \begin{code} cascade :: (Traversable fT, Monoid eT) => ErrorAlgebra fT eT aT -> Algebra fT (Except eT aT) cascade alg expr = case sequenceA expr of Failed xs -> Failed xs OK tree' -> case alg tree' of Left xs -> Failed xs Right res -> OK res \end{code} The |Except| datatype we use above is also described by Paterson and McBride. It is similar to |Either|, but it is designed to only be used in an applicative way so that sequencing two errors results in the sum of those errors (using |mappend|). The monadic |Either|, on the other hand, discards any errors other than the first. In this way, |Except| provides the collecting behavior we described a moment ago. Here is the datatype and its |Applicative| implementation: \begin{code} data Except e a = Failed e | OK a instance Monoid e => Applicative (Except e) where pure = OK OK f <*> OK x = OK (f x) OK _ <*> Failed e = Failed e Failed e <*> OK _ = Failed e Failed e1 <*> Failed e2 = Failed (e1 `mappend` e2) \end{code} Although we now have pretty algebras with error functionality, we have not addressed yet what to do with annotations a tree might have. To do something useful with the annotations, we need a new catamorphism function; one that works on |AnnFix|'s instead of normal |Fix|'s. If we have this new function take error algebras too, we can automatically couple potential errors with the annotations at the positions at which those errors arose, regaining all the functionality that the inelegant catamorphism above had: \begin{code} errorCata :: Traversable fT => ErrorAlgebra fT eT aT -> AnnFix xT fT -> Except [(eT, xT)] aT errorCata alg (In (Ann x expr)) = case traverse (errorCata alg) expr of Failed xs -> Failed xs OK expr' -> case alg expr' of Left x' -> Failed [(x', x)] Right v -> OK v \end{code} \section{Parsing annotated values} \label{sec:annoparse} Now that we have adapted the consumers of the new, annotated datatypes, it is time to adapt the \emph{producers} of the annotated datatypes to automatically fill in the annotations. There are many kinds of producers. One example is the palette in a graphical editor that creates new model objects. But the most popular producer by far is the parser that converts plain text to model objects, introducing structure. For that reason we focus only on adapting parsers. Haskell is well-known for the variety of parser libraries and tools that are available for it. We will choose one of these and show how to easily create values of the new expression type. Because we are developing solutions to be used in pure Haskell, we limit our options to parser \emph{libraries} only, excluding preprocessing tools such as Alex and Happy so that we can manipulate the parsers as first-class citizens in Haskell. Even then there are several options, including polish parsers developed at Utrecht University, Parsec and the ReadP library that ships with the standard libraries. It does not really matter which one of these we pick. Although there are some differences between the libraries, they are all very good at that part we are interested in: the actual parsing. For this thesis we have chosen to work with Parsec as it probably the best-known of those options. We will use version \texttt{parsec-3.0.1}, available from Haskell's package repository Hackage\footnote{\url{http://hackage.haskell.org/package/parsec-3.0.1}}. From here on we assume the reader is familiar with the most important Parsec combinators. \subsection{A parser for |BareExpr|} In order to properly compare the parser for the new version of our expression datatype with that of the old version, we need to give both versions. We have given neither so far. Let us start with the old version for |BareExpr|, the version without the use of fixed points. One of the fundamental parsing operations is the parsing of a single symbol. There are various strategies for this used by different libraries. The parsers from UU, for example, allow you to supply a range of symbols to accept, specifying the lower and upper bound and using the |Ord| constraint to test membership of the range. Sometimes symbol equality is used, forcing the user to combine many smaller parsers if a range of symbols is to be accepted. Parsec uses the most flexible of all the possible approaches: a predicate of type |symbol -> Bool| that tells whether a symbol is to be accepted. The function that accepts this predicate and produces the appropriate parser is often called |satisfy|; an early example of this can be found in \cite{hutton92higher}. This approach is the most flexible because two other approaches from the previous paragraph can be expressed in terms of |satisfy|. \footnote{While most flexible, |satisfy| also gives the parsing library the least amount of information and precludes certain optimizations that otherwise would have been possible.} Our expression producer consists of two phases: the lexer converting characters to expression tokens and the actual parser converting these tokens to expression trees. Using two phases like this makes the second phase easier because we can discard all the whitespace and comment tokens from the input so that the parser does not have to bother with them. Our token type is as follows: \begin{code} data ExprToken = TNum Int | TPlus | TMinus | TStar | TSlash | TPOpen | TPClose | TSpace String isSpace (TSpace _) = True isSpace _ = False isNum (TNum _) = True isNum _ = False \end{code} The lexer producing these constructors is not very interesting and unimportant for the rest of the story, so we omit it. That leaves only the parser: \begin{code} pToken = satisfy . (==) pExpr = chainl1 pTerm (Add <$ pToken TPlus <|> Sub <$ pToken TMinus) pTerm = chainl1 pFactor (Mul <$ pToken TStar <|> Div <$ pToken TSlash) pFactor = pNum <|> pToken TOpen *> pExpr <* pToken TClose pNum = (\(TNum n) -> Num n) <$> satisfy isNum \end{code} \subsection{Keeping track of the position during parsing} Now that we have a parser for the old |Expr|, we can build one for the new |Expr| and compare them. The new parser will have to use the constructors from |ExprF| (which have the same names as those of the old |Expr|) and insert position information at every level before the trees can be used as children of constructors higher up in the tree. To properly annotate the values we build during parsing, the parser needs to keep track of the current position in the input. Parsec provides support for this in the form of (line, column) information, but our datatype |Bounds| requires us to keep track of ranges of whitespace around token sequences. It therefore makes sense to use a range of whitespace as our position information at any moment during the parsing, so we set |Range| as the type of the state parameter |u| in |ParsecT s u m a|, updating the range every time a token is parsed.\footnote{Type constructor |ParsecT|'s arguments |s|, |u|, |m| and |a| stand for stream type, user state, underlying monad and result type, respectively.} The easiest way to have the position information available every time we read a token is to couple each token in the input stream with its |Bounds|. Computing the proper bounds for each token needs to be done before discarding the whitespace tokens from the lexer's output, because we need the whitespace tokens to compute the margins. Therefore we combine the discarding of these tokens and the computation of the bounds in a single operation called |collapse|: \begin{code} collapse :: Symbol s => (s -> Bool) -> [s] -> [(s, Bounds)] collapse space ts = collapse' (0, symbolSize lefts) space rest where (lefts, rest) = span space ts collapse' :: Symbol s => Range -> (s -> Bool) -> [s] -> [(s, Bounds)] collapse' _ _ [] = [] collapse' left space (t:ts) = new : collapse' right space rest where (_, leftInner) = left rightInner = leftInner + symbolSize t rightOuter = rightInner + symbolSize rights right = (rightInner, rightOuter) (rights, rest) = span space ts new = (t, Bounds left right) \end{code} Most of the work is done in |collapse'|. Its first argument is its current offset in the stream, the left margin of the bounds of the next token. The right margins, which form the left margins in the recursive call, are computed by asking for symbol sizes: the reason for the |Symbol| type class, which we will look at in more detail in a few moments. The function |collapse| is the one that is exposed to clients of the API and does not need an offset parameter because it assumes it is at the start of the input. Apart from the input stream, the only other argument is a function that tells which of the symbols are to be discarded. By expressing this as a function |s -> Bool|, |collapse| assumes this can be decided locally by looking at the symbol in question alone. If a context-sensitive decision is needed, |collapse| could be adapted to a more complicated scheme. Rather than building |collapse| specifically for |ExprToken|s, we look at what properties we need of token types and capture these in a type class called |Symbol|: \begin{code} class Symbol s where unparse :: s -> String symbolSize :: s -> Int symbolSize = length . unparse \end{code} The first function, |unparse|, converts a symbol back to a string, exactly the way it was encountered during parsing. The second function, |symbolSize|, is the one we used above. Its default implementation may be overwritten for efficiency, if necessary. The observant reader may have noticed that |symbolSize| is called on both single symbols and lists of symbols in the definition of |collapse|. This is possible because an extra instance is provided for lists: \begin{code} instance Symbol s => Symbol [s] where unparse = concatMap unparse symbolSize = sum . fmap symbolSize \end{code} To recap, the new parser will have |Range| values as user state and will consume tokens coupled with their position information. This is reflected by a new type synonym |P|: \begin{code} type P s = ParsecT [(s, Bounds)] Range \end{code} The type constructor |ParsecT| has been given only two arguments and still expects its third and fourth argument |m| and |a|, so to fully specify |P|, it needs three arguments |s|, |m| and |a|. Every time a new token is consumed, the state needs to be updated. We can hide this in the new definition of |satisfy|. That function will be defined in terms of one of the fundamental building blocks of Parsec parsers: the function |tokenPrim|. Its type is: \begin{code} tokenPrim :: Stream s m t => (t -> String) -> (SourcePos -> t -> s -> SourcePos) -> (t -> Maybe a) -> ParsecT s u m a \end{code} The first argument is a pretty-printing function for the symbol types. We can use |unparse| from our |Symbol| class here. The second argument tells how to update the source position. We will not really use this information during parsing, but it is still used by Parsec when generating error messages. For that reason, we will update it as best as we can using the position information we maintain in the state. The third argument is the predicate passed to |satisfy| that tells which token is expected. Our implementation of |satisfy| then becomes: \begin{code} satisfy :: (Monad m, Symbol s) => (s -> Bool) -> P s m s satisfy ok = do let pos _ (_, bounds) _ = newPos "" 0 (fst (rightMargin bounds) + 1) let match x@(tok, _) | ok tok = Just x | otherwise = Nothing (tok, bounds) <- tokenPrim (unparse . fst) pos match setState (rightMargin bounds) return tok \end{code} Figuring out the current position in the stream is now a simple matter of asking for the current user state: \begin{code} getPos :: Monad m => P s m Range getPos = getState \end{code} \subsection{Building recursively annotated values} Now that we have the parsing infrastructure in place, it is time we look at how to build |AnnFix| values. Let us look at the simplest expression possible and the only ones that form leaves in expression trees: number literals. In |ExprF|, the constructor for number literals has type: \begin{code} Num :: Int -> ExprF rT \end{code} A number with its number field set, such as |Num 1|, has a type that can be specialized to |AnnFix1 Bounds ExprF|. Now we just need to wrap the |Bounds| around it using |mkAnnFix|. We ask for the left margin right before parsing the literal token and the right margin after: \begin{code} type ExprParser = P ExprToken Identity pNum :: ExprParser (AnnFix Bounds ExprF) pNum = unit $ (\(TNum n) -> Num n) <$> satisfy isNum unit :: Monad m => P s m (AnnFix1 Bounds f) -> P s m (AnnFix Bounds f) unit p = do left <- getPos x <- p mkBounded left x mkBounded :: Monad m => Range -> AnnFix1 Bounds f -> P s m (AnnFix Bounds f) mkBounded left x = do right <- getPos return (mkAnnFix (Bounds left right) x) \end{code} Instead of putting this in one function, we have fleshed out some operations we can reuse in a moment. Firstly we introduce a new type synonym |ExprParser| for our expression parser: it will consume |ExprToken|s and use an underlying |Identity| monad. The implementation of |pNum| is equal to the one in the old parser except for the call to |unit|. Function |unit| takes a parser that yields an |AnnFix1| and turns it into a parser that yields an |AnnFix| by wrapping the position information around it. This is a useful combinator for parsers that produce simple nodes such as number literals. Even if a parser does not produce simple nodes, it is very often the case that the call to retrieve the right margins and the call to |mkAnnFix| are right next to each other. This is reflected in |mkBounded| which, when given the left margin, asks for the right margin itself and then builds an |AnnFix|. In the old parser, the branches of the expression trees were built using |chainl1|. Since we were not annotating values back then, there was no distinction in type between values with and without annotation like there is now. We can adapt |chainl1| to make that distinction explicit. \begin{code} chainl1 :: Monad m => P s m (AnnFix Bounds f) -> P s m (AnnFix Bounds f -> AnnFix Bounds f -> AnnFix1 Bounds f) -> P s m (AnnFix Bounds f) \end{code} Here the first argument is again the parser for the operands and the second the parser for the binary operator. In our examples a typical operator is |Add :: rT -> rT -> ExprF rT|. If we give annotated children to |Add|, we end up with |AnnFix1|'s again, which is reflected in |chainl1|'s type. It is |chainl1|'s responsibility to insert the right position information. The full definition of |chainl1| follows: \begin{code} chainl1 px pf = do left <- getPos px >>= rest left where rest left = fix $ \loop x -> option x $ do f <- pf y <- px mkBounded left (f x y) >>= loop \end{code} Apart from the new |chainl1|, nothing else in the old parser needs to change. In fact, the code is syntactically identical to the previous implementation: \begin{code} pExpr :: ExprParser (AnnFix Bounds ExprF) pExpr = chainl1 pTerm (Add <$ pToken TPlus <|> Sub <$ pToken TMinus) pTerm :: ExprParser (AnnFix Bounds ExprF) pTerm = chainl1 pFactor (Mul <$ pToken TStar <|> Div <$ pToken TSlash) pFactor :: ExprParser (AnnFix Bounds ExprF) pFactor = pNum <|> pToken TOpen *> pExpr <* pToken TClose \end{code} Of course, this is only a small example and there are numerous other combinators commonly used in parsing. However, all of these can be adapted to position-saving variants as the API of building annotated values is very general. Just for good measure, here is the position-saving version of |chainr1|: \begin{code} chainr1 :: Monad m => P s m (AnnFix Bounds f) -> P s m (AnnFix Bounds f -> AnnFix Bounds f -> AnnFix1 Bounds f) -> P s m (AnnFix Bounds f) chainr1 px pf = fix $ \loop -> do left <- getPos x <- px option x $ do f <- pf y <- loop mkBounded left (f x y) \end{code} Note the differences between |chainr1| and |chainl1|: in both cases the parsing is done in a right-recursive way because Parsec does not like left-recursive grammars. In the case of |chainl1|, however, the \emph{value} is built in a left recursive way by making the left-hand side an argument to the loop (called |x| in the code). Another important difference is that in |chainl1| the call to |getLeftBounds| is outside the loop while in |chainr1| it is inside. \section{Exploring annotated trees} \label{sec:annoops} \subsection{Representing structural tree selections} Given an annotated subtree of type |AnnFix x f|, we can easily find the corresponding text selection: simply extract the |Bounds| value in the |Ann| constructor. To do the conversion in the other direction, we need to search the tree for a node whose bounds match the text selection. In this section we will introduce functions that do just that. But what should be the result of such a function? It could just yield the subtree that was selected, but then the context of that subtree is lost. Grote Trap solved this by returning the path from the root to the subtree, modeled as a list of child indices |[Int]|. This gives enough context, but it is untyped: such a path could apply to any tree, and you cannot be sure the path is actually valid for a certain tree until you follow it down to the selected node. The traditional, functional way for representing structural selections is the zipper structure, described by G\'erard Huet in 1997 \cite{huet1997zipper}. A zipper datatype is derived from another datatype: what the zipper looks like exactly depends on the shape of the original datatype. For example, the zipper for lists is different from the zipper for the arithmetic expressions we have been using as examples. Conor McBride showed how to automatically find out what a datatype's zipper type looks like \cite{mcbride01thederivative}. A value of a zipper type represents the selection of one particular node in a tree. The selected node is called the zipper's \emph{focus}. Once you have such a zipper value, you can move around the tree, stepping from one node to its sibling, child or parent in $O(1)$ time. The zipper also allows the current focus to be updated in $O(1)$ time. A zipper value is a primary data source: you don't need to hang on to the original, selection-less tree because it is implicit in the zipper value. Our programs so far have been generic over the particular shape functor |f|, albeit under some class constraints to have access to certain functions. Can we also generically derive zippers from functors using this technique? We are unsure whether this is possible and what the type class constraint would look like. However, we can approximate the idea of the zipper and build one generically for functors if we give up the possibility of $O(1)$-time updates of the zipper's focus. The data structure looks like this: \begin{code} data Zipper a = Zipper { zFocus :: a , zUp :: Maybe (Zipper a) , zLeft :: Maybe (Zipper a) , zRight :: Maybe (Zipper a) , zDown :: Maybe (Zipper a) } \end{code} This datatype can be used for tree selections of any tree, not just those expressed in terms of base functors. However, in this section we will only construct and use zipper values of type |Zipper (Fix f)|, for some functor |f|. The reason that |Zipper| does not allow $O(1)$ updates is because its values contain cycles and Haskell does not allow destructive updates. If we were to change the value of a zipper's |zFocus| field, it would no longer be a consistent value, because the other fields---|zUp|, |zLeft|, |zRight| and |zDown|---would still point to old zipper values. To make the zipper consistent, all these values would have to be rebuilt, recursively, resulting in a completely new zipper. This takes longer than $O(1)$ time. The original, true zipper does not have this problem because it contains no redundant information and no cycles. Despite this deficiency, |Zipper| is still usable as a representation of structural selections: there is the focus on a particular node, and there is also context available. Let us look at how to build a zipper value from a |Fix f|. We start by giving the type signature: \begin{code} enter :: Foldable f => Fix f -> Zipper (Fix f) \end{code} There is a new class constraint on the functor type |f| called |Foldable|, available in the standard libraries in |module Data.Foldable|. In power, |Foldable| lies between |Functor| and |Traversable|: every |Traversable| is |Foldable|, and every |Foldable| is a |Functor|, as reflected by the type classes' super classes. The heart of |Foldable| is the |fold| function: \begin{code} fold :: (Foldable t, Monoid m) => t m -> m \end{code} This function says that every |Foldable| can be seen as a container of elements, and these elements can be visited and combined using |mappend| and |mempty|. If the list monoid |[a]| is chosen, the result is a list with all the elements in the container. This is exactly what the standard function |toList :: Foldable t => t a -> [a]| does. For our fixed point functors, |f|'s elements are its children, and therefore we can use |toList| to obtain the functor's children, which is exactly what we need when constructing zippers. The |enter| function is defined in terms of a helper function: \begin{code} enter f = fromJust (enter' Nothing Nothing [f]) enter' :: Foldable f => Maybe (Zipper (Fix f)) -> Maybe (Zipper (Fix f)) -> [Fix f] -> Maybe (Zipper (Fix f)) enter' _ _ [] = Nothing enter' up left (focus@(In f) : fs) = here where here = Just (Zipper focus up left right down) right = enter' up here fs down = enter' here Nothing (toList f) \end{code} The helper function |enter'| is given more context: its first argument is the parent (if one) of the node that will be produced, and its second argument is its left sibling (if it exists). The third and final argument is the list of the resulting node's right siblings, still to be processed. The |where|-clause builds the current focus and the recursive values and gives them names, to be passed on to the recursive function calls. Building the tree in this way ensures optimal sharing, a process that is sometimes called \emph{tying the knot}.\footnote{More examples of \emph{tying the knot} can be found at \url{http://www.haskell.org/haskellwiki/Tying_the_Knot}.} That sharing is optimal can be demonstrated using \emph{Vacuum}, a Haskell library for visualizing the heap. Figure \ref{fig:vacuum} shows the zipper for the parse tree of \texttt{2 + 3 * 4} using the UbiGraph frontend for Vacuum.\footnote{\url{http://hackage.haskell.org/package/vacuum-ubigraph}} The fact that the tree is rendered is proof that the zipper structure---even though it is cyclic---consumes finite space, because otherwise Vacuum would not produce any results. \begin{figure}[t] \centering \includegraphics[width=\textwidth]{ZipperVacuum.png} \caption{A visualization of the zipper for the parse tree of \texttt{2 + 3 * 4} using the \texttt{vacuum-ubigraph-0.1.0.3} package.} \label{fig:vacuum} \end{figure} Once the zipper structure has been created, it can be traversed using the record selectors |zDown|, |zUp|, |zLeft| and |zRight|. From anywhere in the zipper, we can recover the original tree again by traversing up as far as possible and then requesting the focus: \begin{code} leave :: Zipper a -> a leave z = maybe (zFocus z) leave (zUp z) \end{code} Navigating down always selects the first child. A useful helper function is navigating down into the $n$th child. Like all other traversal functions, it might fail, so its result is wrapped in a |Maybe|: \begin{code} child :: Int -> Zipper a -> Maybe (Zipper a) child 0 = zDown child n = child (n - 1) >=> zRight \end{code} The zippers make the traversal functions that are discussed in the following sections easier to define and understand. \subsection{Annotation-guided exploring} Now that we know the return type of the function that converts text selections to tree selections, we can actually build such a function. So far we have built functions that abstract over the specific annotation type, rather than specialized functions for trees annotated with position information. We will do that again here. A naive implementation of the conversion would visit the entire tree, starting at the root and at each node searching recursively down and to the right until a node is found whose bounds match the query range. But the laws outlined in section \ref{sec:bounds} give us a lot of information and allow us to prune entire subtrees in some cases: \begin{itemize} \item If the left offset of the query range is strictly less than the current node's inner right offset, we know we do not have to look at the node's right siblings because law 2 says that children appear in order and their inner ranges do not overlap. \item If the query range is not contained within the current node's outer range, we know we do not have to consider the node's children anymore, by law 3. \end{itemize} Generalizing these choices for arbitrary annotations |x|, we can encode the choices using a function type |x -> ExploreHints|, where |ExploreHints| is defined as follows: \begin{code} data ExploreHints = ExploreHints { matchHere :: Bool , exploreDown :: Bool , exploreRight :: Bool } \end{code} Although uncommon, a parse tree may be constructed in such a way that a parent and its single child have the exact same bounds. If the query range matches these bounds, which of the two nodes should then be chosen? We will go for that node that is the deepest. But we cannot make this decision in general: if we abstract over the annotation type, we cannot make any assumptions about the domain anymore. For that reason, the general function will return the full list of matching tree selections. Our complete exploration function looks like this: \begin{code} explore :: Foldable f => (x -> ExploreHints) -> AnnFix x f -> [Zipper (AnnFix x f)] explore hints = explore' hints . enter explore' :: Foldable f => (x -> ExploreHints) -> Zipper (AnnFix x f) -> [Zipper (AnnFix x f)] explore' hints root = [ z | (dirOk, zs) <- dirs, dirOk (hints x), z <- zs ] where In (Ann x _) = zFocus root dirs = [ (matchHere, [root]) , (exploreDown, exploreMore (zDown root)) , (exploreRight, exploreMore (zRight root)) ] exploreMore = maybe [] (explore' hints) \end{code} The actual work is delegated to |explore'| which takes a zipper as input. It is easier to work on zippers, because they allow abstraction over the navigation of the tree. The |do|-block is written in the list monad, exploring the tree recursively in three relevant directions: first the current node, then down and finally to the right, but only if the hints allow so. Now we can express our positional conversion function in terms of |explore|: \begin{code} selectByRange :: Foldable f => Range -> AnnFix Bounds f -> Maybe (Zipper (AnnFix Bounds f)) selectByRange range@(left, _) = listToMaybe . reverse . explore hints where hints bounds@(Bounds _ (ir, _)) = ExploreHints { matchHere = range `rangeInBounds` bounds , exploreDown = range `rangeInRange` outerRange bounds , exploreRight = left >= ir } \end{code} Currently |explore| yields the topmost matching node first, so |selectRange| reverses the returned list and wraps the first result in a |Just|. Another use case is selecting a single position within the text, rather than a range. To that end we will also define a |selectPos|, but as before we will first define a generalized version and then define |selectPos| in terms of it. \begin{code} findLeftmostDeepest :: Foldable f => (x -> Bool) -> (AnnFix x f) -> Maybe (Zipper (AnnFix x f)) findLeftmostDeepest down = listToMaybe . reverse . explore hints where hints x | down x = ExploreHints True True False | otherwise = ExploreHints False False True \end{code} Rather than using |ExploreHints| again, a |Bool| suffices in this case: given an annotation |x|, should we go down or continue searching to the right? Such a query is easy to express in terms of |explore|. Again, |explore|'s result is reversed before wrapping the head of the list in a |Just|. Now |selectPos| can be written as follows: \begin{code} selectByPos :: Foldable f => Int -> AnnFix Bounds f -> Maybe (Zipper (AnnFix Bounds f)) selectByPos pos = findLeftmostDeepest (posInRange pos . innerRange) \end{code} \subsection{Repairing and navigating text selections} The last two use cases we will look at are the repair of invalid text selections and navigation based on text selections. Section \ref{subsec:structures} distinguished between invalid and valid text selections: a text selection is valid with respect to a parse tree if it corresponds to a structural selection in this parse tree. But what if a text selection is invalid? Invalid selections are not completely useless: we can make a good estimate as to what piece of text the user intended to select, based on the erroneous text selection and the list of all the text selections that \emph{would} have been valid. That is exactly what |repairBy| does: \begin{code} repairBy :: (Foldable f, Ord dist) => (Range -> Range -> dist) -> AnnFix Bounds f -> Range -> Bounds repairBy cost tree range = head (sortOn (cost range . innerRange) (validBounds tree)) sortOn :: Ord b => (a -> b) -> [a] -> [a] sortOn = sortBy . comparing validBounds :: Foldable f => AnnFix Bounds f -> [Bounds] validBounds (In (Ann b f)) = b : concatMap validBounds (toList f) \end{code} Function |repairBy| takes a tree and a text selection. Then it asks for all the selections that would have been valid using |validBounds| and sorts them according to some cost function, to which inner bounds are given. For this |sortOn| is used, which sorts a list based on a property of all elements in the list.\footnote{Luke Palmer has written a blog post on |sortOn|: \url{http://lukepalmer.wordpress.com/2009/07/01/on-the-by-functions/}} Then the first element of the resulting, sorted list is returned. Using |head| here is safe because the list is guaranteed to contain at least one element: the bounds of the root of the tree. One possible cost function is |distRange|, which takes the sum of the absolute differences of two ranges' endpoints. Function |repair| is |repairBy| specialized with this particular cost function: \begin{code} repair :: Foldable f => AnnFix Bounds f -> Range -> Bounds repair = repairBy distRange distRange :: Range -> Range -> Int distRange (l1, r1) (l2, r2) = abs (l1 - l2) + abs (r1 - r2) \end{code} Finally we would like to express navigation based on text selections. Suppose the user has selected a piece of text that corresponds neatly to a structural selection. Now the user wants the selection to expand to the direct parent of the selected node. We can accomplish this by translating the text selection to a zipper, moving up in the zipper and then translating back to text selection. We can capture these actions in a function if we can express the act of moving around within a zipper in a type. This type is not hard to find: it is exactly the type of the record selectors of |Zipper|. All four selectors have the type |Zipper a -> Maybe (Zipper a)|. We can compose such functions with the Kleisli arrow composition operator |(>=>) :: Monad m => (a -> m b) -> (b -> m c) -> (a -> m c)|. From this type we can see that the type of the composition of two movements, e.g.~|zDown >=> zRight|, is also |Zipper a -> Maybe (Zipper a)|. The index of the zipper is always polymorphic in such functions. We can express this by encoding movements in a |newtype|: \begin{code} newtype Nav = Nav { nav :: forall a. Zipper a -> Maybe (Zipper a) } \end{code} Besides composition of |Nav|s, we can also think of an identity navigation: staying in the same position. This makes |Nav| a nice |Monoid|: \begin{code} instance Monoid Nav where mempty = Nav return mappend (Nav n1) (Nav n2) = Nav (n1 >=> n2) \end{code} Now that the type of navigations is known, we can write navigation based on text selections as follows: \begin{code} moveSelection :: Foldable f => AnnFix Bounds f -> Nav -> Range -> Maybe Bounds moveSelection tree (Nav nav) range = (rootAnn . zFocus) <$> (selectByRange range tree >>= nav) rootAnn :: AnnFix x f -> x rootAnn (In (Ann x _)) = x \end{code} Naively, |moveSelection| would return |Maybe Range| rather than |Maybe Bounds|, but |Bounds| contains strictly more information, and we return all the information we have after finding the newly selected node. \section{Summing up} One of the disadvantages of the approach taken in this section is that we had to adapt our original |Expr| datatype to make the type of its children variable, while our original desire was to develop a solution for adding position information without having to change anything about the existing datatype. However, in return for this sacrifice we have gained many benefits. By building the datatypes we require from smaller building blocks (|ExprF|, |Ann|, |Fix|) we have gained a generic scheme for expressing morphisms, including catamorphisms and error catamorphisms. By adding certain constraints to our trees, such as |Traversable|, we can convert between normal algebras and error algebras. We can also generically discard annotations and use both normal algebras and error algebras on both normal and annotated trees. In terms of producers, the parser we built originally for |Expr| did not need many changes to work with the new annotated expressions because most of the work was hidden in the combinators. The building blocks also allowed us to generically express structural selections (zippers) and exploration functions. For trees annotated with position information, this means we are able to convert between text selections and structural selections, as well as fix invalid selections. There is one major disadvantage that was introduced by switching to the explicit recursion method: we can no longer work with families of datatypes. We have not noticed this before because our examples so far have only focused on arithmetic expressions, which do not need mutually recursive datatypes. In \emph{Data types \`{a} la carte} \cite{swierstra08datatypes}, Wouter Swierstra shows how to take the fixpoint of multiple datatypes using coproducts (lifted sums). However, in his solution there is freedom in which datatype to pick at every recursive position: freedom we do not want, because we want to specify which exact datatype to recurse into. In the next section we will look at generic programming techniques to solve this problem properly.