\section{From text selection to expression}

We will construct parser combinators that automatically derive a mapping from valid text selections to expression values.

\subsection{A type class for parsers}

Consider the following type class:

> class (Functor p, Functor (ParseResult p)) => Parser p where
>   type Input p        :: *
>   type ParseResult p  :: * -> *
>   satisfy             :: (Input p -> Bool) -> p (Input p)
>   runParser           :: p a -> [Input p] -> ParseResult p a

This class provides a foundation for parser types. A parser consumes a list of symbols; the |Input| function tells what the type of these symbols is. As can be seen from the header, a parser is a functor; its enclosed type is the type of the value the parser produces on a successful parse. |satisfy| is the most elementary of parser constructors: given a predicate, it builds a parser that recognizes exactly one element that matches that predicate. |runParser| deconstructs a parser given a list of input symbols. The result is wrapped in whatever functor |ParseResult| returns for the parser; this type is usually able to indicate failure, and commonly used type constructors for this are |Maybe| and |Either error|.

The |Parser| class provides no parser combinators, but such functionality can be added by requiring a parser to be also an instance of |Applicative|, |Alternative| and/or |Monad|.

Combining |Parser| with these other type classes, it is possible to write some basic combinators:

> symbol :: (Parser p, Eq (Input p)) => Input p -> p (Input p)
> symbol = satisfy . (==)
>
> token :: (Parser p, Eq (Input p), Applicative p) => [Input p] -> p [Input p]
> token = sequenceA . map symbol
> 
> choice :: Alternative f => [f a] -> f a
> choice = asum
> 
> oneOf :: (Parser p, Alternative p, Eq (Input p)) => [Input p] -> p (Input p)
> oneOf = choice . map symbol

Here |sequenceA| is from package Data.Traversable and |asum| is defined in Data.Foldable.

If a parser can be specified only in terms of functions from these type classes, they are polymorphic in the actual parser implementation:

> type ParserA  s a = forall p# (Parser p, Alternative p,  Input p ~ s) => p a
> type ParserM  s a = forall p# (Parser p, MonadPlus p,    Input p ~ s) => p a


\subsection{Basic parser combinators}

\todo{Make this section fit in the flow.}

Several parser combinator libraries for Haskell exist nowadays. Some of the better known ones are Parsec \cite{leijen01parsec} and UU.Parsing \cite{swierstra1999fec}. Because these are efficient and mature, they are also reasonably complex. Instead of using those, we will pay the price of less efficiency and use a very simple and transparent parser type so that it is easier to see what is going on exactly and to adapt the parser:

> newtype SucParser s a = SucParser (S.StateT [s] [] a)
>   deriving (Functor, Applicative, Alternative, Monad, MonadPlus, MonadState [s])

To see why this is a parser, expand the type according to |StateT|'s definition. |StateT| is the state monad transformer \cite{jones95functional}, defined in the standard module |Control.Monad.State|. The type |SucParser symbol result| is then equivalent to |[symbol] -> [(result, [symbol])]|, which is exactly the type of the parser described by Hutton \cite{hutton92higher} and later by Fokker \cite{fokker95functional}. For brevity, from now on we will substitute |s| for |symbol| and |a| for |result|.

The type is a function that takes a list of symbols as input, and produces what Wadler calls a \emph{list of successes} \cite{wadler85list}. Each success consists of a value of the result type and a list of yet to be parsed input symbols.

Hutton and Fokker's work describe several useful parser combinators that allow the user to build parsers that recognise context-free grammars. We will combine those combinators with recent work by McBride and Paterson who have generalised the `applicative style' used by these combinators, using names that are now standard in the Haskell libraries.

The sequencing parser combinator |<*>| for example has been moved into the |Applicative| type class \cite{mcbride08applicative}. This type class is available in the standard module |Control.Applicative| and looks as follows:

> class Functor f => Applicative f where
>   pure   :: a -> f a
>   (<*>)  :: f (a -> b) -> f a -> f b

McBride and Paterson have shown that every monad is an |Applicative|, and because |StateT s m| is a monad for every monad |m|, the |Applicative| instance for |StateT s m| turns out to be trivial indeed:

> instance Monad m => Applicative (StateT s m) where
>   pure   = return
>   (<*>)  = ap

The choice parser combinator |<||>| has been abstracted into the |Alternative| type class which is defined thusly:

> class Applicative f => Alternative f where
>   empty  :: f a
>   (<|>)  :: f a -> f a -> f a

In order to use |<||>| as a parser combinator we have to supply an instance for |Alternative| for our parser type as well:

> instance MonadPlus m => Alternative (StateT s m) where
>   empty                    = StateT $ const mzero
>   StateT s1 <|> StateT s2  = StateT $ \s -> s1 s `mplus` s2 s

Instead of choosing |MonadPlus m| as context, we could have chosen |Alternative m| instead, because these type classes are isomorphic. But because |Alternative (StateT s m)| requires |Applicative (StateT s m)| which in turn requires |Monad m|, we've chosen the |MonadPlus m| context because it implies the required |Monad m|.

Although there might be other correct and sensible implementations for |Monad []| and |MonadPlus []|, the default ones do exactly what we want here, namely implement the backtracking monad (i.e.~list of successes).

Because of these two instances, we get several other interesting combinators for free (here specialised to the SucParser type), including:

> (<$>)     :: (a -> b)       -> SucParser s a -> SucParser s b
> optional  :: SucParser s a  -> SucParser s (Maybe a)
> some      :: SucParser s a  -> SucParser s [a]
> many      :: SucParser s a  -> SucParser s [a]

Although we have plenty of ways to combine parsers into new ones, we can still only create parsers that don't parse anything, because the only functions that construct parsers are |pure| and |empty|, which both don't consume any input. The most elementary parser that does consume input is |satisfy|:

> sucSatisfy :: (s -> Bool) -> SucParser s s
> sucSatisfy pred = do
>   x:xs <- get
>   if pred x
>     then do put xs; return x
>     else fail "parse error"

Function |satisfy| takes a predicate that determines which symbols match and produces a parser that matches exactly one such symbol. If the parser is used on the empty string, the pattern match |x:xs <- get| will fail and the list monad's |fail| will be called, producing the empty list of successes.

Next we define the |runSucParser| function that given input runs a parser and yields the first successful parse, if any:

> runSucParser :: SucParser s a -> [s] -> Maybe a
> runSucParser p input = 
>   listToMaybe [ x | (x, []) <- runStateT p input ]

We now have all the ingredients for making |SucParser| an instance of our |Parser| type class:

> instance Parser (SucParser s) where
>   type Input        (SucParser s) = s
>   type ParseResult  (SucParser s) = Maybe
>   satisfy    = sucSatisfy
>   runParser  = runSucParser


\subsection{What are valid selections?}

Earlier we said that we would build parser combinators that automatically produce a mapping from valid text selections to values produced by the parser, but what constitutes a valid selection? It turns out that this depends entirely on the application at hand. We'll take a small arithmetic expression language as example.

It seems it is usually desirable that, for example, if the number $192374$ is part of the input, then selecting the number as a whole counts as a good selection, but selecting only, say, $923$ does not. Another example: in the expression $1 + 2 \times 3$, the $1 + 2$ part is invalid as selection because the $\times$ operator has a higher priority than the + and so the expression should be read as $1 + (2 \times 3)$.

Usually the grammar reflects these priority rules and therefore it seems natural to mark some production rules as \emph{units}, meaning that every sentence produced by such a rule counts as a valid selection. In the following grammar for arithmetic expressions, every production rule would be marked:

> expr  ::=  sum
> sum   ::=  mul
>       |    mul '+' sum
> mul   ::=  atom
>       |    atom '*' mul
> atom  ::=  number
>       |    '(' sum ')'

\subsection{Automatically building a parse tree}

One way to derive a mapping from text selection to AST nodes is to simply construct the complete parse tree as the input is parsed. The parse tree, as opposed to the AST, contains no semantic information but instead is made up of all the consumed symbols. Then, as above, some subtrees will have been produced by marked production rules, and these subtrees will be flagged to correspond to nodes in the AST. The parse tree data type looks like this, with the |Unit| constructor denoting a flagged node:

> data  ParseTree s v
>   =   Symbol  s
>   |   Branch     [ParseTree s v]
>   |   Unit    v  (ParseTree s v)

For branches we create a smart constructor:

> branch :: [ParseTree s v] -> ParseTree s v
> branch [q]  = q
> branch qs   = Branch qs

The type parameter |v| is not the same as the type parameter |a| of the previous parser, because the type |a| is a local type that changes all the time, for example when the |<$>| or |<*>| operators are used. On the other hand, |v| is constant over the whole tree, so that the type of values tied to |Unit| nodes is known at the root of the tree. This means that whenever a unit is introduced the type of the value constructed by the parser at that point must equal |v|.

The parameter |s| is the symbol type. For parsers directly processing text this will be |Char|, but for a two-phase parser this could be some token type.

We will now define a parser transformer that takes an existing parser and transforms it in such a way that a parse tree will automatically be built when the parser is run on input. Because this parser transformer builds parse trees, we call it |TreeParser|.

> newtype TreeParser p v a = TreeParser (p ([ParseTree (Input p) v], a))

Looking at the constructor, a |TreeParser| is equal to its underlying parser |p| except that it not only promises to produce a value of type |a|, but also a list of values of the |ParseTree| type we saw above. The |v| parameter is given as argument to |TreeParser|, whereas the symbol type |s| for the |ParseTree| is obtained from the |Input| type function from the |Parser| type class.

To make |TreeParser| an instance of class |Parser|, we will first need to define implementations for |satisfy| and |runParser|:

> treeSatisfy :: Parser p => (Input p -> Bool) -> TreeParser p v (Input p)
> treeSatisfy pred = TreeParser $ (\s -> ([Symbol s], s)) `fmap` satisfy pred
> 
> runTreeParser :: Parser p => TreeParser p v a -> [Input p] -> ParseResult p a
> runTreeParser (TreeParser p) xs = snd `fmap` runParser p xs

Function |treeSatisfy| simply delegates satisfy to the underlying parser, then wraps the consumed symbol in a |Symbol| constructor before putting the produced parser in the |TreeParser| constructor. |runTreeParser| discards the produced parse trees, but we will later write functions for retrieving the trees. Note that |runTreeParser|'s type signature refers to the |ParseResult| type function of the underlying parser. Now a |Functor| instance is the last thing blocking the way for a |Parser| instance:

> instance Functor p => Functor (TreeParser p v) where
>   fmap f (TreeParser p) = TreeParser (fmap (id *** f) p)
> 
> instance Parser p => Parser (TreeParser p v) where
>   type Input        (TreeParser p v) = Input p
>   type ParseResult  (TreeParser p v) = ParseResult p
>   satisfy    = treeSatisfy
>   runParser  = runTreeParser

The tree parser won't be useful until we get some parser combinators. Choice is almost trivial, so we will build that first.

> instance Alternative p => Alternative (TreeParser p v) where
>   empty = TreeParser empty
>   TreeParser p <|> TreeParser q = TreeParser (p <|> q)
> 
> instance MonadPlus p => MonadPlus (TreeParser p v) where
>   mzero = TreeParser mzero
>   TreeParser p `mplus` TreeParser q = TreeParser (p `mplus` q)

Sequencing is a bit more involved, because the two lists of parse trees have to be concatenated:

> instance Applicative p => Applicative (TreeParser p v) where
>   pure v = TreeParser $ pure ([], v)
>   TreeParser p <*> TreeParser q = TreeParser $
>     (\(ts, f) (ts', v) -> (ts ++ ts', f v)) <$> p <*> q
> 
> instance Monad p => Monad (TreeParser p v) where
>   return v = TreeParser $ return ([], v)
>   TreeParser pv >>= f = TreeParser $ do
>     (ts, v) <- pv
>     let TreeParser q = f v
>     (ts', w) <- q
>     return (ts ++ ts', w)

Running a parser now will result in the full input being stored in a flat list of |Symbol| constructors. The |unit| and |unitBy| combinators will cause interesting tree-shapes to be created:

> unit :: Functor p => TreeParser p v v -> TreeParser p v v
> unit = unitBy id
> 
> unitBy :: Functor p => (a -> v) -> TreeParser p v a -> TreeParser p v a
> unitBy f (TreeParser p) = TreeParser $ (\(ts, v) -> ([Unit (f v) (branch ts)], v)) `fmap` p

|unit| gathers the trees from a parser and moves them one level deeper. See figure \ref{fig:unit} for an illustration of this.

\begin{figure}[hbtp]
\centering
\includegraphics[width=\textwidth]{Unit}
\caption{The unit combinator moves the sentences produced by the target parser down one level.}
\label{fig:unit}
\end{figure}

The |unit| function requires the local type to equal the global unit type |v|. If this is not already the case, the user will need to map the result of the parser to |v| first. If the AST consists of mutually recursive datatypes, there will be several distinct local types that will need to be mapped to |v|. In that case, |v| is usually the sum of those types. For example, if locally the result can be an |Expression| or a |Statement|, |v| could be |Either Expression Statement|, and local parsers will need to be mapped with |Left| or |Right| constructors.

To retrieve the tree, the |getTree| combinator is provided, collecting the trees so far into one big tree using the smart constructor |branch|:

> getTree :: Functor p => TreeParser p v a -> TreeParser p v (ParseTree (Input p) v)
> getTree (TreeParser p) = TreeParser $ (\(ts, _) -> (ts, branch ts)) `fmap` p

\subsection{Deriving a mapping}

\todo{Storing symbols is not necessary for deriving |select|.}

We will now focus on deriving a mapping function from parse trees:

> type Range = (Int, Int)
> select :: ParseTree s v -> Range -> [ParseTreeLoc s v]

\todo{Show zipper |ParseTreeLoc|. Explain implementation of |select|.}

\subsection{Supporting two-phase parsers}

If the parser is run on tokens produced by a lexer, tokens will no longer be of length 1, as they are in the case of characters. Therefore |select| cannot safely assume symbol length 1 anymore. To remedy this, we introduce a type class |Symbol| and have |select| compute its ranges in terms of |symbolSize|:

> class Symbol a where
>   symbolSize      :: a -> Int

An instance for |Char| ensures that we have our original functionality back:

> instance Symbol Char where
>   symbolSize _      = 1

If a custom token type is used instead of |Char|, that type will have to implement |Symbol| as well so that |select| can compute the right ranges.

Having |symbolSize| in itself is not enough: the whole point of having a tokenizer is that unimportant tokens can be dropped from the stream before the parser starts reading the stream. But simply dropping the tokens is bad in our case, as this causes the ranges computed by |select| to be wrong. Instead of dropping them from the stream, we wrap every important token in a |Token| constructor and put the neighboring unimportant tokens in |tSpaceBefore| and |tSpaceAfter|:

> data Token s = Token
>   {  tSpaceBefore  :: [s]
>   ,  tImage        :: s
>   ,  tSpaceAfter   :: [s]
>   }

That way there can be an arbitrary number of unimportant tokens between the meaningful tokens, and |select| can be tolerant when given a range that does not exactly match a selectable unit but includes some padding on either or both sides. Figure \ref{fig:tokenlist} illustrates how the tokens are connected for the sentence |"1 + /* comment */ 2"|. Notice that the top row of tokens nicely corresponds to the meaningful tokens.

\begin{figure}[hbtp]
\centering
\includegraphics[width=.6\textwidth]{TokenList}
\caption{The top list contains the images of the important tokens. Subintervals of unimportant tokens below are shared by neighboring important tokens.}
\label{fig:tokenlist}
\end{figure}

A list of symbols can be converted to a list of |Token|s using |collapse|, with the first argument indicating whether a symbol is unimportant:

> collapse :: (s -> Bool) -> [s] -> [Token s]
> collapse space input = collapse' space xs yzs
>   where
>     (xs, yzs) = span space input
> 
> collapse' :: (s -> Bool) -> [s] -> [s] -> [Token s]
> collapse' space spaceBefore input = output where
>   output = if null input then [] else token : rest
>   token = Token
>     {  tImage        = head input
>     ,  tSpaceBefore  = spaceBefore
>     ,  tSpaceAfter   = spaceAfter
>     }
>   (spaceAfter, toProcess) = span space (tail input)
>   rest = collapse' space spaceAfter toProcess

The only case where whitespace is actually lost is when there are no non-space symbols in the input, but this is no problem as there will never be something to feed the parser with.

Now that we have information about neighboring whitespace, we need to be able to tell |select| about this, and so type class |Symbol| becomes:

> class Symbol a where
>   symbolSize       :: a -> Int
>   spaceBeforeSize  :: a -> Int
>   spaceAfterSize   :: a -> Int

Of course, if |s| is a |Symbol|, so is |Token s|:

> instance Symbol s => Symbol (Token s) where
>   symbolSize       (Token _   im  _   ) = symbolSize im
>   spaceBeforeSize  (Token sb  _   _   ) = sum (map symbolSize sb)
>   spaceAfterSize   (Token _   _   sa  ) = sum (map symbolSize sa)

Finally, it is inconvenient that the parser now has to consume language tokens wrapped in |Token| constructors. Fortunately, the parser can safely assume it is processing uncollapsed tokens, because we can lift a parser to consume collapsed tokens instead, making the process fully transparent in the design of the parser:

> lift :: TreeParser (SucParser s) v a -> TreeParser (SucParser (Token s)) v a
> lift = mapInput tImage

\todo{Come up with the right abstraction to have this work for any parser type.}

Sadly, mapping the input type like this is not possible for arbitrary parser types, but only for |TreeParser|s with an underlying |SucParser|, as the implementation heavily depends on the internals of |SucParser|:

> mapInput :: (t -> s) -> TreeParser (B.SucParser s) v a -> TreeParser (B.SucParser t) v a
> mapInput f (TreeParser p) = TreeParser (fromFunction doMapInput) where
>   doMapInput input = do
>     ((trees, value), _) <- toFunction p (map f input)
>     let (trees', rest') = overwriteManySymbols input trees
>     return ((trees', value), rest')
> 
> toFunction :: SucParser s a -> [s] -> [(a, [s])]
> toFunction (SucParser p) = runStateT p
> 
> fromFunction :: ([s] -> [(a, [s])]) -> SucParser s a
> fromFunction = SucParser . StateT

\todo{Explain |overwriteManySymbols|.}