%let report = True %if report \documentclass{article} \usepackage{a4wide} \usepackage{TRtitlepage} %else \documentclass[runningheads,a4paper]{llncs} %endif %if report \author{{S. Doaitse Swierstra} and {Atze Dijkstra}} %then %else \authorinfo{S. Doaitse Swierstra \\ Atze Dijkstra} {Department of Computer Science\\ Utrecht University\\ Utrecht, The Netherlands} {\{doaitse,atze\}@@cs.uu.nl} %endif %%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%% % % This is the LaTeX source for the instructions to authors using % the LaTeX document class 'llncs.cls' for contributions to % the Lecture Notes in Computer Sciences series. % http://www.springer.com/lncs Springer Heidelberg 2006/05/04 % % It may be used as a template for your own input - copy it % to a new file with a new name and use it as the basis % for your article. % % NB: the document class 'llncs' has its own and detailed documentation, see % ftp://ftp.springer.de/data/pubftp/pub/tex/latex/llncs/latex2e/llncsdoc.pdf % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %include lhs2TeX.fmt %include polycode.fmt \usepackage{amssymb} \usepackage{listings} \setcounter{tocdepth}{3} \usepackage{graphicx} \usepackage{comment} \excludecomment{lhs} \usepackage{url} \usepackage{xspace} \newcommand{\eps}{$\epsilon$\ } \urldef{\mailsa}\path||{doaitse}@@swierstra.net|| \newcommand{\keywords}[1]{\par\addvspace\baselineskip \noindent\keywordname\enspace\ignorespaces#1} \usepackage{listings} \lstset{% basicstyle=\ttfamily, breaklines=true, columns=fullflexible, escapeinside = ||, breakindent=0pt} \lstnewenvironment{examplee}{}{} \begin{document} \def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}} %if report \TRtitlepage {Parse Your Options} % argument 1 <= the title {S.~Doaitse~Swierstra\\Atze Dijkstra\par} % argument 2 <= authors {UU-CS-2013-005} % argument 3 <= report number {May 2013} %else \mainmatter % start of an individual contribution %endif % first the title is needed \title{Parse Your Options} % a short form should be given in case it is too long for the running head %if report %then %else \titlerunning{Parse Your Options} \author{S. Doaitse Swierstra and Atze Dijkstra} %\authorrunning{!!!!!!!!!} \institute{Dept. of Computer Science, P.O. Box 80.089, 3508 TB Utrecht, the Netherlands\\ %\mailsc\\ \url{http://www.cs.uu.nl}} %endif \maketitle \begin{abstract} We describe the development of a couple of combinators which can be used to run applicative style parsers in an interleaved way. In the presentation we advocate a scheme for choosing identifier names which clearly shows the types of the values involved, and how to compose them into the desired result. We finish with describing how the combinators can be used to parse command line arguments and files containing options. %if not report %then \keywords{Parser combinators, Haskell, Parallel parsing, Option processing, Permutation parsing} %endif \end{abstract} %format ^^ = " " %format <* = "\mathbin{\text{\small\ttfamily{<*}}}" %format *> = "\mathbin{\text{\small\ttfamily{*>}}}" %format <*> = "\mathbin{\text{\small\ttfamily{<*>}}}" %format <**> = "\mathbin{\text{\small\ttfamily{<**>}}}" %format <|> = "\mathbin{\text{\small\ttfamily{<|>}}}" %format <||> = "\mathbin{\text{\small\ttfamily{<||>}}}" %format <<||> = "\mathbin{\text{\small\ttfamily{<<||>}}}" %format <+> = "\mathbin{\text{\small\ttfamily{<+>}}}" %format <$> = "\mathbin{\text{\small\ttfamily{<\$>}}}" %format <$ = "\mathbin{\text{\small\ttfamily{<\$}}}" %format iI = "\llfloor" %format Ii = "\rrfloor" %format <++> = "\mathbin{\text{\small\ttfamily{<++>}}}" %format +>> = "\mathbin{\text{\small\ttfamily{+>>}}}" \newcommand{\uulib}{\texttt{uulib}\xspace} \newcommand{\uuparsinglib}{\texttt{uu-parsinglib}\xspace} %format b2g_a %format c2g_b %format c2g_b2a %format g_c %format g_b %format g_b2a %format f_c2b %format f_c2a %format f_b2a %format f_c2b2a %format f_c %format f_b %format l_a %format l_b %format l_b2a %format m_a %format m_b %format m_b2a \section{Introduction} In the original version of the \uulib \footnote{\url{http://hackage.haskell.org/package/uulib}} parsing combinator library we introduced two modules, providing combinators for parsing more complicated input structures than those described by context free languages: one for recognising permuted structures and one for recognising merged lists. In this paper we present an \emph{interleave} combinator which generalises these two patterns. As we will see this library can almost completely be constructed using the basic |Applicative| and |Alternative| parser interfaces as e.g. provided by the \uuparsinglib package, without having to deal with the intricate internals of the parsing process itself. We only require that it is possible to split a parser into two components: one which recognises the empty sequence in case the original parser can do so and one which takes care of the non-empty cases. Of course, a direct consequence of choosing a feature rich library as \uuparsinglib is that the parsers constructed out of it inherit all its nice properties: returning results in an online way, providing informative error messages, and adapting the input where parsing cannot proceed. Before we introduce our new library we introduce some common definitions and give an example of parsing permuted and merged structure in Section \ref{sec:defs}, followed by a motivating example for our new combinators in Section \ref{sec:example}. Section \ref{sec:core} forms the core of the paper describing the internals of the new library, whereas in Section \ref{sec:options} we develop a small library using the new combinators for processing command line options. We will see that there is not much more work left for the programmer than just enumerating the possible options and the way they are to be denoted. We conclude in Section \ref{sec:conclusions}. \section{Common definitions and background}\label{sec:defs} We start out by repeating the definition some familiar classes, and give applications of the previously introduced combinators. The class |Applicative| introduces a combinator |<*>| that combines two side-effecting functions into a single function. The important observation is that the results of the two functions are combined by applying the result of the first to that of the second. In this paper such functions will be referred to as parsers, as this is the way we will instantiate the interface. The |<*>| operator is a function which runs its two parser arguments sequentially on the input: i.e. the second parser takes off in with the input state in which the first one has finished. The class |Alternative| introduces the companion operator |<||>| which runs either one of its operands, but not both. Its unit element is |empty|; this is not, as its name suggests, the parser which recognises the empty string but instead the parser which always fails. For the sake of completeness we also introduce the |Functor| class. \begin{spec} class Functor f where fmap :: (b -> a) -> f b -> f a class Applicative f where (<*>) :: f (b -> a) -> f b -> f a pure :: a -> f a class Alternative f where (<|>) :: f a -> f a -> f a empty :: f a \end{spec} Next we introduce some helper functions that may be used to modify the result of a parser, or to throw away a result in which we are not interested: \begin{spec} (<$>) :: Functor f => (b -> a) -> f b -> f a (<$>) = fmap p <* q = const <$> p <*> q f <$ p = const f <$> p p *> q = flip const <$> p <*> q p `opt` a = p <|> pure a \end{spec} In addition to the operator for sequential composition |<*>| we defined \cite{1030338} a permuting combinator, with a similar type (the precise definition of the |g| does not matter here): \begin{code} <||> :: g (b -> a) -> g b -> g a \end{code} This operator also recognises both its operands, but irrespective of the order in which they occur in the input stream, i.e. we may either run its left operand and have its right operand take off where the first one finished or the other way around. Actually it is even more expressive in the sense that if either of its |g|-operands is again built using |<||||>| the elements of the other operand may intervene. Using this combinator we can construct parsers which recognise input in which the components are possibly permuted. Examples of such uses abound: the order of the fields in a \BibTeX\ entry is not fixed, nor is the order of the specification of the field values in a Haskell record fixed. Consider the data type: \begin{spec} data Cart = Cart {x :: Float; y :: Float} | Polar {rho :: Float; phi :: Float} \end{spec} Now one might want to be able to read both @Cart{x=1,y=2}@ as well as @Cart{y=2,x=1}@ from the input; if a Haskell compiler accepts both representations, why shouldn't we be able to read both these representations from the input? Using the |<||||>| parser this is easily achieved (|mkG| and |sepBy| are helper functions to which we will come back later): \begin{spec} pTwoFloats op constr s1 s2 = pToken constr *> pCurly ( (op <$> pField s1 pFloat <||> pField s2 pFloat) `sepBy` pSym ',' ) pField s p = mkG (pToken s *> pSym '=' *> p) pCurly p = pSym '{' *> p <* pSym '}' pCart = ^^ pTwoFloats Cart "Cart" "x" "y" <|> pTwoFloats Polar "Polar" "rho" "phi" \end{spec} We \cite{Guerra:2005fk} furthermore defined a combinator |<+>| which makes it possible to construct parsers which recognise merged lists. In parser |int_low| the |listOf| function converts a parser to a parser which may run interleaved with other parsers and recognises a list of values accepted by its argument parser, the function |<+>| runs both arguments in an interleaved mode, and |pMerged| converts its argument back to a normal parser again: \begin{spec} int_low :: Parser ([Int],[Char]) int_low = pMerged ( listOf pInt <+> listOf pLower ) \end{spec} The function |int_low| accepts the input string |"a1bc2"| and returns the nested pair |([ 1, 2 ], "abc")|; the input is split into two sub-streams which are each accepted by one of the two operands of |<+>|. On closer inspection we see that the list merging parsers are a generalisation of the permuting parsers: just make sure that each of the merged lists has precisely length 1 and list merging degenerates to permuting. This made us also realise that the merging of lists was just a special case of {\em merging any collection of structured sequences}, which in its turn raised the question whether we can define combinators which make it possible to describe this process, thus generalising the two libraries just mentioned into a single one. \section{Parsing Log Files}\label{sec:example} As a motivation for our new combinators we start with a simple example of their use. It deals with unraveling a file which was created by several concurrent processes writing entries into it: each process indicates its start by writing out the character |'s'| followed by its unique identity (numbers in the example), next it will emit a couple of work entries each consisting of the character |'w'| followed by its process identity and a space. The rest of the line contains log information. A process indicates its termination by closing its sequence of entries with a line consisting of the character |'c'| gaian followed by its process identity. So all entries generated by a single proces will be labeled with the same process identity, which we assume to be unique throughout the log file. Given e.g. a log file containing the following lines: \begin{code} s2 -- start of process 2 s1 -- start of process 1 w1 a1 -- first work entry of process 1 w2 b -- first work entry of process 2 w1 a2 -- second work entry of process 1 c1 -- process 1 closes s3 -- start of process 3 w3 c -- first work entry of process 3 c2 -- process 2 closes c3 -- process 3 closes \end{code} we want to produce the following table, where each process is associated with the contents of its work entries: \begin{code} [("2",["b"]),("1",["a1","a2"]),("3",["c"])] \end{code} In the code we use names starting with a |g| to bind to parsers which can run in an interleaved way (which we will refer to as \emph{grammars} from now on), whereas names starting with a |p| refer to conventional parsers which recognise a consecutive segment of input. The function |mkG| converts a conventional parser to a grammar; the result will still recognise a consecutive part of the input, but once it has done so it may pass control on to a competing parser. We start out by defining the grammar |gProcess| recognising the sequence of events for a single process, using some of the basic parsers provided by common combinator libraries providing an applicative interface, and subsequently use the function |gmList| to concurrently run as many of them as needed. We have used a monadic bind to use the process identity retrieved from the starting line in constructing the parsers for the subsequent lines of this process. \begin{code} gProcess = do i <- mkG pStart w <- pMany . mkG $ pWork i _ <- mkG $ pClose i return (i, w) gLog = gmList gProcess \end{code} Next we define the parsers recognising the various kinds of log entries. The function |pMunch| accepts the longest prefix of the input which passes the predicate parameter. \begin{code} pStart = pToken "s" *> pMunch (/= '\n') <* pToken "\n" pWork i = pToken ('w': i ++ " ") *> pMunch (/= '\n') -- read the rest of the line <* pToken "\n" pClose i = pToken ('c': i++"\n") \end{code} Note that the order in which the work comes out is determined by the order of their first entry in the log file. \section{Merging Parsers}\label{sec:core} \begin{lhs} \begin{code} {-# LANGUAGE ExistentialQuantification, ScopedTypeVariables #-} module Text.ParserCombinators.UU.MergeAndPermute where import Text.ParserCombinators.UU.Core import Debug.Trace \end{code} \end{lhs} \subsection{Representing parsers for merged input structures} As we have seen in the example, what we are looking for is the possibility to interleave parsers; one can think of this as associating a separate colour with each parser and splitting up the input in a series of coloured segments, such that the concatenation of segments of the same colour is accepted by its correspondingly coloured parser. The question to answer is how to split the input into uniformly coloured segments, i.e. how to find the points in the input where colour switching may take place and what colour to give to the segments. In order to be able to do so efficiently we have decided to construct our interleaving parsers out of \emph{basic} parsers, which are conventional parsers which recognise a \emph{consecutive part} of the input. We rephrase our use of the word \emph{grammar} to stand for a \emph{description of} a parser which can pause at a colour switch and continue when the input switches back to its colour again. Such grammars can thus be run in a competing fashion. From this point on we will use the word |parser| to refer to a conventional parser which recognises a segment of the input. Once a parser gets hold of the input, and has successfully started parsing, we will let this parser run until completion. Once it has completed, all pending grammars (including the grammar for which the parser which just ran forms a constituent) can try to continue to parse. A consequence of this is that at any point in the input where we may switch between grammars each of the competing grammars should present its \emph{first-parsers}, i.e. the candidates for accepting the next uniformly coloured segment. These presented first-parsers play a similar role as the {\em first} sets resulting from an $LL(1)$ grammar analysis; the only difference is that we dynamically compute the collection of first-parsers instead of statically a the set of first symbols. A very important issue we have to take care of is the avoidance of unwanted ambiguity as the following example shows. Suppose we have the following permuting parser: \begin{spec} pMaybe s = pToken s `opt` "" pAmb = (,) <$> pMaybe "A" <||> pMaybe "B" \end{spec} and our input consists of the string |"A"|, then there are two different ways in which the empty string |""| can be seen as part of the input: |"" ++ "A"| and |"A" ++""|. Thus when running the parser |pAmb| on the input string |"A"| there are two possible parses, both returning the same result. Unless our underlying parsing library somehow knows how to deal with this we will eitherget an error message, a rather arbitrary choice of one of the possible parses or, when this happens more than once, and exponnetial number of results. Even if the library could handle ambiguity it may not be able to discover that both results are the same. This problem has already been described in the development of the permuting combinators \cite{1030338} and the solution we take here is based on a similar assumption as we have chosen there: we assume that we are able to split a parser in a part recognising the non-empty part and one recognising the possibly empty part. The latter will be represented by the value that is to be returned as as a witness in case the parser accepts the empty input (denoted as $\epsilon$ from now on). In order to make this explicit in our interface we introduce a class: \begin{code} class Splittable f where getNonPure :: f a -> Maybe (f a) getPure :: f a -> Maybe a \end{code} where we will assume the following equality to hold for any |Alternative| functor |f| we want to use for our basic parsers: \begin{spec} f == maybe empty id (getNonPure f) <|> maybe empty pure (getPure f) \end{spec} We are now ready to define the data type |Gram| representing {\em grammars}. It allows us to compute, at each possible splitting point in the input, the collection of first-parsers that may try to continue at that point. We start out by defining a data type |Alt| which has a constructor |Seq| which explicitly represents the splitting of the grammar into in a first-parser and ``the rest of the work to be done''. Since we also want our grammars to have a monadic interface we equip our data type |Alt| a |Bind| alternative, which again explcitly contains the first-parser to be run: %{ %format . = "." \begin{code} data Alt f a = forall c . (f (c -> a)) `Seq` ( Gram f c) | forall c . (f c) `Bind` (c -> Gram f a) \end{code} Based on the data type |Alt| we now define the data type |Gram|. It contains two components: a value of type |[Alt f a]| which is the alternation (choice) of all non-empty parts jointly accepting a non-empty sequence of symbols, and a |Maybe a| value representing the value to be returned in case \eps is accepted: \begin{code} data Gram f a = Gram [Alt f a] (Maybe a) \end{code} %} The type parameter |f| corresponds to the conventional, non-interruptible parsers, which we use as building blocks for our grammars. \subsection{Defining the various class instances for |Gram|} We now define the instances for these newly introduced data types for the classes |Functor|, |Applicative|, |Alternative| and |Monad|. \subsubsection{|Gram| is a |Functor|} We start with the instances for |Functor| for both the data type |Gram| and |Alt|. The code is straightforward: \begin{code} instance Functor f => Functor (Gram f) where fmap b2a (Gram l_b m_b) = Gram (map (b2a <$>) l_b) (b2a <$> m_b) instance Functor f => Functor (Alt f) where fmap b2a (f_c2b `Seq` g_c ) = ((b2a .) <$> f_c2b) `Seq` g_c fmap b2a (f_c `Bind` c2g_b ) = f_c `Bind` (\c -> b2a <$> c2g_b c) \end{code} Here we have chosen a naming convention which makes the types of the values involved explicitly visible in the program text: |b2a| holds a function value of the type |b -> a|, |m_b| stands for a value of type |Maybe b|, |l_a| stands for a list of |Alt f a| values, |f_c2b| is bound to a value of type |f (c -> b)|, |g_c| is bound to a value of type |Gram f c| and |c2g_b| to a value of type |c -> Gram b|. Now it is easy to see that |fmap b2a (fc `Bind` c2g_b)| should result in a value of type |Gram a|, etc. Note that applying |fmap| maintains the invariant that the first-parser can be easily recognised for each |Alt| intact. We have chosen to use |<$>| as an alias for |fmap| in this code wherever possible, since this makes the names chosen even more helpful in the representation of the code. %$ \subsubsection{|Gram| is an |Applicative| functor} We want to construct our merging parsers in just the same way as we construct normal parsers, using the well known |Applicative| and |Alternative| interfaces \cite{McB07}. Since our data types enforce the property that we can easily identify the first-parser to be used, this is where the real work takes place. The pattern we follow however is common, and well known from various process algebras and given in Figure \ref{fig:appl}. The first-parsers in the |Seq| and |Bind| constructs of the left-hand side operand are ``rotated'' out. The hard work is done by the function |fwdby| which combines the remaining part of the |Seq| and |Bind| constructs to form the new right-hand side of the |Seq| and |Bind| construct in the result. The |g_c| and |g_b| grammars of a |Seq| are ran returning a value of type |(c, b)|. We modify the value returned by the first-parser --by applying |uncurry| to it-- to accept this pair of values instead of getting passed the two arguments individually. If the left-hand side parser can recognise the empty input then also the first-parsers of the right-hand side grammar are first-parsers of the resulting grammar; they can start to accept part of the input too. This explains the second component in the definition of the |Alts f b| in the right-hand side of the |<*>| definition, where we use the witness value of type |b -> a| to update the result of the right-hand side parser. The definition of |pure| speaks for itself: we have no non-empty alternatives, and the parser can recognise $\epsilon$ with a witness of type |a|. \begin{figure} \begin{code} instance Functor f => Applicative (Gram f) where pure a = Gram [] (Just a) Gram l_b2a m_b2a <*> ~g_b@(Gram l_b m_b) = Gram ( map (`fwdby` g_b) l_b2a ++ [b2a <$> f_b | Just b2a <- [m_b2a], f_b <- l_b] ) ( m_b2a <*> m_b) fwdby :: Functor f => Alt f (b -> a) -> Gram f b -> Alt f a (f_c2b2a `Seq` g_c) `fwdby` g_b = (uncurry <$> f_c2b2a) `Seq` ((,) <$> g_c <*> g_b) (f_c `Bind` c2g_b2a) `fwdby` g_b = f_c `Bind` (\c -> c2g_b2a c <*> g_b) uncurry f (x, y) = f x y instance Functor f => Alternative (Gram f) where empty = Gram [] Nothing Gram ps pe <|> Gram qs qe = Gram (ps++qs) (pe <|> qe) \end{code} \caption{|Gram| is a member of the |Applicative| and |Alternative| classes.\label{fig:appl}} \end{figure} A subtle point is that we had to add an irrefutable pattern match (|~|) to the right-hand side operand of |<*>|, since otherwise the pattern matching creates an endless loop in situations like the definition of |pMany| when |f| is instantiated with some |Gram g|: \begin{code} pMany p :: f a -> f [a] pMany p = let result = (:) <$> p <*> result `opt` [] in result \end{code} Here |<*>| does not have access to the top-level constructor in its right-hand side, since this constructor is produced by this very call to |<*>|. Note that the call to |<*>| is evaluated lazily, so we have no problems with recursive grammars. The unrolling of these definitions is done on a by-demand basis during the actual parsing process. This technique resembles the parallel parsing strategy as developed by Claessen \cite{JFP:254719} and code also bears close resemblance to the computation of the |firsts| set, as described by Swierstra and Duponcheel \cite{SwieDupo96}. In case the left-hand side is a monadic construct we just use the original first-parser |f_c|, and again move the right-hand side |c2g_b2a| of the left operator to the right-hand side of the result, where it is composed with the original right hand side using |<*>|. \subsubsection{|Gram| is an |Alternative| functor} The instance for |Alternative| is almost trivial: we concatenate the list of alternatives from both operands. This leaves the question what to do if the grammar is ambiguous, caused by both alternatives to be able to accept \eps. We have chosen to use the left-biasedness of |<*>| as defined for |Maybe| to choose the left value to return. The code is again given in figure \ref{fig:appl}. \subsubsection{|Gram| is a |Monad|} The next thing we want to do is to equip our grammars with a monadic interface, which we again achieve by ``rotating'' all but the first-parser to the right so the first-parser again is presented at the top level constructor. In the case of a |`Seq`| construct as the left argument of |>>=| we split its monadic effect into two steps: in the first step we make the first part if the left-hand side operand explicitly visible in the resulting |`Bind`| construct, whereas the corresponding right-hand side |g_c| of the |`Seq`| part is, once it has been combined using |\c2b -> c2b <$> g_c | with the result |c2b| of the left-hand side of the |`Seq`| in another call to |>>=| in the right-hand side of the resulting top level |`Bind`| constructs. When the left-hand side accepts \eps we have to take some extra precautions, since in this case the first-parsers created by a call to right-hand side compete for input too, since they too may accept input at this point. Since we have the witness of this empty left-hand side available, we can use it to compute the |Gram a| value returned by the right-hand side of |>>=|, and with this its first-parsers become available too and can take part in the competition: \begin{code} instance Functor f => Monad (Gram f) where return a = Gram [] (Just a) Gram l_b m_b >>= b2g_a = case m_b of Nothing -> Gram (map (`bindto` b2g_a) l_b) Nothing Just b -> let Gram l_a m_a = b2g_a b in Gram ( map (`bindto` b2g_a) l_b ++ l_a) m_a bindto :: Functor f => Alt f b -> (b -> Gram f a) -> Alt f a (f_c2b `Seq` g_c) `bindto` b2g_a = f_c2b `Bind` \ c2b -> c2b <$> g_c >>= b2g_a (f_c `Bind` c2g_b) `bindto` b2g_a = f_c `Bind` \ c -> c2g_b c >>= b2g_a \end{code} \subsubsection{Constructing elementary |Gram| values} The last thing we have to do is to show how we can lift a parser into an equivalent |Gram| \begin{code} mkG:: (Splittable f, Functor f) => f a -> Gram f a mkG p = Gram (maybe [] (\p -> [(const <$> p) `Seq` pure ()]) (getNonPure p)) (getPure p) \end{code} At first sight this code looks more complicated than strictly needed; this is caused by our choice of the |Alt| data type. We could have easily added a third case |Single (f a)| to this data type, and have used this |Single| constructor here. We have chosen for the current, minimalistic approach, in which this |Single| data type is encoded as a parser which is to be followed by an \eps parser returning |()| the result of which is subsequently discarded. This adds a small constant-time overhead to our parsers, which we think is acceptable in return for the increased simplicity of the code. \subsection{|<||||>| and |<<||||>|} In the previous subsections we have defined the data type |Gram f a|, have shown how to lift elementary parsers to this structure, and have defined instances for this type for the |Functor|, |Applicative|, |Alternative| and |Monad| classes. As a final step we now define the |<||||>| combinator, which describes the ``interleaved'' composition of two grammars. We will however express this combinator in terms of an even more primitive combinator |<<||||>|: \begin{code} infixl 4 <||> infixl 4 <<||> (<||>) , (<<||>) :: Gram (b->a) -> Gram b -> Gram a \end{code} Note that we have given these operators the same type as the conventional |<*>| combinator, since we very much like the applicative interface for describing how to combine the two accepted values into the result. The reason, of course, that we cannot use the |<*>| combinator is that it has already been used to express the more conventional sequential composition of two |Gram| values. The idea of the |<<||||>| combinator is that it will run one of the first-parsers of its left-hand side operand on the input first, and from that point on will behave like |<||||>|, which does not have a preference for either of its operands to start accepting input. We can easily define |<||||>| in terms of |<<||||>|: \begin{code} g_b2a <||> g_b = ^^ ^^ g_b2a <<||> g_b <|> flip ($) <$> g_b <<||> g_b2a \end{code} Here we see that the resulting parser will either run one of the first-parsers from its left-hand side operand or one of the first-parsers of its right-hand side operand. In case both grammars can accept \eps we get the same witness value twice, and in principle our grammar becomes ambiguous; the biased choice of the |Maybe| instance of |Alternative| throws away one of these two (equal) values. So all we have to do now is to define |<<||||>|. We construct a new grammar which has as its first-parsers all the first-parsers of its left-hand side operand: \begin{code} (<<||>):: Functor f => Gram f (b->a) -> Gram f b -> Gram f a g_b2a@(Gram l_b2a m_b2a) <<||> ~g_b@(Gram _ m_b) = Gram (map (`fwdby'` g_b) l_b2a) (m_b2a <*> m_b) (f_c2b2a `Seq` g_c) `fwdby'` g_b = (uncurry <$> f_c2b2a) `Seq` ((,) <$> g_c <||> g_b) (f_c `Bind` c2g_b2a) `fwdby'` g_b = fc `Bind` (\ c -> c2g_b2a c <||> g_b) \end{code} Notice that this code is almost the same as that for the definition of |<*>|; only have the occurrences of |<*>| in the right-hand sides of the |fwdby| function been replaced by |<||||>|, thus indicating that thus constructed parsers should run interleaved instead of sequentially. \subsection{Converting Grammars into Parsers} The only thing left to do now it to show how to construct a real parser from a |Gram| structure. We will require that the parameter |f| we have carried around thus far has instances for the usual |Applicative|, |Alternative| and |Monad| interfaces, so we can use the functions available from these classes in this process. For each of the alternatives we select the first-parser from it, use |<||>| to select one of these to run, and after that either combine its result using |<*>| with the parser generated from the corresponding |Gram| value in the case of a |`Seq`|, or use it as an argument to the right-hand side operand in the case of a |`Bind`| and convert this result again into a proper parser. \begin{code} mkP :: (Monad f, Applicative f, Alternative f) => Gram f a -> f a mkP (Gram l_a m_a) = foldr (<|>) (maybe empty pure m_a) (map mkP_Alt l_a) where mkP_Alt (f_b2a `Seq` g_b ) = f_b2a <*> mkP g_b mkPrAlt (f_b `Bind` b2g_a) = f_b >>= (mkP . b2g_a) \end{code} \subsection{Inserting separators} As we have seen in the |pCart| example it is a common case that the elements which we want to recognise, and which occur in a permuted order are separated by e.g. a |';'| or a |','|. For these cases we have introduced a special version of |mkP|, which takes an additional argument telling how to parse a separator. The hard work is done by a function |insertSep| which prefixes each parser, except the first one, in the |Gram| parameter by the parser that recognises the separator: \begin{code} sepBy :: Applicative f => Gram f a -> f b -> f a sepBy g sep = mkP (insertSep sep g) insertSep :: (Applicative f) => f b -> Gram f a -> Gram f a insertSep sep (Gram l_a m_a :: Gram f a) = Gram (map insertSepInAlt l_a) m_a where insertSepInAlt (f_b2a `Seq` g_b ) = f_b2a `Seq` prefixSepInGram g_b insertSepInAlt (f_b `Bind` b2g_a) = f_b `Bind` (insertSep sep . b2g_a) prefixSepInGram (Gram l_a m_a) = Gram (map prefixSepInAlt l_a) m_a prefixSepInAlt :: Alt f b -> Alt f b prefixSepInAlt (f_b2a `Seq` g_b) = (sep *> f_b2a) `Seq` prefixSepInGram g_b \end{code} Because we are making use of polymorphic recursion we had to insert a few type annotations in the code. \subsection{Parsing merged lists} Although the combinators follow the common interfaces, there are a few tricky points one has to keep in mind when using them. The fact that the interleaved parsers compete for input may lead to some complications one should be aware of. We take a look at the traditional definition of |pMany|, which converts a parser into a parser which recognises a list of elements recognised by its argument parser: \begin{code} pList p = let pmp = (:) <$> p <*> pmp `opt` [] in pmp \end{code} In this definition the recursive call to |pmp| only starts to play a role once the first instance of |p| has succeeded. If we however replace the |<*>| operator by a |<||||>| operator, then the recursive |pmp| can start to parse immediately too, and will spawn yet another instance of |p| which starts to compete for the input and so on recursively; apparently changing sequential execution by interleaved execution has deeper implications than is directly visible from the code. Fortunately this problem can be solved rather easily: we decide to only start with a new instance of |pmp| competing for the input, once |p| has started its work and has comsumed a bit of input. Hence we define: \begin{code} gmList p =let pmp = (:) <$> p <<||> pmp `opt` [] in pmp \end{code} We see here that the availability of |<<||||>| plays an essential role in this definition; in that sense it is more primitive than |<||||>|, which was expressed in terms of |<<||||>|. \section{Applications}\label{sec:options} In this section we will give two examples of the use of the introduced merging combinators. \subsection{Parsing options} One of the most boring tasks in writing an application is the processing of the options passed on the command line. Although there are some packages and tools to make one's life a bit easier, there always remains a lot of work to be done. Usually conversion from the strings which were passed to the kind of values one is really interested in has to be done explicitly, for optional arguments defaults have to be provided, for required arguments we have to check that they have actually been provided, and there are many conventions for passing the options, be it in short form as in @ls -l@, in long form as in @haddock --enable-documentation@, in a kind of key-value pair as in @process -o outputfile@ or @process -o=outputfile@, etc. We will now present a small collection of combinators which completely takes away this burden from the programmer, using the introduced merging parser combinators. We start out by assuming that in our program we want to put our recognised options in a record with named fields. Using Template Haskell we generate lenses to give us access to the individual fields. As an example we define the following data types and example record: \begin{code} import Data.Lenses import Data.Lenses.Template data Prefers = Clean | Haskell deriving Show data Address = Address {city_ :: String, room_ :: String} deriving Show data Name = Name { name_:: String, prefers_:: Prefers, ints_ :: [Int] , address_ :: Address} deriving Show $(deriveLenses ''Name) $(deriveLenses ''Address) $(deriveLenses ''Prefers) defaults = Name "Doaitse" Haskell [] (Address "Utrecht" "BBL517") \end{code} The |deriveLenses| calls to template Haskell generate code which will give us read and write access to the fields which a name which ends in an |'_'|. What is precisely generated does not matter too much here, but what is important is that the imported packages provide amongst others a function |alter| which can be used to update a field of type |a| pointed at by the first parameter in a record of type |r| by applying the passed function of type |a -> a| to it: \begin{code} alter :: MonadState a m => (m () -> StateT r Identity b) -> (a -> a) -> (r -> r) \end{code} Now we can update the record at say the field |prefers| as follows: \begin{examplee} print ((prefers `alter` (const Clean)) defaults) Name { name_ = "Doaitse", prefers_ = Clean, ints_ = [], address_ = Address {city_ = "Utrecht", room_ = "BBL517"}} \end{examplee} So the important thing to remember is that an expression |a `alter` f| applies the function |f| to a field pointed at by the lens |a|. The first thing we do is to define a function |oG| (|optionGrammar|), which takes a normal parser |p| which parses a single option and modifies the parser such thatits result is applied to the field pointed at by the lens: \begin{code} oG p a = mkG ((a `alter`) <$> p) \end{code} Using this code we can define an option parser which recognises a required option that takes a single extra parameter, such as e.g. filename. \begin{code} required_ a (string, p) = oG ( pSymbol ("-" ++ [head string]) *> lexeme p) a <|> oG ( pSymbol ("--" ++ string ++ " ") *> lexeme p) a <|> oG ( pSymbol ("--" ++ string ++ "=") *> lexeme p) a required a (string, p) = required_ a (string, const <$> p) \end{code} The call |inp `required` ("filename", pFileName)| will construct a grammar which is able to recognise one occurrence of the three possible forms of passing an option: @-f inputfile@, @--filename inputfile@ and @--filename=inputfile@, and will update the field with name |inp| in the record which will hold our recognised options. The parser |pFilename| recognises the file name part of the option. Using this basic parser for a single option we can now define special versions of it. The function |option| makes the required field optional, as its name suggests. The function |flag| recognises an option which does not read an extra argument from the input, but just sets the field to the passed value: \begin{code} option a string_p = required a string_p `opt` id flag a (string, v) = option a (string, pure v) flags a table = foldr (<>) (pure id) (map (flag a) table) \end{code} At this point one may say that the code we have presented thus far does not really depend on the fact that we have introduced grammars, and could have been implemented using the permutation parsers which have been available for a long time (see e.g. the @options-applicative@ or the @uu-parsinglib@ packages on hackage). We now come to the point where our somewhat more involved combinators will start to pay of. In the example record we see that we have a field which holds a list of integers, and wouldn't it be nice if these integers could each be specified by a separate item in the list of options? For this we define the functions: \begin{code} options a (string, p) = pFoldr ( (.), id) (required_ a (string, (:) <$> p) optionsl a string_p = (last .) <$> options a string_p optionsf a string_p = (head .) <$> options a string_p \end{code} Use of each of these functions will make that several settings, distributed over the total list of options, will be recognised and combined into a value for a single field, to which they will be prepended. The functions |optionsl| and |optionsf| select the last, respectively the first element of this list. They can be used in the case where the same option may be set several times and we want to use the last setting or the first one. This may come in handy when the total list of options consists of e.g. a sequence of options as typed on the command line concatenated with a list of options taken from some preferences file. Finally we want to be able to define the options for the |Address| field which is part of our |Name| record holding the recognised options. Using lenses this is easy. We recognise a list of options, and apply these to field of the parent record pointed to by the passed lens: \begin{code} field s opts = (s `alter`) <$> opts \end{code} So now we are finally ready to show how the specification of the option parser for our |name| record looks like (note that the order in which we specify the options does not matter): \begin{code} instance Functor f => Monoid (Gram f (r -> r)) where mappend p q = (.) <$> p <||> q mempty = empty flags table a = foldr (<>) (pure id) [flag text val a | (text, val) <- table] oName = ^^ name `option` ("name", pString) <> ints `options` ("int", pNatural) <> prefers `flags` [("clean", Clean) ,("haskell", Haskell)] <> address `field` ( city `option` ("city", pString) <> room `option` ("room", pString) ) \end{code} By making values of type |Gram f (r -> r)| an instance of the class |Monoid|, by defining |mappend| as the merge of the two parameters, and composing the record updating functions returned by both parameters when used as parser, we can use the nice |<>| notation to combine the options. We finally run our options: \begin{examplee} run (($ defaults) <$> mkP oName) "--name=Rinus --int=7 --city=Nijmegen -i 5 --clean -i3" -- Result: Name { name_ = "Rinus" , prefers_ = Clean , ints_ = [7,5,3] , address_ = Address { street_ = "Nijmegen" , room_ = "BBL517"}} \end{examplee} Note that the specifications for the nested |Address| field appear distributed among the other options, and that all the integers we have specified end up together in the |ints| field. In the definition of the derived combinators used for specifying specific variants of options we have chosen to use lenses, which return a function which updates an already existing structure containing the values to be collected. The advantage of this approach is that we may start with a record containing the default values and apply the result of reading a preferences file to it, and next apply on top of that the extra information parsed from the command line. Note that each of these structures can be easily parsed using the newly introduced combinators, be it using parsers described in an applicative style or using lenses, which correspond more closely to the familiar keyword-value way of specifying parameters. \subsection{Nested options} It is not unfamiliar to pass options for a linker that is to be involved in a later stage on the command line of a compiler. One of the problems which may arise from such an option architecture is that the option specifications of otherwise rather independent programs may start to interfere; is e.g. the @--verbose@ option passed to a module installer like |cabal| meant for the installer itself, or for the |haddock| program that is used to generate the associated documentation? This problem can be easily solved by requiring these options to be surrounded by @+haddock@ and @-haddock@ markers on the command line, and specifying the command line parser as follows: \begin{code} pHaddock = (gList. mkG) ( pToken "+haddock" *> mkP( ) <* pToken "-haddock" ) \end{code} \section{Conclusions}\label{sec:conclusions} We have derived a set of very general combinators which make it possible to unravel merged input structures. The library imposes very few restrictions on the underlying parser combinators used. A distinguishing feature of our combinators is that they extend beyond the now common parsers used for permuted structures. Our new combinators have both a monadic and an applicative interface. By being able to switch between the sequential and the merging view on the input we can recognise permuted structures which are embedded inside other permuted structures. In this way we may specify options for various different subsequent program stages, where the option structures for these programs would otherwise conflict. In deriving the library we found it very helpful to choose our identifiers in such a way that the types are explicitly represented in the chosen identifiers. As we will see his greatly helps in identifying the way we can construct the needed values out of the available values. Choosing the identifiers in the way we did greatly helped us in writing the code, and is an essential ingredient in applying the programming paradigm in which we {\em ``Let the types do the work''}. Our experience in programming in this way helped us so much, how trivial this observation may seem, that we think it deserves being pointed at explicitly rather than just being used in the construction of this library. \bibliographystyle{splncs03} \section{Acknowledgement} We want to thank Bastiaan Heeren and members of the Software Technology reading club for commenting on earlier versions of this paper. \begin{flushleft} %\bibliography{/Users/doaitse/InProgress/BibTeX/biblio} \begin{thebibliography}{1} \providecommand{\url}[1]{\texttt{#1}} \providecommand{\urlprefix}{URL } \bibitem{1030338} Baars, A.I., L{\"o}h, A., Swierstra, S.D.: Parsing permutation phrases. J. Funct. Program. 14(6), 635--646 (2004) \bibitem{JFP:254719} Claessen, K.: {FUNCTIONAL PEARL} {P}arallel {P}arsing {P}rocesses. Journal of Functional Programming 14(6), 741--757 (2004) \bibitem{Guerra:2005fk} Guerra, R., Baars, A.I., Swierstra, S.D., Saraiva, J.: Preserving order in non-order preserving parsers. Tech. Rep. UU-CS-2005-025, Department of Information and Computing Sciences, Utrecht University (2005) \bibitem{McB07} McBride, C., Paterson, R.: Applicative programming with effects. Journal of Functional Programming 18(01), 1--13 (2007), \url{http://portal.acm.org/citation.cfm?id=1348940.1348941} \bibitem{SwieDupo96} Swierstra, S.D., Duponcheel, L.: Deterministic, error-correcting combinator parsers. In: Launchbury, J., Meijer, E., Sheard, T. (eds.) Advanced Functional Programming. LNCS-Tutorial, vol. 1129, pp. 184--207. Springer-Verlag (1996) \end{thebibliography} \end{flushleft} \end{document}