\documentclass[a4paper]{artikel3} %opmaakdingen: \usepackage{a4wide,charter} \usepackage{xcolor} \usepackage[english]{babel} \usepackage{graphicx} \usepackage{wrapfig} %include lhs2TeX.fmt %include selections.fmt \title{Generic selections of subexpressions} \author{Martijn van Steenbergen} \date{\today} \begin{document} \maketitle \section{Introduction} Many applications that work directly with textual computer languages have parsers that not only build abstract syntax trees (ASTs), but also save position information of constructs in the source code to be able to offer their functionality to the user. Common examples are: \begin{itemize} \item Compilers save position information to be able to give detailed line and column information in their warning and error messages. \item Advanced editors and integrated development environments (IDEs) need the position information to be able to offer the user options based on the current text selection, for example to offer certain types of refactoring. \item Recent work on exercise assistants \cite{eqsolver} allows students to select part of an expression and apply rewrite rules to them. The assistant is also able to detect invalid selections and offer suggestions to fix them. \end{itemize} Right now every time a parser is written and positional information needs to be saved, this has to be added to the parser and AST by hand, involving writing a lot of boilerplate code. Furthermore, local changes in a document should generally leave the rest of the document unchanged, preserving comments and layout. Parsers usually discard whitespace and comments, and modifications are almost always defined on the parsed AST. In order to preserve layout when printing a modified value, parsers should save position information in the AST, but this makes modifications on the AST more complicated. I would like to develop a library or framework that helps developers by taking most of all this extra work out of their hands. \section{GroteTrap} Supervised by Johan Jeuring, Jeroen Leeuwestein and I created a library called \emph{GroteTrap} that allows developers to quickly define languages and their semantics while at the same time getting above functions for free. \cite{mvs08grotetrap} For example: > data Logic > = Var String > | Or [Logic] > | And [Logic] > | Impl Logic Logic > | Not Logic > deriving (Show, Eq) > > logicLanguage :: Language Logic > logicLanguage = language > { variable = Var > , operators = > [ Unary Not Prefix 0 "!" > , Assoc And ~ 1 "&&" > , Assoc Or ~ 2 "||" > , Binary Impl InfixR 3 "->" > ] > } With this definition we can do the following: > > let tree = readSentence logicLanguage "p && q -> r" > > unparse tree > "p && q -> r" > > evaluate logicLanguage tree > Impl (And [Var "p",Var "q"]) (Var "r") > > let selection = select [0,1] tree > > evaluate logicLanguage selection > Var "q" > > range selection > (5,6) > > rangeToSelection tree (5,6) > Just ([0,1],0) The developer gets a parser that turns strings into a universal AST for free. This AST, or parts of it, can in turn be converted into values of a custom data type (|Logic| in this example). The parser puts position information in the AST so that the AST can be used to map from text selections in the original source code (|"p && q -> r"|) to selections of the AST and vice versa. Building a datatype is not necessary. Consider the following example, where an |Int| is built right away: > arith :: Language Int > arith = language > { number = id > , operators = > [ Assoc sum 2 "+" > , Binary (-) InfixL 2 "-" > , Assoc product 1 "*" > , Binary div InfixL 1 "/" > ] > } > > evalArith :: String -> Int > evalArith = readExpression arith While GroteTrap helps with setting up small expression languages, anything bigger is out of its reach. There are two reasons for this. Firstly, the set of grammars supported is very small. GroteTrap allows you to define unary and binary operators and accepts numbers and identifiers as literals. Anything beyond that is very difficult to set up. For example, expressing the ternary operator \texttt{?:} found in programming languages such as Java and C\# in binary operators is cumbersome. \texttt{?} and \texttt{:} would both be binary operators, to be later combined into a single construct. This prevents the parser from discarding the use of \texttt{?} without an accompanying \texttt{:}, and vice versa. The grammar implemented by the parser is too liberal and consequently extra checks need to be done at a later stage. Other constructs such as a method definition \texttt{void method() \{ ... \}} would require even more complicated schemes, resulting in code that is hard to write, understand and maintain. Secondly, all nodes in the AST must be of the same type, as the unary and binary operators require functions of types |a -> a| and |a -> a -> a|, respectively. Imagine the following family of data types, an example taken from \cite{rodriguez2008mutrec}: > data Expr > = Const Int > | Add Expr Expr > | Mul Expr Expr > | EVar Var > | Let Decl Expr > data Decl > = Var := Expr > | Seq Decl Decl > type Var > = String It is impossible to have GroteTrap derive a parser for these data structures, because they are mutually recursive and hence the types of the nodes differ. For the same reason, nested data types \cite{birdmeertens98:nest} such as perfectly balanced trees are prohibited too: > data Perfect a > = Nil > | Branch (Perfect (a, a)) I want to improve on GroteTrap by lifting these restrictions as far as possible and sensible. \section{Research} \begin{figure}[hbtp] \centering \includegraphics[width=0.60\textwidth]{Pipeline} \caption{The refactoring cycle.} \label{fig:pipeline} \end{figure} Figure \ref{fig:pipeline} shows the parse-select-update-print pipeline. Each transition brings some interesting problems. I will discuss each transition in detail. \subsection{Parser combinators} GroteTrap allows only very limited grammars. To recover the full expressiveness of context-free grammars, the normal parser combinators present in libraries such as Parsec \cite{leijen01parsec} and UUlib \cite{swierstra1999fec} are needed: > token :: [s] -> Parser s [s] > (<|>) :: Parser s a -> Parser s a -> Parser s a > (<*>) :: Parser s (b -> a) -> Parser s b -> Parser s a > (<$>) :: (a -> b) -> Parser s a -> Parser s b Parallel to the normal user-defined values, these new combinators will build a generic rose-shaped parse tree with positional information. This tree can be used to map from text selections to paths in the parse tree. However, it is desirable that not every node in the parse tree be selectable. If they were, each individual symbol would be a valid selection, since each leaf in the parse tree corresponds to exactly one input symbol. To know which text selections are valid, a new combinator |unit| is introduced: > unit :: Parser s a -> Parser s a This function marks all the sentences accepted by the parser it is given as \emph{units}, which signify valid tree selections and therefore also valid text selections. Figure \ref{fig:unitparsetree} shows an example parse tree where the nodes that correspond to selectable subexpressions are marked as units. \begin{figure}[hbtp] \centering \includegraphics[width=0.30\textwidth]{ParseTree} \caption{The parse tree for the expression |"1+2*3"|. Solid borders indicate selectable unit elements.} \label{fig:unitparsetree} \end{figure} \cite{allentemporal} has studied the relations two time intervals can have. It is almost trivial to apply these relations to text selections, even though text selections work with integral units whereas time is (assumed to be) continuous. In Allen's terms, it is interesting to note that two distinct valid text selections on an expression never \emph{overlap}: either one is a complete subrange of the other (which does not count as an overlap in his work), or they are disjoint. Expressions using associative operators such as |1 + 2 + 3 + 4| always result in a left-hanging or right-hanging parse tree if the parser is defined using the combinators shown above. If the tree hangs to the left, all valid selections are either single numbers, or ranges starting with the |1|. The same goes for trees that hang to the right, but then with ranges ending with the |4|. But because |+| is an associative operator, all of these ranges are expected to be valid, and also ranges such as |2 + 3| which do not start or end at either border. Because this behavior is not expressible with the current combinators, a new combinator |chain| is introduced which allows arbitrary subranges of operands to be selected, as long as the selections are contiguous. > chain :: Parser s a -> Parser s () -> Parser s [a] The first argument to |chain| parses operands, while the second argument parses operators. Parsers are usually defined in two phases: lexing and building the parse tree. Since the parse tree with which most of the work will be done is composed of tokens, special care needs to be taken when mapping from text to tree selections. Firstly, an individual symbol (token) might correspond to more than one character in the source code. Secondly, certain tokens are usually discarded right after lexing (which justifies the use of two phases), such as whitespace and comments. In order to preserve layout, these should be kept without hindering the second phase of the parsing process. \subsection{Selecting} To represent selections within a parse tree, a zipper \cite{huet1997zipper} on the parse tree can be used. Zippers offer type-safe selections and efficient navigation and updating in tree-shaped data structures. When the value produced by the parser at a location pointed at by a zipper location needs to be retrieved, what is that value's type? This is unknown because parsers can have any local type and this type is lost at higher levels in the grammar. I will consider several solutions to this problem. The first is to introduce an extra global type parameter |g| to the |Parser| type: > type Parser s a g Function |unit| then also requires a function that lifts |a| to |g|. This still allows arbitrary local types, while guaranteeing that the value of a unit is of a type still known at higher levels. > unit :: (a -> g) -> Parser s a g -> Parser s a g > eval :: ParseTreeLoc s g -> g If all units were already of the same type, no harm is done, because |unit| can then be given the identity function. In our earlier example there were two distinct local types: |Expr| and |Decl|. In this case, the global type could be |Either Expr Decl|, and the functions given to |unit| would then be |Left| or |Right| respectively. In the general case, the user could define a wrapper data type, with one constructor for each possible local type: > data Family > = Expr Expr > | Decl Decl > | ... Another possible solution worth exploring is using zippers on the value types of the parsers. \cite{rodriguez2008mutrec} have developed a way to do generic programming on mutually recursive datatypes, including the derivation of a cross-type zipper. They too use a datatype family, implemented using an indexing type class |Ix|. Retrieving the value from a parse subtree would then look like: > eval :: Ix g ix => ParseTreeLoc s g -> ix \subsection{Updating} When part of the AST has been selected, any changes such as insertions or deletions of nodes can be performed. \cite{rodriguez2008mutrec}'s work can come in handy again here: using their library it is very easy to apply transformations on the tree, or to use the zipper again. Another option is to apply rewrite rules such as those discussed in \cite{rodriguez08generic}. \subsection{Printing} When converting the AST back to source code, as much as possible of the layout needs to be preserved, so the AST must somehow retain this information. For newly inserted nodes in the tree, new source code needs to be generated as necessary. GroteTrap derives a parser from a language definition; a related and equally interesting problem I would like to investigate is how to derive a pretty printer from a language definition. The pretty printer would not work on the |ParseTree|s, because those imply the result of a parse, which in turn requires a textual representation of the value to print: this would undermine my goal. Instead, the pretty printing works on the user-defined type, such as |Logic|. One possible solution involves requiring all semantic functions in the |Language| to be constructors of a data type. Hence, the arithmetic example shown earlier is no longer allowed. Pretty-printing a value then recursively inspects the constructors of the input value, converting them to their textual representation using the generic inspection functions from the Scrap Your Boilerplate library \cite{Lammel_PeytonJones:2003SYB} and the information from the |Language|. The next step would be to derive a pretty printer from a parser defined using parser combinators instead of a |Language|. Again, the first thing to try would be restricting the functions given to the |<$>| operator to datatype constructors. Consider: > data Logic > = T > | F > | And Logic Logic > | Or Logic Logic > | Impl Logic Logic Using |token|, |<||>|, |<*>| and |<$>| it might be possible to derive a pretty-printer if at each choice |<||>| each option is uniquely identified by a specific constructor. I would like to investigate the feasibility of this idea. \small \bibliographystyle{alpha} \bibliography{selections} \end{document}