\chapter{Introduction} \label{sec:introduction} In the early days of computing, designing and implementing a new programming language was very special. With the many tools that are available to language designers and programmers today, it has become much easier to build a language ``just for fun'', as seen through the many esoteric programming languages such as LOLCODE \cite{lindsay07lolcode}, Brainfuck \cite{mueller93brainfuck}, Piet \cite{morgan01marpiet}, and Iota and Jot \cite{barker01iotajot}. The increased ease of building a language is partly because of the numerous tools available for building parsers, either through parser combinators such as list of successes \cite{wadler85list, hutton98parsing}, Parsec \cite{leijen01parsec} and those in the Utrecht toolset \cite{swierstra1999fec} or parser generators such as lex and yacc \cite{levine92lex}, JavaCC \cite{viswanadha96javacc} and ANTLR \cite{parr07antlr}. Another reason is the modularization of compilers: in early days every new language had to build the entire chain of processes that make up a compiler. Nowadays it is common to split a compiler in clearly separated phases that allow reuse. Examples of this are Clojure, Groovy, and Scala that compile to the Java platform and C\#, F\#, and Visual Basic .NET that compile to the .NET CLI, inheriting all libraries available for those platforms. Another strategy is to compile languages to intermediate languages such as GRIN, LLVM, C-{}-, or even C, for which excellent compilers are already available. The first steps of designing a new language include the definition of a semantics and a data structure to capture programs. Then follow the parser and algorithms for evaluating programs or compiling them to another language. As a language matures, new features are added or existing ones are improved. There usually comes a point where the quality of the error messages is improved. This is often implemented later not only because it is very difficult to figure out what the user intended to do \cite{heeren05TopQuality}---which is necessary to give a good descriptive message---but also because it requires changes that are pervasive throughout the whole compiler. Often extra state needs to be stored and passed around in order to give error messages that make sense. A good example is the location information accompanying error messages: if the user doesn't know where exactly an error occurred, it doesn't matter anymore how good or descriptive the error messages are; the user is not going to accept the compiler as a good one. However, adding support for location information requires changes to various components of the compiler: \begin{itemize} \item The datatypes that represent a compilation unit need to have fields added for storing the position annotations. Although this change is almost mechanical, it is not a small one: every constructor of every datatype that makes up the model needs to have fields added for this. \item Producers of the now annotated model have to change to fill in the new fields. Parsers, for example, have to request the location information from the parse state and inject it into the model whenever a new node is constructed, distracting from the actual parsing process and requiring the distinction between annotated and unannotated nodes. \item Consumers of the new model (functions taking the new model as input) need to be adapted as well, either to explicitly ignore the annotations if they are not needed, or to use them in the results if they are. \end{itemize} Implementing these changes is not difficult, but it is definitely a lot of work. It is somewhat like a design pattern: a solution to a common problem that is usually not expressed as a code library or framework, but rather as a series of descriptive steps. \section{Structural selections} \label{subsec:structures} In this thesis the notion of a structural selection comes up often. Let us define more clearly what we mean by structural selections. Data stored in computers is serialized: linear sequences of bits stored on physical media. Structure emerges when meaning is given to these bits, such that they can be manipulated in useful ways. A sure sign of structure is when an entity is composed of several smaller entities. Parser technology was developed to take a linear sequence of symbols (bits, bytes, characters, words) and recognize substructures, translating them into an in-memory model where the structure is explicit. This makes it easy for computer programs and programmers to reason about the meaning of the input and manipulate it. Structures and substructures are necessary for introducing variety: how else could you define a formal language with an infinite number of meaningful sentences? When you have a structure with substructures, it is clear what a structural selection is: it is the selection of one or more particular substructures. Before the structure was unveiled by the parser, all we could do was select ranges of input, such as a text selection in a text editor. Parsers connect two different worlds of structures. On the one hand they are based on grammars, where the foundations for structure are the production rules that make up the grammar. On the other hand, parsers usually build abstract syntax trees (ASTs). Here the basis for the structure are the constructors that the datatypes of the ASTs are made of. It is often the case that substructures of an AST directly relate to specific pieces of the source code. It therefore makes sense to ask: what text selection does this structural selection correspond to? The question in the other direction is reasonable, too. A real example of this are the compilers we talked about in the first section. They work on structural ASTs and when they want to refer to a specific selection, they need to express this selection in terms of a text selection in the source code, because that is what the user fed to the compiler and that is what the user manipulates. Another example is the Outline View in the Eclipse editor (Figure \ref{fig:outlineview}). This view is usually shown next to the source code and provides a hierarchical overview of the source. Positioning the caret in the code highlights the corresponding node in the outline and vice versa: clicking on a node in the view highlights the appropriate text selection in the source. All these functionalities require mapping between the two kinds of selections. \begin{figure}[t] \centering \includegraphics[width=\textwidth]{OutlineView.png} \caption{In Eclipse, the editor selection and Outline View selection are always in sync.} \label{fig:outlineview} \end{figure} Obviously, there is a lot more freedom in textual selections than in structural selections and not every text selection will correspond to a selection in the AST. Therefore, we distinguish between text selections that do and don't correspond to structural selections. We will call them \emph{valid} and \emph{invalid} selections, respectively. Many editors work on textual representations of models. Some, however, take a more graphical approach in which the structure is more apparent. In the EQL editor, for example, the user could drag blocks from and to containers. The EQL editor was built on the Graphical Editing Framework \cite{GEF}, which is designed for graphical, structural editors. Another very nice example of a structural editor is the equation editor that comes with Adobe FrameMaker: the user manipulates text but in a structured way in which the tree shape of the expressions is apparent. For example, by pressing the space bar the user can select the next enclosing part of the expression, which corresponds to moving up one level in the AST. In general purpose programming languages, the distinction between the two structural worlds of grammar production rules and nodes in an AST is always explicit and it is the parser's or pretty printer's job to connect the two. However, some tools specialized for term rewriting such as TXL \cite{cordy06txl}, Stratego \cite{visser03stratego}, and Rascal \cite{klint09rascal} adopt the radical view that \emph{the grammar is the AST}. In this view, the various production rules \emph{are} the constructors of the AST and usually some restrictions are put on the grammar, such as forbidding nested choices (denoted with a vertical bar \texttt{||} in BNF) because this introduces anonymous constructors, making it more difficult to work with parse trees later on. Whitespace usually has a special meaning in these languages too, instead of optionally introducing whitespace as something special in the lexer phase. Working with languages in this way offers some important advantages: there is no need to separately specify the AST, parser, and pretty printer anymore, as these are unified into one syntax. Furthermore, rewrite rules can be expressed using the syntax of the language being defined. This strategy works well in such specialized languages, but is very hard to adapt in general purpose programming languages because they usually don't allow customization of syntax. In general purpose languages, concrete syntax has to be encoded as string literals, making it impossible to achieve strong static types. \section{Contributions} This thesis is about making the mapping from text selection to structural selection and vice versa as easy as possible for the developer of a compiler or tool. As many of the issues described in the introduction as possible are hidden from the programmer. The solutions are developed in and for the functional general-purpose programming language Haskell. As a teaser, here is a code snippet that shows an example use of the API presented in this paper: \begin{code} exprEval expr = case expr of Num n -> Right n Add x y -> Right (x + y) Sub x y -> Right (x - y) Mul x y -> Right (x * y) Div x y | y == 0 -> Left "division by zero" | otherwise -> Right (x `div` y) \end{code} It defines an algebra for evaluating an arithmetic expression. Using this algebra, we may call a function |compileExpr| as follows: \begin{verbatim} ghci> compileExpr exprEval "1 + 2/0 + 3 + 4/0" Errors: * 4- 7: division by zero * 14-17: division by zero \end{verbatim} Specifically, this thesis makes the following contributions: \begin{itemize} \item By using fixpoint views, a type-indexed datatype is defined to recursively insert position information in algebraic datatypes (Chapter \ref{sec:fixedpoint}). \item We will see how consumers of datatypes expressed as catamorphisms can be made to work automatically with both the bare datatypes and the derived information-rich datatypes (Section \ref{sec:cata}). Furthermore, a new kind of catamorphism is introduced that makes the possibility of failure explicit, allowing automatic extraction of the relevant position information in case of failure (Section \ref{sec:errorcata}). \item Several common editor use cases are solved, such as mapping from text selection to structural selection and back and fixing invalid selections (Section \ref{sec:annoops}). \item Parser combinators are introduced that help in building recursively annotated ASTs (Section \ref{sec:annoparse}). \item The solutions to the problems are solved again using the generic programming library MultiRec and compared with the previous solutions (Chapter \ref{sec:generic}). \end{itemize}