\chapter{Introduction}

\label{sec:introduction}

In the early days of computing, designing and implementing a new programming
language was very special. With the many tools that are available to language
designers and programmers today, it has become much easier to build a language
``just for fun'', as seen through the many esoteric programming languages such
as LOLCODE \cite{lindsay07lolcode}, Brainfuck \cite{mueller93brainfuck}, Piet
\cite{morgan01marpiet}, and Iota and Jot \cite{barker01iotajot}.

The increased ease of building a language is partly because of the numerous
tools available for building parsers, either through parser combinators such
as list of successes \cite{wadler85list, hutton98parsing}, Parsec
\cite{leijen01parsec} and those in the Utrecht toolset \cite{swierstra1999fec}
or parser generators such as lex and yacc \cite{levine92lex}, JavaCC
\cite{viswanadha96javacc} and ANTLR \cite{parr07antlr}. Another reason is the
modularization of compilers: in early days every new language had to build the
entire chain of processes that make up a compiler. Nowadays it is common to
split a compiler in clearly separated phases that allow reuse. Examples of
this are Clojure, Groovy, and Scala that compile to the Java platform and C\#,
F\#, and Visual Basic .NET that compile to the .NET CLI, inheriting all
libraries available for those platforms. Another strategy is to compile
languages to intermediate languages such as GRIN, LLVM, C-{}-, or even C, for
which excellent compilers are already available.

The first steps of designing a new language include the definition of a
semantics and a data structure to capture programs. Then follow the parser and
algorithms for evaluating programs or compiling them to another language. As a
language matures, new features are added or existing ones are improved.

There usually comes a point where the quality of the error messages is
improved. This is often implemented later not only because it is very
difficult to figure out what the user intended to do
\cite{heeren05TopQuality}---which is necessary to give a good descriptive
message---but also because it requires changes that are pervasive throughout
the whole compiler. Often extra state needs to be stored and passed around in
order to give error messages that make sense. A good example is the location
information accompanying error messages: if the user doesn't know where
exactly an error occurred, it doesn't matter anymore how good or descriptive
the error messages are; the user is not going to accept the compiler as a good
one. However, adding support for location information requires changes to
various components of the compiler:

\begin{itemize}
  \item The datatypes that represent a compilation unit need to have fields
  added for storing the position annotations. Although this change is almost
  mechanical, it is not a small one: every constructor of every datatype that
  makes up the model needs to have fields added for this.
  \item Producers of the now annotated model have to change to fill in the new
  fields. Parsers, for example, have to request the location information from
  the parse state and inject it into the model whenever a new node is
  constructed, distracting from the actual parsing process and requiring the
  distinction between annotated and unannotated nodes.
  \item Consumers of the new model (functions taking the new model as input)
  need to be adapted as well, either to explicitly ignore the annotations if
  they are not needed, or to use them in the results if they are.
\end{itemize}

Implementing these changes is not difficult, but it is definitely a lot of
work. It is somewhat like a design pattern: a solution to a common problem
that is usually not expressed as a code library or framework, but rather as a
series of descriptive steps.


\section{Structural selections}

\label{subsec:structures}

In this thesis the notion of a structural selection comes up often. Let us
define more clearly what we mean by structural selections.

Data stored in computers is serialized: linear sequences of bits stored on
physical media. Structure emerges when meaning is given to these bits, such
that they can be manipulated in useful ways. A sure sign of structure is when
an entity is composed of several smaller entities.

Parser technology was developed to take a linear sequence of symbols (bits,
bytes, characters, words) and recognize substructures, translating them into
an in-memory model where the structure is explicit. This makes it easy for
computer programs and programmers to reason about the meaning of the input and
manipulate it. Structures and substructures are necessary for introducing
variety: how else could you define a formal language with an infinite number
of meaningful sentences?

When you have a structure with substructures, it is clear what a structural
selection is: it is the selection of one or more particular substructures.
Before the structure was unveiled by the parser, all we could do was select
ranges of input, such as a text selection in a text editor.

Parsers connect two different worlds of structures. On the one hand they are
based on grammars, where the foundations for structure are the production
rules that make up the grammar. On the other hand, parsers usually build
abstract syntax trees (ASTs). Here the basis for the structure are the
constructors that the datatypes of the ASTs are made of.

It is often the case that substructures of an AST directly relate to specific
pieces of the source code. It therefore makes sense to ask: what text
selection does this structural selection correspond to? The question in the
other direction is reasonable, too. A real example of this are the compilers
we talked about in the first section. They work on structural ASTs and when
they want to refer to a specific selection, they need to express this
selection in terms of a text selection in the source code, because that is
what the user fed to the compiler and that is what the user manipulates.
Another example is the Outline View in the Eclipse editor (Figure
\ref{fig:outlineview}). This view is usually shown next to the source code and
provides a hierarchical overview of the source. Positioning the caret in the
code highlights the corresponding node in the outline and vice versa: clicking
on a node in the view highlights the appropriate text selection in the source.
All these functionalities require mapping between the two kinds of selections.

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{OutlineView.png}
\caption{In Eclipse, the editor selection and Outline View selection are always in sync.}
\label{fig:outlineview}
\end{figure}

Obviously, there is a lot more freedom in textual selections than in
structural selections and not every text selection will correspond to a
selection in the AST. Therefore, we distinguish between text selections that
do and don't correspond to structural selections. We will call them
\emph{valid} and \emph{invalid} selections, respectively.

Many editors work on textual representations of models. Some, however, take a
more graphical approach in which the structure is more apparent. In the EQL
editor, for example, the user could drag blocks from and to containers. The
EQL editor was built on the Graphical Editing Framework \cite{GEF}, which is
designed for graphical, structural editors. Another very nice example of a
structural editor is the equation editor that comes with Adobe FrameMaker: the
user manipulates text but in a structured way in which the tree shape of the
expressions is apparent. For example, by pressing the space bar the user can
select the next enclosing part of the expression, which corresponds to moving
up one level in the AST.

In general purpose programming languages, the distinction between the two
structural worlds of grammar production rules and nodes in an AST is always
explicit and it is the parser's or pretty printer's job to connect the two.
However, some tools specialized for term rewriting such as TXL
\cite{cordy06txl}, Stratego \cite{visser03stratego}, and Rascal
\cite{klint09rascal} adopt the radical view that \emph{the grammar is the
AST}. In this view, the various production rules \emph{are} the constructors
of the AST and usually some restrictions are put on the grammar, such as
forbidding nested choices (denoted with a vertical bar \texttt{||} in BNF)
because this introduces anonymous constructors, making it more difficult to
work with parse trees later on. Whitespace usually has a special meaning in
these languages too, instead of optionally introducing whitespace as something
special in the lexer phase. Working with languages in this way offers some
important advantages: there is no need to separately specify the AST, parser,
and pretty printer anymore, as these are unified into one syntax. Furthermore,
rewrite rules can be expressed using the syntax of the language being defined.

This strategy works well in such specialized languages, but is very hard to
adapt in general purpose programming languages because they usually don't
allow customization of syntax. In general purpose languages, concrete syntax
has to be encoded as string literals, making it impossible to achieve strong
static types.


\section{Contributions}

This thesis is about making the mapping from text selection to structural
selection and vice versa as easy as possible for the developer of a compiler
or tool. As many of the issues described in the introduction as possible are
hidden from the programmer. The solutions are developed in and for the
functional general-purpose programming language Haskell.

As a teaser, here is a code snippet that shows an example use of the API
presented in this paper:

\begin{code}
exprEval expr = case expr of
  Num  n     -> Right n
  Add  x  y  -> Right (x  +  y)
  Sub  x  y  -> Right (x  -  y)
  Mul  x  y  -> Right (x  *  y)
  Div  x  y
    | y == 0     -> Left   "division by zero"
    | otherwise  -> Right  (x `div` y)
\end{code}

It defines an algebra for evaluating an arithmetic expression. Using this
algebra, we may call a function |compileExpr| as follows:

\begin{verbatim}
ghci> compileExpr exprEval "1 + 2/0 + 3 + 4/0"
Errors:
*  4- 7: division by zero
* 14-17: division by zero
\end{verbatim}

Specifically, this thesis makes the following contributions:

\begin{itemize}
  \item By using fixpoint views, a type-indexed datatype is defined to
  recursively insert position information in algebraic datatypes (Chapter
  \ref{sec:fixedpoint}).
  \item We will see how consumers of datatypes expressed as catamorphisms can
  be made to work automatically with both the bare datatypes and the derived
  information-rich datatypes (Section \ref{sec:cata}). Furthermore, a new kind
  of catamorphism is introduced that makes the possibility of failure
  explicit, allowing automatic extraction of the relevant position information
  in case of failure (Section \ref{sec:errorcata}).
  \item Several common editor use cases are solved, such as mapping from text
  selection to structural selection and back and fixing invalid selections
  (Section \ref{sec:annoops}).
  \item Parser combinators are introduced that help in building recursively
  annotated ASTs (Section \ref{sec:annoparse}).
  \item The solutions to the problems are solved again using the generic
  programming library MultiRec and compared with the previous solutions
  (Chapter \ref{sec:generic}).
\end{itemize}