\chapter{Representing textual selections}

\label{sec:bounds}

Text selections will be used throughout this entire thesis and play an
important role. For that reason, we need to decide on a way to represent them.
In order to represent text selections, we first need to represent a single
position within a text.

There are two representations of text position that are used often: offset
from the start of the text and line-column numbers. Both are useful in
different circumstances. Code editors usually present the programmer with line
and column numbers, because code tends to be line-oriented.

But behind the scenes programs usually work with offsets, especially if they
need to do computations on these offsets. Only at the last moment, right
before presenting the position information to the user, are the offsets
converted to line-column numbers. Offsets are easier to work with because they
do not depend on the exact characters in the input and can be expressed as a
single number instead of two.

Because offsets are easier to work with behind the scenes, we will use offsets
to represent text positions in this thesis. A text selection is then a tuple
of two offsets. We call the type of text selections |Range|:

\begin{code}
type Range = (Int, Int)
\end{code}

Offsets can be thought of as positions between two characters. They start at
zero. For any text of length |n|, there are |n + 1| valid offsets, namely
those in the closed interval |[0..n]|. For ranges |(left, right)| we always
maintain the invariant that |0 <= left| and |left <= right|.

We define some utility functions to work with offsets and ranges:

\begin{code}
posInRange :: Int -> Range -> Bool
posInRange pos (left, right) = left <= pos && pos <= right

rangeInRange :: Range -> Range -> Bool
rangeInRange (left, right) range  =   left   `posInRange` range
                                  &&  right  `posInRange` range
\end{code}

Function |posInRange| tells whether a range contains a certain position.
Ranges are closed intervals: positions may coincide with a range's end points.
Function |rangeInRange inner outer| tells whether |inner| is a subrange of
|outer|. Again, end points may coincide.

% \begin{figure}{r}{0.5\textwidth}
\begin{figure}[0.5\textwidth]
\begin{center}
\includegraphics[width=0.5\textwidth]{bounds.pdf}
\end{center}
\caption{The abstract syntax tree for the expression \texttt{2 + 3 * 5}. The
bounds for the node representing the number |3| are indicated, showing the
naming of the various ranges.}
\label{fig:bounds}
\end{figure}

To be able to map between text selections and tree selections, we need to
remember what each subtree's position in the input was. What information do we
need to remember, exactly? To translate from tree selection to text selection,
it is sufficient to store a |Range| with every node. However, if a single
range is all we have for a node, then to translate a text selection back to a
tree selection the user needs to select the \emph{exact} range for the text
selection to be recognized. It is often the case that a structure is
surrounded by some whitespace in the source text, and it seems only fair to
allow the user to select some of this whitespace in addition to the structure
itself.

For that reason, we store not two but four offsets for each subtree: the point
where the whitespace before the structure starts, the point where the
whitespace turns into actual structural text, the point where the whitespace
after the structure starts, and the point where this whitespace ends. Figure
\ref{fig:bounds} shows these four offsets for the node representing the |3| in
the arithmetic expression \texttt{2 + 3 * 5} and the terms we will use for
some meaningful combinations of these offsets. Using this information, we can
be flexible to the user and accept any text selection that starts between the
first two offsets and ends between the last two offsets.

The combination of these four offsets is stored in the datatype Bounds:

\begin{code}
data  Bounds = Bounds { leftMargin :: Range, rightMargin :: Range }
  deriving (Eq, Show)

innerRange :: Bounds -> Range
innerRange (Bounds (_, left) (right, _)) = (left, right)

outerRange :: Bounds -> Range
outerRange (Bounds (left, _) (_, right)) = (left, right)
\end{code}

The bounds store the left and right margin. They could have stored the inner
and outer range just as well, in which case |leftMargin| and |rightMargin|
would have been the projection functions rather than |innerRange| and
|outerRange|. However, because the same margin is often shared between
multiple nodes in the tree, it is more memory-efficient to store the margins
than the inner and outer ranges. To give an example: the subtree for |3 * 5|
shares its left margin with the node for |3| and its right margin with the
node for |5|.

We define a function |rangeInBounds| that tells whether a range is a valid
selection for a node that has the specified bounds: the left endpoint should
be in the left margin, and the right endpoint should be in the right margin.

\begin{code}
rangeInBounds :: Range -> Bounds -> Bool
rangeInBounds (l, r) b =
  l  `posInRange`  leftMargin   b &&
  r  `posInRange`  rightMargin  b
\end{code}

The way the parse tree is shown in relation to its source text in figure
\ref{fig:bounds} implies some laws we rely on and maintain as invariants when
using and constructing annotated parse trees. These laws are:

\begin{enumerate}
  \item A node's inner range is always enclosed by that node's outer range,
  i.e.~for every node's |bounds| we have that |innerRange bounds
  `rangeInRange` outerRange bounds|.
  \item Children appear in the same order in the source text as in the AST and
  their inner ranges do not overlap. So for each pair of adjacent siblings, we
  have that their respective bounds |bounds_1| and |bounds_2| adhere to |fst
  (rightMargin bounds_1) <= snd (leftMargin bounds_2)|.
  \item A node's bounds always enclose that node's children's bounds. In other
  words, for a parent's every child we have that |innerRange bounds_child
  `rangeInRange` innerRange bounds_parent|.
\end{enumerate}