\chapter{Representing textual selections} \label{sec:bounds} Text selections will be used throughout this entire thesis and play an important role. For that reason, we need to decide on a way to represent them. In order to represent text selections, we first need to represent a single position within a text. There are two representations of text position that are used often: offset from the start of the text and line-column numbers. Both are useful in different circumstances. Code editors usually present the programmer with line and column numbers, because code tends to be line-oriented. But behind the scenes programs usually work with offsets, especially if they need to do computations on these offsets. Only at the last moment, right before presenting the position information to the user, are the offsets converted to line-column numbers. Offsets are easier to work with because they do not depend on the exact characters in the input and can be expressed as a single number instead of two. Because offsets are easier to work with behind the scenes, we will use offsets to represent text positions in this thesis. A text selection is then a tuple of two offsets. We call the type of text selections |Range|: \begin{code} type Range = (Int, Int) \end{code} Offsets can be thought of as positions between two characters. They start at zero. For any text of length |n|, there are |n + 1| valid offsets, namely those in the closed interval |[0..n]|. For ranges |(left, right)| we always maintain the invariant that |0 <= left| and |left <= right|. We define some utility functions to work with offsets and ranges: \begin{code} posInRange :: Int -> Range -> Bool posInRange pos (left, right) = left <= pos && pos <= right rangeInRange :: Range -> Range -> Bool rangeInRange (left, right) range = left `posInRange` range && right `posInRange` range \end{code} Function |posInRange| tells whether a range contains a certain position. Ranges are closed intervals: positions may coincide with a range's end points. Function |rangeInRange inner outer| tells whether |inner| is a subrange of |outer|. Again, end points may coincide. % \begin{figure}{r}{0.5\textwidth} \begin{figure}[0.5\textwidth] \begin{center} \includegraphics[width=0.5\textwidth]{bounds.pdf} \end{center} \caption{The abstract syntax tree for the expression \texttt{2 + 3 * 5}. The bounds for the node representing the number |3| are indicated, showing the naming of the various ranges.} \label{fig:bounds} \end{figure} To be able to map between text selections and tree selections, we need to remember what each subtree's position in the input was. What information do we need to remember, exactly? To translate from tree selection to text selection, it is sufficient to store a |Range| with every node. However, if a single range is all we have for a node, then to translate a text selection back to a tree selection the user needs to select the \emph{exact} range for the text selection to be recognized. It is often the case that a structure is surrounded by some whitespace in the source text, and it seems only fair to allow the user to select some of this whitespace in addition to the structure itself. For that reason, we store not two but four offsets for each subtree: the point where the whitespace before the structure starts, the point where the whitespace turns into actual structural text, the point where the whitespace after the structure starts, and the point where this whitespace ends. Figure \ref{fig:bounds} shows these four offsets for the node representing the |3| in the arithmetic expression \texttt{2 + 3 * 5} and the terms we will use for some meaningful combinations of these offsets. Using this information, we can be flexible to the user and accept any text selection that starts between the first two offsets and ends between the last two offsets. The combination of these four offsets is stored in the datatype Bounds: \begin{code} data Bounds = Bounds { leftMargin :: Range, rightMargin :: Range } deriving (Eq, Show) innerRange :: Bounds -> Range innerRange (Bounds (_, left) (right, _)) = (left, right) outerRange :: Bounds -> Range outerRange (Bounds (left, _) (_, right)) = (left, right) \end{code} The bounds store the left and right margin. They could have stored the inner and outer range just as well, in which case |leftMargin| and |rightMargin| would have been the projection functions rather than |innerRange| and |outerRange|. However, because the same margin is often shared between multiple nodes in the tree, it is more memory-efficient to store the margins than the inner and outer ranges. To give an example: the subtree for |3 * 5| shares its left margin with the node for |3| and its right margin with the node for |5|. We define a function |rangeInBounds| that tells whether a range is a valid selection for a node that has the specified bounds: the left endpoint should be in the left margin, and the right endpoint should be in the right margin. \begin{code} rangeInBounds :: Range -> Bounds -> Bool rangeInBounds (l, r) b = l `posInRange` leftMargin b && r `posInRange` rightMargin b \end{code} The way the parse tree is shown in relation to its source text in figure \ref{fig:bounds} implies some laws we rely on and maintain as invariants when using and constructing annotated parse trees. These laws are: \begin{enumerate} \item A node's inner range is always enclosed by that node's outer range, i.e.~for every node's |bounds| we have that |innerRange bounds `rangeInRange` outerRange bounds|. \item Children appear in the same order in the source text as in the AST and their inner ranges do not overlap. So for each pair of adjacent siblings, we have that their respective bounds |bounds_1| and |bounds_2| adhere to |fst (rightMargin bounds_1) <= snd (leftMargin bounds_2)|. \item A node's bounds always enclose that node's children's bounds. In other words, for a parent's every child we have that |innerRange bounds_child `rangeInRange` innerRange bounds_parent|. \end{enumerate}