Tokenisation - Part I - From chaos to control.

λ March 9, 2021
Tags: felt, prolog

The first step in any computer language is to turn the meaningless jumble of bytes in the source file into a sequence of more meaningful character sequences, or tokens, hence the term Tokenisation. This article is an explanation of that process and how it has been implemented for a transpiler project using SWI Prolog as the implementation language.

But I don’t know prolog!

You say that but… you are probably a decent developer and you have probably had exposure to functions and booleans. That will do for now. Just think of every Prolog predicate as a function that returns true or false and has a bunch of parameters. Convention dictates that the parameters are, from left to right, inputs and then outputs.

Here’s a very very VERY short summation of how to read the code but trust me this is literally the tip of the iceberg and I don’t mean lettuce:

    % By convention we run inputs on the left, outputs on the right.
    a_predicate(Arg1In, Arg2In, ResultOut) :-
        foo(Arg1In, Temp1),    % Temp1 = foo(Arg1In)
        bar(Arg1In, Temp2),    % Temp2 = bar(Arg2In)
        ResultOut = [Temp1, Temp2].

In a nutshell, a_predicate takes two input arguments and returns a result in ResultOut if and only if all three of its contained predicate calls also succeed. If foo() succeeds, then it will attempt to call bar(), and if that succeeds then it will unify the ResultOut variable with the two returned values as a list.

Unification is one of the core concepts behind Prolog programming and, in its simplest form you can think of it as pattern matching between things that are known and things that are unknown. If the Prolog engine determines that two expressions can be unified than that line succeeds. Unification is way more powerful than that but I don’t want to drown you in details too soon.

For a better insight than mine, read here, it has a flowchart that helps.

That’s about as much as I feel you need to read the rest of this article. Prolog is a subtle and incredibly powerful language, the logic paradigm is one that not many people are too familiar with. Time spent learning Prolog is not time wasted. The first version of Erlang was written with Prolog!

Controlling the Chaos.

The language that I want to process uses LISP s-expression format, if you have written Clojure code then you already know what that looks like this:

    ; FELT comments begin with a semi-colon until the end of the line
    ; and /* .. */ can also be used as well. A simple function to add
    ; two numbers and return the result. Wow. Much code, many awesome.
    (defun our-function (arg1 arg2)
        (return (+ arg1 arg2)))

It is pretty easy to see how it goes; args in, do stuff, return the sum. In normal LISP the return value from a function is the value of the last executed form, but in FELT you have to be explicit as not all of the target languages have this feature. I may change that but I have to finish a version first!

If that file was called simple.felt and we examined it on the disk, this is what it would look like as a raw data file, output provided by using the od command:

    $ od simple.felt
/tmp  od -a simple.felt
    0000000    ;  sp   F   E   L   T  sp   c   o   m   m   e   n   t   s  sp
    0000020    b   e   g   i   n  sp   w   i   t   h  sp   a  sp   s   e   m
    0000040    i   -   c   o   l   o   n  sp   u   n   t   i   l  sp   t   h
    0000060    e  sp   e   n   d  sp   o   f  sp   t   h   e  sp   l   i   n
    0000100    e  nl   ;  sp   a   n   d  sp   /   *  sp   .   .  sp   *   /
    0000120   sp   c   a   n  sp   a   l   s   o  sp   b   e  sp   u   s   e
    0000140    d  sp   a   s  sp   w   e   l   l   .  sp   A  sp   s   i   m
    0000160    p   l   e  sp   f   u   n   c   t   i   o   n  sp   t   o  sp
    0000200    a   d   d  nl   ;  sp   t   w   o  sp   n   u   m   b   e   r
    0000220    s  sp   a   n   d  sp   r   e   t   u   r   n  sp   t   h   e
    0000240   sp   r   e   s   u   l   t   .  sp   W   o   w   .  sp   M   u
    0000260    c   h  sp   c   o   d   e   ,  sp   m   a   n   y  sp   a   w
    0000300    e   s   o   m   e   .  nl   (   d   e   f   u   n  sp   o   u
    0000320    r   -   f   u   n   c   t   i   o   n  sp   (   a   r   g   1
    0000340   sp   a   r   g   2   )  nl  sp  sp  sp  sp   (   r   e   t   u
    0000360    r   n  sp   (   +  sp   a   r   g   1  sp   a   r   g   2   )
    0000400    )   )  nl                                                    

So the challenge is turning the above pavement pizza of bytes into something we can actually work with. The ‘something’ is called a token. A token is just a sequence of bytes in the file that has a semantic meaning to the source language. For example, in FELT the characters () are used to enclose an s-expression, the characters [] are for lists of things and {} are for association maps / hash-maps. Strings are parsed as content between "" or '' pairs, no distinction is made by FELT but the target language may or may not decide to attempt string interpolation according to which are used.

Sucking it up…

Given a file location, the first thing to do is to attempt to read that file,

tokenise_file(File, Tokens) :-
    tokenise_file(File, Tokens, [nocomments]).

tokenise_file(File, Tokens, Options) :-
    phrase_from_file(tokens_(TokStream), File),
    process_options(Options, TokStream, Tokens).

The first predicate takes the location of the source file as input and says that it will return a list of the tokens in the second argument, Tokens. Sounds reasonable. It does this by calling out to another predicate of the same name but of a different arity, tokenise_file/2 is using tokenise_file/3 to do the heavy lifting, and passes an option flag which elides comment tokens from the final list of tokens. Mostly the transpiler doesn’t care about comments but by using the appropriate command line switch it will dutifully echo them through if required.

Grinding the bytes

The real process of reading the file and applying the tokenisation logic is done by the call to the predicate phrase_from_file/2. As in Erlang we name a function by its label and arity, and there can be multiple arity versions of a function, sorry, a predicate.

The first argument is: tokens_(TokStream), the second argument is: File. The official documentation for phrase_from_file/2 says that the first parameter is :Grammar which is a set of DCG rules that are applied to the file until it is fully consumed, and the second parameter is +File which is the source file.

After the phrasing is applied, then process_options/3 is applied to transform the list of tokens into the final output. Note that we asked that comments be left out, this is what the nocomments option does. The entire code for that is shown here for clarity (and completeness), it’s pretty simple:

%%        process_options(+Options, +Tokens, -Out).
%
%         For each option given in the Options list, we apply the
%         handler to the current list of tokens until all options are
%         processed.

process_options([], Out, Out) :- !.

process_options([nocomments|Opts], Tokens, Out) :-
    debug(tokeniser_options, '> filter: remove all comments', []),
    exclude(filter_token([com1, comn]), Tokens, Filtered),
    !,
    debug(tokeniser_nocomments, 'BEFORE: ~p~nAFTER:  ~p~n', [Tokens, Filtered]),
    process_options(Opts, Filtered, Out).

process_options([O|Opts], Tokens, Out) :-
    debug(tokeniser, '> filter? ~w', O),
    process_options(Opts, Tokens, Out).

The first clause is used when the source list is empty, that is, we have processed all the options given (or it really was empty to start with) so we unify +Tokens with -Out, making -Out contain the final list.

There is a convention that when documenting predicates one uses various prefix characters, simply:

    +Var   % Var is -expected- to be a bound variable
    -Var   % Var is -expected- to be unbound, empty, void, nothing in it yet.
    :Var   % Var is a meta-argument. Implies +, higher-order use case.

Bound is just a posh way of saying ‘has a value bound to it’ and Unbound means the variable is devoid of a value, waiting to be filled in. There are other prefix characters but I won’t explain them now. They are called ‘modes’ and fully explained here for SWI Prolog. In Mercury you -have- to specify them so the compiler can actually do its job as efficiently as possible. This comes about because in Prolog there is no return value from a predicate, the true/false is internal so you have to pass in variables for every parameter you want to track or process. This can make recursion a little noisy at times but not usually.

Show me the code already!

OK, enough of the pre-amble and set up, let’s see how you turn a series of characters into a list of tokens for subsequent processing, you read this far, I thank you.

If you recall, the tokenisation process is triggered by this call:

    phrase_from_file(tokens_(TokStream), File),

and here, for you to see and marvel at its splendour is the full (current) implementation of tokens_. There is another convention in Prolog that any predicate that ends with an underscore is an ‘auxiliary’ predicate; this means it is not meant to be called from anywhere other than a predicate called tokens but it is just a convention. It’s explained amazingly well by Markus Triska on his YouTube channel, “The Power of Prolog” and I love his voice too! I feel smarter just watching them.

I’ve decided to post the full code here, comment block and all as it helps to explain to me at least what I was thinking about at the time. Essentially tokens_ is the recursively called auxiliary predicate that chomps its way through the input file one character at a time and it has been told to look for certain things and ‘emit’ a token.

%%        tokens_(-Tokens).
%
%         This consumes as many valid tokens from the input stream
%         until exhaustion. Note that the final fail leaves any
%         trailing whitespace in the stream so we have to ensure that
%         we consume all that is left or the return to phrase/2 from
%         phrase_from_file/2 also fails!
%
%         The DCG rules rely on some state being maintained in the
%         head of the queue, the position, stored as a tuple of (Pos,
%         L, C). Pos is the 0-based index from the start of the stream
%         as presented, L and C are line and column.
%
%         @see movepos/3

tokens_(Tokens) -->
    push_state((0,1,0)),
    token_collect(Tokens),
    consume_all,
    pop_state(_),
    !.

token_collect([T| Ts]) -->
    skip_ws, (
        bracket(T) |
        term(T)
    ), !,
    token_collect(Ts).

token_collect([]) --> [].

Directed Clause Grammars

Most Prolog distributions come with the facility to allow a ‘grammar’ to be specified using --> syntax. It’s worth pointing out that this is syntactic sugar for difference list manipulation but I am not going to explain that for now. All you need to be aware of is that behind the scenes in DCG execution there is a list of the unparsed input. There is also a technique you can use of pushing state onto this list (stack, if you will) so that every DCG rule can have access to read and modify the state, this saves having to add a state parameter to every single rule.

FELT Tokens

The output from the tokeniser will be a list of functors (data structures) that represents the contents of the source file at a higher level than bytes. Every token then consists of the position of the token in the source code and then optional arguments if required.

    token-functor-name( (offset, line, column) [ ...body... ] )

Token Position & Content

The position is formed as a 3-tuple of (offset, line, column) and as we can see, the push_state/1 predicate starts off the first token at offset zero, line one, column zero. As we munch our way through the file this is updated and every time we emit a token, it will contain the starting position of the token and, if required, the token body. Some tokens don’t need a body as the name alone is enough to place the contents. Here are the possible tokens that can be returned:

    o_paren c_paren(Pos, Body) %  "("   ")"
    o_list  c_list(Pos, Body)  %  "["   "]"
    o_map   c_map(Pos, Body)   %  "{"   "}"
    str1(Pos, Body)            % single quoted string  using \'
    str2(Pos, Body)            % double   "      "     using \"
    token(Pos, Body)           % default textual token
    com1(Pos, Body)            % single line comment
    comn(Pos, Body)            % block comment

Chopping it all up

Looking back up the page we can see that the token_collect//1 is the bad boy that does it. Really does it… it is a recursive function that basically skips any current whitespace and then attempts to analyse the head of the input list to see what it can find, let’s break it down:

token_collect([T| Ts]) -->
    skip_ws, (
        bracket(T) |
        term(T)
    ), !,
    token_collect(Ts).

token_collect([]) --> [].

The SECOND clause is the TERMINATING clause, the first is the RECURSIVE case. When the input list is empty, we are done. In Prolog, the COMMA is read as AND, given the nature of what a predicate is, so, reading aloud we have: skip whitespace AND match a bracket OR match a term, commit to that result (!, the cut operator) and then call token_collect/1 again. Also note I use the | to mean OR, this would normally be the ; character but I prefer | as it reads like BNF. This is my only relaxing of using a semi-colon in the code base to mean OR.

Note that the list of tokens is constructed by using stack frames, it looks a little odd at first but is a very powerful and common idiom in Prolog applications. On entry, the Tokens list is unbound, awaiting a value, and here we see that we have pattern matched that list by deconstructing it into the head and tail. Again, if you have used Erlang or Elixir this will look familiar to you.

Stack frames all the way down…and back

On entry for the first time, [T|Ts] splits the list, built the list has no value so also T and Ts have no value. If either of bracket//1 or term// succeed then T is BOUND to a value as shown earlier for one of the tokens, for example str2((0,1,0), "Cool") means a double-quoted string was found at the start of the file.

The eagle eyed amongst you may have noticed that I have written token_collect//1 with a double forward slash not a single one, well done! This is convention to show that it is a DCG rule not a normal predicate.

token_collect([T| Ts]) -->
    skip_ws, (
        bracket(T) |
        term(T)
    ), !,
    token_collect(Ts).

token_collect([]) --> [].

Having -unified- T with a token, we call token_collect//1 again, with the still empty tail of the list we saw coming in, and this goes around again looking for another bracket or token, and again, and again until such time as this clause fails because we’ve run out of data; at that point the second, terminating clause, is used and this returns to the caller because it does not call itself again. The stack frames are then unwound, and as it ascends, T now has a value, and on return to the previous stack frame, the list is grown from the head out because if you remember, we split the list into a head and tail, unify the head with the latest token and then split and unify again.

OK, in the next part, we will look at how we actually detect each individual token a character at a time… I feel this post is a little bit long now and you may be feeling tired. If you made it this far.

LEARN PROLOG!