K.I.S.S. my AST!

λ March 21, 2021
Tags: prolog, felt, life

Perseverance is 99.9% of being a developer. What’s the other 0.1%? I think, after careful thought it’s as simple as this - ‘Wanting to’, or, to use a more modern expression, ’giving a f*’. When you are working on a long-term solo project that’s now in its ninth year of existence, when you did V1.0 but didn’t release it because you felt(!) it wasn’t good enough, when you tried to write it in C, Haskell but kept coming back to Prolog. Sigh, one day it will all be meaningless I hope.

I have been working on my transpiler project (FELT) for a while now and after a ten month hiatus of sorts (furlough, cancer, redundancy, new job) I have finally managed to summon up the ginger / courage / be-arsed to dive back into the code I wrote. On the one hand, I didn’t realise how far in those eight months of furlough just how much code I’d written, beautifully documented, following Prolog style guides, the works but again on the other hand I hadn’t realised that it would be so hard to get back into the mindset of a stressed out furloughed old git of a developers code!!!

It’s not that the code is scruffy, it’s just that I was obviously ‘on a roll’ with it and somehow, despite each predicate being lavishly and exquisitely documented, what isn’t jumping back out is the overall process flow, it’s fine to see individual methods, the notes on the stave, but what tune is being played?

This is part of the reason that I am creating not just FELT but an IDE to go with it; this IDE will not use files, nor will it use source code and no it won’t be another sad 4GL thing either. Let’s just say that the last language I learned recently (J) to the point of getting it enough to use it to write a PostgreQSL interface and an SDL graphics library interface opened my eyes to a few things about the way in which the brain assimilates things on the page and also how the correct design of a langauge, textual or graphical, can have a huge impact on how it is perceived, understood and used.

I can recommened several readings of this, I’ve fully read it three times, and lost count of the number of times I’ve had to stop and go back to take on board the full enormity of what has just been stated. The paper is by Kenneth Iverson:

Notation as a Tool of Thought

With Mr. Iverson being the creator of APL. About two years ago I spent about a year learning Dyalog APL and then I found that Iverson had moved on from that and created J instead, I am so glad he did because it made me realise how much time is wasted just typing out reams of characters just to get something done. APL and hence J are so damned powerful it blows your mind. Think of any implementation of the core algorithm for Conways Game of Life and then marvel at this, the same thing in one line of code, this is Dyalog APL:

life ← {⊃1 ⍵ ∨.∧ 3 4 = +/ +⌿ ¯1 0 1 ∘.⊖ ¯1 0 1 ⌽¨ ⊂⍵}

and the same algorithm in J:

life=: (3 3 (+/ e. 3+0,4&{)@,;._3 ])@(0,0,0,.0,.])

Before you wonder, no that’s not a UTF8 rendering error in your browser, that’s actually what APL and J code look like, for real. And what scares me is that I can read it!

In the last few weeks, in spare time, I have slowly been able to get my head back into the code I wrote and then have not seen in ten months. Prolog is lovely to read, and once you know how, it doesn’t take long to understand what each predicate is doing. No, the real issue is what’s going on between the predicates, the call ordering, the messaging if you will. I am not the first person to suffer this nor will I be the last and it impinges a little on one of my criticisms of test-driven development, that you can have 100% test coverage yet the application is still crap, crashes, doesn’t do what you expected especially if it exists within a larger inter-function hive-mind group. TDD would be for intra-function behavioural testing.


Abstract Syntax Trees

These do not grow in the wild yet. An abstract syntax tree (AST) is an internal data structure that contains the tokens from the input source file and holds them in such a way that the original ‘intent’ of the source is preserved. A transformation that is possibly isomorphic on some level, whitespace and comments not withstanding, in that a correctly ordered traversal of the AST would reveal the original source text. In the case of FELT, the default is to elide comments and whitespace on the grounds that only the source code matters and the final rendered code is NOT for human consumption. Despite that, I have ensured that the final code is passed through a beautification mechanism because I know that in the early days of adoption, developers like me will be highly distrustful and want to see what’s coming out the other end and make sure it at least looks like something they would have produced!

So… let’s work a simple example. Here’s a tiny little program in FELT that just says “hello” given the name of a carbon based life form:

; a simple greeting program
(defun greet (name)
  (o> "Hello " name ", how's it going?"))

Note: o> is emit and there are other shortforms available because quite frankly typing can be boring and sometimes it helps to look like your code is smarter than you are if the management are standing over your shoulder. The current short forms in FELT are hopefully reasonably self-explanatory:

o>                EMIT         the 'o' is 'output', the > is unix like piping to a file
:=                DEFVAR       define a variable or variables
=                 ASSIGN       yup, variable assignment, right to left.
@< @> <@ >@ @#    -arrays-     array operations
->                MCALL        instance call, what OO people live for
::                CLASS-CALL   class call, static method calls.
λ                 DEFUN        defining a function, may be public, protected or private

If you feel inclined you could check out the FELT Instruction A-Z to see what I’m up against. And remember, that site is 99.9% FELT code, front end and back end, all the JavaScript, all the server side PHP and all the CSS is all written in the same underlying language. There is a REPL that you can fiddle with if you like, YYMV, and on the left of that page is a drop-down menu that shows some pages that make up the site itself.

There are over 80 instructions and they all have to be parsed, syntax checked and then a renderer for the desired language has to be produced. All by me, all in Prolog. Until then I will not die happy.

The tokenised output of the above snippet is:

; a simple greeting program
(λ greet (name)
  (o> "Hello " name ", how's it going?"))
?- tokenise_file('blog.lisp',X), print_term(X,[]).
[ o_paren((0,1,0)),
  token((1,1,1),"λ"),
  token((3,1,3),"greet"),
  o_paren((9,1,9)),
  token((10,1,10),"name"),
  c_paren((14,1,14)),
  o_paren((18,2,2)),
  token((19,2,3),"o>"),
  str2((22,2,6),"Hello "),
  token((31,2,15),"name"),
  str2((36,2,20),", how's it going?"),
  c_paren((55,2,39)),
  c_paren((56,2,40))
]

This list of tokens is then pumped into the AST generator, which also performs syntax checking along the way in an attempt to adhere to the ‘fail fast’ approach, however in this case we have no such errors (more in future articles) in this short code sample. Note that λ has been scanned as the reserved word DEFUN. The next step is to ‘reshape’ the token stream.

Reshaping The Streams

I had possibly over-planned the AST but back in the day, I knew that I was going to be writing a lot of code to syntax check 80-odd commands, and the thought of typing o_paren over and over made me implement a step I call reshaping. This not only reduces the amount of work in pure keyboard activity but it also transforms certain tokens into a different form that will make the AST source code more readable.

The most obvious way to demonstrate this technique is to reshape the above list, here is what we end up with:

[ op((0,1,0)),
  defun((1,1,1)),
  tk((3,1,3),"greet"),
  op((9,1,9)),
  tk((10,1,10),"name"),
  cp((14,1,14)),
  op((18,2,2)),
  emit((19,2,3)),
  s2((22,2,6),"Hello "),
  tk((31,2,15),"name"),
  s2((36,2,20),", how's it going?"),
  cp((55,2,39)),
  cp((56,2,40))
]

The key reshape transformations are:

  • token() ⟼ tk()
  • o_paren, c_paren et al. are reduced to op(), cp() etc.
  • single and double quoted strings ⟼s1(), s2()
  • reserved words are promoted to a first-class functor as can be seen by the change from token((23,2,3),“o>”) into the form emit((23,3,3)).

Note that token promotion to a first-class functor preserves the position parameter; we may have changed the internal form but the source code marker did not change.

The definitive reshape code

The code that does this is pretty short actually, and makes use of Unification and pattern matching in a reasonably self-explanatory way:

%%        reshape(+TokenIn, -TokenOut).
%
%         This is where we turn tokens into keywords and
%         preprocessor instructions.
%
%         The AST DCG rules assume shorter cleaner representations of
%         the raw tokeniser output and this is where we do that token
%         transformation.
%
reshape(In, Out) :- maplist(reshape_, In, Out).

% trivial remappings
reshape_(o_paren(P), op(P)).
reshape_(c_paren(P), cp(P)).
reshape_(o_list(P), ol(P)).
reshape_(c_list(P), cl(P)).
reshape_(o_map(P), om(P)).
reshape_(c_map(P), cm(P)).
reshape_(str1(P,B), s1(P,B)).
reshape_(str2(P,B), s2(P,B)).
reshape_(com1(P,B), c1(P,B)).
reshape_(comn(P,B), cn(P,B)).
reshape_(keyword(P,B), kw(P,B)).

P is a position, a three-tuple of (byte-ffset, line, column). B is the body of the string. The first thing to notice again is the auxiliary pattern which comprises of the designated predicate reshape/2 and that it simply maps reshape_/2 over each input token. If you are familiar with functional programming style the call to maplist may look familiar to you. In Prolog, maplist/2, maplist/3 etc. are very useful indeed. Think of it as a combination of mapping and currying, the first parameter to maplist here is just the name of a principle functor (reshape_/.2) but it could equally be something like foo(test, A) and then the resulting call it makes per token would be foo(test, A, token-list-entry, Out). It automatically gathers up each returned Out term into a final list, learn Prolog!

But why do it ?

I did it for clarity of representation; Prolog is a declarative language and so I wanted the code to be as readable and reasonable as possible, for example, without going into details here is the AST code that examines and consumes the token stream when looking for a list definition:

%-----------------------------------------------------------------------
%         list
%-----------------------------------------------------------------------
%         "[" [ VALUE ] "]"
%         VALUE: any valid expression

list(List) -->
    [ol(Pos)],
    list_body(Pos, Body),
    [cl(_)],
    !,
    {List =..[list, Pos, Body]}.

% reached end of list yet ?
list_body(_Pos, []), [CL] -->
    CL =<< cl,
    !.

I know that having a cleaner token form means that the above code is short and concise and still readable. There a lot more to the list handling code but then it is going to be an industrial strength application. If I hadn’t done the reshaping stage then the first part of the code would have had to be:

%-----------------------------------------------------------------------
%         list
%-----------------------------------------------------------------------
%         "[" [ VALUE ] "]"
%         VALUE: any valid expression

list(List) -->
    [o_list(Pos)],
    list_body(Pos, Body),
    [c_list(_)],
    !,
    {List =..[list, Pos, Body]}.

It might look trivial but it has and continues to save a lot of time and effort thinking about it. And just for fun, here’s the complete AST handler for the DEFVAR instruction, and shows how Prolog code can be beautifully readable.

I am planning some articles on difference lists and user defined operators because I wrote a few of those which also helped me create nicer code, here it is, the AST DEFUN handler in all its glory:

%-----------------------------------------------------------------------
%         defvar
%-----------------------------------------------------------------------
%         DEFVAR NAME
%         DEFVAR (NAME VALUE)
%         DEFVAR (NAME :ref)
%         :VALUE any valid expression

defvar(Pos, defvar(Pos, varval(Var, RHS))) -->
    expect(tk, Pos, defvar_expected_name, Var),
    single_or_sexp(RHS),
    expect_cp(Pos),
    !,
    d('defvar: var:~p value: ~p', [Var, RHS]).

defvar(Pos, defvar(Pos, var(Var))) -->
    expect(tk, Pos, defvar_expected_name, Var),
    expect_cp(Pos),
    !,
    d('defvar: var:~p', [Var]).

I don’t expect you to understand it unless you know Prolog, but just looking at it, knowing that the comma at the end of each line is read as AND, you can kind of get a feel for what it is doing. I hope.

I think that’s enough for my head for now. The next few articles are going to be about something more fundamental: difference lists, and then how they give us DCG, directed clause grammars using --> and term rewriting on code loading (like a macro facility) to let syntax sugaring feel just like any other part of the language. And I haven’t even covered user defined operators either.

Hot diggety, Prolog sure is a mighty fine environment to work in.