qse/regress/awk/lisp/Impl-notes

Implementation notes


1. Overview

Since the code should be self-explanatory to anyone knowledgeable
about Lisp implementation, these notes assume you know Lisp but not
interpreters.  I haven't got around to writing up a complete
discussion of everything, though.

The code for an interpreter can be pretty low on redundancy -- this is
natural because the whole reason for implementing a new language is to
avoid having to code a particular class of programs in a redundant
style in the old language.  We implement what that class of programs
has in common just once, then use it many times.  Thus an interpreter
has a different style of code, perhaps denser, than a typical
application program.


2. Data representation

Conceptually, a Lisp datum is a tagged pointer, with the tag giving
the datatype and the pointer locating the data.  We follow the common
practice of encoding the tag into the two lowest-order bits of the
pointer.  This is especially easy in awk, since arrays with
non-consecutive indices are just as efficient as dense ones (so we can
use the tagged pointer directly as an index, without having to mask
out the tag bits).  (But, by the way, mawk accesses negative indices
much more slowly than positive ones, as I found out when trying a
different encoding.)

This Lisp provides three datatypes: integers, lists, and symbols.  (A
modern Lisp provides many more.)

For an integer, the tag bits are zero and the pointer bits are simply
the numeric value; thus, N is represented by N*4.  This choice of the
tag value has two advantages.  First, we can add and subtract without
fiddling with the tags.  Second, negative numbers fit right in.
(Consider what would happen if N were represented by 1+N*4 instead,
and we tried to extract the tag as N%4, where N may be either positive
or negative.  Because of this problem and the above-mentioned
inefficiency of negative indices, all other datatypes are represented
by positive numbers.)


3. The evaluation/saved-bindings stack

The following is from an email discussion; it doesn't develop
everything from first principles but is included here in the hope
it will be helpful.

Hi.  I just took a look at awklisp, and remembered that there's more
to your question about why we need a stack -- it's a good question.
The real reason is because a stack is accessible to the garbage
collector.

We could have had apply() evaluate the arguments itself, and stash
the results into variables like arg0 and arg1 -- then the case for
ADD would look like

if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)

The obvious problem with that approach is how to handle calls to
user-defined procedures, which could have any number of arguments.
Say we're evaluating ((lambda (x) (+ x 1)) 42).  (lambda (x) (+ x 1))
is the procedure, and 42 is the argument.

A (wrong) solution could be to evaluate each argument in turn, and
bind the corresponding parameter name (like x in this case) to the
resulting value (while saving the old value to be restored after we
return from the procedure).  This is wrong because we must not
change the variable bindings until we actually enter the procedure --
for example, with that algorithm ((lambda (x y) y) 1 x) would return
1, when it should return whatever the value of x is in the enclosing
environment.  (The eval_rands()-type sequence would be: eval the 1,
bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind
y to that, then eval the body of the lambda.)

Okay, that's easily fixed -- evaluate all the operands and stash them
away somewhere until you're done, and *then* do the bindings.  So
the question is where to stash them.  How about a global array?
Like

  for (i = 0; arglist != NIL; ++i) {
    global_temp[i] = eval(car[arglist])
    arglist = cdr[arglist]
  }

followed by the equivalent of extend_env().  This will not do, because
the global array will get clobbered in recursive calls to eval().
Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +,
like this: global_temp[0] gets 2, and then global_temp[1] gets the
eval of (* 3 4).  But in evaluating (* 3 4), global_temp[0] gets set
to 3 and global_temp[1] to 4 -- so the original assignment of 2 to
global_temp[0] is clobbered before we get a chance to use it.  By
using a stack[] instead of a global_temp[], we finesse this problem.

You may object that we can solve that by just making the global array
local, and that's true; lots of small local arrays may or may not be
more efficient than one big global stack, in awk -- we'd have to try
it out to see.  But the real problem I alluded to at the start of this
message is this: the garbage collector has to be able to find all the
live references to the car[] and cdr[] arrays.  If some of those
references are hidden away in local variables of recursive procedures,
we're stuck.  With the global stack, they're all right there for the
gc().

(In C we could use the local-arrays approach by threading a chain of
pointers from each one to the next; but awk doesn't have pointers.)

(You may wonder how the code gets away with having a number of local
variables holding lisp values, then -- the answer is that in every
such case we can be sure the garbage collector can find the values
in question from some other source.  That's what this comment is
about:

# All the interpretation routines have the precondition that their
# arguments are protected from garbage collection.

In some cases where the values would not otherwise be guaranteed to
be available to the gc, we call protect().)

Oh, there's another reason why apply() doesn't evaluate the arguments
itself: it's called by do_apply(), which handles lisp calls like
(apply car '((x))) -- where we *don't* want the x to get evaluated
by apply().


4. Um, what I was going to write about

more on data representation
is_foo procedures slow it down by a few percent but increase clarity
(try replacing them and other stuff with macros, time it.)

gc: overview; how to write gc-safe code using protect(); point out
	that relocating gcs introduce further complications

driver loop, macros

evaluation
globals for temp values because of recursion, space efficiency
environment -- explicit stack needed because of gc

error handling, or lack thereof
strategies for cheaply adding error recovery

I/O