ase/cmd/awk/lisp/Impl-notes

Implementation notes


1. Overview

Since the code should be self-explanatory to anyone knowledgeable
about Lisp implementation, these notes assume you know Lisp but not
interpreters.  I haven't got around to writing up a complete
discussion of everything, though.

The code for an interpreter can be pretty low on redundancy -- this is
natural because the whole reason for implementing a new language is to
avoid having to code a particular class of programs in a redundant
style in the old language.  We implement what that class of programs
has in common just once, then use it many times.  Thus an interpreter
has a different style of code, perhaps denser, than a typical
application program.


2. Data representation

Conceptually, a Lisp datum is a tagged pointer, with the tag giving
the datatype and the pointer locating the data.  We follow the common
practice of encoding the tag into the two lowest-order bits of the
pointer.  This is especially easy in awk, since arrays with
non-consecutive indices are just as efficient as dense ones (so we can
use the tagged pointer directly as an index, without having to mask
out the tag bits).  (But, by the way, mawk accesses negative indices
much more slowly than positive ones, as I found out when trying a
different encoding.)

This Lisp provides three datatypes: integers, lists, and symbols.  (A
modern Lisp provides many more.)

For an integer, the tag bits are zero and the pointer bits are simply
the numeric value; thus, N is represented by N*4.  This choice of the
tag value has two advantages.  First, we can add and subtract without
fiddling with the tags.  Second, negative numbers fit right in.
(Consider what would happen if N were represented by 1+N*4 instead,
and we tried to extract the tag as N%4, where N may be either positive
or negative.  Because of this problem and the above-mentioned
inefficiency of negative indices, all other datatypes are represented
by positive numbers.)


3. The evaluation/saved-bindings stack

The following is from an email discussion; it doesn't develop 
everything from first principles but is included here in the hope
it will be helpful.

Hi.  I just took a look at awklisp, and remembered that there's more
to your question about why we need a stack -- it's a good question.
The real reason is because a stack is accessible to the garbage
collector.

We could have had apply() evaluate the arguments itself, and stash
the results into variables like arg0 and arg1 -- then the case for
ADD would look like

if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)

The obvious problem with that approach is how to handle calls to
user-defined procedures, which could have any number of arguments.
Say we're evaluating ((lambda (x) (+ x 1)) 42).  (lambda (x) (+ x 1))
is the procedure, and 42 is the argument.  

A (wrong) solution could be to evaluate each argument in turn, and
bind the corresponding parameter name (like x in this case) to the
resulting value (while saving the old value to be restored after we
return from the procedure).  This is wrong because we must not 
change the variable bindings until we actually enter the procedure --
for example, with that algorithm ((lambda (x y) y) 1 x) would return
1, when it should return whatever the value of x is in the enclosing
environment.  (The eval_rands()-type sequence would be: eval the 1,
bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind
y to that, then eval the body of the lambda.)

Okay, that's easily fixed -- evaluate all the operands and stash them
away somewhere until you're done, and *then* do the bindings.  So 
the question is where to stash them.  How about a global array?
Like

  for (i = 0; arglist != NIL; ++i) {
    global_temp[i] = eval(car[arglist])
    arglist = cdr[arglist]
  }

followed by the equivalent of extend_env().  This will not do, because
the global array will get clobbered in recursive calls to eval().
Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +,
like this: global_temp[0] gets 2, and then global_temp[1] gets the
eval of (* 3 4).  But in evaluating (* 3 4), global_temp[0] gets set
to 3 and global_temp[1] to 4 -- so the original assignment of 2 to
global_temp[0] is clobbered before we get a chance to use it.  By
using a stack[] instead of a global_temp[], we finesse this problem.

You may object that we can solve that by just making the global array
local, and that's true; lots of small local arrays may or may not be
more efficient than one big global stack, in awk -- we'd have to try
it out to see.  But the real problem I alluded to at the start of this
message is this: the garbage collector has to be able to find all the
live references to the car[] and cdr[] arrays.  If some of those
references are hidden away in local variables of recursive procedures,
we're stuck.  With the global stack, they're all right there for the
gc().

(In C we could use the local-arrays approach by threading a chain of
pointers from each one to the next; but awk doesn't have pointers.)

(You may wonder how the code gets away with having a number of local
variables holding lisp values, then -- the answer is that in every
such case we can be sure the garbage collector can find the values
in question from some other source.  That's what this comment is
about:

# All the interpretation routines have the precondition that their
# arguments are protected from garbage collection.

In some cases where the values would not otherwise be guaranteed to
be available to the gc, we call protect().)

Oh, there's another reason why apply() doesn't evaluate the arguments 
itself: it's called by do_apply(), which handles lisp calls like 
(apply car '((x))) -- where we *don't* want the x to get evaluated
by apply().


4. Um, what I was going to write about

more on data representation
is_foo procedures slow it down by a few percent but increase clarity
(try replacing them and other stuff with macros, time it.)

gc: overview; how to write gc-safe code using protect(); point out
	that relocating gcs introduce further complications

driver loop, macros

evaluation
globals for temp values because of recursion, space efficiency
environment -- explicit stack needed because of gc

error handling, or lack thereof
strategies for cheaply adding error recovery

I/O
2008-01-01 07:02:50 +00:00			`Implementation notes`


			`1. Overview`

			`Since the code should be self-explanatory to anyone knowledgeable`
			`about Lisp implementation, these notes assume you know Lisp but not`
			`interpreters. I haven't got around to writing up a complete`
			`discussion of everything, though.`

			`The code for an interpreter can be pretty low on redundancy -- this is`
			`natural because the whole reason for implementing a new language is to`
			`avoid having to code a particular class of programs in a redundant`
			`style in the old language. We implement what that class of programs`
			`has in common just once, then use it many times. Thus an interpreter`
			`has a different style of code, perhaps denser, than a typical`
			`application program.`


			`2. Data representation`

			`Conceptually, a Lisp datum is a tagged pointer, with the tag giving`
			`the datatype and the pointer locating the data. We follow the common`
			`practice of encoding the tag into the two lowest-order bits of the`
			`pointer. This is especially easy in awk, since arrays with`
			`non-consecutive indices are just as efficient as dense ones (so we can`
			`use the tagged pointer directly as an index, without having to mask`
			`out the tag bits). (But, by the way, mawk accesses negative indices`
			`much more slowly than positive ones, as I found out when trying a`
			`different encoding.)`

			`This Lisp provides three datatypes: integers, lists, and symbols. (A`
			`modern Lisp provides many more.)`

			`For an integer, the tag bits are zero and the pointer bits are simply`
			`the numeric value; thus, N is represented by N*4. This choice of the`
			`tag value has two advantages. First, we can add and subtract without`
			`fiddling with the tags. Second, negative numbers fit right in.`
			`(Consider what would happen if N were represented by 1+N*4 instead,`
			`and we tried to extract the tag as N%4, where N may be either positive`
			`or negative. Because of this problem and the above-mentioned`
			`inefficiency of negative indices, all other datatypes are represented`
			`by positive numbers.)`


			`3. The evaluation/saved-bindings stack`

			`The following is from an email discussion; it doesn't develop`
			`everything from first principles but is included here in the hope`
			`it will be helpful.`

			`Hi. I just took a look at awklisp, and remembered that there's more`
			`to your question about why we need a stack -- it's a good question.`
			`The real reason is because a stack is accessible to the garbage`
			`collector.`

			`We could have had apply() evaluate the arguments itself, and stash`
			`the results into variables like arg0 and arg1 -- then the case for`
			`ADD would look like`

			`if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)`

			`The obvious problem with that approach is how to handle calls to`
			`user-defined procedures, which could have any number of arguments.`
			`Say we're evaluating ((lambda (x) (+ x 1)) 42). (lambda (x) (+ x 1))`
			`is the procedure, and 42 is the argument.`

			`A (wrong) solution could be to evaluate each argument in turn, and`
			`bind the corresponding parameter name (like x in this case) to the`
			`resulting value (while saving the old value to be restored after we`
			`return from the procedure). This is wrong because we must not`
			`change the variable bindings until we actually enter the procedure --`
			`for example, with that algorithm ((lambda (x y) y) 1 x) would return`
			`1, when it should return whatever the value of x is in the enclosing`
			`environment. (The eval_rands()-type sequence would be: eval the 1,`
			`bind x to 1, eval the x -- yielding 1 which is wrong -- and bind`
			`y to that, then eval the body of the lambda.)`

			`Okay, that's easily fixed -- evaluate all the operands and stash them`
			`away somewhere until you're done, and then do the bindings. So`
			`the question is where to stash them. How about a global array?`
			`Like`

			`for (i = 0; arglist != NIL; ++i) {`
			`global_temp[i] = eval(car[arglist])`
			`arglist = cdr[arglist]`
			`}`

			`followed by the equivalent of extend_env(). This will not do, because`
			`the global array will get clobbered in recursive calls to eval().`
			`Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +,`
			`like this: global_temp[0] gets 2, and then global_temp[1] gets the`
			`eval of (* 3 4). But in evaluating (* 3 4), global_temp[0] gets set`
			`to 3 and global_temp[1] to 4 -- so the original assignment of 2 to`
			`global_temp[0] is clobbered before we get a chance to use it. By`
			`using a stack[] instead of a global_temp[], we finesse this problem.`

			`You may object that we can solve that by just making the global array`
			`local, and that's true; lots of small local arrays may or may not be`
			`more efficient than one big global stack, in awk -- we'd have to try`
			`it out to see. But the real problem I alluded to at the start of this`
			`message is this: the garbage collector has to be able to find all the`
			`live references to the car[] and cdr[] arrays. If some of those`
			`references are hidden away in local variables of recursive procedures,`
			`we're stuck. With the global stack, they're all right there for the`
			`gc().`

			`(In C we could use the local-arrays approach by threading a chain of`
			`pointers from each one to the next; but awk doesn't have pointers.)`

			`(You may wonder how the code gets away with having a number of local`
			`variables holding lisp values, then -- the answer is that in every`
			`such case we can be sure the garbage collector can find the values`
			`in question from some other source. That's what this comment is`
			`about:`

			`# All the interpretation routines have the precondition that their`
			`# arguments are protected from garbage collection.`

			`In some cases where the values would not otherwise be guaranteed to`
			`be available to the gc, we call protect().)`

			`Oh, there's another reason why apply() doesn't evaluate the arguments`
			`itself: it's called by do_apply(), which handles lisp calls like`
			`(apply car '((x))) -- where we don't want the x to get evaluated`
			`by apply().`


			`4. Um, what I was going to write about`

			`more on data representation`
			`is_foo procedures slow it down by a few percent but increase clarity`
			`(try replacing them and other stuff with macros, time it.)`

			`gc: overview; how to write gc-safe code using protect(); point out`
			`that relocating gcs introduce further complications`

			`driver loop, macros`

			`evaluation`
			`globals for temp values because of recursion, space efficiency`
			`environment -- explicit stack needed because of gc`

			`error handling, or lack thereof`
			`strategies for cheaply adding error recovery`

			`I/O`