This commit is contained in:
147
ase/cmd/awk/lisp/Impl-notes
Normal file
147
ase/cmd/awk/lisp/Impl-notes
Normal file
@ -0,0 +1,147 @@
|
||||
Implementation notes
|
||||
|
||||
|
||||
1. Overview
|
||||
|
||||
Since the code should be self-explanatory to anyone knowledgeable
|
||||
about Lisp implementation, these notes assume you know Lisp but not
|
||||
interpreters. I haven't got around to writing up a complete
|
||||
discussion of everything, though.
|
||||
|
||||
The code for an interpreter can be pretty low on redundancy -- this is
|
||||
natural because the whole reason for implementing a new language is to
|
||||
avoid having to code a particular class of programs in a redundant
|
||||
style in the old language. We implement what that class of programs
|
||||
has in common just once, then use it many times. Thus an interpreter
|
||||
has a different style of code, perhaps denser, than a typical
|
||||
application program.
|
||||
|
||||
|
||||
2. Data representation
|
||||
|
||||
Conceptually, a Lisp datum is a tagged pointer, with the tag giving
|
||||
the datatype and the pointer locating the data. We follow the common
|
||||
practice of encoding the tag into the two lowest-order bits of the
|
||||
pointer. This is especially easy in awk, since arrays with
|
||||
non-consecutive indices are just as efficient as dense ones (so we can
|
||||
use the tagged pointer directly as an index, without having to mask
|
||||
out the tag bits). (But, by the way, mawk accesses negative indices
|
||||
much more slowly than positive ones, as I found out when trying a
|
||||
different encoding.)
|
||||
|
||||
This Lisp provides three datatypes: integers, lists, and symbols. (A
|
||||
modern Lisp provides many more.)
|
||||
|
||||
For an integer, the tag bits are zero and the pointer bits are simply
|
||||
the numeric value; thus, N is represented by N*4. This choice of the
|
||||
tag value has two advantages. First, we can add and subtract without
|
||||
fiddling with the tags. Second, negative numbers fit right in.
|
||||
(Consider what would happen if N were represented by 1+N*4 instead,
|
||||
and we tried to extract the tag as N%4, where N may be either positive
|
||||
or negative. Because of this problem and the above-mentioned
|
||||
inefficiency of negative indices, all other datatypes are represented
|
||||
by positive numbers.)
|
||||
|
||||
|
||||
3. The evaluation/saved-bindings stack
|
||||
|
||||
The following is from an email discussion; it doesn't develop
|
||||
everything from first principles but is included here in the hope
|
||||
it will be helpful.
|
||||
|
||||
Hi. I just took a look at awklisp, and remembered that there's more
|
||||
to your question about why we need a stack -- it's a good question.
|
||||
The real reason is because a stack is accessible to the garbage
|
||||
collector.
|
||||
|
||||
We could have had apply() evaluate the arguments itself, and stash
|
||||
the results into variables like arg0 and arg1 -- then the case for
|
||||
ADD would look like
|
||||
|
||||
if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)
|
||||
|
||||
The obvious problem with that approach is how to handle calls to
|
||||
user-defined procedures, which could have any number of arguments.
|
||||
Say we're evaluating ((lambda (x) (+ x 1)) 42). (lambda (x) (+ x 1))
|
||||
is the procedure, and 42 is the argument.
|
||||
|
||||
A (wrong) solution could be to evaluate each argument in turn, and
|
||||
bind the corresponding parameter name (like x in this case) to the
|
||||
resulting value (while saving the old value to be restored after we
|
||||
return from the procedure). This is wrong because we must not
|
||||
change the variable bindings until we actually enter the procedure --
|
||||
for example, with that algorithm ((lambda (x y) y) 1 x) would return
|
||||
1, when it should return whatever the value of x is in the enclosing
|
||||
environment. (The eval_rands()-type sequence would be: eval the 1,
|
||||
bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind
|
||||
y to that, then eval the body of the lambda.)
|
||||
|
||||
Okay, that's easily fixed -- evaluate all the operands and stash them
|
||||
away somewhere until you're done, and *then* do the bindings. So
|
||||
the question is where to stash them. How about a global array?
|
||||
Like
|
||||
|
||||
for (i = 0; arglist != NIL; ++i) {
|
||||
global_temp[i] = eval(car[arglist])
|
||||
arglist = cdr[arglist]
|
||||
}
|
||||
|
||||
followed by the equivalent of extend_env(). This will not do, because
|
||||
the global array will get clobbered in recursive calls to eval().
|
||||
Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +,
|
||||
like this: global_temp[0] gets 2, and then global_temp[1] gets the
|
||||
eval of (* 3 4). But in evaluating (* 3 4), global_temp[0] gets set
|
||||
to 3 and global_temp[1] to 4 -- so the original assignment of 2 to
|
||||
global_temp[0] is clobbered before we get a chance to use it. By
|
||||
using a stack[] instead of a global_temp[], we finesse this problem.
|
||||
|
||||
You may object that we can solve that by just making the global array
|
||||
local, and that's true; lots of small local arrays may or may not be
|
||||
more efficient than one big global stack, in awk -- we'd have to try
|
||||
it out to see. But the real problem I alluded to at the start of this
|
||||
message is this: the garbage collector has to be able to find all the
|
||||
live references to the car[] and cdr[] arrays. If some of those
|
||||
references are hidden away in local variables of recursive procedures,
|
||||
we're stuck. With the global stack, they're all right there for the
|
||||
gc().
|
||||
|
||||
(In C we could use the local-arrays approach by threading a chain of
|
||||
pointers from each one to the next; but awk doesn't have pointers.)
|
||||
|
||||
(You may wonder how the code gets away with having a number of local
|
||||
variables holding lisp values, then -- the answer is that in every
|
||||
such case we can be sure the garbage collector can find the values
|
||||
in question from some other source. That's what this comment is
|
||||
about:
|
||||
|
||||
# All the interpretation routines have the precondition that their
|
||||
# arguments are protected from garbage collection.
|
||||
|
||||
In some cases where the values would not otherwise be guaranteed to
|
||||
be available to the gc, we call protect().)
|
||||
|
||||
Oh, there's another reason why apply() doesn't evaluate the arguments
|
||||
itself: it's called by do_apply(), which handles lisp calls like
|
||||
(apply car '((x))) -- where we *don't* want the x to get evaluated
|
||||
by apply().
|
||||
|
||||
|
||||
4. Um, what I was going to write about
|
||||
|
||||
more on data representation
|
||||
is_foo procedures slow it down by a few percent but increase clarity
|
||||
(try replacing them and other stuff with macros, time it.)
|
||||
|
||||
gc: overview; how to write gc-safe code using protect(); point out
|
||||
that relocating gcs introduce further complications
|
||||
|
||||
driver loop, macros
|
||||
|
||||
evaluation
|
||||
globals for temp values because of recursion, space efficiency
|
||||
environment -- explicit stack needed because of gc
|
||||
|
||||
error handling, or lack thereof
|
||||
strategies for cheaply adding error recovery
|
||||
|
||||
I/O
|
Reference in New Issue
Block a user