148 lines
		
	
	
		
			6.2 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			148 lines
		
	
	
		
			6.2 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Implementation notes
 | |
| 
 | |
| 
 | |
| 1. Overview
 | |
| 
 | |
| Since the code should be self-explanatory to anyone knowledgeable
 | |
| about Lisp implementation, these notes assume you know Lisp but not
 | |
| interpreters.  I haven't got around to writing up a complete
 | |
| discussion of everything, though.
 | |
| 
 | |
| The code for an interpreter can be pretty low on redundancy -- this is
 | |
| natural because the whole reason for implementing a new language is to
 | |
| avoid having to code a particular class of programs in a redundant
 | |
| style in the old language.  We implement what that class of programs
 | |
| has in common just once, then use it many times.  Thus an interpreter
 | |
| has a different style of code, perhaps denser, than a typical
 | |
| application program.
 | |
| 
 | |
| 
 | |
| 2. Data representation
 | |
| 
 | |
| Conceptually, a Lisp datum is a tagged pointer, with the tag giving
 | |
| the datatype and the pointer locating the data.  We follow the common
 | |
| practice of encoding the tag into the two lowest-order bits of the
 | |
| pointer.  This is especially easy in awk, since arrays with
 | |
| non-consecutive indices are just as efficient as dense ones (so we can
 | |
| use the tagged pointer directly as an index, without having to mask
 | |
| out the tag bits).  (But, by the way, mawk accesses negative indices
 | |
| much more slowly than positive ones, as I found out when trying a
 | |
| different encoding.)
 | |
| 
 | |
| This Lisp provides three datatypes: integers, lists, and symbols.  (A
 | |
| modern Lisp provides many more.)
 | |
| 
 | |
| For an integer, the tag bits are zero and the pointer bits are simply
 | |
| the numeric value; thus, N is represented by N*4.  This choice of the
 | |
| tag value has two advantages.  First, we can add and subtract without
 | |
| fiddling with the tags.  Second, negative numbers fit right in.
 | |
| (Consider what would happen if N were represented by 1+N*4 instead,
 | |
| and we tried to extract the tag as N%4, where N may be either positive
 | |
| or negative.  Because of this problem and the above-mentioned
 | |
| inefficiency of negative indices, all other datatypes are represented
 | |
| by positive numbers.)
 | |
| 
 | |
| 
 | |
| 3. The evaluation/saved-bindings stack
 | |
| 
 | |
| The following is from an email discussion; it doesn't develop 
 | |
| everything from first principles but is included here in the hope
 | |
| it will be helpful.
 | |
| 
 | |
| Hi.  I just took a look at awklisp, and remembered that there's more
 | |
| to your question about why we need a stack -- it's a good question.
 | |
| The real reason is because a stack is accessible to the garbage
 | |
| collector.
 | |
| 
 | |
| We could have had apply() evaluate the arguments itself, and stash
 | |
| the results into variables like arg0 and arg1 -- then the case for
 | |
| ADD would look like
 | |
| 
 | |
| if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1)
 | |
| 
 | |
| The obvious problem with that approach is how to handle calls to
 | |
| user-defined procedures, which could have any number of arguments.
 | |
| Say we're evaluating ((lambda (x) (+ x 1)) 42).  (lambda (x) (+ x 1))
 | |
| is the procedure, and 42 is the argument.  
 | |
| 
 | |
| A (wrong) solution could be to evaluate each argument in turn, and
 | |
| bind the corresponding parameter name (like x in this case) to the
 | |
| resulting value (while saving the old value to be restored after we
 | |
| return from the procedure).  This is wrong because we must not 
 | |
| change the variable bindings until we actually enter the procedure --
 | |
| for example, with that algorithm ((lambda (x y) y) 1 x) would return
 | |
| 1, when it should return whatever the value of x is in the enclosing
 | |
| environment.  (The eval_rands()-type sequence would be: eval the 1,
 | |
| bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind
 | |
| y to that, then eval the body of the lambda.)
 | |
| 
 | |
| Okay, that's easily fixed -- evaluate all the operands and stash them
 | |
| away somewhere until you're done, and *then* do the bindings.  So 
 | |
| the question is where to stash them.  How about a global array?
 | |
| Like
 | |
| 
 | |
|   for (i = 0; arglist != NIL; ++i) {
 | |
|     global_temp[i] = eval(car[arglist])
 | |
|     arglist = cdr[arglist]
 | |
|   }
 | |
| 
 | |
| followed by the equivalent of extend_env().  This will not do, because
 | |
| the global array will get clobbered in recursive calls to eval().
 | |
| Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +,
 | |
| like this: global_temp[0] gets 2, and then global_temp[1] gets the
 | |
| eval of (* 3 4).  But in evaluating (* 3 4), global_temp[0] gets set
 | |
| to 3 and global_temp[1] to 4 -- so the original assignment of 2 to
 | |
| global_temp[0] is clobbered before we get a chance to use it.  By
 | |
| using a stack[] instead of a global_temp[], we finesse this problem.
 | |
| 
 | |
| You may object that we can solve that by just making the global array
 | |
| local, and that's true; lots of small local arrays may or may not be
 | |
| more efficient than one big global stack, in awk -- we'd have to try
 | |
| it out to see.  But the real problem I alluded to at the start of this
 | |
| message is this: the garbage collector has to be able to find all the
 | |
| live references to the car[] and cdr[] arrays.  If some of those
 | |
| references are hidden away in local variables of recursive procedures,
 | |
| we're stuck.  With the global stack, they're all right there for the
 | |
| gc().
 | |
| 
 | |
| (In C we could use the local-arrays approach by threading a chain of
 | |
| pointers from each one to the next; but awk doesn't have pointers.)
 | |
| 
 | |
| (You may wonder how the code gets away with having a number of local
 | |
| variables holding lisp values, then -- the answer is that in every
 | |
| such case we can be sure the garbage collector can find the values
 | |
| in question from some other source.  That's what this comment is
 | |
| about:
 | |
| 
 | |
| # All the interpretation routines have the precondition that their
 | |
| # arguments are protected from garbage collection.
 | |
| 
 | |
| In some cases where the values would not otherwise be guaranteed to
 | |
| be available to the gc, we call protect().)
 | |
| 
 | |
| Oh, there's another reason why apply() doesn't evaluate the arguments 
 | |
| itself: it's called by do_apply(), which handles lisp calls like 
 | |
| (apply car '((x))) -- where we *don't* want the x to get evaluated
 | |
| by apply().
 | |
| 
 | |
| 
 | |
| 4. Um, what I was going to write about
 | |
| 
 | |
| more on data representation
 | |
| is_foo procedures slow it down by a few percent but increase clarity
 | |
| (try replacing them and other stuff with macros, time it.)
 | |
| 
 | |
| gc: overview; how to write gc-safe code using protect(); point out
 | |
| 	that relocating gcs introduce further complications
 | |
| 
 | |
| driver loop, macros
 | |
| 
 | |
| evaluation
 | |
| globals for temp values because of recursion, space efficiency
 | |
| environment -- explicit stack needed because of gc
 | |
| 
 | |
| error handling, or lack thereof
 | |
| strategies for cheaply adding error recovery
 | |
| 
 | |
| I/O
 |