148 lines
		
	
	
		
			6.2 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
		
		
			
		
	
	
			148 lines
		
	
	
		
			6.2 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
|  | Implementation notes | ||
|  | 
 | ||
|  | 
 | ||
|  | 1. Overview | ||
|  | 
 | ||
|  | Since the code should be self-explanatory to anyone knowledgeable | ||
|  | about Lisp implementation, these notes assume you know Lisp but not | ||
|  | interpreters.  I haven't got around to writing up a complete | ||
|  | discussion of everything, though. | ||
|  | 
 | ||
|  | The code for an interpreter can be pretty low on redundancy -- this is | ||
|  | natural because the whole reason for implementing a new language is to | ||
|  | avoid having to code a particular class of programs in a redundant | ||
|  | style in the old language.  We implement what that class of programs | ||
|  | has in common just once, then use it many times.  Thus an interpreter | ||
|  | has a different style of code, perhaps denser, than a typical | ||
|  | application program. | ||
|  | 
 | ||
|  | 
 | ||
|  | 2. Data representation | ||
|  | 
 | ||
|  | Conceptually, a Lisp datum is a tagged pointer, with the tag giving | ||
|  | the datatype and the pointer locating the data.  We follow the common | ||
|  | practice of encoding the tag into the two lowest-order bits of the | ||
|  | pointer.  This is especially easy in awk, since arrays with | ||
|  | non-consecutive indices are just as efficient as dense ones (so we can | ||
|  | use the tagged pointer directly as an index, without having to mask | ||
|  | out the tag bits).  (But, by the way, mawk accesses negative indices | ||
|  | much more slowly than positive ones, as I found out when trying a | ||
|  | different encoding.) | ||
|  | 
 | ||
|  | This Lisp provides three datatypes: integers, lists, and symbols.  (A | ||
|  | modern Lisp provides many more.) | ||
|  | 
 | ||
|  | For an integer, the tag bits are zero and the pointer bits are simply | ||
|  | the numeric value; thus, N is represented by N*4.  This choice of the | ||
|  | tag value has two advantages.  First, we can add and subtract without | ||
|  | fiddling with the tags.  Second, negative numbers fit right in. | ||
|  | (Consider what would happen if N were represented by 1+N*4 instead, | ||
|  | and we tried to extract the tag as N%4, where N may be either positive | ||
|  | or negative.  Because of this problem and the above-mentioned | ||
|  | inefficiency of negative indices, all other datatypes are represented | ||
|  | by positive numbers.) | ||
|  | 
 | ||
|  | 
 | ||
|  | 3. The evaluation/saved-bindings stack | ||
|  | 
 | ||
|  | The following is from an email discussion; it doesn't develop  | ||
|  | everything from first principles but is included here in the hope | ||
|  | it will be helpful. | ||
|  | 
 | ||
|  | Hi.  I just took a look at awklisp, and remembered that there's more | ||
|  | to your question about why we need a stack -- it's a good question. | ||
|  | The real reason is because a stack is accessible to the garbage | ||
|  | collector. | ||
|  | 
 | ||
|  | We could have had apply() evaluate the arguments itself, and stash | ||
|  | the results into variables like arg0 and arg1 -- then the case for | ||
|  | ADD would look like | ||
|  | 
 | ||
|  | if (proc == ADD) return is(a_number, arg0) + is(a_number, arg1) | ||
|  | 
 | ||
|  | The obvious problem with that approach is how to handle calls to | ||
|  | user-defined procedures, which could have any number of arguments. | ||
|  | Say we're evaluating ((lambda (x) (+ x 1)) 42).  (lambda (x) (+ x 1)) | ||
|  | is the procedure, and 42 is the argument.   | ||
|  | 
 | ||
|  | A (wrong) solution could be to evaluate each argument in turn, and | ||
|  | bind the corresponding parameter name (like x in this case) to the | ||
|  | resulting value (while saving the old value to be restored after we | ||
|  | return from the procedure).  This is wrong because we must not  | ||
|  | change the variable bindings until we actually enter the procedure -- | ||
|  | for example, with that algorithm ((lambda (x y) y) 1 x) would return | ||
|  | 1, when it should return whatever the value of x is in the enclosing | ||
|  | environment.  (The eval_rands()-type sequence would be: eval the 1, | ||
|  | bind x to 1, eval the x -- yielding 1 which is *wrong* -- and bind | ||
|  | y to that, then eval the body of the lambda.) | ||
|  | 
 | ||
|  | Okay, that's easily fixed -- evaluate all the operands and stash them | ||
|  | away somewhere until you're done, and *then* do the bindings.  So  | ||
|  | the question is where to stash them.  How about a global array? | ||
|  | Like | ||
|  | 
 | ||
|  |   for (i = 0; arglist != NIL; ++i) { | ||
|  |     global_temp[i] = eval(car[arglist]) | ||
|  |     arglist = cdr[arglist] | ||
|  |   } | ||
|  | 
 | ||
|  | followed by the equivalent of extend_env().  This will not do, because | ||
|  | the global array will get clobbered in recursive calls to eval(). | ||
|  | Consider (+ 2 (* 3 4)) -- first we evaluate the arguments to the +, | ||
|  | like this: global_temp[0] gets 2, and then global_temp[1] gets the | ||
|  | eval of (* 3 4).  But in evaluating (* 3 4), global_temp[0] gets set | ||
|  | to 3 and global_temp[1] to 4 -- so the original assignment of 2 to | ||
|  | global_temp[0] is clobbered before we get a chance to use it.  By | ||
|  | using a stack[] instead of a global_temp[], we finesse this problem. | ||
|  | 
 | ||
|  | You may object that we can solve that by just making the global array | ||
|  | local, and that's true; lots of small local arrays may or may not be | ||
|  | more efficient than one big global stack, in awk -- we'd have to try | ||
|  | it out to see.  But the real problem I alluded to at the start of this | ||
|  | message is this: the garbage collector has to be able to find all the | ||
|  | live references to the car[] and cdr[] arrays.  If some of those | ||
|  | references are hidden away in local variables of recursive procedures, | ||
|  | we're stuck.  With the global stack, they're all right there for the | ||
|  | gc(). | ||
|  | 
 | ||
|  | (In C we could use the local-arrays approach by threading a chain of | ||
|  | pointers from each one to the next; but awk doesn't have pointers.) | ||
|  | 
 | ||
|  | (You may wonder how the code gets away with having a number of local | ||
|  | variables holding lisp values, then -- the answer is that in every | ||
|  | such case we can be sure the garbage collector can find the values | ||
|  | in question from some other source.  That's what this comment is | ||
|  | about: | ||
|  | 
 | ||
|  | # All the interpretation routines have the precondition that their | ||
|  | # arguments are protected from garbage collection. | ||
|  | 
 | ||
|  | In some cases where the values would not otherwise be guaranteed to | ||
|  | be available to the gc, we call protect().) | ||
|  | 
 | ||
|  | Oh, there's another reason why apply() doesn't evaluate the arguments  | ||
|  | itself: it's called by do_apply(), which handles lisp calls like  | ||
|  | (apply car '((x))) -- where we *don't* want the x to get evaluated | ||
|  | by apply(). | ||
|  | 
 | ||
|  | 
 | ||
|  | 4. Um, what I was going to write about | ||
|  | 
 | ||
|  | more on data representation | ||
|  | is_foo procedures slow it down by a few percent but increase clarity | ||
|  | (try replacing them and other stuff with macros, time it.) | ||
|  | 
 | ||
|  | gc: overview; how to write gc-safe code using protect(); point out | ||
|  | 	that relocating gcs introduce further complications | ||
|  | 
 | ||
|  | driver loop, macros | ||
|  | 
 | ||
|  | evaluation | ||
|  | globals for temp values because of recursion, space efficiency | ||
|  | environment -- explicit stack needed because of gc | ||
|  | 
 | ||
|  | error handling, or lack thereof | ||
|  | strategies for cheaply adding error recovery | ||
|  | 
 | ||
|  | I/O |