SFR-Lisp-1

I wrote a toy Lisp interpreter in C because, as my favorite physicist once put it, "What I cannot create, I do not understand." It implements reference-counted garbage collection and a secondary GC to handle circular references. Also, in an effort to learn something about WASM/WASI, I've compiled it to WASM and wrote a little web interface for it.

I am very glad I finally got around to writing a Lisp interpreter in C. I'm relatively happy with the simplicity of the runtime system. Hopefully I'll get a chance to bolt on some other language on top of it. It would be fun, for example, to write a JavaScript or Ruby interpreter using the same object system / runtime.

I also hope to use the code as the basis for a long tutorial or small book. It would be titled something like: "How to Write a Lisp Interpreter (in C)". For now, let me share some of my favorite parts of the interpreter and the experience.

Extending Printf With Custom Format Specifiers

At some point, after I created a base set of classes, and a generic Object class, I was doing a lot of printing. When I wanted to print an object, in the context of a string, I was doing multiple calls to printf, one for the part before the object, a call to the Object's print method, and another for the part after the object. It was getting very tedious. I knew what I wanted was a Golang-style fmt.Printf("obj = %v", obj) thing. But how to get this working in C? How to extend the format modifiers supported by printf to call my Object's print method?

After some false starts, it finally occurred to me, that it is possible to do this by simply replicating the steps that I was already doing. You want to treat everything other than our new format specifier as arguments to regular printf, and somewhere in between handle our special format specifier. The real question was, how can we only send certain parts of the va_list in our special print function to printf, while slicing the incoming arguments into different parts? Thanks to the existence of vprintf this is totally possible. Let me show you the code.


void ObjectUtil_eprintf_sig(int SIGSIZE, char** sigptr, int* sigposptr, va_list* argv) {
  char* sig = *sigptr;
  if(strcmp(sig, "%v") == 0) {
    Object_print(va_arg(*argv, void*));
  }
  else {
    vprintf(*sigptr, *argv);
  }
  memset(*sigptr, 0, SIGSIZE);
  *sigposptr = 0;
}

void ObjectUtil_eprintf_buf(int BUFSIZE, char** bufptr, int* bufposptr) {
  if(*bufposptr > 0) {
    fputs(*bufptr, stdout);
  }
  memset(*bufptr, 0, BUFSIZE);
  *bufposptr = 0;
}

void ObjectUtil_eprintf(char* fmt, ...) {
  int i, j;
  int argc = 0;
  va_list argv;
  size_t fmt_len = strlen(fmt);
  char ch;
  for(i = 0; i < fmt_len; i++) {
    ch = fmt[i];
    if(ch == '%') {
      argc++;
    }
  }
  va_start(argv, fmt);
  int   BUFSIZE = fmt_len+1;
  char* buf = calloc(1, BUFSIZE);
  int   bufpos = 0;
  int   SIGSIZE = 20; 
  char* sig = calloc(1, SIGSIZE);
  int   sigpos = 0;
  char  insig = 0;
  int   argpos = 0;
  for(i = 0; i < fmt_len; i++) {
    ch = fmt[i];
    if(insig) {
      if((ch >= 'a' && ch <= 'z') || (ch >= '0' && ch <= '9')) {
        sig[sigpos++] = ch;
      }
      else {
        ObjectUtil_eprintf_sig(SIGSIZE, &sig, &sigpos, &argv);
        insig = 0;
        if(i > 0) { i--; }
      }
    }
    else
    if(ch == '%') {
      insig = 1;
      sig[0] = '%';
      sigpos = 1;
      ObjectUtil_eprintf_buf(BUFSIZE, &buf, &bufpos);
    }
    else {
      buf[bufpos++] = ch;
    }
  }
  if(insig && sigpos > 0) {
    ObjectUtil_eprintf_sig(SIGSIZE, &sig, &sigpos, &argv);
  }
  else
  if(bufpos > 0) {
    ObjectUtil_eprintf_buf(BUFSIZE, &buf, &bufpos);
  }
  free(buf);
  free(sig);
  va_end(argv);
}

Since we always need items from the va_list in the original order in which it came, we do not have to slice the va_list. We can pass it as it is. We can do this because, printf/vprintf does not really care about the size of the va_list, nor does it know. Like all variable argument functions in C, it only knows the number of variable arguments to read from the stack from some information in one of the non-variable arguments. In the case of printf/vprintf, it is the format string that tells it how many arguments to read from the stack, it counts the number of format specifiers. We can easily manipulate the format string. Which is what I have done here.

We know what format specifiers look like, and so we can extract the format specifiers from the format string and pass those normal format strings with normal format specifiers to vprintf. When we encounter our "extended format specifier", that is "%v", then we can call our Object_print method, which will call the correct print method of the implementation class based on the object type.

This was one of my favorite parts of the code base, because I didn't know if it was possible, and I certainly did not know if it was possble for me to do it.

Here is how you would use it, taken from one of the tests


//...code trimmed...
printf("Test Number + Number ...\n");
Object* res1 = Object_accept(Object_bop_add(num1, num2));
nassert(res1->rc == 1);
ObjectUtil_eprintf("%v + %v = %v\n", num1, num2, res1);

I like the ergonomics of the result. It is now, much easier to print these objects in this toy object system. Getting this into the code relatively early was really helpful.

Note: There is a memory issue in ObjectUtil_eprintf, see if you can spot it. I'm making some assumptions about how big a format specifier can be.

The Ownership Model

Let me briefly tell you about the ownership model. All objects in the object system for this thing are reference counted. Each object keeps a reference field called rc. Reference counting is simple, anyone can do it, and even non-toy serious interpreters like the CPython interpreter use this mechanism. C++ makes this sort of reference counting easy, with std::shared_pointer. But I wasn't in C++, I couldn't count on something like smart-pointers to increment the reference-count on assignment and decrement it on end-of-scope object destruction.

Smart-pointers and reference counting are really trying to solve the ownership problem. Who owns this object? Who is responsible for destroying it? I wonder what it means that ownership of something essentially translates to the right/responsibility to destroy it. Anyway, let me try not to get distracted.

Since I didn't have the magic of C++ operator overloading, and destructors being called at end-of-scope, I decided on an explicit set of ownership semantics by defining the following functions:


Object* Object_accept(Object* self)
Object* Object_reject(Object* self)
Object* Object_return(Object* self)
void    Object_assign(Object** targetptr, Object* source)

I love that the names for these line up in character count. Let me describe how they are used.

When a function receives an object Object* obj, you call Object_accept(obj) on it. This will increment the reference count and ensure that it is not in a returning state. The returning state tries to solve the 0 reference-count problem with factory functions, what to do when a function has just built a new object and is returning it to the caller? At the end of that function, there are zero stack variables referring to it, but it should not be garbage collected yet. In this case, it is the caller's responsibility to accept it or reject it with Object_accept or Object_reject.

All function calls are already a contract between the caller and the receiver (the function). In this scheme, I have extended that contract to ensure that these reference counted objects are properly destroyed.

What about Object_assign? When you are done with an object, you should assign the variable to NULL, using Object_assign(&obj, NULL);. Or if you assign another variable on the stack to some object pointer, you should use Object_assign(&obj, val).

The Garbage Collector

Reference counted garbage collection has one major drawback, it cannot destroy objects which form a reference cycle. To understand why, let's take a look at an example with two objects. Suppose two objects points at each other, forming a two object cycle.

Before
After

Suppose we have a stack variable Object* aPtr which refers to A. And A is a container-type object, like List or Hash, it refers to another container-type object B, which refers back to A. We have a cycle. Notice that only container types objects can form cycles, because they are the only types capable of referring to other objects. If we now delete aPtr by assigning it to NULL, with say Object_assign(&aPtr, NULL). We are left with a Objects A and B referring to each other, both with a reference count of 1. The count is correct, because they do refer to each other, you see two in-degree arrows, but the fact they still exist is wrong.

I decided to do something similar to what the CPython interpreter does. We can rely on regular reference counting to deal with most objects. Most objects will not form reference cycles. And we can create a separate garbage collector to find and destroy any objects that are left over. But how do we find these reference cycles?

We can use the technique of internal-reference-elimination, and it works as follows. First, separate your mental model of all objects such that we correctly differentiate between the stack and the heap. All of our objects all live on the heap, and our object pointers live on the stack.

We create a field called rc_gc which we denote in the diagrams below with a slash following the rc count. Set this field to equal the rc value. Iterate through all of the object relationships in the object system, walk the edges of the heap graph. This is not a recursive process, it's iterative, we only need to visit the immediate referred objects of the current referring object. For each referred object, we now know that something on the heap is referring to it, so decrease the rc_gc count by 1.

All objects that have an rc_gc count of zero, should now tentatively be marked as unreachable. I say tentative, because an object that now appears unreachable, might still be reachable from a reachable object. In my example below, that's Object C, it will appear unreachable after the first walk through all objects. Now, iterate all of the edges once again, for a second time: Any object that is referred to by a reachable object should be unmarked as unreachable. Lastly, all unreachable objects at this point are really unreachable, destroy them.

The panels below will try to show each of these 3 phases. I'm using another color to indicate that the object is flagged as unreachable.

That's all there is to it. I've skipped over a few details. For example, when dealing with objects in the returning state, we simply treat them as reachable. This algorithm does a good job of identifying all objects which are no longer referenced by variables on the stack.

Conclusion

This was a really fun project. I hope you have enjoyed some of the highlights.

Quick Links

  • Github.
  • Try it! (WASM/WASI).