Writing My Own Language (Part 1)

4 Jul 2024

17 minute read

About a year ago, Writing An Interpreter In Go became popular, thanks to streamers like ThePrimeagen and teej_dv. This motivated me to read and follow the book myself. I documented my journey in this GitHub repo.

After completing the book and its successor, Writing A Compiler In Go, I became hooked on the idea of creating a usable language. But what makes a language usable? Taking inspiration from this project, I decided to define a language as usable if I can use it to solve this year’s Advent of Code.

Unfortunately for me, I am a spoiled programmer. If I am going to spend a month programming in a language, I want it to be more than just an interpreter. Specifically, I also want syntax highlighting and a language server. With this in mind, I named my language AOC lang (Almost Operational Coding Language) and began my journey. For those interested, the code is available on my GitHub.

In this first part of the project, we will focus on the interpreter.

Design Decisions §

If you’ve read the aforementioned books, you will know there are many decisions to make when creating a programming language. If not, just take my word for it :)

I will group these decisions into two main categories:

How the language behaves
How the interpreter is implemented

Language Behavior §

If you want a more complete overview of the language, see the overview example. Here I’ll try to summarize what the language can do and why I decided it should be this way.

Data Types §

Most of the language features are pretty self-explanatory, so I won’t waste too much time going over them. The language has all the types you would expect: integers, strings, characters, booleans, arrays, and dictionaries. It also has built-in functions that allow us to convert between the types. For instance, int("10") parses the string to an int.

Functions §

What language doesn’t have functions? In fact, in AOC lang, functions are not just functions; they are closures and first-class citizens. Functions being closures means they can capture the variables outside of the function body. For instance:

outside = 10
f = fn() {
    return otside * 2
}

f() // returns 20

Functions being first-class citizens means they can be used like any other value (e.g., strings or integers). They can be inserted into an array or bound to a variable. In fact, binding to a variable is the only way to construct a named function!

I also added a lot of built-in functions. There are functions for dealing with stdin and stdout, string manipulation (trim, split, etc.), float rounding, and more. Basically, everything one needs to solve Advent of Code!

Control Flow §

The language has all the control flow statements and loops you would expect. It has if/else statements, for and while loops, and break and continue statements.

Everything is an Expressions §

Now let’s look at more fun decisions. Almost everything in AOC lang is an expression, which means it returns a value. For instance, we can do things like this:

variable = if (condition) {
    10
} else {
    20
}

In the above example, the value stored in variable is 10 if condition is truthy¹ and 20 otherwise.

The exceptions to this are loops (what would the value be there?) and assignments. Assignment is an exception because it looked like a good idea at the beginning and I stuck with it. If I were to implement it again, I would probably make assignments an expression as well.

Array Unpacking §

Another fun decision was how to handle functions that need to return more than one value. An obvious solution is returning an array, but that makes using the values very verbose:

res = function()
res1 = res[0]
res2 = res[1]
// ...

I decided to implement a shorthand where we can unpack arrays:

[a, b, c] = [10, 20, 30]

So if the function returns an array, we can assign it to individual variables!

Objects §

Let’s take a look at how to organize object-like data types. For instance, when implementing a custom data structure, I want to structure it nicely. I didn’t want to implement objects or structs, since that is a lot of additional work (and I am lazy). Instead I decided to (ab)use dictionaries. I added syntactic sugar to access dictionary fields. If a dictionary field is named like an identifier, we can access it with dot notation:

// Standard way:
dict[foo]
// Dot notation:
dict.foo

This, combined with functions being first-class closures, allows us to construct simple objects:

// Construct an object
obj = {
    "value": 0
}
obj.increment = fn() {
    obj.value = obj.value + 1
}

// Use it
print(obj.value) // 0
obj.increment()
print(obj.value) // 1

Error Handling §

I didn’t want to spend too much time on error handling, so I did it the easy way. I introduced a billion-dollar mistake into the language – null. Basically, everything that should raise an exception returns null. Accessing an array index out of bounds? null. Accessing a dictionary key not present in the dictionary? null. Value of if/else without else? null. You get the idea.

Imports §

Probably the most interesting design decision was what to do with imports. I knew I wanted and needed imports. Throughout December, I will inevitably write code I will want to reuse on different days. Maybe I will have to implement array sorting, sets, or a priority queue. Who knows?

I wanted to do something similar to what Zig does. In Zig, the import statement returns a struct that corresponds to the imported file. The only problem is, my language doesn’t really have structs. But it does have functions! I decided that importing a file should execute the file contents as a function and return whatever the result of the function execution is.

// adder.aoc
fn(a, b) {a + b}

// main.aoc
adder = use "adder.aoc"
print(adder(1, 2)) // 3

Interpreter Implementation §

Before discussing individual decisions, let’s take an overview of what goes into interpreting a language.

First, the input gets converted from a string into a list of tokens. This is the job of a lexer. It converts an input like this:

x + 1

into a list of tokens like this:

[ Ident("x"), Plus, Integer(1) ]

The list of tokens then gets converted to an AST (Abstract Syntax Tree). This is a tree representation of the program.

Finally, the AST can be evaluated, and the result displayed on the screen.

Parser §

There are a lot of parsing techniques that can be used. Initially, I toyed with the idea of using an LR(1) parser, which is a bottom-up approach. It is one of the most widely used parsing techniques, which is why I considered it. Usually, this kind of parser is not written by hand; instead, it’s generated from a formal grammar. I didn’t want to go too deep into the world of parser generators for this project, and I also didn’t want to use something that I didn’t understand, so I abandoned this idea.

In the end, I went with the Pratt parser, which is a top-down recursive descent parser. The method is also described in Writing An Interpreter In Go, so I was already familiar with it. This is the main reason I chose it. For those interested in an overview of the Pratt parser, I found this blog post to be a great resource.

Interpreter §

By now you probably know what I’ll say: there are many interpreter variations to choose from. The big two are tree walking, which is described in Writing An Interpreter In Go, and bytecode, which is described in Writing A Compiler In Go.

A tree walking interpreter is recursively called on the nodes in the AST. It looks something like this:

eval(1 + 2) =
    = eval(1) + eval(2)
    = 1 + eval(2)
    = 1 + 2
    = 3

As you can see, it’s a really simple approach that is easy to understand and implement. The problem is that it’s not very fast.

The alternative is a bytecode interpreter. In this case, the AST first gets compiled to bytecode, which is just a fancy word for predefined instructions that we come up with ourselves. Secondly, the bytecode gets executed by a VM (Virtual Machine). In this case, the VM is just a fancy name for a program we write that can execute the bytecode we previously defined. This approach makes the interpreter a bit more complex but a lot faster. It’s also the approach I went with.

Programming Language §

The last big decision is: in which language should I write the interpreter? I wanted a language that has a powerful enough type system to represent an AST nicely. Additionally, the language in which I write the language server will be the same as the language for the interpreter, since the language server will reuse the parser, AST definitions, and more. The three languages that I am familiar with and that have such a type system are: Zig, Rust, and OCaml.

Zig is a language that I really like and hope will succeed, but as of the time of writing, the ecosystem is still very immature. This is why I skipped it for this project, but I definitely want to use it for some future project.

Deciding between Rust and OCaml was more difficult. OCaml is garbage collected, so I wouldn’t have to implement my own garbage collection. However, it is missing resizable arrays. Since AOC lang has resizable arrays, I would have to implement my own, but it’s not a big deal.

When deciding between the two, I had a feeling that I would have more fun doing it in Rust. Even if I had to implement my own garbage collector. Additionally, Rust compiles into WebAssembly, so I can create a playground website similar to Go Playground. Therefore, I went ahead and chose Rust.

Challenges §

Now that we know what we are building and how we are going to build it, we can discuss the challenges I encountered along the way.

Error Location §

Without interpreter errors having location information, the language is not really usable. If I am editing a file with 100 lines of code and the interpreter just reports a syntax error, I will give up. I need to know on which line the syntax error has occurred. In fact, I want to know the precise character range that is problematic.

The first issue is deciding how to represent a range of code. Luckily for me, I know that I also want to implement a language server. To make my future life easier, I can just use the same range definition as the one defined in the LSP. Basically, a range is defined as a pair of start and end positions, where a position is made up of a line index and a character offset in the line.

To see how the character offset is measured, I looked further into the docs and stumbled upon this comment:

Character offsets count UTF-16 code units.

This is the default and must always be supported by servers.

So Rust strings are UTF-8 and I have to measure offsets in UTF-16. Great…

The solution was to inject the positions at the beginning, in the lexer. The lexer reads the input string character by character and keeps track of the current position as a line index and character offset. Getting the length of the character in UTF-16 is easy since Rust’s char has a len_utf16() method. This position is then attached to the emitted tokens.

The parser also keeps track of the ranges, and the range information is attached to each node in the AST. Finally, the compiler attaches the range information to each emitted bytecode instruction.

If at any stage of interpreting the program (lexing, parsing, compiling, or executing) we encounter an error, we know exactly where it occurred and can add the range information to the returned error type.

Strings §

I foolishly decided that AOC lang strings would be Unicode strings instead of simple ASCII strings. As you can see from the previous section, strings and chars come with many problems to solve. Strings in AOC lang are represented with Rust strings, which means they are UTF-8 encoded.

I also want my strings to be indexable. That meant I first had to define a character data type. Additionally, Rust’s strings are not indexable, since they are UTF-8, which means that if a string contains non-ASCII symbols, indexing may leave you inside a character.

I knew that AOC lang would be primarily used for solving Advent of Code, and inputs are (for now) always ASCII. So I solved the encoding problems by taking a lot of shortcuts. With these shortcuts, AOC lang strings behave as follows:

A string is UTF-8 encoded.
The len function returns the number of bytes in the string.
Indexing a string is done by byte index and can be problematic for non-ASCII strings.
The char type only supports ASCII characters.
int and char built-in methods can be used to convert integers to characters and vice versa.

Parsing Comments §

Every usable language has support for comments, and so does AOC lang. It supports basic single-line comments:

// this is a comment
1 + 1 // this is also a comment

Lexing the comments was easy. The interesting part was parsing them and representing them as part of the AST. My initial idea was to define a comment AST node that could be part of the ast::Program (an array of AST nodes) or part of an AST block node (also an array of AST nodes). The problem with this approach is handling comments that don’t take up the whole line. For instance, this is valid AOC code but doesn’t have an AST representation with the described approach:

[
    42,    // first element
    "foo", // second element
]

You might ask, why I would even want to keep comments as part of the AST. The only reason I can’t ignore the comments is to future-proof the parser for when I’ll implement the language server. For instance, the hover information should include comments that are defined directly above the symbol the user is hovering over. And the formatter also needs to know where in the code comments are located so that it can position them correctly.

So I don’t really need to keep comments as part of the AST. What I need is for the parser to emit a data structure that contains the comments and can quickly return a comment at a specific line. I went for the easy solution, and this data structure ended up being an array to which comments, together with their positions, are added during parsing. Since the input is parsed from the top to the bottom, comments in the array are ordered by the line index. Using bisection, we can quickly query for a comment at a specific line.

Garbage Collection §

The last challenge to solve is memory management. AOC objects are represented as an Object enum in Rust:

enum Object {
    Null,
    Integer(i64),
    String(Rc<String>),
    // ...
}

Primitives like integers or floats are the easiest to represent; we just store a Rust integer or float. Strings are a bit more interesting. The Object has to be Clone, since we can have the same object in multiple places at once. The same object can be stored in a global variable, a local variable inside a function, and on the VM stack at the same time:

global = "foo"
function = fn(local) {
    len(local) // local is put on the VM stack
}
function(global)

Strings cannot contain another object, so we can just use Rc, which has the effect of making AOC strings pass by reference.

However, we can’t use the same Rc trick on objects that can hold other objects. Let’s take a look at this example with arrays:

cycle = []
push(cycle, cycle)

In the above example, we push the array cycle into itself. What we get is an array that contains itself. If we were to use just an Rc<Vec> as the underlying data structure, we would end up with a vector that is the owner of an Rc that is the owner of the vector itself. In other words, we would have a reference cycle. Similar cycles can be constructed using dictionaries or a combination of arrays, dictionaries, and closures.

We have to do two things:

Construct a way to explicitly free memory we don’t need anymore.
Detect which objects are not needed and use the constructed method to free them.

Coming up with a way to free memory explicitly in Rust was a fun little challenge. In a language like C or Zig, we could just call malloc, remember the pointer addresses of allocated objects, and call free on those addresses when we don’t need them anymore. In Rust, memory usually gets freed when an object is dropped. This is also how most of the smart pointers in the standard library behave.

I ended up using Rc and Weak. A detailed explanation of how Rc and Weak work is available here. The basic idea is that Rc keeps count of how many instances of Rc point to the owned object. When an instance of Rc is dropped, the count is decreased. If the count reaches zero, the owned object is dropped and deallocated. Weak also points to an object, but it doesn’t contribute to the count. To dereference Weak, we first have to upgrade it to Rc. If the object was freed before we try to upgrade a Weak, we get None.

This gives us the basic idea for garbage collection. When we want to allocate a new object that should be managed (like an array or dictionary), we call the GarbageCollector’s allocate method. This method allocates a new object with Rc::new. It stores this Rc but returns a Weak. When we work with an object allocated this way, we always have to call upgrade on it to get the Rc. However, the garbage collector can decide to free the object by dropping the stored Rc. This will decrease the count to zero and free the allocated object.

The implementation itself is pretty simple. We first define two structs: one that represents a reference to an allocated object and one that represents the owner. If the owner gets dropped, the object is deallocated:

struct Ref<T> {
    value: Weak<RefCell<T>>,
    id: usize,
}

struct Owner {
    value: Rc<dyn Any>,
}

We can also define the GarbageCollector struct as:

struct GarbageCollector {
    owners: HashMap<usize, Owner>,
}

In the above definitions, the id is just a raw pointer to the allocated object as usize. GarbageCollector stores a map where the keys are ids and the values are the Owners.

The keen-eyed among you might have noticed that the value is actually a RefCell. This allows interior mutability of Vec and HashMap, making AOC lang arrays and dictionaries pass-by-reference.

As you can imagine, the implementation of the GarbageCollector::allocate method is pretty simple:

Create a new Rc with Rc::new.
Get the id with Rc::as_ptr.
Store the created Rc as Owner.
Create and return a Ref.

We can now allocate objects that can be deallocated. We just have to deallocate them!

Deallocating objects is not the hard part; we just have to delete the owner from the owners hash map. The hard part is deciding which owners to delete. There are many different garbage collection approaches. I went with the mark-and-sweep approach. The mark-and-sweep approach is made up of two steps:

Mark phase: decide which objects should be deallocated.
Sweep phase: deallocate those objects.

It is easier to decide which objects should be kept. Objects that should be kept are the ones that we can get to from the objects currently on the VM stack or stored as globals. We can define a method traverse which marks the objects that should be kept. We call it on all the objects currently on the VM stack and on all the globals. If the method hits an array, dictionary, or closure, it is recursively called on all the elements, values, or captured objects. After the objects that should be kept are marked, we can just deallocate everything else.

The final step is to actually call the free method. But when should we call it? We could do it after each instruction gets executed by the VM, but that would be a lot of traversing work for not that much deallocation work. Instead, I decided that the free method gets called if there are more than MAX_OBJECTS allocated, where MAX_OBJECTS is an arbitrary number. As of the time of writing, that number is 1024. For solving the Advent of Code, I probably won’t need that many arrays, dictionaries, or closures allocated, so this effectively means that free is called every now and again, after enough allocations are made. If a program allocates more than MAX_OBJECTS objects, free gets called after each instruction is executed, which is also fine for now.

With all of that effort, we don’t have to worry about memory leaks anymore!

Conclusion §

With all this effort the interpreter is finally implemented! There is a lot of room to optimize the parser, compiler, and virtual machine to make the interpreter faster, but I will stop here.

Next on the menu is syntax highlighting, so stay tuned for that! Again, all the code is available on my GitHub if you are interested in this sort of stuff.

I defined a value to be truthy if it’s not false or null. ↩︎