A Simple Lisp Parser in Python
Posted Mon, Nov 23, 2009 in:A managed to sneak in a little time recently and write a simple Lisp s-expression parser in Python.
I started out with a version that just understands parentheses, atoms, and strings, but it was just a couple of extra lines to add in support for the quote “operator”, that I went ahead and threw that in there, too.
I’ll go ahead and go through it bit-by-bit here, though I think it’s pretty straightforward.
from string import whitespace
atom_end = set('()"\'') | set(whitespace)
When any of these things are encountered while reading an atom, the atom should be ended. This means no need for whitespace betwen an atom and opening or closing parentheses or quotes.
def parse(sexp):
stack, i, length = [[]], 0, len(sexp)
while i < length:
Rather than write this recursively, I am using a list as a stack and a simple while-loop to go through the contents of the string. I start with an empty list on the stack, which will hold the entire parsed sexp when the parsing is done.
c = sexp[i]
reading = type(stack[-1])
Each character is put into c and the type of the current item being read is checked. Rather than using some variable to keep track of the current state of the parser, I just used a distinct Python type for each of the possible lisp types to be read: a list for a list, a string for a string, and a string inside a tuple for an atom. Using this system, I just have to check what type the top element in the stack is, and I’ll know what I’m currently reading.
if reading == list:
if c == '(': stack.append([])
elif c == ')':
stack[-2].append(stack.pop())
if stack[-1][0] == ('quote',): stack[-2].append(stack.pop())
elif c == '"': stack.append('')
elif c == "'": stack.append([('quote',)])
elif c in whitespace: pass
else: stack.append((c,))
When reading a list:
- If an opening parenthesis is encountered, add a new list to the stack
- If a closing parenthesis is encountered, take the list from the top of the stack and append it to the next list down in the stack. Check if the last thing read was a quote, and if so, add this list as the parameter to the quote call.
- If an opening double-quote is encountered, add an empty string to the stack.
- If a single-quote is found, add a list with a quote symbol at the beginning of it.
- If c is whitespace, ignore it.
- Otherwise, start an atom with whatever other character was found.
elif reading == str:
if c == '"':
stack[-2].append(stack.pop())
if stack[-1][0] == ('quote',): stack[-2].append(stack.pop())
elif c == '\\':
i += 1
stack[-1] += sexp[i]
else: stack[-1] += c
When reading a string:
- If a closing double-quote is found, end the string, and check for a previous quote call like above (it doesn’t really make sense to quote a string, but this isn’t really the place to check for that, is it?)
- If a backslash is found, put whatever the following character is in the string. This is a simplistic way to at least allow double-quotes to be escaped, but I don’t have anything allowing for more complex escape-strings in here yet.
- Anything else, add to the string.
elif reading == tuple:
if c in atom_end:
atom = stack.pop()
if atom[0][0].isdigit(): stack[-1].append(eval(atom[0]))
else: stack[-1].append(atom)
if stack[-1][0] == ('quote',): stack[-2].append(stack.pop())
continue
else: stack[-1] = ((stack[-1][0] + c),)
When reading an atom:
- If the end of the atom has been reached, check if the atom is a number (this code just assumes that if the first character of the atom is numeric, that it’s a number). If it is, I cheat and use eval to get its value, if it’s not, I leave the atom as a tuple-wrapped string.
- Anything else, add as a character to the atom.
i += 1
return stack.pop()
At the end here, I just take whatever’s on top of the stack (which should be the only thing on the stack if a proper s-expression was passed in), and return that.
I’m going to continue working on and improving this parser, for use in another project of mine, so I will be adding more functionality.
I was thinking of doing a s-expression to Python bytecode compiler, but someone already wrote one, and they even took the name I was going to use: sexpy.
I think the first thing I’ll be adding is the capability for the parser to yield sub-sexps so the entire thing doesn’t need to be parsed in one go before returning anything. Another thing I’ll most likely add is some checking for invalid sexps, something I left out for clarity in my initial version.
For those who are interested, here’s the complete version of sexp.py