I’ve been wanting to use recursive descent parser for a while now.
And recently I decided to implement a small project, minimising dependencies usage. You guessed it, I implemented recursive descent parser in there. Several watched videos and read articles left me without understanding of the algorithm. The explanations included either pure math theory or unrealistic examples. So here I am writing about recursive descent, assuming I didn’t mixed up anything and my implementation is correct.
Task: parse bencode.
Bencoding
Bencode has only four types (spaces are for convenience only):
- integer
i42e - string
5:hello - list
l i1e e - dictionary
d 5:hello 5:world e
Initially, I was trying to write a parser using apparent rules, that is look at the examples and write some code how I think it should be parsed. I had to use stack to temporary save values because list and dictionary can be recursive. For example, list can contain elements of any type, so you have to push list onto the stack when recursing into nested list. In my case, I also needed to save token positions. The worst part is mutating stack items, meaning the stack must contain pointers which hurts performance, that led to bugs and several hours of debugging.
Grammar
From now on, the very first thing I do when writing a parser will be EBNF. Abiding syntax rules of the notation is not that important, especially when you are not going to autogenerate parser’s code from them. I’ve replaced some terms with more widely known: *, [0-9], + regular expressions.
any = integer | string | list | dictionary
integer = i [0-9]+ e
string = [0-9]+ : .*
list = l <any>* e
dictionary = d (<string> <any>)* e
Spaces in this notation server only as visual help. Any bencode type can be represented as sum-type any. You should read | as or.
Integers must conform to the structure: literal i; then one digit or more; ending with e. I think the usage of regular exprerssion syntax in the grammar simplifies things a lot. Indeed, many people are already familiar with regexp!
A string can be empty, therefore the usage of *.
A list can contain any type zero or more times, but dictionary keys must always be string.
Grammar helps writing code
Recursive descent suggests defining a function per type in the grammar. Function’s concern is only its type, delegating work by recursively calling other functions. So, to parse any you would write
type Parser struct {
input []byte
offset int
}
func NewParser(input []byte) *Parser {
return &Parser{
input: input,
offset: 0,
}
}
func (p *Parser) ParseAny() (any, error) {
if p.offset >= len(p.input) {
return nil, io.ErrUnexpectedEOF
}
integer, err := p.parseInteger()
if err == nil {
return integer, nil
}
string, err := p.parseString()
if err == nil {
return string, nil
}
list, err := p.parseList()
if err == nil {
return list, nil
}
dictionary, err := p.parseDictionary()
if err == nil {
return dictionary, nil
}
return nil, fmt.Errorf("could not parse any value: %s", p.input[p.offset:])
}
Warning
The very first thing I see in the code is that it tries to bruteforce parse four times, and only then report an error. Let’s discuss optimizations later.
Snippet above is written according to the rule any, although achieving it by pure bruteforce way it not optimal. At this point you can see how recursive parsing automatically handles opening and closing tags. Recursive algoritmhs are elegant!
Of course now you need to parse each individual bencode type. You can even use regular expressions to parse them, but if you’ve come this far let’s switch to fully manual mode.
func (p *Parser) parseInteger() (int64, error) {
if p.input[p.offset] != 'i' {
return -1, fmt.Errorf("expected 'i' got %v", p.input[p.offset])
}
// ...
}
Checking integer tag is the very first thing this code should do, that is just follow the grammar rules. The dynamic error can be replaced with typed one to allow ParseAny differentiate between when tag does not match with current type and when something went really wrong.
func (p *Parser) parseInteger() (int64, error) {
// ...
integerStr := p.input[p.offset+1:][:closingIndex]
integer, err := strconv.ParseInt(string(integerStr), 10, 64)
if err != nil {
return -1, fmt.Errorf("parse integer: %w", err)
}
p.offset += len(integerStr) + 2
return integer, nil
}
Nothing extraordinary here. Notice the pattern p.input[p.offset+1:][:closingIndex]. It prevents duplication of p.offset+1 when slicing closingIndex, that is slicing the result of the first slicing operation. One more important, p.offset is not mutated until we absolutele know no errors can occur.
func (p *Parser) parseString() (string, error) {
marker := p.input[p.offset]
if marker < '0' || marker > '9' {
return "", fmt.Errorf("expected ascii digit but got %v", marker)
}
lenEndIndex := bytes.IndexByte(p.input[p.offset+1:], ':')
if lenEndIndex < 0 {
return "", errors.New("could not find end of a string length")
}
stringLengthStr := p.input[p.offset:][:lenEndIndex+1]
length, err := strconv.ParseInt(string(stringLengthStr), 10, 64)
if err != nil {
return "", fmt.Errorf("parse string length: %w", err)
}
str := p.input[p.offset+len(stringLengthStr)+1:][:length]
p.offset += len(stringLengthStr) + 1 + int(length)
return string(str), nil
}
Again, parsing method returns concrete type and error. String tag validation is a bit more involved, but not too complicated. As long as the first byte looks like a digit – search for a colon, which serves as delimeter between string length and its value. Be wary! Offset calculations must be done with care.
func (p *Parser) parseList() ([]any, error) {
if p.input[p.offset] != 'l' {
return nil, fmt.Errorf("expected 'l' but got %v", p.input[p.offset])
}
p.offset += 1
// ...
}
Increasing difficulty. Now we need to parse list type. Copy the tag check, but after that ParseAny won’t work without offset modification. ParseAny checks what is currently on p.offset position, that’s why you need to move one past the tag byte. Now offset points to first element of the slice or end of the slice.
func (p *Parser) parseList() ([]any, error) {
// ...
var list []any
for {
item, err := p.ParseAny()
if err != nil {
// TODO
}
list = append(list, item)
}
p.offset += 1
return list, nil
}
More code to come. Since list length is not stored anywhere, we use infinite loop. List contains elements of type any, so call p.ParseAny() in the loop. But how do I know when to stop the loop, you may ask?
func (p *Parser) parseList() ([]any, error) {
// ...
loop:
for {
item, err := p.ParseAny()
if err != nil {
switch {
case p.input[p.offset] == 'e':
break loop
default:
return nil, err
}
}
// ...
}
// ...
}
One solution to that is to check current p.offset, the end tag means parsing is done. But the better solution would be a typed error type TokenErr struct{Token byte}, then you can use it in error matching using errors.As.
Dictionary parsing is really similar: check tag; parse string key in infinite loop; parse value. The difference lies in dictionary value parsing. Key parsing can break loop, but value parsing must always return error.
Fi
any = (i <integer>) | ([0-9] <string>) | (l <list>) | (d <dictionary>)
integer = [0-9]+ e
string = [0-9]+ : .* // або [0-9]* : .*
list = <any>* e
dictionary = (<string> <any>)* e
What about these rules? Now any knows every type’s tag, so you can stop bruteforcing in ParseAny. Though you must be clever with implementation. Should any consume the first byte? If no, all methods must increase p.offset += 1 at the very beginning, but p.parseString shouldn’t.
Повний код
package recdes
import (
"bytes"
"errors"
"fmt"
"io"
"strconv"
)
type Parser struct {
input []byte
offset int
}
func NewParser(input []byte) *Parser {
return &Parser{
input: input,
offset: 0,
}
}
func (p *Parser) ParseAny() (any, error) {
if p.offset >= len(p.input) {
return nil, io.ErrUnexpectedEOF
}
integer, err := p.parseInteger()
if err == nil {
return integer, nil
}
string, err := p.parseString()
if err == nil {
return string, nil
}
list, err := p.parseList()
if err == nil {
return list, nil
}
dictionary, err := p.parseDictionary()
if err == nil {
return dictionary, nil
}
return nil, fmt.Errorf("could not parse any value: %s", p.input[p.offset:])
}
func (p *Parser) parseInteger() (int64, error) {
if p.input[p.offset] != 'i' {
return -1, fmt.Errorf("expected 'i' got %v", p.input[p.offset])
}
closingIndex := bytes.IndexByte(p.input[p.offset+1:], 'e')
if closingIndex < 0 {
return -1, errors.New("could not find end of an integer")
}
integerStr := p.input[p.offset+1:][:closingIndex]
integer, err := strconv.ParseInt(string(integerStr), 10, 64)
if err != nil {
return -1, fmt.Errorf("parse integer: %w", err)
}
p.offset += len(integerStr) + 2
return integer, nil
}
func (p *Parser) parseString() (string, error) {
marker := p.input[p.offset]
if marker < '0' || marker > '9' {
return "", fmt.Errorf("expected ascii digit but got %v", marker)
}
lenEndIndex := bytes.IndexByte(p.input[p.offset+1:], ':')
if lenEndIndex < 0 {
return "", errors.New("could not find end of a string length")
}
stringLengthStr := p.input[p.offset:][:lenEndIndex+1]
length, err := strconv.ParseInt(string(stringLengthStr), 10, 64)
if err != nil {
return "", fmt.Errorf("parse string length: %w", err)
}
str := p.input[p.offset+len(stringLengthStr)+1:][:length]
p.offset += len(stringLengthStr) + 1 + int(length)
return string(str), nil
}
func (p *Parser) parseList() ([]any, error) {
if p.input[p.offset] != 'l' {
return nil, fmt.Errorf("expected 'l' but got %v", p.input[p.offset])
}
p.offset += 1
var list []any
loop:
for {
item, err := p.ParseAny()
if err != nil {
switch {
case p.input[p.offset] == 'e':
break loop
default:
return nil, err
}
}
list = append(list, item)
}
p.offset += 1
return list, nil
}
func (p *Parser) parseDictionary() (map[string]any, error) {
if p.input[p.offset] != 'd' {
return nil, fmt.Errorf("expected 'd' but got %v", p.input[p.offset])
}
p.offset += 1
dict := make(map[string]any)
loop:
for {
key, err := p.parseString()
if err != nil {
switch {
case p.input[p.offset] == 'e':
break loop
default:
return nil, err
}
}
value, err := p.ParseAny()
if err != nil {
return nil, err
}
dict[key] = value
}
if p.input[p.offset] != 'e' {
return nil, fmt.Errorf("expected 'e' but got %v", p.input[p.offset])
}
p.offset += 1
return dict, nil
}