Pratt Parsing: Introduction and Implementation in TypeScript

Introduction

When hand-writing recursive descent parsers, one often encounters the issue of left recursion. The conventional approach is to rewrite the grammar to eliminate left recursion, but this requires introducing additional grammar rules, making the code less maintainable. Additionally, operators of different precedence and associativity pose further challenges. These reasons make the implementation of traditional recursive descent parsers more difficult.

In this article, we will introduce the Pratt Parsing algorithm proposed by Vaughan Pratt, which can address these problems.

Basic Concepts

First, let’s introduce some basic concepts in parsing.

  • Infix Operators: For example, the positive and negative signs + -. The characteristic of infix operators is that they appear between operands.

  • Prefix Operators: For example, addition, subtraction, multiplication, and division + - * /. The characteristic of prefix operators is that they appear before operands.

  • Postfix Operators: For example, factorial !. The characteristic of postfix operators is that they appear after operands.

In this article, we define:

1type InfixOp = '+' | '-' | '*' | '/' | '^'
2type PrefixOp = '-' | '+'
3type PostfixOp = '!'

AST (Abstract Syntax Tree) is a tree-like structure used to represent the syntactic structure of a program. For an expression, we define the following AST structure:

 1class ExprNode {
 2    toString(): string {
 3        throw new Error('not implemented')
 4    }
 5}
 6
 7class ValueNode extends ExprNode {
 8    constructor(public val: number) {
 9        super()
10    }
11    toString() {
12        return this.val.toString()
13    }
14}
15
16class PrefixOpNode extends ExprNode {
17    constructor(public op: PrefixOp, public rhs: ExprNode) {
18        super()
19    }
20    toString() {
21        return `(${this.op}${this.rhs})`
22    }
23}
24class InfixOpNode extends ExprNode {
25    constructor(public op: InfixOp, public lhs: ExprNode, public rhs: ExprNode) {
26        super()
27    }
28    toString() {
29        return `(${this.lhs}${this.op}${this.rhs})`
30    }
31}
32class PostfixOpNode extends ExprNode {
33    constructor(public op: PostfixOp, public lhs: ExprNode) {
34        super()
35    }
36    toString() {
37        return `(${this.lhs}${this.op})`
38    }
39}

The tokenizer is responsible for splitting the input string into a sequence of tokens. It typically provides an interface in the form of an iterator. Tokenization is relatively simple, so we won’t go into detail here. Let’s assume that we already have a stream of tokens.

In this example, we define the following token types:

1
2enum TokenType { None, Num, Add, Sub, Mul, Div, Fac, Pow, LPa, RPa, Eoi }
3
4class Token {
5    constructor(public type: TokenType = TokenType.None, public value: string = '') { }
6}
  • TokenType.None: Empty token used as a placeholder.

  • TokenType.Num: Number.

  • TokenType.Add, TokenType.Sub, TokenType.Mul, TokenType.Div, TokenType.Pow: Addition, subtraction, multiplication, division, exponentiation.

  • TokenType.Fac: Factorial.

  • TokenType.LPa, TokenType.RPa: Left and right parentheses.

  • TokenType.Eoi: End of Input, indicating the end of the input.

Defining precedence:

 1    precOf(token: string): number {
 2        const precs = {
 3            '+': 10,
 4            '-': 10,
 5            '*': 20,
 6            '/': 20,
 7            '^': 30,
 8            '!': 40,
 9        } as Record<string, number>
10
11        return precs[token] || 0
12    }

Define an iterator:

 1class Iterator<T> {
 2    private pos: number
 3    constructor(private inner: Array<T> = []) {
 4        this.pos = 0
 5    }
 6    next(): T | undefined {
 7        var peek = this.peek()
 8        this.pos++
 9        return peek
10    }
11    peek(): T | undefined {
12        return this.inner[this.pos] || undefined
13    }
14}

We will wrap a token array with an iterator.

  • The next() method returns the current token and moves the iterator pointer to the next token. This operation is also known as consume or eat.

  • The peek() method returns the current token without changing the iterator pointer.

Our token stream is defined as follows:

 1let tokens = new Iterator<Token>([
 2    // - 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
 3    new Token(TokenType.Sub, '-'),
 4    new Token(TokenType.Num, '1'),
 5    new Token(TokenType.Add, '+'),
 6    new Token(TokenType.LPa, '('),
 7    new Token(TokenType.Num, '2'),
 8    new Token(TokenType.Sub, '-'),
 9    new Token(TokenType.Num, '3'),
10    new Token(TokenType.RPa, ')'),
11    new Token(TokenType.Mul, '*'),
12    new Token(TokenType.Num, '6'),
13    new Token(TokenType.Div, '/'),
14    new Token(TokenType.Num, '3'),
15    new Token(TokenType.Fac, '!'),
16    new Token(TokenType.Sub, '-'),
17    new Token(TokenType.Num, '2'),
18    new Token(TokenType.Pow, '^'),
19    new Token(TokenType.Num, '3'),
20    new Token(TokenType.Pow, '^'),
21    new Token(TokenType.Num, '4'),
22    new Token(TokenType.Eoi),
23])

Pratt Parser

Pratt Parser is a top-down parser that works based on the following principles:

  1. Divide Tokens into two categories: prefix operators and infix operators. Assume that prefix operators have the highest precedence, followed by infix operators in descending order. Note: For postfix operators, we treat them as a special form of infix operators. (The specific reason will be explained later.) Therefore, we can categorize them into three classes without any problem.

  2. Each Token is associated with a parsing function, which is responsible for parsing expressions starting with that Token.

    • For example, for Num, we parse and obtain a ValueNode.

    • For example, for Add, we parse and obtain an InfixOpNode.

    Since we divide Tokens into three classes, each class has its corresponding parser, specifically:

    1. PrefixParser: Parser for prefix Tokens.

    2. InfixParser: Parser for infix Tokens.

    3. PostfixParser: Parser for postfix Tokens. (Can be seen as a special form of InfixParser)

  3. Each Token is associated with a precedence, which determines the parsing order. In the Parse function, we actually have two precedence levels: the context precedence (which may not be the precedence of the current Token) derived from the arguments of the Parse function, and the precedence of the next Token obtained through peeking.

  4. We parse the prefix expression first. Here comes the key part. If the precedence of the next Token is greater than the context precedence, we consider it as part of the infix/postfix expression and continue parsing. Otherwise, we return the prefix expression.

Let’s take an example to understand the core code below:

 1    parse(prec: number = 0): ExprNode {
 2        let token = this.tokens.next()!
 3        let prefixParser: PrefixFn = this.parsers.prefix[token.type]
 4        if (!prefixParser) {
 5            throw new Error(`Unexpected prefix token ${token.type}`)
 6        }
 7        let lhs: ExprNode = prefixParser(token)
 8        let precRight = this.precOf(this.tokens.peek()!.value)
 9
10
11        while (prec < precRight) {
12            token = this.tokens.next()!
13            let infixParser: InfixFn | PostfixFn = this.parsers.infix[token.type] || this.parsers.postfix[token.type]
14            if (!infixParser) {
15                throw new Error(`Unexpected infix or postfix token ${token.value}`)
16            }
17            lhs = infixParser(lhs, token)
18            precRight = this.precOf(this.tokens.peek()!.value)
19        }
20
21        return lhs
22    }

Initial state,

  • The current context precedence is 0, and the following token stream is + 1 - 2 * 3. Using the prefix + parser, we first get (+1). Then, since the precedence of - is the same, we directly return the LHS, which is + 1 as the AST node.

  • At this point, the context precedence is 0, and the remaining token stream is - 2 * 3. We see that the precedence of - is 10, which is higher than 0. Therefore, we consume - and call the infix - parser.

    • Now, the context precedence is 10, and the remaining token stream is 2 * 3. We first treat 2 as a prefix and call its prefix parser, which gives us a ValueNode(2).

      • Then, we peek at * and find that the precedence of * is 20, which is higher than 10. So we consume * and call the infix * parser.
        • The infix parser calls Parse.

          • Now, the context precedence is 20, and the remaining token stream is 3. We consume it as a prefix. Then, we find that the precedence of Eoi is 0, which is lower than 20. So we directly return the AST node for 3.
        • The infix parser obtains 3 as the RHS. Combining it with the LHS, which is 2 obtained as the actual argument, we get an InfixOpNode(2 * 3).

    • Then we return from the infix parser for -, and with LHS=(+1) and RHS=(2 * 3), we get an InfixOpNode((+1) - (2 * 3)).

  • Finally, since there are no more tokens, we return directly.

Our Prefix/Infix/PostfixParser definitions are as follows:

 1    public parsers = {
 2        prefix: {
 3            [TokenType.Num]: (token: Token) => {
 4                return new ValueNode(parseInt(token.value))
 5            },
 6            [TokenType.Add]: (token: Token) => {
 7                return new PrefixOpNode('+', this.parse(this.precOf('+')))
 8            },
 9            [TokenType.Sub]: (token: Token) => {
10                return new PrefixOpNode('-', this.parse(this.precOf('-')))
11            },
12            [TokenType.LPa]: (token: Token) => {
13                const expr = this.parse(0)
14                const next = this.tokens.next()!
15                if (next.type !== TokenType.RPa) {
16                    throw new Error('Expected )')
17                }
18                return expr
19            },
20        } as { [k: number]: PrefixFn },
21        infix: {
22            [TokenType.Add]: (lhs: ExprNode, token: Token) => {
23                return new InfixOpNode('+', lhs, this.parse(this.precOf('+')))
24            },
25            [TokenType.Sub]: (lhs: ExprNode, token: Token) => {
26                return new InfixOpNode('-', lhs, this.parse(this.precOf('-')))
27            },
28            [TokenType.Mul]: (lhs: ExprNode, token: Token) => {
29                return new InfixOpNode('*', lhs, this.parse(this.precOf('*')))
30            },
31            [TokenType.Div]: (lhs: ExprNode, token: Token) => {
32                return new InfixOpNode('/', lhs, this.parse(this.precOf('/')))
33            },
34            [TokenType.Pow]: (lhs: ExprNode, token: Token) => {
35                return new InfixOpNode('^', lhs, this.parse(this.precOf('^') - 1)) // Right associative
36            },
37        } as { [k: number]: InfixFn },
38        postfix: {
39            [TokenType.Fac]: (lhs: ExprNode, token: Token) => {
40                return new PostfixOpNode('!', lhs)
41            }
42        } as { [k: number]: PostfixFn }
43    }

Step-by-Step Analysis

The above example was just for quick understanding. In reality, we will encounter other problems as well:

  1. Cases with suffixes

  2. Cases with equal precedence

  3. Cases with left associativity and right associativity

So let’s provide a more comprehensive example.

1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4

First, we read the Token -, which is a prefix operator. Therefore, we find its PrefixParser.

1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2^
1[TokenType.Sub]: (token: Token) => {
2    return new PrefixOpNode('-', this.parse(this.precOf('-')))
3},

We continue parsing based on the precedence of -. That is, this.parse(this.precOf('-')).

So we read the Token 1, which is a number. Therefore, we find its PrefixParser.

1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2  ^
1[TokenType.Num]: (token: Token) => {
2    return new ValueNode(parseInt(token.value))
3},

We get a ValueNode(1), and then continue parsing. We peek at the next Token, which is +. Since the precedence of - is higher than +, we continue parsing.

1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2    ^

Then we encounter Token ( on the right. It has no precedence (or we can say its precedence is 0). So we don’t enter the while loop, and instead return. After returning two levels, we get PrefixOpNode('-', ValueNode(1)).

At this point, the current precedence is 0, and the current Token is -. We can peek at Token +. Since the precedence of + is higher than 0, we enter the while loop. Then we get the InfixParser for +.

1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2^   ^
3    peek

Therefore, we continue parsing based on the precedence of +. That is, this.parse(this.precOf('+')).

At this point, the prefix we read outside the while loop is (. So we use the PrefixParser for (.

1[TokenType.LPa]: (token: Token) => {
2    const expr = this.parse(0)
3    const next = this.tokens.next()!
4    if (next.type !== TokenType.RPa) {
5        throw new Error('Expected )')
6    }
7    return expr
8},

In the (, we actually treat the expression inside the parentheses as a completely new expression. As you can see, the precedence is reset to zero (line 2 of the code above).

We will parse the internal expression and get the sub-expression InfixOpNode('-', ValueNode(2), ValueNode(3)).

Now, how is the right parenthesis handled? In the parse function, since the precedence of the right parenthesis is 0, it will be directly returned. Therefore, in the PrefixParser for LPa, we will consume the right parenthesis. And then continue parsing.

Thus, the precedence is restored to the precedence of + as the function returns. And we continue parsing.

1- 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
2              ^

Since the precedence of * is higher than +, we will call the InfixParser for *. And thus, we parse the right sub-expression 6 / 3!.

It is worth mentioning that the InfixParser and PostfixParser are actually equivalent. Because Postfix is just a special case of Infix. You can compare the following two sub-parsers:

1[TokenType.Mul]: (lhs: ExprNode, token: Token) => {
2    return new InfixOpNode('*', lhs, this.parse(this.precOf('*')))
3},
4[TokenType.Fac]: (lhs: ExprNode, token: Token) => {
5    return new PostfixOpNode('!', lhs)
6}

It can be observed that their implementations are exactly the same. The only difference is that the InfixParser will continue parsing the right subexpression, while the PostfixParser will not.

Lastly, let’s take a look at how associativity is handled. Since operators like + and * are left-associative, they don’t require any special handling. However, the ^ operator is right-associative. For example, in the expression 2 ^ 3 ^ 4, it should be parsed as 2 ^ (3 ^ 4). Therefore, we need to handle this special case in the InfixParser.

1[TokenType.Pow]: (lhs: ExprNode, token: Token) => {
2    return new InfixOpNode('^', lhs, this.parse(this.precOf('^') - 1)) // Right associativity
3},

Why do we subtract one from the precedence? This is done to make the left ^ operator have a higher precedence, so that the right 3 ^ 4 will be parsed as a subexpression.

In the end, we obtain the following AST:

 1InfixOpNode {
 2  op: '-',
 3  lhs: InfixOpNode {
 4    op: '+',
 5    lhs: PrefixOpNode { op: '-', rhs: ValueNode { val: 1 } },
 6    rhs: InfixOpNode {
 7      op: '/',
 8      lhs: InfixOpNode {
 9        op: '*',
10        lhs: InfixOpNode {
11          op: '-',
12          lhs: ValueNode { val: 2 },
13          rhs: ValueNode { val: 3 }
14        },
15        rhs: ValueNode { val: 6 }
16      },
17      rhs: PostfixOpNode { op: '!', lhs: ValueNode { val: 3 } }
18    }
19  },
20  rhs: InfixOpNode {
21    op: '^',
22    lhs: ValueNode { val: 2 },
23    rhs: InfixOpNode {
24      op: '^',
25      lhs: ValueNode { val: 3 },
26      rhs: ValueNode { val: 4 }
27    }
28  }
29}

The corresponding parenthesized expression is:

(((-1) + (((2 - 3) * 6) / (3!))) - (2 ^ (3 ^ 4)))

Full Code

Here is the complete code:

  1/**
  2 * Author: Zijing Zhang (Pluveto)
  3 * Date: 2023-01-11 15:28:00
  4 * Description: A Pratt Parser implemented in TypeScript.
  5 */
  6const util = require('util')
  7
  8type InfixOp = '+' | '-' | '*' | '/' | '^'
  9type PrefixOp = '-' | '+'
 10type PostfixOp = '!'
 11
 12class ExprNode {
 13    toString(): string {
 14        throw new Error('not implemented')
 15    }
 16}
 17
 18class ValueNode extends ExprNode {
 19    constructor(public val: number) {
 20        super()
 21    }
 22    toString() {
 23        return this.val.toString()
 24    }
 25}
 26
 27class PrefixOpNode extends ExprNode {
 28    constructor(public op: PrefixOp, public rhs: ExprNode) {
 29        super()
 30    }
 31    toString() {
 32        return `(${this.op}${this.rhs})`
 33    }
 34}
 35class InfixOpNode extends ExprNode {
 36    constructor(public op: InfixOp, public lhs: ExprNode, public rhs: ExprNode) {
 37        super()
 38    }
 39    toString() {
 40        return `(${this.lhs}${this.op}${this.rhs})`
 41    }
 42}
 43class PostfixOpNode extends ExprNode {
 44    constructor(public op: PostfixOp, public lhs: ExprNode) {
 45        super()
 46    }
 47    toString() {
 48        return `(${this.lhs}${this.op})`
 49    }
 50}
 51
 52class Iterator<T> {
 53    private pos: number
 54    constructor(private inner: Array<T> = []) {
 55        this.pos = 0
 56    }
 57    next(): T | undefined {
 58        var peek = this.peek()
 59        this.pos++
 60        return peek
 61    }
 62    peek(): T | undefined {
 63        return this.inner[this.pos] || undefined
 64    }
 65}
 66
 67
 68enum TokenType { None, Num, Add, Sub, Mul, Div, Fac, Pow, LPa, RPa, Eoi }
 69
 70class Token {
 71    constructor(public type: TokenType = TokenType.None, public value: string = '') { }
 72}
 73
 74type PrefixFn = (token: Token) => ExprNode
 75type InfixFn = (lhs: ExprNode, token: Token) => ExprNode
 76type PostfixFn = (lhs: ExprNode, token: Token) => ExprNode
 77
 78class Parser {
 79    constructor(public tokens: Iterator<Token>) { }
 80    public parsers = {
 81        prefix: {
 82            [TokenType.Num]: (token: Token) => {
 83                return new ValueNode(parseInt(token.value))
 84            },
 85            [TokenType.Add]: (token: Token) => {
 86                return new PrefixOpNode('+', this.parse(this.precOf('+')))
 87            },
 88            [TokenType.Sub]: (token: Token) => {
 89                return new PrefixOpNode('-', this.parse(this.precOf('-')))
 90            },
 91            [TokenType.LPa]: (token: Token) => {
 92                const expr = this.parse(0)
 93                const next = this.tokens.next()!
 94                if (next.type !== TokenType.RPa) {
 95                    throw new Error('Expected )')
 96                }
 97                return expr
 98            },
 99        } as { [k: number]: PrefixFn },
100        infix: {
101            [TokenType.Add]: (lhs: ExprNode, token: Token) => {
102                return new InfixOpNode('+', lhs, this.parse(this.precOf('+')))
103            },
104            [TokenType.Sub]: (lhs: ExprNode, token: Token) => {
105                return new InfixOpNode('-', lhs, this.parse(this.precOf('-')))
106            },
107            [TokenType.Mul]: (lhs: ExprNode, token: Token) => {
108                return new InfixOpNode('*', lhs, this.parse(this.precOf('*')))
109            },
110            [TokenType.Div]: (lhs: ExprNode, token: Token) => {
111                return new InfixOpNode('/', lhs, this.parse(this.precOf('/')))
112            },
113            [TokenType.Pow]: (lhs: ExprNode, token: Token) => {
114                return new InfixOpNode('^', lhs, this.parse(this.precOf('^') - 1)) // Right associative
115            },
116        } as { [k: number]: InfixFn },
117        postfix: {
118            [TokenType.Fac]: (lhs: ExprNode, token: Token) => {
119                return new PostfixOpNode('!', lhs)
120            }
121        } as { [k: number]: PostfixFn }
122    }
123    precOf(token: string): number {
124        const precs = {
125            '+': 10,
126            '-': 10,
127            '*': 20,
128            '/': 20,
129            '^': 30,
130            '!': 40,
131        } as Record<string, number>
132
133        return precs[token] || 0
134    }
135    parse(prec: number = 0): ExprNode {
136        let token = this.tokens.next()!
137        let prefixParser: PrefixFn = this.parsers.prefix[token.type]
138        if (!prefixParser) {
139            throw new Error(`Unexpected prefix token ${token.type}`)
140        }
141        let lhs: ExprNode = prefixParser(token)
142        let precRight = this.precOf(this.tokens.peek()!.value)
143
144
145        while (prec < precRight) {
146            token = this.tokens.next()!
147            let infixParser: InfixFn | PostfixFn = this.parsers.infix[token.type] || this.parsers.postfix[token.type]
148            if (!infixParser) {
149                throw new Error(`Unexpected infix or postfix token ${token.value}`)
150            }
151            lhs = infixParser(lhs, token)
152            precRight = this.precOf(this.tokens.peek()!.value)
153        }
154
155        return lhs
156    }
157}
158
159
160let tokens = new Iterator<Token>([
161    // - 1 + (2 - 3) * 6 / 3 ! - 2 ^ 3 ^ 4
162    new Token(TokenType.Sub, '-'),
163    new Token(TokenType.Num, '1'),
164    new Token(TokenType.Add, '+'),
165    new Token(TokenType.LPa, '('),
166    new Token(TokenType.Num, '2'),
167    new Token(TokenType.Sub, '-'),
168    new Token(TokenType.Num, '3'),
169    new Token(TokenType.RPa, ')'),
170    new Token(TokenType.Mul, '*'),
171    new Token(TokenType.Num, '6'),
172    new Token(TokenType.Div, '/'),
173    new Token(TokenType.Num, '3'),
174    new Token(TokenType.Fac, '!'),
175    new Token(TokenType.Sub, '-'),
176    new Token(TokenType.Num, '2'),
177    new Token(TokenType.Pow, '^'),
178    new Token(TokenType.Num, '3'),
179    new Token(TokenType.Pow, '^'),
180    new Token(TokenType.Num, '4'),
181    new Token(TokenType.Eoi),
182])
183
184let parser = new Parser(tokens)
185let ast = parser.parse()
186
187console.log(util.inspect(ast, { showHidden: false, depth: null, colors: true }))
188console.log("Equivalent expression: ", ast.toString());

References

https://www.youtube.com/watch?v=Nlqv6NtBXcA&t=1171s


Recent Posts

To view all articles, please click on the 'Archive' button in the top right corner