Blog

Placeholder

Python

Test inline: a=123a = 123

Test inline: ab\frac{a}{b}

Test multi-line:

θt+1θtηmt^vt^+ϵ\theta_{t+1} \gets \theta_{t} - \eta \frac{\hat{m_{t}}}{\sqrt{\hat{v_t}} + \epsilon} f(x)=x+1f(x) = x + 1

Building a Python Interpreter from Scratch

Creating a Python interpreter from scratch is a fascinating project that introduces you to the inner workings of interpreters, compilers, and language design. Though creating a full-fledged interpreter for Python’s entire language is a significant undertaking, this guide will outline the foundational concepts and steps to help you get started with a minimalist Python interpreter.

Prerequisites

Before diving into this project, it’s helpful to have a basic understanding of:

  • Programming Language Theory: Familiarity with concepts like syntax, semantics, parsing, and execution.
  • Python: Some understanding of Python will help, as we’ll be interpreting a simplified version of Python itself.
  • Data Structures: Parsing expressions requires working with data structures like stacks, queues, and trees.

Step 1: Define the Interpreter’s Scope

To avoid overwhelming complexity, define a narrow scope:

  1. Support basic arithmetic expressions (+, -, *, /).
  2. Handle simple variable assignments.
  3. Ignore advanced features like classes, functions, and imports.

This will allow you to focus on core interpreter concepts without getting bogged down by Python’s full feature set.

Step 2: Tokenization (Lexical Analysis)

Tokenization is the process of breaking down the input text into meaningful chunks, called tokens. Each token represents a part of the input, such as a number, operator, or identifier.

Here’s a basic tokenizer in Python:

import re

def tokenize(code):
    token_spec = [
        ('NUMBER', r'\d+(\.\d*)?'),    # Integer or decimal number
        ('ID', r'[A-Za-z_]\w*'),       # Identifiers
        ('OP', r'[+\-*/]'),            # Arithmetic operators
        ('ASSIGN', r'='),              # Assignment operator
        ('WHITESPACE', r'\s+'),        # Whitespace
    ]
    tok_regex = '|'.join(f'(?P<{pair[0]}>{pair[1]})' for pair in token_spec)
    for match in re.finditer(tok_regex, code):
        kind = match.lastgroup
        value = match.group()
        if kind != 'WHITESPACE':  # Skip whitespace tokens
            yield kind, value

Example:

list(tokenize("x = 10 + 2 * 3"))
# Output: [('ID', 'x'), ('ASSIGN', '='), ('NUMBER', '10'), ('OP', '+'), ('NUMBER', '2'), ('OP', '*'), ('NUMBER', '3')]

Step 3: Parsing (Syntax Analysis)

Parsing turns tokens into a structure the interpreter can understand. A common choice is to use Abstract Syntax Trees (ASTs), where each node represents an operation or value in the code.

For a simple expression like x = 10 + 2 * 3, the AST might look like:

    =
   / \
  x   +
     / \
   10   *
       / \
      2   3

To build an AST, use the Shunting Yard algorithm for parsing expressions. Here’s a simple recursive parser for arithmetic expressions:

class ASTNode:
    def __init__(self, type, value=None, left=None, right=None):
        self.type = type
        self.value = value
        self.left = left
        self.right = right

def parse_expression(tokens):
    tokens = list(tokens)
    def parse_term():
        token = tokens.pop(0)
        if token[0] == 'NUMBER':
            return ASTNode('NUMBER', value=float(token[1]))
        elif token[0] == 'ID':
            return ASTNode('ID', value=token[1])
    def parse():
        node = parse_term()
        while tokens and tokens[0][0] == 'OP':
            op = tokens.pop(0)
            right = parse_term()
            node = ASTNode('OP', value=op[1], left=node, right=right)
        return node
    return parse()

Step 4: Interpreting the AST (Execution)

The interpreter traverses the AST and evaluates it. Each node type (e.g., assignment, operation, number) has its own evaluation logic.

Here’s a simple interpreter function:

def evaluate(node, context={}):
    if node.type == 'NUMBER':
        return node.value
    elif node.type == 'ID':
        return context.get(node.value, 0)
    elif node.type == 'ASSIGN':
        context[node.left.value] = evaluate(node.right, context)
        return context[node.left.value]
    elif node.type == 'OP':
        left_val = evaluate(node.left, context)
        right_val = evaluate(node.right, context)
        if node.value == '+':
            return left_val + right_val
        elif node.value == '-':
            return left_val - right_val
        elif node.value == '*':
            return left_val * right_val
        elif node.value == '/':
            return left_val / right_val
    return None

Step 5: Putting It All Together

Here’s how you can put everything together to interpret a line of code:

def interpret(code, context={}):
    tokens = tokenize(code)
    ast = parse_expression(tokens)
    return evaluate(ast, context)

Example:

context = {}
interpret("x = 10 + 2 * 3", context)
print(context)  # Output: {'x': 16.0}

Conclusion

Building a Python interpreter from scratch is a rewarding project. By understanding tokenization, parsing, and evaluation, you gain a deeper appreciation of how programming languages work under the hood. This simple interpreter serves as a foundation; you can extend it with more complex features like functions, control flow, and error handling.

Further Reading

  • “Crafting Interpreters” by Robert Nystrom: A comprehensive guide to building interpreters.
  • Python’s AST Module: Explore Python’s own AST capabilities.
  • Compiler and Language Theory: Books like “Compilers: Principles, Techniques, and Tools” (Dragon Book) provide deeper insights into language design.