ethandl.dev

The source for my website
git clone git://git.ethandl.dev/ethandl.dev
Log | Files | Refs | Submodules

writeup.md (4794B)


      1 ---
      2 date: 2024-06-10
      3 title: "My custom regular expression parser & engine"
      4 summary: "An incredibly simple regular expression engine built with finite state automaton in Rust"
      5 description: "An incredibly simple regular expression engine built with finite state automaton in Rust"
      6 weight: 1
      7 ---
      8 # Source Code
      9 The source code is publicly available on my 
     10 [own git server](https://git.ethandl.dev/automaton/)
     11 
     12 # Why?
     13 I was inspired to try and make my own automaton engine after learning about them
     14 in a course offered by the ANU, COMP1600.
     15 
     16 Prior to COMP1600, I had already learned a bit about regular grammars and 
     17 regular expressions in a compiler construction course, but that course didn't go
     18 in much depth about how they might be practically implemented.
     19 
     20 COMP1600 showed me that it is actually incredibly simple to make a regular 
     21 expression engine that will match strings in linear time complexity with respect
     22 to the string length thanks to a concept known as deterministic finite automata 
     23 (DFAs).
     24 
     25 # What is this?
     26 If you don't know what regular expressions are, leave now and return when you 
     27 do.
     28 
     29 This project is currently a collection of implementations of finite automata, 
     30 along with a regular expression compiler.
     31 I will primarily be highlighting its ability to compile regular expressions into
     32 easily computable structures in memory known as automaton.
     33 
     34 Essentially, you give a regular expression, and it spits out an automaton that
     35 can recognise the regular language described by that expression.
     36 We can think of the regex string as a kind of "source code" for a machine that 
     37 accepts or rejects other strings; this "machine" is called an automaton.
     38 Specifically for regular expressions and languages, these automaton can be 
     39 described with a finite memory known ahead of time, and are hence called finite 
     40 state automaton (FSA).
     41 
     42 This project utilises 3 different forms of FSA in the compilation process from 
     43 regular expression to the final deterministic finite automaton (DFA). See the
     44 [compilation pipeline](#comp-pline)
     45 
     46 ## Automaton Included {#automaton-included}
     47 Currently, only finite state automaton are included, no turing automaton. (yet!)
     48 | Automaton type                      | Primary  data structure                | `struct`    | Typical use                                     |
     49 |-------------------------------------|----------------------------------------|-------------|-------------------------------------------------|
     50 | \\(\epsilon \operatorname{NFA}^*\\) | `Graph` (using the `petgraph` library) | `GraphENFA` | Programmatically Building Automaton             |
     51 | \\(\epsilon \operatorname{NFA}\\)   | `HashMap`                              | `ENFA`      | Easy to build automaton                         |
     52 | \\(\operatorname{NFA}\\)            | `HashMap`                              | `NFA`       | Less problematic to convert to DFA              |
     53 | \\(\operatorname{DFA}\\)            | `HashMap`                              | `DFA`       | Complete control over a deterministic automaton |
     54 
     55 # Regex Compilation Pipeline {#comp-pline}
     56 {{< katex >}}
     57 The typical compilation pipeline for a regular expression \\(R_L\\) defining the
     58 regular langauge \\(L\\) in the `automaton` library is the following:
     59 $$
     60 R_L
     61 \mapsto \epsilon \operatorname{NFA}^*_L
     62 \mapsto \epsilon \operatorname{NFA}_L
     63 \mapsto \operatorname{NFA}_L
     64 \mapsto \operatorname{DFA}_L
     65 $$
     66 See [the automaton included](#automaton-included) for info on these data structures.
     67 
     68 This is exactly what the `regex` binary does.
     69 
     70 The mapping \\(R_L \mapsto \epsilon\text{NFA}^\*_L\\) is not necessarily trivial, 
     71 the language of regular expressions is unfortunately not regular (it would have 
     72 been really nice to describe the language of regular expressions exactly with a
     73 regular expression). I implemented an LL(1) predictive recursive descent 
     74 parser[^1] to parse the regular expression into an AST, which is then converted
     75 into an \\(\epsilon\text{NFA}^\*_L\\) through a set of algorithms described in 
     76 the COMP1600 lectures.
     77 
     78 # Conversion
     79 
     80 Conversion between various finite state automaton is achieved through the
     81 algorithms described in the lectures, with not much care for the memory
     82 complexity of the resulting conversion. Conversions are also not perfectly
     83 reversible, converting an \\(\text{NFA}\\) to a \\(\text{DFA}\\) and then back
     84 will just yield a \\(\text{DFA}\\) in the struct of an \\(\text{NFA}\\).
     85 
     86 # Demo
     87 See the demo I have hosted here:
     88 {{< article link="/projects/dfa-regex-engine/demo/" >}}
     89 
     90 [^1]: In all honesty I'm not sure if it is really LL(1), I think it may 
     91       technically be nondeterministic given it has some notion of context. As of
     92       writing, it has been months since I actually implemented it and did any 
     93       formal theory on grammars so I'm a bit vague.