writeup.md (4794B)
1 --- 2 date: 2024-06-10 3 title: "My custom regular expression parser & engine" 4 summary: "An incredibly simple regular expression engine built with finite state automaton in Rust" 5 description: "An incredibly simple regular expression engine built with finite state automaton in Rust" 6 weight: 1 7 --- 8 # Source Code 9 The source code is publicly available on my 10 [own git server](https://git.ethandl.dev/automaton/) 11 12 # Why? 13 I was inspired to try and make my own automaton engine after learning about them 14 in a course offered by the ANU, COMP1600. 15 16 Prior to COMP1600, I had already learned a bit about regular grammars and 17 regular expressions in a compiler construction course, but that course didn't go 18 in much depth about how they might be practically implemented. 19 20 COMP1600 showed me that it is actually incredibly simple to make a regular 21 expression engine that will match strings in linear time complexity with respect 22 to the string length thanks to a concept known as deterministic finite automata 23 (DFAs). 24 25 # What is this? 26 If you don't know what regular expressions are, leave now and return when you 27 do. 28 29 This project is currently a collection of implementations of finite automata, 30 along with a regular expression compiler. 31 I will primarily be highlighting its ability to compile regular expressions into 32 easily computable structures in memory known as automaton. 33 34 Essentially, you give a regular expression, and it spits out an automaton that 35 can recognise the regular language described by that expression. 36 We can think of the regex string as a kind of "source code" for a machine that 37 accepts or rejects other strings; this "machine" is called an automaton. 38 Specifically for regular expressions and languages, these automaton can be 39 described with a finite memory known ahead of time, and are hence called finite 40 state automaton (FSA). 41 42 This project utilises 3 different forms of FSA in the compilation process from 43 regular expression to the final deterministic finite automaton (DFA). See the 44 [compilation pipeline](#comp-pline) 45 46 ## Automaton Included {#automaton-included} 47 Currently, only finite state automaton are included, no turing automaton. (yet!) 48 | Automaton type | Primary data structure | `struct` | Typical use | 49 |-------------------------------------|----------------------------------------|-------------|-------------------------------------------------| 50 | \\(\epsilon \operatorname{NFA}^*\\) | `Graph` (using the `petgraph` library) | `GraphENFA` | Programmatically Building Automaton | 51 | \\(\epsilon \operatorname{NFA}\\) | `HashMap` | `ENFA` | Easy to build automaton | 52 | \\(\operatorname{NFA}\\) | `HashMap` | `NFA` | Less problematic to convert to DFA | 53 | \\(\operatorname{DFA}\\) | `HashMap` | `DFA` | Complete control over a deterministic automaton | 54 55 # Regex Compilation Pipeline {#comp-pline} 56 {{< katex >}} 57 The typical compilation pipeline for a regular expression \\(R_L\\) defining the 58 regular langauge \\(L\\) in the `automaton` library is the following: 59 $$ 60 R_L 61 \mapsto \epsilon \operatorname{NFA}^*_L 62 \mapsto \epsilon \operatorname{NFA}_L 63 \mapsto \operatorname{NFA}_L 64 \mapsto \operatorname{DFA}_L 65 $$ 66 See [the automaton included](#automaton-included) for info on these data structures. 67 68 This is exactly what the `regex` binary does. 69 70 The mapping \\(R_L \mapsto \epsilon\text{NFA}^\*_L\\) is not necessarily trivial, 71 the language of regular expressions is unfortunately not regular (it would have 72 been really nice to describe the language of regular expressions exactly with a 73 regular expression). I implemented an LL(1) predictive recursive descent 74 parser[^1] to parse the regular expression into an AST, which is then converted 75 into an \\(\epsilon\text{NFA}^\*_L\\) through a set of algorithms described in 76 the COMP1600 lectures. 77 78 # Conversion 79 80 Conversion between various finite state automaton is achieved through the 81 algorithms described in the lectures, with not much care for the memory 82 complexity of the resulting conversion. Conversions are also not perfectly 83 reversible, converting an \\(\text{NFA}\\) to a \\(\text{DFA}\\) and then back 84 will just yield a \\(\text{DFA}\\) in the struct of an \\(\text{NFA}\\). 85 86 # Demo 87 See the demo I have hosted here: 88 {{< article link="/projects/dfa-regex-engine/demo/" >}} 89 90 [^1]: In all honesty I'm not sure if it is really LL(1), I think it may 91 technically be nondeterministic given it has some notion of context. As of 92 writing, it has been months since I actually implemented it and did any 93 formal theory on grammars so I'm a bit vague.