Metalyzer is a dependency-free C++17 compiler frontend and lexical analyzer generator. Built entirely from scratch without relying on standard regex libraries, the project translates custom .mz specification files into highly optimized, standalone C++ lexer code.
The engine implements foundational automata theory to eliminate runtime ambiguity, utilizing a compute-efficient pipeline to generate O(1) state-transition tables for high-performance tokenization.
To achieve maximum performance and predictability, Metalyzer handles all regex compilation, graph reduction, and memory layout optimization internally.
| Component | Implementation | Engineering Benefit |
|---|---|---|
| Graph Compilation | Thompson NFA + Power-Set DFA | Complete control over graph boundaries; zero external regex dependencies. |
| Conflict Resolution | Algorithmic Priority Assignment | Mathematically guarantees the highest-priority rule wins (e.g., if vs [a-z]+). |
| State Optimization | Equivalence-Class Partitioning | Strictly isolates partitions by Rule ID, minimizing footprint without destroying logic. |
| Runtime Matching | Maximal Munch Algorithm | O(1) transitions per character with greedy stream rollback. |
| Memory Footprint | 2D Transition Matrix Compression | Cache-friendly array layout; eliminates pointer-chasing during runtime tokenization. |
The engine transforms high-level regex specifications into low-level transition matrices through a strict, multi-pass graph pipeline:
graph LR
UserSpec((spec.mz)) -->|Parse| Frontend[Frontend Parser]
subgraph "Automata Engine"
Frontend -->|Regex IR| NFA[Thompson NFA]
NFA -->|Subset Construction| DFA[Power-Set DFA]
DFA -->|Equivalence Partitioning| MinDFA[Minimized DFA]
MinDFA -->|Serialization| Comp[Table Compressor]
end
subgraph "Code Generation"
Comp -->|"O(1) Matrix"| Gen[C++ Generator]
Gen -.->|Emit| Hpp[Lexer.hpp]
Gen -.->|Emit| Cpp[Lexer.cpp]
end
Metalyzer converts human-readable regex into executable state machines using three algorithmic passes:
- Thompson's Construction (NFA): Parses regular expressions via the Shunting-Yard algorithm and builds Non-Deterministic Finite Automata. Supports Kleene stars (
*), unions (|), groupings (()), and character classes ([]). - Power-Set Construction (DFA): Resolves non-determinism. This stage implements Algorithmic Priority Resolution—if a string mathematically matches multiple rules, the engine resolves the conflict during graph conversion rather than at runtime.
- State Minimization: Minimizes the DFA using equivalence-class partitioning while strictly protecting rule priority boundaries.
The generated C++ code avoids dynamic backtracking graphs. Instead, it utilizes a highly cache-friendly 2D transition matrix.
- Mechanism: The runtime implements the greedy Maximal Munch algorithm. It aggressively consumes characters until a dead-end is reached, then seamlessly rolls back the input stream to the last known accepting state.
- Benefit: Ensures the longest valid token is universally matched while maintaining deterministic O(1) state transitions per character.
Metalyzer consumes a standard 3-section specification file format (inspired by Lex/Flex) to allow seamless injection of custom C++ action code:
%{
// 1. Header Section: Injected at the top of the generated file
#include <iostream>
enum Token { ERR = -2, EOF_TOK = -1, INT = 1, IF = 2, ID = 3 };
%}
%%
// 2. Rules Section: Regex mapped to Action Blocks
[0-9]+ { return Token::INT; }
if { return Token::IF; }
[a-z]+ { return Token::ID; }
%%
// 3. User Code Section: Injected at the bottom of the generated file
int main() {
Lexer lexer(std::cin);
// Tokenization loop...
}
- C++17 compliant compiler (GCC 9+ or Clang 10+)
- CMake 3.10 or higher
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)
Compile your lexer specifications by passing them to the generator. Currently, the build executes an end-to-end internal test suite:
./metalyzer_app
With the foundational lexical engine complete, the suite is scheduled to expand into a complete language frontend:
- Action Code Injection: Expanding the C++ generator to dynamically write user-defined action blocks (
{ return Token::IF; }) into the runtimeswitchstatement. - Parser Generator: Implementation of a
.myspecification parser to generate Abstract Syntax Trees (ASTs) using LALR/LR(1) lookahead tables. - Semantic Analyzer: AST validation passes for type-checking and logical constraint verification.
- LLVM Backend Integration: A lowering phase (Codegen) to translate the validated AST into LLVM Intermediate Representation (IR), bridging the gap from custom syntax to executable machine code.