Skip to content

raj-v020/metalyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Metalyzer Compiler Suite

Metalyzer is a dependency-free C++17 compiler frontend and lexical analyzer generator. Built entirely from scratch without relying on standard regex libraries, the project translates custom .mz specification files into highly optimized, standalone C++ lexer code.

The engine implements foundational automata theory to eliminate runtime ambiguity, utilizing a compute-efficient pipeline to generate O(1) state-transition tables for high-performance tokenization.

Technical Highlights

To achieve maximum performance and predictability, Metalyzer handles all regex compilation, graph reduction, and memory layout optimization internally.

Component Implementation Engineering Benefit
Graph Compilation Thompson NFA + Power-Set DFA Complete control over graph boundaries; zero external regex dependencies.
Conflict Resolution Algorithmic Priority Assignment Mathematically guarantees the highest-priority rule wins (e.g., if vs [a-z]+).
State Optimization Equivalence-Class Partitioning Strictly isolates partitions by Rule ID, minimizing footprint without destroying logic.
Runtime Matching Maximal Munch Algorithm O(1) transitions per character with greedy stream rollback.
Memory Footprint 2D Transition Matrix Compression Cache-friendly array layout; eliminates pointer-chasing during runtime tokenization.

Pipeline Architecture

The engine transforms high-level regex specifications into low-level transition matrices through a strict, multi-pass graph pipeline:

graph LR
    UserSpec((spec.mz)) -->|Parse| Frontend[Frontend Parser]
    
    subgraph "Automata Engine"
    Frontend -->|Regex IR| NFA[Thompson NFA]
    NFA -->|Subset Construction| DFA[Power-Set DFA]
    DFA -->|Equivalence Partitioning| MinDFA[Minimized DFA]
    MinDFA -->|Serialization| Comp[Table Compressor]
    end
    
    subgraph "Code Generation"
    Comp -->|"O(1) Matrix"| Gen[C++ Generator]
    Gen -.->|Emit| Hpp[Lexer.hpp]
    Gen -.->|Emit| Cpp[Lexer.cpp]
    end

Loading

1. The Automata Engine: Graph Compilation

Metalyzer converts human-readable regex into executable state machines using three algorithmic passes:

  • Thompson's Construction (NFA): Parses regular expressions via the Shunting-Yard algorithm and builds Non-Deterministic Finite Automata. Supports Kleene stars (*), unions (|), groupings (()), and character classes ([]).
  • Power-Set Construction (DFA): Resolves non-determinism. This stage implements Algorithmic Priority Resolution—if a string mathematically matches multiple rules, the engine resolves the conflict during graph conversion rather than at runtime.
  • State Minimization: Minimizes the DFA using equivalence-class partitioning while strictly protecting rule priority boundaries.

2. The Runtime Lexer: Maximal Munch

The generated C++ code avoids dynamic backtracking graphs. Instead, it utilizes a highly cache-friendly 2D transition matrix.

  • Mechanism: The runtime implements the greedy Maximal Munch algorithm. It aggressively consumes characters until a dead-end is reached, then seamlessly rolls back the input stream to the last known accepting state.
  • Benefit: Ensures the longest valid token is universally matched while maintaining deterministic O(1) state transitions per character.

Specification Format (.mz)

Metalyzer consumes a standard 3-section specification file format (inspired by Lex/Flex) to allow seamless injection of custom C++ action code:

%{
// 1. Header Section: Injected at the top of the generated file
#include <iostream>
enum Token { ERR = -2, EOF_TOK = -1, INT = 1, IF = 2, ID = 3 };
%}

%%
// 2. Rules Section: Regex mapped to Action Blocks
[0-9]+    { return Token::INT; }
if        { return Token::IF; }
[a-z]+    { return Token::ID; }
%%

// 3. User Code Section: Injected at the bottom of the generated file
int main() {
    Lexer lexer(std::cin);
    // Tokenization loop...
}

Build and Run

Prerequisites

  • C++17 compliant compiler (GCC 9+ or Clang 10+)
  • CMake 3.10 or higher

Compilation

mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)

Running the Engine

Compile your lexer specifications by passing them to the generator. Currently, the build executes an end-to-end internal test suite:

./metalyzer_app 

Future Work

With the foundational lexical engine complete, the suite is scheduled to expand into a complete language frontend:

  • Action Code Injection: Expanding the C++ generator to dynamically write user-defined action blocks ({ return Token::IF; }) into the runtime switch statement.
  • Parser Generator: Implementation of a .my specification parser to generate Abstract Syntax Trees (ASTs) using LALR/LR(1) lookahead tables.
  • Semantic Analyzer: AST validation passes for type-checking and logical constraint verification.
  • LLVM Backend Integration: A lowering phase (Codegen) to translate the validated AST into LLVM Intermediate Representation (IR), bridging the gap from custom syntax to executable machine code.

About

A dependency-free C++17 compiler frontend and lexical analyzer generator. Compiles custom regex specifications into highly optimized, standalone C++ tokenizers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors