Purpose: Detailed documentation of every class, method, and function in the codebase.
Generated: 2026-03-07
- src/scanner.py
- src/template_loader.py
- src/prompt_builder.py
- src/model_engine.py
- src/postprocessor.py
- src/formatter.py
- main.py
@dataclass
class UniversalResource:
file_path: str # Absolute path to the file
mime_type: str # MIME type (e.g., 'image/jpeg', 'application/pdf')
raw_data: Any # Path reference for downstream processing
size_bytes: int = 0 # File size in bytesPurpose: Immutable container for a discovered file. The StarryEngine uses mime_type to route the file to the correct analyzer (_analyze_image, _analyze_pdf, or _analyze_text).
Design Decision: raw_data is set to the file path rather than the file contents because images and PDFs can be very large. Loading them eagerly would exhaust memory. Instead, each analyzer loads the file on demand.
@dataclass
class ScanResult:
resources: List[UniversalResource] # All discovered files
total_bytes: int = 0 # Sum of all file sizes
skipped_count: int = 0 # Files/dirs skipped by filter
error_count: int = 0 # Files that failed to scan
errors: List[str] = [] # Error messagesPurpose: Aggregated output from a directory scan. Provides statistics for the TUI (total bytes, file count) and error tracking for robustness.
Property:
count→int: Returnslen(self.resources).
Purpose: Initializes the MIME detection engine (python-magic) and sets up skip patterns.
Default Skip Patterns: Instructions, .venv, venv, __pycache__, .git, .DS_Store, .idea, .pytest_cache, node_modules, .github, models, .env
Parameter: skip_patterns overrides the defaults if provided.
Purpose: Returns True if any skip pattern appears anywhere in the path string.
Algorithm: Simple substring matching — any(s in path for s in self.skip_patterns).
Tradeoff: Substring matching is fast but imprecise (e.g., a file named modelsummary.txt would match models). For this use case, false positives in skip logic are acceptable.
Purpose: Backward-compatible wrapper around scan(). Returns just the resource list.
When to use: When you only need the file list and don't care about stats/errors.
Purpose: Full DFS traversal with statistics, error tracking, and optional filtering.
Algorithm:
- Validate
root_pathis a directory - Walk with
os.walk()(DFS order) - Prune: Remove skip-pattern directories from
dirs[:]in-place (preventsos.walkfrom descending) - For each file: detect MIME type, get size, create
UniversalResource - Catch
OSError/PermissionErrorper file and log toerrors
Performance Note: Directory pruning (dirs[:] = [...]) is O(n) per directory but prevents the walker from entering massive skip directories like node_modules/, which can contain 100k+ files.
Parameter: apply_filter=False disables all filtering — useful for testing.
Purpose: Loads master_template.md from the specified directory (or auto-resolves from ../templates/).
Behavior:
- Reads the raw template file
- Generates
cleanedversion (HTML comments stripped) - Generates
compactedversion (comments stripped + duplicate placeholders collapsed) - If the file is missing, activates Recovery Mode with a minimal fallback template
Purpose: Strips ALL HTML comments (<!-- ... -->) and collapses 3+ consecutive newlines to 2.
Regex: re.sub(r'<!--.*?-->', '', template, flags=re.DOTALL) — the DOTALL flag ensures multi-line comments are matched.
Important: This is the foundation of the "no instruction leakage" guarantee. By stripping every HTML comment, we ensure no <!-- AI INSTRUCTION: --> markers ever reach the model.
Purpose: Aggressively reduces template size for minimal token usage.
Additional Operations (beyond clean):
- Collapses consecutive
**{{PLACEHOLDER}}**table rows into a single row - Collapses consecutive
${{VAR}}$rows - Collapses consecutive
{{CODE_LINE_N}}placeholders
Use Case: When the model's context window is limited and every token counts.
| Property | Type | Description |
|---|---|---|
raw |
str |
Original, unmodified template content |
cleaned |
str |
Template with HTML comments stripped |
compacted |
str |
Aggressively minimized template |
path |
str |
Absolute path to the template file |
| Constant | Value |
|---|---|
MERMAID_CLASSDEF_DEFAULT |
classDef default fill:#1a1a1a,stroke:#bc13fe,... |
MERMAID_CLASSDEF_HIGHLIGHT |
classDef highlight fill:#2a0a3a,stroke:#00f3ff,... |
These are the canonical source of truth for cyberpunk Mermaid styling. Used by both PromptBuilder (injected into system prompt) and MermaidFixer (auto-injected into output).
Purpose: Constructs the complete prompt: system rules + template + source input.
Structure:
[System Rules: Core Directives, Section Rules, Mermaid Rules, Output Rules]
--- MASTER TEMPLATE START ---
[Template Markdown]
--- MASTER TEMPLATE END ---
SOURCE INPUT TO SYNTHESIZE:
[Raw Content]
Parameter is_image: When True, the context label changes from "structured data" to "visual architecture", which subtly shifts the model's interpretation of the input.
Purpose: Generates the complete set of Knowledge Architect rules as a single string.
Rule Categories:
- CORE DIRECTIVES (4 rules): Authorship, Synthesis > Summary, Formatting, Academic Tone
- SECTION-SPECIFIC RULES (8 sections): Document Record, Core Concepts, Visual Knowledge Graph, Technical Deep Dive, Annotated Glossary, Exam Preparation, Curated Study, Quick Reference, Metacognitive Calibration
- OUTPUT RULES (3 rules): Clean Markdown only, replace placeholders, generate all 10 sections
Design Decision: All rules are in one method rather than spread across multiple files. This makes it trivial to audit, modify, or extend the rule set.
Purpose: Maps any MIME type to one of 6 processing strategies.
Returns one of: 'image', 'pdf', 'office', 'structured', 'text', 'binary'
Classification Priority:
- Check if MIME is in
IMAGE_TYPESor starts withimage/→'image' - Check if MIME is in
PDF_TYPES→'pdf' - Check if MIME is in
OFFICE_TYPES→'office' - Check if MIME is in
STRUCTURED_TYPES→'structured' - Check if MIME is in
BINARY_TYPESor matches binary heuristic →'binary' - Default fallback →
'text'(safest: most unknown types are readable)
Covered MIME Types:
| Category | MIME Types |
|---|---|
| Image | jpeg, png, gif, bmp, tiff, webp, svg+xml, heic, heif, x-icon |
| application/pdf | |
| Office | docx, pptx, xlsx, odt, ods, odp, doc, xls, ppt |
| Structured | json, csv, xml, yaml, tab-separated-values |
| Text | plain, html, css, javascript, python, java, c, c++, go, rust, ruby, perl, shell, markdown, rst, tex, latex, diff, patch, log, config |
| Binary | octet-stream, zip, gzip, tar, 7z, rar, jar, exe, mach-binary, sharedlib, wasm, sqlite, audio/, video/, font/* |
Purpose: Heuristic for detecting likely binary MIME types not in the explicit set.
Checks: audio/, video/, font/ prefixes, and keywords like octet-stream, executable, archive, compressed.
Purpose: Reads content from any file format, gracefully handling encoding issues and size limits.
Encoding Fallback Chain: UTF-8 → Latin-1 → UTF-8 with error replacement.
Truncation: Files exceeding max_chars are truncated with a [...truncated...] marker.
Design Decision: Triple encoding fallback ensures no file crashes the pipeline. Latin-1 accepts any byte sequence (0x00–0xFF), so it never fails. The error replacement encoding is the final safety net.
Purpose: Parses JSON and pretty-prints it with 2-space indent for model readability.
Fallback: Falls back to read_text_file() on JSON decode errors.
Purpose: Reads CSV and formats rows as pipe-delimited text.
Truncation: Stops at max_rows with a truncation marker.
Purpose: Extracts text from Office documents (.docx, .pptx, .xlsx) by reading their internal XML files.
Algorithm: Office documents are ZIP archives containing XML. This method:
- Opens as ZipFile
- Finds XML files matching
document,slide,sheet, orcontentpatterns - Strips XML tags with regex
- Joins extracted text
Limitations: Cannot read encrypted documents or extract formatting. For encrypted docs, returns a descriptive message instead of crashing.
Purpose: Generates a metadata summary for binary files.
Output: File name, extension, size in bytes, and a prompt asking the model to generate a study guide about the file type itself.
Purpose: Loads the Gemma 3 model into Apple Silicon unified memory.
Initialization Steps:
- Call
mlx_lm.load(model_path)→ returns(model, tokenizer) - Create
TemplateLoader()→ loads and processes the master template - Store
master_template(raw) and_prompt_template(cleaned)
Memory: The Gemma 3 4B model uses ~5 GB of unified memory. The 12B variant needs ~16 GB.
Purpose: Routes a UniversalResource to the correct analyzer using MimeClassifier.
Routing Table:
| Strategy | Analyzer | File Types |
|---|---|---|
image |
_analyze_image() |
JPEG, PNG, GIF, BMP, TIFF, WebP, HEIC |
pdf |
_analyze_pdf() |
PDF (with OCR fallback) |
office |
_analyze_office() |
DOCX, PPTX, XLSX, ODT, etc. |
structured |
_analyze_structured() |
JSON, CSV, XML, YAML |
binary |
_analyze_binary() |
ZIP, audio, video, fonts, executables |
text |
_analyze_text() |
Python, Java, C, HTML, CSS, Markdown, shell scripts, etc. |
Pipeline: PIL open → RGB convert → multimodal prompt → stream → PostProcessor
Pipeline: PyMuPDF extract → OCR fallback (if <100 chars) → prompt → stream → PostProcessor
Performance: Text capped at 12,000 chars. OCR renders first 2 pages at 150 DPI.
Pipeline: TextExtractor.read_office_file() → prompt → stream → PostProcessor
New in v2.1: Handles .docx, .pptx, .xlsx, .odt by extracting XML text from the ZIP archive.
Pipeline: TextExtractor (JSON/CSV/text fallback) → prompt → stream → PostProcessor
New in v2.1: Pretty-prints JSON, formats CSV as pipe-delimited tables.
Pipeline: TextExtractor.read_binary_preview() → prompt → stream → PostProcessor
New in v2.1: Instead of crashing on binary files, generates a metadata summary and asks the model to explain the file type.
Pipeline: TextExtractor.read_text_file() → prompt → stream → PostProcessor
Improved in v2.1: Now uses encoding fallback (UTF-8 → Latin-1 → replace) and caps content at 12,000 characters.
Purpose: Repairs common Mermaid diagram issues in LLM output.
Pipeline:
_replace_forbidden_types()→ sequenceDiagram/mindmap/classDiagram → graph TD_inject_classdef()→ adds cyberpunk classDef lines if missing_remove_inline_styles()→ stripsstyle NodeID fill:...directives_remove_semicolons()→ strips trailing;from mermaid lines
Regex Pattern for blocks: r'```mermaid\n.*?```' with re.DOTALL — matches the entire mermaid code fence.
classDef Injection Logic: Only injects if classDef default is NOT already present. Finds the diagram type line (e.g., graph TD) and inserts classDef on the next line.
Purpose: Removes instruction markers that leak from the template into the output.
Leak Patterns Detected:
<!-- AI INSTRUCTION ... -->(HTML comment format)[[AI INSTRUCTION]] ...(bracket format)**RULES:** ...(bold marker)**DIAGRAM SELECTION:** ...(selection marker)**BLOCK SELECTION:** ...(block marker)**HARD RULES ...(hard rules marker){{UPPERCASE_PLACEHOLDER}}(unfilled placeholders)
Purpose: Checks that generated output meets structural requirements.
Checks Performed:
- All 10 required sections present (case-insensitive search)
- Mermaid code fence exists
- Exam questions exist (
QUESTION 01orQUESTION 1) - No leaked instruction markers
- No unfilled placeholders
Validity Criteria: Output is valid if:
- At most 2 sections are missing AND
- Mermaid diagram is present AND
- Exam questions are present
@dataclass
class ValidationResult:
is_valid: bool
sections_found: List[str]
sections_missing: List[str]
has_mermaid: bool
has_exam_questions: bool
has_source_archive: bool
warnings: List[str]Purpose: Orchestrates the full post-processing pipeline.
Pipeline:
OutputCleaner.clean()— strip leaked instructionsMermaidFixer.fix()— repair diagrams- Whitespace normalization — collapse 3+ newlines
OutputValidator.validate()— log warnings (non-blocking)
Design Decision: Validation is non-blocking — it logs warnings but does not reject output. This is intentional: a study guide missing 1-2 sections is still valuable. The warnings help with debugging and quality tracking.
Purpose: Creates the Instructions/ output directory.
Behavior: Uses os.makedirs(exist_ok=True) — idempotent, safe to call multiple times.
Purpose: Post-processes and saves a study guide.
Naming Convention: {original_name}_StudyGuide.md with spaces replaced by underscores.
Post-Processing: When post_process=True (default), runs PostProcessor.process() before writing. This is the final safety net — even if the engine produces dirty output, the saved file will be clean.
Purpose: Reads a saved guide and runs OutputValidator.validate() on it.
Use Case: Automated quality checks on previously generated guides.
Maps MIME type substrings to emoji icons. Falls back to 📦 for unknown types.
Formats byte counts as human-readable strings (B, KB, MB, GB, TB).
Calculates the knowledge amplification ratio and renders it as 1-5 colored stars.
Checks if a path matches any skip pattern. Used in the TUI's Phase 2 to filter resources.
Prints a phase header with consistent styling.
Purpose: The main pipeline orchestrator.
4-Phase Flow:
- Neural Initialization: Load Gemma 3, init scanner and formatter
- Deep Scan: Traverse CWD, filter, display resource table
- Knowledge Synthesis: Process each file with live progress bars and token callbacks
- Mission Report: Display results table and constellation footer