StarryNote v2.1 — Function Explanations

Purpose: Detailed documentation of every class, method, and function in the codebase.
Generated: 2026-03-07

src/scanner.py
src/template_loader.py
src/prompt_builder.py
src/model_engine.py
src/postprocessor.py
src/formatter.py
main.py

`src/scanner.py` — UniversalResource, ScanResult, StarryScanner

`UniversalResource` (dataclass)

@dataclass
class UniversalResource:
    file_path: str       # Absolute path to the file
    mime_type: str       # MIME type (e.g., 'image/jpeg', 'application/pdf')
    raw_data: Any        # Path reference for downstream processing
    size_bytes: int = 0  # File size in bytes

Purpose: Immutable container for a discovered file. The StarryEngine uses mime_type to route the file to the correct analyzer (_analyze_image, _analyze_pdf, or _analyze_text).

Design Decision: raw_data is set to the file path rather than the file contents because images and PDFs can be very large. Loading them eagerly would exhaust memory. Instead, each analyzer loads the file on demand.

`ScanResult` (dataclass)

@dataclass
class ScanResult:
    resources: List[UniversalResource]  # All discovered files
    total_bytes: int = 0                # Sum of all file sizes
    skipped_count: int = 0              # Files/dirs skipped by filter
    error_count: int = 0                # Files that failed to scan
    errors: List[str] = []              # Error messages

Purpose: Aggregated output from a directory scan. Provides statistics for the TUI (total bytes, file count) and error tracking for robustness.

Property:

count → int: Returns len(self.resources).

`StarryScanner`

`init(skip_patterns: Optional[Set[str]] = None)`

Purpose: Initializes the MIME detection engine (python-magic) and sets up skip patterns.

Default Skip Patterns: Instructions, .venv, venv, __pycache__, .git, .DS_Store, .idea, .pytest_cache, node_modules, .github, models, .env

Parameter: skip_patterns overrides the defaults if provided.

`should_skip(path: str) -> bool`

Purpose: Returns True if any skip pattern appears anywhere in the path string.

Algorithm: Simple substring matching — any(s in path for s in self.skip_patterns).

Tradeoff: Substring matching is fast but imprecise (e.g., a file named modelsummary.txt would match models). For this use case, false positives in skip logic are acceptable.

`scan_directory(root_path: str) -> List[UniversalResource]`

Purpose: Backward-compatible wrapper around scan(). Returns just the resource list.

When to use: When you only need the file list and don't care about stats/errors.

`scan(root_path: str, apply_filter: bool = True) -> ScanResult`

Purpose: Full DFS traversal with statistics, error tracking, and optional filtering.

Algorithm:

Validate root_path is a directory
Walk with os.walk() (DFS order)
Prune: Remove skip-pattern directories from dirs[:] in-place (prevents os.walk from descending)
For each file: detect MIME type, get size, create UniversalResource
Catch OSError/PermissionError per file and log to errors

Performance Note: Directory pruning (dirs[:] = [...]) is O(n) per directory but prevents the walker from entering massive skip directories like node_modules/, which can contain 100k+ files.

Parameter: apply_filter=False disables all filtering — useful for testing.

`src/template_loader.py` — TemplateLoader

`TemplateLoader`

`init(template_dir: str = None)`

Purpose: Loads master_template.md from the specified directory (or auto-resolves from ../templates/).

Behavior:

Reads the raw template file
Generates cleaned version (HTML comments stripped)
Generates compacted version (comments stripped + duplicate placeholders collapsed)
If the file is missing, activates Recovery Mode with a minimal fallback template

`clean(template: str) -> str` (static method)

Purpose: Strips ALL HTML comments () and collapses 3+ consecutive newlines to 2.

Regex: re.sub(r'', '', template, flags=re.DOTALL) — the DOTALL flag ensures multi-line comments are matched.

Important: This is the foundation of the "no instruction leakage" guarantee. By stripping every HTML comment, we ensure no  markers ever reach the model.

`make_compact(template: str) -> str` (class method)

Purpose: Aggressively reduces template size for minimal token usage.

Additional Operations (beyond clean):

Collapses consecutive **{{PLACEHOLDER}}** table rows into a single row
Collapses consecutive ${{VAR}}$ rows
Collapses consecutive {{CODE_LINE_N}} placeholders

Use Case: When the model's context window is limited and every token counts.

Properties

Property	Type	Description
`raw`	`str`	Original, unmodified template content
`cleaned`	`str`	Template with HTML comments stripped
`compacted`	`str`	Aggressively minimized template
`path`	`str`	Absolute path to the template file

`src/prompt_builder.py` — PromptBuilder

`PromptBuilder`

Class Constants

Constant	Value
`MERMAID_CLASSDEF_DEFAULT`	`classDef default fill:#1a1a1a,stroke:#bc13fe,...`
`MERMAID_CLASSDEF_HIGHLIGHT`	`classDef highlight fill:#2a0a3a,stroke:#00f3ff,...`

These are the canonical source of truth for cyberpunk Mermaid styling. Used by both PromptBuilder (injected into system prompt) and MermaidFixer (auto-injected into output).

`build(template: str, raw_content: str, is_image: bool = False) -> str` (class method)

Purpose: Constructs the complete prompt: system rules + template + source input.

Structure:

[System Rules: Core Directives, Section Rules, Mermaid Rules, Output Rules]
--- MASTER TEMPLATE START ---
[Template Markdown]
--- MASTER TEMPLATE END ---
SOURCE INPUT TO SYNTHESIZE:
[Raw Content]

Parameter is_image: When True, the context label changes from "structured data" to "visual architecture", which subtly shifts the model's interpretation of the input.

`_build_rules(context_label: str) -> str` (class method, internal)

Purpose: Generates the complete set of Knowledge Architect rules as a single string.

Rule Categories:

CORE DIRECTIVES (4 rules): Authorship, Synthesis > Summary, Formatting, Academic Tone
SECTION-SPECIFIC RULES (8 sections): Document Record, Core Concepts, Visual Knowledge Graph, Technical Deep Dive, Annotated Glossary, Exam Preparation, Curated Study, Quick Reference, Metacognitive Calibration
OUTPUT RULES (3 rules): Clean Markdown only, replace placeholders, generate all 10 sections

Design Decision: All rules are in one method rather than spread across multiple files. This makes it trivial to audit, modify, or extend the rule set.

`src/model_engine.py` — MimeClassifier, TextExtractor, StarryEngine

`MimeClassifier`

Purpose: Maps any MIME type to one of 6 processing strategies.

`classify(mime_type: str) -> str` (class method)

Returns one of: 'image', 'pdf', 'office', 'structured', 'text', 'binary'

Classification Priority:

Check if MIME is in IMAGE_TYPES or starts with image/ → 'image'
Check if MIME is in PDF_TYPES → 'pdf'
Check if MIME is in OFFICE_TYPES → 'office'
Check if MIME is in STRUCTURED_TYPES → 'structured'
Check if MIME is in BINARY_TYPES or matches binary heuristic → 'binary'
Default fallback → 'text' (safest: most unknown types are readable)

Covered MIME Types:

Category	MIME Types
Image	jpeg, png, gif, bmp, tiff, webp, svg+xml, heic, heif, x-icon
PDF	application/pdf
Office	docx, pptx, xlsx, odt, ods, odp, doc, xls, ppt
Structured	json, csv, xml, yaml, tab-separated-values
Text	plain, html, css, javascript, python, java, c, c++, go, rust, ruby, perl, shell, markdown, rst, tex, latex, diff, patch, log, config
Binary	octet-stream, zip, gzip, tar, 7z, rar, jar, exe, mach-binary, sharedlib, wasm, sqlite, audio/, video/, font/*

`_is_binary_mime(mime_type: str) -> bool` (static, internal)

Purpose: Heuristic for detecting likely binary MIME types not in the explicit set.

Checks: audio/, video/, font/ prefixes, and keywords like octet-stream, executable, archive, compressed.

`TextExtractor`

Purpose: Reads content from any file format, gracefully handling encoding issues and size limits.

`read_text_file(file_path, max_chars=12000) -> str` (static)

Encoding Fallback Chain: UTF-8 → Latin-1 → UTF-8 with error replacement.

Truncation: Files exceeding max_chars are truncated with a [...truncated...] marker.

Design Decision: Triple encoding fallback ensures no file crashes the pipeline. Latin-1 accepts any byte sequence (0x00–0xFF), so it never fails. The error replacement encoding is the final safety net.

`read_json_file(file_path, max_chars=12000) -> str` (static)

Purpose: Parses JSON and pretty-prints it with 2-space indent for model readability.

Fallback: Falls back to read_text_file() on JSON decode errors.

`read_csv_file(file_path, max_rows=100) -> str` (static)

Purpose: Reads CSV and formats rows as pipe-delimited text.

Truncation: Stops at max_rows with a truncation marker.

`read_office_file(file_path, max_chars=12000) -> str` (static)

Purpose: Extracts text from Office documents (.docx, .pptx, .xlsx) by reading their internal XML files.

Algorithm: Office documents are ZIP archives containing XML. This method:

Opens as ZipFile
Finds XML files matching document, slide, sheet, or content patterns
Strips XML tags with regex
Joins extracted text

Limitations: Cannot read encrypted documents or extract formatting. For encrypted docs, returns a descriptive message instead of crashing.

`read_binary_preview(file_path, max_bytes=2000) -> str` (static)

Purpose: Generates a metadata summary for binary files.

Output: File name, extension, size in bytes, and a prompt asking the model to generate a study guide about the file type itself.

`StarryEngine`

`init(model_path: str = "google/gemma-3-4b-it")`

Purpose: Loads the Gemma 3 model into Apple Silicon unified memory.

Initialization Steps:

Call mlx_lm.load(model_path) → returns (model, tokenizer)
Create TemplateLoader() → loads and processes the master template
Store master_template (raw) and _prompt_template (cleaned)

Memory: The Gemma 3 4B model uses ~5 GB of unified memory. The 12B variant needs ~16 GB.

`process_resource(resource: UniversalResource, on_token=None) -> str`

Purpose: Routes a UniversalResource to the correct analyzer using MimeClassifier.

Routing Table:

Strategy	Analyzer	File Types
`image`	`_analyze_image()`	JPEG, PNG, GIF, BMP, TIFF, WebP, HEIC
`pdf`	`_analyze_pdf()`	PDF (with OCR fallback)
`office`	`_analyze_office()`	DOCX, PPTX, XLSX, ODT, etc.
`structured`	`_analyze_structured()`	JSON, CSV, XML, YAML
`binary`	`_analyze_binary()`	ZIP, audio, video, fonts, executables
`text`	`_analyze_text()`	Python, Java, C, HTML, CSS, Markdown, shell scripts, etc.

`_analyze_image(image_path, on_token=None) -> str`

Pipeline: PIL open → RGB convert → multimodal prompt → stream → PostProcessor

`_analyze_pdf(file_path, on_token=None) -> str`

Pipeline: PyMuPDF extract → OCR fallback (if <100 chars) → prompt → stream → PostProcessor

Performance: Text capped at 12,000 chars. OCR renders first 2 pages at 150 DPI.

`_analyze_office(file_path, on_token=None) -> str`

Pipeline: TextExtractor.read_office_file() → prompt → stream → PostProcessor

New in v2.1: Handles .docx, .pptx, .xlsx, .odt by extracting XML text from the ZIP archive.

`_analyze_structured(file_path, mime_type, on_token=None) -> str`

Pipeline: TextExtractor (JSON/CSV/text fallback) → prompt → stream → PostProcessor

New in v2.1: Pretty-prints JSON, formats CSV as pipe-delimited tables.

`_analyze_binary(file_path, on_token=None) -> str`

Pipeline: TextExtractor.read_binary_preview() → prompt → stream → PostProcessor

New in v2.1: Instead of crashing on binary files, generates a metadata summary and asks the model to explain the file type.

`_analyze_text(file_path, on_token=None) -> str`

Pipeline: TextExtractor.read_text_file() → prompt → stream → PostProcessor

Improved in v2.1: Now uses encoding fallback (UTF-8 → Latin-1 → replace) and caps content at 12,000 characters.

`src/postprocessor.py` — MermaidFixer, OutputCleaner, OutputValidator, PostProcessor

`MermaidFixer`

Purpose: Repairs common Mermaid diagram issues in LLM output.

`fix(text: str) -> str` (class method)

Pipeline:

_replace_forbidden_types() → sequenceDiagram/mindmap/classDiagram → graph TD
_inject_classdef() → adds cyberpunk classDef lines if missing
_remove_inline_styles() → strips style NodeID fill:... directives
_remove_semicolons() → strips trailing ; from mermaid lines

Regex Pattern for blocks: r'```mermaid\n.*?```' with re.DOTALL — matches the entire mermaid code fence.

classDef Injection Logic: Only injects if classDef default is NOT already present. Finds the diagram type line (e.g., graph TD) and inserts classDef on the next line.

`OutputCleaner`

Purpose: Removes instruction markers that leak from the template into the output.

`clean(text: str) -> str` (class method)

Leak Patterns Detected:

 (HTML comment format)
[[AI INSTRUCTION]] ... (bracket format)
**RULES:** ... (bold marker)
**DIAGRAM SELECTION:** ... (selection marker)
**BLOCK SELECTION:** ... (block marker)
**HARD RULES ... (hard rules marker)
{{UPPERCASE_PLACEHOLDER}} (unfilled placeholders)

`OutputValidator`

Purpose: Checks that generated output meets structural requirements.

`validate(text: str) -> ValidationResult` (class method)

Checks Performed:

All 10 required sections present (case-insensitive search)
Mermaid code fence exists
Exam questions exist (QUESTION 01 or QUESTION 1)
No leaked instruction markers
No unfilled placeholders

Validity Criteria: Output is valid if:

At most 2 sections are missing AND
Mermaid diagram is present AND
Exam questions are present

`ValidationResult` (dataclass)

@dataclass
class ValidationResult:
    is_valid: bool
    sections_found: List[str]
    sections_missing: List[str]
    has_mermaid: bool
    has_exam_questions: bool
    has_source_archive: bool
    warnings: List[str]

`PostProcessor`

Purpose: Orchestrates the full post-processing pipeline.

`process(raw_output: str) -> str` (class method)

Pipeline:

OutputCleaner.clean() — strip leaked instructions
MermaidFixer.fix() — repair diagrams
Whitespace normalization — collapse 3+ newlines
OutputValidator.validate() — log warnings (non-blocking)

Design Decision: Validation is non-blocking — it logs warnings but does not reject output. This is intentional: a study guide missing 1-2 sections is still valuable. The warnings help with debugging and quality tracking.

`src/formatter.py` — StarryFormatter

`StarryFormatter`

`init(current_execution_dir: str)`

Purpose: Creates the Instructions/ output directory.

Behavior: Uses os.makedirs(exist_ok=True) — idempotent, safe to call multiple times.

`save_guide(original_filepath: str, content: str, post_process: bool = True) -> str`

Purpose: Post-processes and saves a study guide.

Naming Convention: {original_name}_StudyGuide.md with spaces replaced by underscores.

Post-Processing: When post_process=True (default), runs PostProcessor.process() before writing. This is the final safety net — even if the engine produces dirty output, the saved file will be clean.

`validate_guide(file_path: str) -> ValidationResult`

Purpose: Reads a saved guide and runs OutputValidator.validate() on it.

Use Case: Automated quality checks on previously generated guides.

`main.py` — TUI Pipeline

TUI Utility Functions

`_icon(mime: str) -> str`

Maps MIME type substrings to emoji icons. Falls back to 📦 for unknown types.

`_sz(n: int) -> str`

Formats byte counts as human-readable strings (B, KB, MB, GB, TB).

`_density(input_bytes: int, output_len: int) -> str`

Calculates the knowledge amplification ratio and renders it as 1-5 colored stars.

`_should_skip(path: str) -> bool`

Checks if a path matches any skip pattern. Used in the TUI's Phase 2 to filter resources.

`_phase(n: int, title: str, glyph: str)`

Prints a phase header with consistent styling.

`run()`

Purpose: The main pipeline orchestrator.

4-Phase Flow:

Neural Initialization: Load Gemma 3, init scanner and formatter
Deep Scan: Traverse CWD, filter, display resource table
Knowledge Synthesis: Process each file with live progress bars and token callbacks
Mission Report: Display results table and constellation footer

FilesExpand file tree

FunctionExplanations.md

Latest commit

History

FunctionExplanations.md

File metadata and controls

StarryNote v2.1 — Function Explanations

Table of Contents

src/scanner.py — UniversalResource, ScanResult, StarryScanner

UniversalResource (dataclass)

ScanResult (dataclass)

StarryScanner

__init__(skip_patterns: Optional[Set[str]] = None)

should_skip(path: str) -> bool

scan_directory(root_path: str) -> List[UniversalResource]

scan(root_path: str, apply_filter: bool = True) -> ScanResult

src/template_loader.py — TemplateLoader

TemplateLoader

__init__(template_dir: str = None)

clean(template: str) -> str (static method)

make_compact(template: str) -> str (class method)

Properties

src/prompt_builder.py — PromptBuilder

PromptBuilder

Class Constants

build(template: str, raw_content: str, is_image: bool = False) -> str (class method)

_build_rules(context_label: str) -> str (class method, internal)

src/model_engine.py — MimeClassifier, TextExtractor, StarryEngine

MimeClassifier

classify(mime_type: str) -> str (class method)

_is_binary_mime(mime_type: str) -> bool (static, internal)

TextExtractor

read_text_file(file_path, max_chars=12000) -> str (static)

read_json_file(file_path, max_chars=12000) -> str (static)

read_csv_file(file_path, max_rows=100) -> str (static)

read_office_file(file_path, max_chars=12000) -> str (static)

read_binary_preview(file_path, max_bytes=2000) -> str (static)

StarryEngine

__init__(model_path: str = "google/gemma-3-4b-it")

process_resource(resource: UniversalResource, on_token=None) -> str

_analyze_image(image_path, on_token=None) -> str

_analyze_pdf(file_path, on_token=None) -> str

_analyze_office(file_path, on_token=None) -> str

_analyze_structured(file_path, mime_type, on_token=None) -> str

_analyze_binary(file_path, on_token=None) -> str

_analyze_text(file_path, on_token=None) -> str

src/postprocessor.py — MermaidFixer, OutputCleaner, OutputValidator, PostProcessor

MermaidFixer

fix(text: str) -> str (class method)

OutputCleaner

clean(text: str) -> str (class method)

OutputValidator

validate(text: str) -> ValidationResult (class method)

ValidationResult (dataclass)

PostProcessor

process(raw_output: str) -> str (class method)

src/formatter.py — StarryFormatter

StarryFormatter

__init__(current_execution_dir: str)

save_guide(original_filepath: str, content: str, post_process: bool = True) -> str

validate_guide(file_path: str) -> ValidationResult

main.py — TUI Pipeline