Skip to content

Latest commit

 

History

History
511 lines (315 loc) · 17.8 KB

File metadata and controls

511 lines (315 loc) · 17.8 KB

StarryNote v2.1 — Function Explanations

Purpose: Detailed documentation of every class, method, and function in the codebase.
Generated: 2026-03-07


Table of Contents


src/scanner.py — UniversalResource, ScanResult, StarryScanner

UniversalResource (dataclass)

@dataclass
class UniversalResource:
    file_path: str       # Absolute path to the file
    mime_type: str       # MIME type (e.g., 'image/jpeg', 'application/pdf')
    raw_data: Any        # Path reference for downstream processing
    size_bytes: int = 0  # File size in bytes

Purpose: Immutable container for a discovered file. The StarryEngine uses mime_type to route the file to the correct analyzer (_analyze_image, _analyze_pdf, or _analyze_text).

Design Decision: raw_data is set to the file path rather than the file contents because images and PDFs can be very large. Loading them eagerly would exhaust memory. Instead, each analyzer loads the file on demand.


ScanResult (dataclass)

@dataclass
class ScanResult:
    resources: List[UniversalResource]  # All discovered files
    total_bytes: int = 0                # Sum of all file sizes
    skipped_count: int = 0              # Files/dirs skipped by filter
    error_count: int = 0                # Files that failed to scan
    errors: List[str] = []              # Error messages

Purpose: Aggregated output from a directory scan. Provides statistics for the TUI (total bytes, file count) and error tracking for robustness.

Property:

  • countint: Returns len(self.resources).

StarryScanner

__init__(skip_patterns: Optional[Set[str]] = None)

Purpose: Initializes the MIME detection engine (python-magic) and sets up skip patterns.

Default Skip Patterns: Instructions, .venv, venv, __pycache__, .git, .DS_Store, .idea, .pytest_cache, node_modules, .github, models, .env

Parameter: skip_patterns overrides the defaults if provided.


should_skip(path: str) -> bool

Purpose: Returns True if any skip pattern appears anywhere in the path string.

Algorithm: Simple substring matching — any(s in path for s in self.skip_patterns).

Tradeoff: Substring matching is fast but imprecise (e.g., a file named modelsummary.txt would match models). For this use case, false positives in skip logic are acceptable.


scan_directory(root_path: str) -> List[UniversalResource]

Purpose: Backward-compatible wrapper around scan(). Returns just the resource list.

When to use: When you only need the file list and don't care about stats/errors.


scan(root_path: str, apply_filter: bool = True) -> ScanResult

Purpose: Full DFS traversal with statistics, error tracking, and optional filtering.

Algorithm:

  1. Validate root_path is a directory
  2. Walk with os.walk() (DFS order)
  3. Prune: Remove skip-pattern directories from dirs[:] in-place (prevents os.walk from descending)
  4. For each file: detect MIME type, get size, create UniversalResource
  5. Catch OSError/PermissionError per file and log to errors

Performance Note: Directory pruning (dirs[:] = [...]) is O(n) per directory but prevents the walker from entering massive skip directories like node_modules/, which can contain 100k+ files.

Parameter: apply_filter=False disables all filtering — useful for testing.


src/template_loader.py — TemplateLoader

TemplateLoader

__init__(template_dir: str = None)

Purpose: Loads master_template.md from the specified directory (or auto-resolves from ../templates/).

Behavior:

  1. Reads the raw template file
  2. Generates cleaned version (HTML comments stripped)
  3. Generates compacted version (comments stripped + duplicate placeholders collapsed)
  4. If the file is missing, activates Recovery Mode with a minimal fallback template

clean(template: str) -> str (static method)

Purpose: Strips ALL HTML comments (<!-- ... -->) and collapses 3+ consecutive newlines to 2.

Regex: re.sub(r'<!--.*?-->', '', template, flags=re.DOTALL) — the DOTALL flag ensures multi-line comments are matched.

Important: This is the foundation of the "no instruction leakage" guarantee. By stripping every HTML comment, we ensure no <!-- AI INSTRUCTION: --> markers ever reach the model.


make_compact(template: str) -> str (class method)

Purpose: Aggressively reduces template size for minimal token usage.

Additional Operations (beyond clean):

  1. Collapses consecutive **{{PLACEHOLDER}}** table rows into a single row
  2. Collapses consecutive ${{VAR}}$ rows
  3. Collapses consecutive {{CODE_LINE_N}} placeholders

Use Case: When the model's context window is limited and every token counts.


Properties

Property Type Description
raw str Original, unmodified template content
cleaned str Template with HTML comments stripped
compacted str Aggressively minimized template
path str Absolute path to the template file

src/prompt_builder.py — PromptBuilder

PromptBuilder

Class Constants

Constant Value
MERMAID_CLASSDEF_DEFAULT classDef default fill:#1a1a1a,stroke:#bc13fe,...
MERMAID_CLASSDEF_HIGHLIGHT classDef highlight fill:#2a0a3a,stroke:#00f3ff,...

These are the canonical source of truth for cyberpunk Mermaid styling. Used by both PromptBuilder (injected into system prompt) and MermaidFixer (auto-injected into output).


build(template: str, raw_content: str, is_image: bool = False) -> str (class method)

Purpose: Constructs the complete prompt: system rules + template + source input.

Structure:

[System Rules: Core Directives, Section Rules, Mermaid Rules, Output Rules]
--- MASTER TEMPLATE START ---
[Template Markdown]
--- MASTER TEMPLATE END ---
SOURCE INPUT TO SYNTHESIZE:
[Raw Content]

Parameter is_image: When True, the context label changes from "structured data" to "visual architecture", which subtly shifts the model's interpretation of the input.


_build_rules(context_label: str) -> str (class method, internal)

Purpose: Generates the complete set of Knowledge Architect rules as a single string.

Rule Categories:

  1. CORE DIRECTIVES (4 rules): Authorship, Synthesis > Summary, Formatting, Academic Tone
  2. SECTION-SPECIFIC RULES (8 sections): Document Record, Core Concepts, Visual Knowledge Graph, Technical Deep Dive, Annotated Glossary, Exam Preparation, Curated Study, Quick Reference, Metacognitive Calibration
  3. OUTPUT RULES (3 rules): Clean Markdown only, replace placeholders, generate all 10 sections

Design Decision: All rules are in one method rather than spread across multiple files. This makes it trivial to audit, modify, or extend the rule set.


src/model_engine.py — MimeClassifier, TextExtractor, StarryEngine

MimeClassifier

Purpose: Maps any MIME type to one of 6 processing strategies.

classify(mime_type: str) -> str (class method)

Returns one of: 'image', 'pdf', 'office', 'structured', 'text', 'binary'

Classification Priority:

  1. Check if MIME is in IMAGE_TYPES or starts with image/'image'
  2. Check if MIME is in PDF_TYPES'pdf'
  3. Check if MIME is in OFFICE_TYPES'office'
  4. Check if MIME is in STRUCTURED_TYPES'structured'
  5. Check if MIME is in BINARY_TYPES or matches binary heuristic → 'binary'
  6. Default fallback → 'text' (safest: most unknown types are readable)

Covered MIME Types:

Category MIME Types
Image jpeg, png, gif, bmp, tiff, webp, svg+xml, heic, heif, x-icon
PDF application/pdf
Office docx, pptx, xlsx, odt, ods, odp, doc, xls, ppt
Structured json, csv, xml, yaml, tab-separated-values
Text plain, html, css, javascript, python, java, c, c++, go, rust, ruby, perl, shell, markdown, rst, tex, latex, diff, patch, log, config
Binary octet-stream, zip, gzip, tar, 7z, rar, jar, exe, mach-binary, sharedlib, wasm, sqlite, audio/, video/, font/*

_is_binary_mime(mime_type: str) -> bool (static, internal)

Purpose: Heuristic for detecting likely binary MIME types not in the explicit set.

Checks: audio/, video/, font/ prefixes, and keywords like octet-stream, executable, archive, compressed.


TextExtractor

Purpose: Reads content from any file format, gracefully handling encoding issues and size limits.

read_text_file(file_path, max_chars=12000) -> str (static)

Encoding Fallback Chain: UTF-8 → Latin-1 → UTF-8 with error replacement.

Truncation: Files exceeding max_chars are truncated with a [...truncated...] marker.

Design Decision: Triple encoding fallback ensures no file crashes the pipeline. Latin-1 accepts any byte sequence (0x00–0xFF), so it never fails. The error replacement encoding is the final safety net.

read_json_file(file_path, max_chars=12000) -> str (static)

Purpose: Parses JSON and pretty-prints it with 2-space indent for model readability.

Fallback: Falls back to read_text_file() on JSON decode errors.

read_csv_file(file_path, max_rows=100) -> str (static)

Purpose: Reads CSV and formats rows as pipe-delimited text.

Truncation: Stops at max_rows with a truncation marker.

read_office_file(file_path, max_chars=12000) -> str (static)

Purpose: Extracts text from Office documents (.docx, .pptx, .xlsx) by reading their internal XML files.

Algorithm: Office documents are ZIP archives containing XML. This method:

  1. Opens as ZipFile
  2. Finds XML files matching document, slide, sheet, or content patterns
  3. Strips XML tags with regex
  4. Joins extracted text

Limitations: Cannot read encrypted documents or extract formatting. For encrypted docs, returns a descriptive message instead of crashing.

read_binary_preview(file_path, max_bytes=2000) -> str (static)

Purpose: Generates a metadata summary for binary files.

Output: File name, extension, size in bytes, and a prompt asking the model to generate a study guide about the file type itself.


StarryEngine

__init__(model_path: str = "google/gemma-3-4b-it")

Purpose: Loads the Gemma 3 model into Apple Silicon unified memory.

Initialization Steps:

  1. Call mlx_lm.load(model_path) → returns (model, tokenizer)
  2. Create TemplateLoader() → loads and processes the master template
  3. Store master_template (raw) and _prompt_template (cleaned)

Memory: The Gemma 3 4B model uses ~5 GB of unified memory. The 12B variant needs ~16 GB.


process_resource(resource: UniversalResource, on_token=None) -> str

Purpose: Routes a UniversalResource to the correct analyzer using MimeClassifier.

Routing Table:

Strategy Analyzer File Types
image _analyze_image() JPEG, PNG, GIF, BMP, TIFF, WebP, HEIC
pdf _analyze_pdf() PDF (with OCR fallback)
office _analyze_office() DOCX, PPTX, XLSX, ODT, etc.
structured _analyze_structured() JSON, CSV, XML, YAML
binary _analyze_binary() ZIP, audio, video, fonts, executables
text _analyze_text() Python, Java, C, HTML, CSS, Markdown, shell scripts, etc.

_analyze_image(image_path, on_token=None) -> str

Pipeline: PIL open → RGB convert → multimodal prompt → stream → PostProcessor


_analyze_pdf(file_path, on_token=None) -> str

Pipeline: PyMuPDF extract → OCR fallback (if <100 chars) → prompt → stream → PostProcessor

Performance: Text capped at 12,000 chars. OCR renders first 2 pages at 150 DPI.


_analyze_office(file_path, on_token=None) -> str

Pipeline: TextExtractor.read_office_file() → prompt → stream → PostProcessor

New in v2.1: Handles .docx, .pptx, .xlsx, .odt by extracting XML text from the ZIP archive.


_analyze_structured(file_path, mime_type, on_token=None) -> str

Pipeline: TextExtractor (JSON/CSV/text fallback) → prompt → stream → PostProcessor

New in v2.1: Pretty-prints JSON, formats CSV as pipe-delimited tables.


_analyze_binary(file_path, on_token=None) -> str

Pipeline: TextExtractor.read_binary_preview() → prompt → stream → PostProcessor

New in v2.1: Instead of crashing on binary files, generates a metadata summary and asks the model to explain the file type.


_analyze_text(file_path, on_token=None) -> str

Pipeline: TextExtractor.read_text_file() → prompt → stream → PostProcessor

Improved in v2.1: Now uses encoding fallback (UTF-8 → Latin-1 → replace) and caps content at 12,000 characters.


src/postprocessor.py — MermaidFixer, OutputCleaner, OutputValidator, PostProcessor

MermaidFixer

Purpose: Repairs common Mermaid diagram issues in LLM output.

fix(text: str) -> str (class method)

Pipeline:

  1. _replace_forbidden_types() → sequenceDiagram/mindmap/classDiagram → graph TD
  2. _inject_classdef() → adds cyberpunk classDef lines if missing
  3. _remove_inline_styles() → strips style NodeID fill:... directives
  4. _remove_semicolons() → strips trailing ; from mermaid lines

Regex Pattern for blocks: r'```mermaid\n.*?```' with re.DOTALL — matches the entire mermaid code fence.

classDef Injection Logic: Only injects if classDef default is NOT already present. Finds the diagram type line (e.g., graph TD) and inserts classDef on the next line.


OutputCleaner

Purpose: Removes instruction markers that leak from the template into the output.

clean(text: str) -> str (class method)

Leak Patterns Detected:

  1. <!-- AI INSTRUCTION ... --> (HTML comment format)
  2. [[AI INSTRUCTION]] ... (bracket format)
  3. **RULES:** ... (bold marker)
  4. **DIAGRAM SELECTION:** ... (selection marker)
  5. **BLOCK SELECTION:** ... (block marker)
  6. **HARD RULES ... (hard rules marker)
  7. {{UPPERCASE_PLACEHOLDER}} (unfilled placeholders)

OutputValidator

Purpose: Checks that generated output meets structural requirements.

validate(text: str) -> ValidationResult (class method)

Checks Performed:

  1. All 10 required sections present (case-insensitive search)
  2. Mermaid code fence exists
  3. Exam questions exist (QUESTION 01 or QUESTION 1)
  4. No leaked instruction markers
  5. No unfilled placeholders

Validity Criteria: Output is valid if:

  • At most 2 sections are missing AND
  • Mermaid diagram is present AND
  • Exam questions are present

ValidationResult (dataclass)

@dataclass
class ValidationResult:
    is_valid: bool
    sections_found: List[str]
    sections_missing: List[str]
    has_mermaid: bool
    has_exam_questions: bool
    has_source_archive: bool
    warnings: List[str]

PostProcessor

Purpose: Orchestrates the full post-processing pipeline.

process(raw_output: str) -> str (class method)

Pipeline:

  1. OutputCleaner.clean() — strip leaked instructions
  2. MermaidFixer.fix() — repair diagrams
  3. Whitespace normalization — collapse 3+ newlines
  4. OutputValidator.validate() — log warnings (non-blocking)

Design Decision: Validation is non-blocking — it logs warnings but does not reject output. This is intentional: a study guide missing 1-2 sections is still valuable. The warnings help with debugging and quality tracking.


src/formatter.py — StarryFormatter

StarryFormatter

__init__(current_execution_dir: str)

Purpose: Creates the Instructions/ output directory.

Behavior: Uses os.makedirs(exist_ok=True) — idempotent, safe to call multiple times.


save_guide(original_filepath: str, content: str, post_process: bool = True) -> str

Purpose: Post-processes and saves a study guide.

Naming Convention: {original_name}_StudyGuide.md with spaces replaced by underscores.

Post-Processing: When post_process=True (default), runs PostProcessor.process() before writing. This is the final safety net — even if the engine produces dirty output, the saved file will be clean.


validate_guide(file_path: str) -> ValidationResult

Purpose: Reads a saved guide and runs OutputValidator.validate() on it.

Use Case: Automated quality checks on previously generated guides.


main.py — TUI Pipeline

TUI Utility Functions

_icon(mime: str) -> str

Maps MIME type substrings to emoji icons. Falls back to 📦 for unknown types.

_sz(n: int) -> str

Formats byte counts as human-readable strings (B, KB, MB, GB, TB).

_density(input_bytes: int, output_len: int) -> str

Calculates the knowledge amplification ratio and renders it as 1-5 colored stars.

_should_skip(path: str) -> bool

Checks if a path matches any skip pattern. Used in the TUI's Phase 2 to filter resources.

_phase(n: int, title: str, glyph: str)

Prints a phase header with consistent styling.

run()

Purpose: The main pipeline orchestrator.

4-Phase Flow:

  1. Neural Initialization: Load Gemma 3, init scanner and formatter
  2. Deep Scan: Traverse CWD, filter, display resource table
  3. Knowledge Synthesis: Process each file with live progress bars and token callbacks
  4. Mission Report: Display results table and constellation footer