Skip to content

Ethycs/layer_context_seg

Repository files navigation

Project AURA: Adaptive Universal Reorganization Architecture

1. Core Philosophy: The "Tape-to-Graph" Transformation

At its heart, this project treats any linear document—be it a transcript, a codebase, or a research paper—as a one-dimensional "tape" of information. While easy to produce, this linear format obscures the complex, non-linear relationships between the concepts within it. Our system transforms this simple tape into a rich, multi-dimensional knowledge graph.

The process is as follows:

  1. Disassembly (Tape to Nodes): The continuous tape is first segmented into discrete, semantic "nodes." Each node represents a coherent idea, a code block, or a conversational turn. This is not an arbitrary split; it is a meaning-preserving discretization.
  2. Reconstruction (Nodes to Graph): We then establish relational edges between these nodes. These edges represent dependencies, explanations, contradictions, or temporal sequences, creating a true network of knowledge.
  3. Reassembly (Graph to Purpose-Driven Documents): The knowledge graph is not the final output. It is a powerful intermediate representation. From this single graph, we can reassemble the information into countless new "tapes," each optimized for a specific purpose (e.g., a tutorial, an executive summary, a technical reference).

2. Enhanced Architecture: Edge-Aware Segmentation and Graph Enrichment

The system implements a sophisticated edge-aware segmentation approach that preserves semantic relationships during the tape-to-graph transformation.

2.1. Edge Type System

The system recognizes 10 distinct edge types that capture different semantic relationships:

Edge Type Strength Preserves Unit Importance Pattern
EXPLAINS 0.9 Yes 0.8 "because", "since", "therefore", "explains why", "this means"
ELABORATES 0.7 Yes 0.6 "specifically", "in detail", "furthermore", "additionally"
CONTRADICTS 0.8 Yes 0.9 "however", "but", "contrary to", "on the other hand"
IS_EXAMPLE_OF 0.85 Yes 0.7 "for example", "such as", "like", "instance of", "e.g."
IS_CONSEQUENCE_OF 0.75 Yes 0.8 "results in", "leads to", "causes", "consequently"
DEPENDS_ON 1.0 Yes 1.0 "requires", "needs", "depends on", "uses", "imports"
SUMMARIZES 0.6 No 0.5 "in summary", "to summarize", "overall", "in short"
REFERENCES 0.4 No 0.4 "see also", "refer to", "mentioned in", "as shown in"
CONTINUES 0.3 No 0.2 "continuing", "moreover", "also", "and", "next"
NO_RELATION 0.0 No 0.0 (default when no pattern matches)

2.2. Graph Enrichment Pipeline

After initial segmentation, the system applies a comprehensive 7-phase enrichment pipeline:

  1. Edge Type Inference: Analyzes unlabeled edges to infer their semantic type using transitive relationships, content analysis, and structural patterns
  2. Transitive Relationship Building: Constructs transitive edges for critical relationships (e.g., if A depends on B and B depends on C, then A transitively depends on C)
  3. Semantic Clustering: Applies Louvain community detection to identify topical clusters and assigns cluster representatives
  4. Redundancy Analysis: Identifies merge candidates, redundant continuations, and duplicate examples for consolidation
  5. Global Importance Calculation: Combines PageRank, betweenness centrality, edge-type importance, and cluster importance into a unified score
  6. Narrative Flow Detection: Identifies narrative patterns like setup-payoff, problem-solution, claim-evidence, and concept-example pairs
  7. Dependency Optimization: Finds critical paths, circular dependencies, and optimizes the dependency graph structure

2.3. Multi-Dimensional Scoring System

Each segment is evaluated across five dimensions:

  • Uniqueness (0.3 weight): How distinct the content is compared to other segments
  • Completeness (0.25 weight): Whether the segment forms a self-contained thought
  • Edge Importance (0.25 weight): The significance of its relationships to other segments
  • Structural Importance (0.15 weight): Its role in the graph structure (hub, bridge, etc.)
  • Semantic Density (0.05 weight): Information density and technical content

2.4. Self-Organizing Map (SOM) Integration

The system includes SOM capabilities for advanced spatial analysis:

  • Spatial Organization: Maps segments to a 2D grid based on semantic similarity
  • Colocation Detection: Identifies segments that map to the same SOM position as merge candidates
  • Isolation Analysis: Finds segments that are semantically isolated
  • Path Finding: Multiple strategies for document generation (scenic, hub_tour, edge_guided, direct)
  • Density Analysis: Identifies dense regions of related content

3. The Mathematical Justification: Percolation Theory and Optimal Chunking

A critical step in the "Tape-to-Nodes" transformation is determining the ideal size and overlap of our initial text chunks. We ground our approach in Percolation Theory.

Imagine our document's concepts as a physical medium. For information to "percolate" or flow from one end of the document to the other, the chunks (nodes) must be sufficiently connected.

  • Too little overlap (<15%): The chunks are disconnected islands of meaning. The system can understand local concepts but fails to form a "big picture" view. No "giant connected component" of knowledge emerges.
  • Too much overlap (>30%): The system becomes computationally inefficient and redundant. The connections are trivially obvious, and we lose the ability to identify meaningful, non-local relationships.

The critical threshold lies in the 15-30% overlap range. At this density, a phase transition occurs. The graph of nodes becomes globally connected, allowing insights and context to flow across the entire information space. This ensures that the meaning of a concept at the beginning of the tape can influence and be influenced by a concept at the end, enabling true retroactive understanding. This mathematically-grounded overlap is fundamental to our chunking strategy, ensuring the resulting knowledge graph is both coherent and comprehensive.

4. Current Implementation: Disassembly and Reassembly Rules

The system uses separate disassembly and reassembly rules to first break down text optimally, then reconstruct it with different organizational principles.

4.1. DISASSEMBLY RULES (Breaking Down)

Location: layered-context-graph/src/partitioning/partition_manager.py

Rule K1: Initial Disassembly Rules

Method: _apply_disassembly_rules()

  • Semantic boundaries: Split at paragraphs, major topic shifts
  • Attention clusters: Use attention patterns to find natural breaks
  • Percolation thresholds: Apply percolation theory boundaries
  • Instruction markers: Split at special tokens (QWQ_REASONING, etc.)

Rule K2: Iterative Segmentation Rules

Method: _iterative_segmentation()

  • Target: Continue segmentation until average segment length ≈ 400 characters
  • Max rounds: Up to 5 rounds of refinement
  • Convergence: Stop when segments reach optimal size or no change occurs

Rule K3: Round-Specific Splitting Criteria

Method: _split_by_round_criteria()

  • Round 1: _split_by_semantic_boundaries() - Paragraphs, sentences
  • Round 2: _split_by_syntactic_boundaries() - Clauses, phrases
  • Round 3: _split_by_instruction_markers() - Special tokens
  • Round 4+: _split_by_character_count() - Fallback fixed-size

Rule K4: Merging Criteria (when segments too small)

Method: _merge_segments()

  • Target: Combine small segments up to 1.2x target length
  • Preserve: Semantic coherence during merging
  • Strategy: Greedy combination with overlap management

4.2. EDGE-AWARE SEGMENTATION

Location: layered-context-graph/src/partitioning/enhanced_partition_manager.py

The enhanced system adds edge-aware capabilities:

  • Edge Detection During Segmentation: Predicts edge types between content pieces before creating nodes
  • Semantic Unit Preservation: Groups pieces that must stay together based on edge constraints
  • Pattern-Based Detection: Uses regex patterns to identify relationship indicators in text

4.3. REASSEMBLY RULES (Reconstructing)

Location: layered-context-graph/src/graph/graph_reassembler.py

Rule G1: Reconstruction Analysis

Method: _analyze_optimal_segments()

  • Input: Final optimal segments from disassembly
  • Analysis: Categorize segment types, themes, importance
  • Output: Segment analysis metadata for reconstruction

Rule G2: Reconstruction Rules (Different from Disassembly)

Method: _apply_reconstruction_rules()

  • Importance ordering: Organize by conceptual significance
  • Conceptual clustering: Group related concepts together
  • Flow optimization: Arrange for logical reading flow
  • Layered organization: Create hierarchical structure

Rule G3: Content Generation Rules

Method: _generate_reorganized_content()

  • Layer-based output: Organize content into conceptual layers
  • Flow optimization: Ensure smooth transitions between concepts
  • Coherence preservation: Maintain semantic relationships
  • Readable formatting: Add headers, structure, navigation

4.4. PIPELINE FLOW

Input Text
    ↓
[DISASSEMBLY PHASE]
    ↓
Rule K1: Initial split by semantic/attention boundaries
    ↓
Rule K2: Iterative refinement (Rounds 1-5)
    ├─ Rule K3: Round-specific criteria
    └─ Rule K4: Merging when needed
    ↓
Optimal Segments (avg ~400 chars)
    ↓
[REASSEMBLY PHASE]  
    ↓
Rule G1: Analyze optimal segments
    ↓
Rule G2: Apply reconstruction rules
    ├─ Importance ordering
    ├─ Conceptual clustering  
    ├─ Flow optimization
    └─ Layered organization
    ↓
Rule G3: Generate reorganized content
    ↓
Final Reorganized Output

5. API-Based Implementation

The current implementation uses a robust, fallback approach that relies on standard LLM API calls rather than direct attention head access.

5.1. api_processor.py

Builds a knowledge graph from a document using only standard LLM API calls.

5.2. graph_utils.py

Merges highly similar nodes and prunes weak edges to clean the graph. It also classifies nodes as KEEP, DELETE, or TRACK based on graph metrics and content.

5.3. condenser.py

The main user-facing class that orchestrates the entire process of transforming a transcript into various structured, condensed outputs.

5.4. Enhanced Components

enhanced_partition_manager.py

Implements the sophisticated edge-aware segmentation with:

  • EdgeTypeRules class defining properties for each edge type
  • Multi-strategy edge inference (transitive, content-based, structural)
  • Graph enrichment pipeline with 7 phases
  • Multi-dimensional scoring system

som_enhanced_partition_manager.py

Adds Self-Organizing Map capabilities:

  • 2D spatial mapping of segments
  • Colocation detection for merge candidates
  • Multiple path-finding strategies for document generation
  • Integration with graph enrichment pipeline

6. Future Work and Aspirational Architecture

This section outlines the planned enhancements and the more advanced, attention-based architecture that the project is working towards.

6.1. Multi-Round Annotation

The system is designed to support a multi-round annotation process where a single base graph is enriched with multiple layers of analysis.

  • Base Graph: Created once from clean semantic chunks.
  • Annotation Layers: Multiple analysis rounds add metadata to the same nodes and edges.
  • Layer Types:
    • Syntactic Layer: Grammar, POS tags, linguistic structure.
    • Semantic Layer: Topics, concepts, meaning relationships.
    • Pragmatic Layer: Intent, discourse, communicative purpose.

6.2. Attention-Driven Graph Creation

A more advanced implementation will use transformer attention patterns to directly guide the graph construction process.

  • Principle: Attention mechanisms slice existing text, never recreate it.
  • Process: The transformer analyzes semantic chunks and builds the knowledge graph directly.
  • Output: A pure graph structure with original content preserved in the nodes and attention patterns as metadata.

6.3. Scaffold-Guided Reconstruction

The reassembly process can be enhanced by using the original document's structure as a scaffold.

  • Scaffold: The original document structure is used as a template.
  • Method: Graph content is mapped back to the original organization.
  • Benefits: This maintains the natural flow of the document while incorporating the insights from the graph.

6.4. Implementation Plan for Advanced Features

  • Instruction Seeder: A module to insert natural language instructions into text to guide attention heads.
  • Attention-Based Graph Builder: A module to build graphs directly from attention patterns.
  • CUDA Optimization: Add GPU acceleration for performance improvements.

7. Master Processor Usage Guide

The master_processor.py is the main entry point for the layered context graph system, providing a complete pipeline from text input to synthesized output.

7.1. Basic Usage

# Process a document with default settings
python master_processor.py --input document.txt

# Use demo content
python master_processor.py --demo simple

# Specify output directory
python master_processor.py --input document.txt --output results/

7.2. Processing Modes

The system supports three main processing modes:

Single-Pass Mode (Default)

Basic processing with QwQ attention extraction.

python master_processor.py --input document.txt --mode single-pass

Multi-Round Mode

Applies multiple annotation layers for deeper analysis.

python master_processor.py --input document.txt --mode multi-round

Language-Guided Mode

Uses natural language rules to guide processing.

python master_processor.py --input document.txt --mode language-guided --rules technical_documentation

7.3. Conversation Processing

For dialogue and transcript formats, use conversation-specific modes:

# Timeline mode - chronological order
python master_processor.py --input conversation.txt --conversation-mode timeline

# Speaker mode - organized by speaker
python master_processor.py --input conversation.txt --conversation-mode speaker

# Evolution mode - shows concept development
python master_processor.py --input conversation.txt --conversation-mode evolution

# Current state mode - final conclusions
python master_processor.py --input conversation.txt --conversation-mode current_state

# Research mode - extracts key insights, code, and roads not taken
python master_processor.py --input conversation.txt --conversation-mode research

7.4. Synthesis Options (Tape₂ Generation)

Generate purpose-specific documents from the knowledge graph:

# Executive summary
python master_processor.py --input document.txt --synthesize executive_summary

# Tutorial format
python master_processor.py --input document.txt --synthesize tutorial

# Technical reference
python master_processor.py --input document.txt --synthesize reference

# README documentation
python master_processor.py --input document.txt --synthesize readme

7.5. Complete Examples

# Process a technical document and generate a tutorial
python master_processor.py --input technical_doc.txt --mode multi-round --synthesize tutorial

# Analyze a conversation and extract research insights
python master_processor.py --input conversation.txt --conversation-mode research --synthesize executive_summary

# Use demo content with verbose output
python master_processor.py --demo technical --mode language-guided --verbose

# Full pipeline with custom output
python master_processor.py \
    --input my_document.txt \
    --mode multi-round \
    --conversation-mode evolution \
    --synthesize reference \
    --output my_results/

7.6. Command-Line Options

Option Description Choices
--input, -i Input text file path Any valid file path
--output, -o Output directory Any valid directory path
--demo Use demo content simple, technical, transcript, conversation
--mode Processing mode single-pass, multi-round, language-guided
--conversation-mode Conversation reassembly mode timeline, speaker, evolution, current_state, research
--synthesize Generate synthesized content executive_summary, tutorial, reference, readme
--rules Predefined rule set Available rule sets from configuration
--verbose, -v Enable verbose logging Flag (no value needed)

7.7. Output Files

The processor generates several output files:

  1. Main results: qwq_layered_results_[timestamp].json - Complete processing results
  2. Reassembled text: qwq_layered_results_[timestamp].txt - Reorganized document
  3. Synthesized content: synthesized_[type]_[timestamp].md - If synthesis requested

7.8. GPU Support

The system automatically detects and uses GPU acceleration when available:

  • Supports NVIDIA CUDA GPUs
  • Falls back to CPU when GPU not available
  • Displays GPU information in verbose mode

7.9. Research Mode Features

When using --conversation-mode research, the system extracts:

  1. Most Definitive Ideas: Final, refined versions of concepts
  2. Implementation Code: Practical code examples with context
  3. Roads Not Taken: Alternative approaches that were rejected
  4. Unique Early Ideas: Interesting concepts that weren't fully developed
  5. Concept Evolution: How ideas changed throughout the conversation

This mode is particularly useful for:

  • Analyzing design discussions
  • Extracting actionable insights from brainstorming sessions
  • Creating technical documentation from conversations
  • Understanding decision-making processes

8. Optimizing for Technical Documents

The enhanced system includes specific optimizations for processing technical documents with embedded code:

8.1. Parameter Tuning for Code-Heavy Content

For documents with dense technical content and many code blocks:

# Recommended configuration
manager = SOMEnhancedPartitionManager(
    similarity_threshold=0.75,
    use_graph_aware=True,
    min_segment_size=50    # Much smaller than default 500
)

8.2. Code-Aware Segmentation Features

  • Code Block Preservation: Code blocks are treated as atomic units and never split
  • Context Preservation: Code explanations are kept with their associated code blocks
  • Dialog Format Support: Handles User:/Assistant: conversation formats
  • Pattern Detection: Automatically identifies code-explanation-example patterns

8.3. Expected Results for Technical Documents

For a well-structured technical document, expect:

  • 50-200 nodes (not just 3)
  • 5-10 different edge types detected
  • Multiple semantic clusters around different technical concepts
  • Rich narrative patterns (problem-solution, concept-example)
  • Meaningful importance scores distinguishing core concepts from examples

8.4. Troubleshooting Poor Segmentation

If the system produces too few segments:

  1. Reduce min_segment_size to 50-100 characters
  2. Increase MAX_LEVELS in tree construction to 5+
  3. Use more granular segmentation rules for technical content
  4. Check that the document is being loaded correctly
  5. Enable verbose logging to see segmentation decisions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published