At its heart, this project treats any linear document—be it a transcript, a codebase, or a research paper—as a one-dimensional "tape" of information. While easy to produce, this linear format obscures the complex, non-linear relationships between the concepts within it. Our system transforms this simple tape into a rich, multi-dimensional knowledge graph.
The process is as follows:
- Disassembly (Tape to Nodes): The continuous tape is first segmented into discrete, semantic "nodes." Each node represents a coherent idea, a code block, or a conversational turn. This is not an arbitrary split; it is a meaning-preserving discretization.
- Reconstruction (Nodes to Graph): We then establish relational edges between these nodes. These edges represent dependencies, explanations, contradictions, or temporal sequences, creating a true network of knowledge.
- Reassembly (Graph to Purpose-Driven Documents): The knowledge graph is not the final output. It is a powerful intermediate representation. From this single graph, we can reassemble the information into countless new "tapes," each optimized for a specific purpose (e.g., a tutorial, an executive summary, a technical reference).
The system implements a sophisticated edge-aware segmentation approach that preserves semantic relationships during the tape-to-graph transformation.
The system recognizes 10 distinct edge types that capture different semantic relationships:
| Edge Type | Strength | Preserves Unit | Importance | Pattern |
|---|---|---|---|---|
| EXPLAINS | 0.9 | Yes | 0.8 | "because", "since", "therefore", "explains why", "this means" |
| ELABORATES | 0.7 | Yes | 0.6 | "specifically", "in detail", "furthermore", "additionally" |
| CONTRADICTS | 0.8 | Yes | 0.9 | "however", "but", "contrary to", "on the other hand" |
| IS_EXAMPLE_OF | 0.85 | Yes | 0.7 | "for example", "such as", "like", "instance of", "e.g." |
| IS_CONSEQUENCE_OF | 0.75 | Yes | 0.8 | "results in", "leads to", "causes", "consequently" |
| DEPENDS_ON | 1.0 | Yes | 1.0 | "requires", "needs", "depends on", "uses", "imports" |
| SUMMARIZES | 0.6 | No | 0.5 | "in summary", "to summarize", "overall", "in short" |
| REFERENCES | 0.4 | No | 0.4 | "see also", "refer to", "mentioned in", "as shown in" |
| CONTINUES | 0.3 | No | 0.2 | "continuing", "moreover", "also", "and", "next" |
| NO_RELATION | 0.0 | No | 0.0 | (default when no pattern matches) |
After initial segmentation, the system applies a comprehensive 7-phase enrichment pipeline:
- Edge Type Inference: Analyzes unlabeled edges to infer their semantic type using transitive relationships, content analysis, and structural patterns
- Transitive Relationship Building: Constructs transitive edges for critical relationships (e.g., if A depends on B and B depends on C, then A transitively depends on C)
- Semantic Clustering: Applies Louvain community detection to identify topical clusters and assigns cluster representatives
- Redundancy Analysis: Identifies merge candidates, redundant continuations, and duplicate examples for consolidation
- Global Importance Calculation: Combines PageRank, betweenness centrality, edge-type importance, and cluster importance into a unified score
- Narrative Flow Detection: Identifies narrative patterns like setup-payoff, problem-solution, claim-evidence, and concept-example pairs
- Dependency Optimization: Finds critical paths, circular dependencies, and optimizes the dependency graph structure
Each segment is evaluated across five dimensions:
- Uniqueness (0.3 weight): How distinct the content is compared to other segments
- Completeness (0.25 weight): Whether the segment forms a self-contained thought
- Edge Importance (0.25 weight): The significance of its relationships to other segments
- Structural Importance (0.15 weight): Its role in the graph structure (hub, bridge, etc.)
- Semantic Density (0.05 weight): Information density and technical content
The system includes SOM capabilities for advanced spatial analysis:
- Spatial Organization: Maps segments to a 2D grid based on semantic similarity
- Colocation Detection: Identifies segments that map to the same SOM position as merge candidates
- Isolation Analysis: Finds segments that are semantically isolated
- Path Finding: Multiple strategies for document generation (scenic, hub_tour, edge_guided, direct)
- Density Analysis: Identifies dense regions of related content
A critical step in the "Tape-to-Nodes" transformation is determining the ideal size and overlap of our initial text chunks. We ground our approach in Percolation Theory.
Imagine our document's concepts as a physical medium. For information to "percolate" or flow from one end of the document to the other, the chunks (nodes) must be sufficiently connected.
- Too little overlap (<15%): The chunks are disconnected islands of meaning. The system can understand local concepts but fails to form a "big picture" view. No "giant connected component" of knowledge emerges.
- Too much overlap (>30%): The system becomes computationally inefficient and redundant. The connections are trivially obvious, and we lose the ability to identify meaningful, non-local relationships.
The critical threshold lies in the 15-30% overlap range. At this density, a phase transition occurs. The graph of nodes becomes globally connected, allowing insights and context to flow across the entire information space. This ensures that the meaning of a concept at the beginning of the tape can influence and be influenced by a concept at the end, enabling true retroactive understanding. This mathematically-grounded overlap is fundamental to our chunking strategy, ensuring the resulting knowledge graph is both coherent and comprehensive.
The system uses separate disassembly and reassembly rules to first break down text optimally, then reconstruct it with different organizational principles.
Location: layered-context-graph/src/partitioning/partition_manager.py
Method: _apply_disassembly_rules()
- Semantic boundaries: Split at paragraphs, major topic shifts
- Attention clusters: Use attention patterns to find natural breaks
- Percolation thresholds: Apply percolation theory boundaries
- Instruction markers: Split at special tokens (QWQ_REASONING, etc.)
Method: _iterative_segmentation()
- Target: Continue segmentation until average segment length ≈ 400 characters
- Max rounds: Up to 5 rounds of refinement
- Convergence: Stop when segments reach optimal size or no change occurs
Method: _split_by_round_criteria()
- Round 1:
_split_by_semantic_boundaries()- Paragraphs, sentences - Round 2:
_split_by_syntactic_boundaries()- Clauses, phrases - Round 3:
_split_by_instruction_markers()- Special tokens - Round 4+:
_split_by_character_count()- Fallback fixed-size
Method: _merge_segments()
- Target: Combine small segments up to 1.2x target length
- Preserve: Semantic coherence during merging
- Strategy: Greedy combination with overlap management
Location: layered-context-graph/src/partitioning/enhanced_partition_manager.py
The enhanced system adds edge-aware capabilities:
- Edge Detection During Segmentation: Predicts edge types between content pieces before creating nodes
- Semantic Unit Preservation: Groups pieces that must stay together based on edge constraints
- Pattern-Based Detection: Uses regex patterns to identify relationship indicators in text
Location: layered-context-graph/src/graph/graph_reassembler.py
Method: _analyze_optimal_segments()
- Input: Final optimal segments from disassembly
- Analysis: Categorize segment types, themes, importance
- Output: Segment analysis metadata for reconstruction
Method: _apply_reconstruction_rules()
- Importance ordering: Organize by conceptual significance
- Conceptual clustering: Group related concepts together
- Flow optimization: Arrange for logical reading flow
- Layered organization: Create hierarchical structure
Method: _generate_reorganized_content()
- Layer-based output: Organize content into conceptual layers
- Flow optimization: Ensure smooth transitions between concepts
- Coherence preservation: Maintain semantic relationships
- Readable formatting: Add headers, structure, navigation
Input Text
↓
[DISASSEMBLY PHASE]
↓
Rule K1: Initial split by semantic/attention boundaries
↓
Rule K2: Iterative refinement (Rounds 1-5)
├─ Rule K3: Round-specific criteria
└─ Rule K4: Merging when needed
↓
Optimal Segments (avg ~400 chars)
↓
[REASSEMBLY PHASE]
↓
Rule G1: Analyze optimal segments
↓
Rule G2: Apply reconstruction rules
├─ Importance ordering
├─ Conceptual clustering
├─ Flow optimization
└─ Layered organization
↓
Rule G3: Generate reorganized content
↓
Final Reorganized Output
The current implementation uses a robust, fallback approach that relies on standard LLM API calls rather than direct attention head access.
Builds a knowledge graph from a document using only standard LLM API calls.
Merges highly similar nodes and prunes weak edges to clean the graph. It also classifies nodes as KEEP, DELETE, or TRACK based on graph metrics and content.
The main user-facing class that orchestrates the entire process of transforming a transcript into various structured, condensed outputs.
Implements the sophisticated edge-aware segmentation with:
- EdgeTypeRules class defining properties for each edge type
- Multi-strategy edge inference (transitive, content-based, structural)
- Graph enrichment pipeline with 7 phases
- Multi-dimensional scoring system
Adds Self-Organizing Map capabilities:
- 2D spatial mapping of segments
- Colocation detection for merge candidates
- Multiple path-finding strategies for document generation
- Integration with graph enrichment pipeline
This section outlines the planned enhancements and the more advanced, attention-based architecture that the project is working towards.
The system is designed to support a multi-round annotation process where a single base graph is enriched with multiple layers of analysis.
- Base Graph: Created once from clean semantic chunks.
- Annotation Layers: Multiple analysis rounds add metadata to the same nodes and edges.
- Layer Types:
- Syntactic Layer: Grammar, POS tags, linguistic structure.
- Semantic Layer: Topics, concepts, meaning relationships.
- Pragmatic Layer: Intent, discourse, communicative purpose.
A more advanced implementation will use transformer attention patterns to directly guide the graph construction process.
- Principle: Attention mechanisms slice existing text, never recreate it.
- Process: The transformer analyzes semantic chunks and builds the knowledge graph directly.
- Output: A pure graph structure with original content preserved in the nodes and attention patterns as metadata.
The reassembly process can be enhanced by using the original document's structure as a scaffold.
- Scaffold: The original document structure is used as a template.
- Method: Graph content is mapped back to the original organization.
- Benefits: This maintains the natural flow of the document while incorporating the insights from the graph.
- Instruction Seeder: A module to insert natural language instructions into text to guide attention heads.
- Attention-Based Graph Builder: A module to build graphs directly from attention patterns.
- CUDA Optimization: Add GPU acceleration for performance improvements.
The master_processor.py is the main entry point for the layered context graph system, providing a complete pipeline from text input to synthesized output.
# Process a document with default settings
python master_processor.py --input document.txt
# Use demo content
python master_processor.py --demo simple
# Specify output directory
python master_processor.py --input document.txt --output results/The system supports three main processing modes:
Basic processing with QwQ attention extraction.
python master_processor.py --input document.txt --mode single-passApplies multiple annotation layers for deeper analysis.
python master_processor.py --input document.txt --mode multi-roundUses natural language rules to guide processing.
python master_processor.py --input document.txt --mode language-guided --rules technical_documentationFor dialogue and transcript formats, use conversation-specific modes:
# Timeline mode - chronological order
python master_processor.py --input conversation.txt --conversation-mode timeline
# Speaker mode - organized by speaker
python master_processor.py --input conversation.txt --conversation-mode speaker
# Evolution mode - shows concept development
python master_processor.py --input conversation.txt --conversation-mode evolution
# Current state mode - final conclusions
python master_processor.py --input conversation.txt --conversation-mode current_state
# Research mode - extracts key insights, code, and roads not taken
python master_processor.py --input conversation.txt --conversation-mode researchGenerate purpose-specific documents from the knowledge graph:
# Executive summary
python master_processor.py --input document.txt --synthesize executive_summary
# Tutorial format
python master_processor.py --input document.txt --synthesize tutorial
# Technical reference
python master_processor.py --input document.txt --synthesize reference
# README documentation
python master_processor.py --input document.txt --synthesize readme# Process a technical document and generate a tutorial
python master_processor.py --input technical_doc.txt --mode multi-round --synthesize tutorial
# Analyze a conversation and extract research insights
python master_processor.py --input conversation.txt --conversation-mode research --synthesize executive_summary
# Use demo content with verbose output
python master_processor.py --demo technical --mode language-guided --verbose
# Full pipeline with custom output
python master_processor.py \
--input my_document.txt \
--mode multi-round \
--conversation-mode evolution \
--synthesize reference \
--output my_results/| Option | Description | Choices |
|---|---|---|
--input, -i |
Input text file path | Any valid file path |
--output, -o |
Output directory | Any valid directory path |
--demo |
Use demo content | simple, technical, transcript, conversation |
--mode |
Processing mode | single-pass, multi-round, language-guided |
--conversation-mode |
Conversation reassembly mode | timeline, speaker, evolution, current_state, research |
--synthesize |
Generate synthesized content | executive_summary, tutorial, reference, readme |
--rules |
Predefined rule set | Available rule sets from configuration |
--verbose, -v |
Enable verbose logging | Flag (no value needed) |
The processor generates several output files:
- Main results:
qwq_layered_results_[timestamp].json- Complete processing results - Reassembled text:
qwq_layered_results_[timestamp].txt- Reorganized document - Synthesized content:
synthesized_[type]_[timestamp].md- If synthesis requested
The system automatically detects and uses GPU acceleration when available:
- Supports NVIDIA CUDA GPUs
- Falls back to CPU when GPU not available
- Displays GPU information in verbose mode
When using --conversation-mode research, the system extracts:
- Most Definitive Ideas: Final, refined versions of concepts
- Implementation Code: Practical code examples with context
- Roads Not Taken: Alternative approaches that were rejected
- Unique Early Ideas: Interesting concepts that weren't fully developed
- Concept Evolution: How ideas changed throughout the conversation
This mode is particularly useful for:
- Analyzing design discussions
- Extracting actionable insights from brainstorming sessions
- Creating technical documentation from conversations
- Understanding decision-making processes
The enhanced system includes specific optimizations for processing technical documents with embedded code:
For documents with dense technical content and many code blocks:
# Recommended configuration
manager = SOMEnhancedPartitionManager(
similarity_threshold=0.75,
use_graph_aware=True,
min_segment_size=50 # Much smaller than default 500
)- Code Block Preservation: Code blocks are treated as atomic units and never split
- Context Preservation: Code explanations are kept with their associated code blocks
- Dialog Format Support: Handles User:/Assistant: conversation formats
- Pattern Detection: Automatically identifies code-explanation-example patterns
For a well-structured technical document, expect:
- 50-200 nodes (not just 3)
- 5-10 different edge types detected
- Multiple semantic clusters around different technical concepts
- Rich narrative patterns (problem-solution, concept-example)
- Meaningful importance scores distinguishing core concepts from examples
If the system produces too few segments:
- Reduce
min_segment_sizeto 50-100 characters - Increase
MAX_LEVELSin tree construction to 5+ - Use more granular segmentation rules for technical content
- Check that the document is being loaded correctly
- Enable verbose logging to see segmentation decisions