Skip to content

Latest commit

Β 

History

History
523 lines (405 loc) Β· 15.6 KB

File metadata and controls

523 lines (405 loc) Β· 15.6 KB

Cacummaro - Web PDF Ingest & Categorization Application

Cacummaro - From Latin "cacumen" (peak, summit) - reaching the peak of document organization!

A comprehensive Java application for ingesting web pages, converting them to PDF, extracting metadata, classifying documents, and integrating with Obsidian for knowledge management.

Features

  • Web to PDF: Convert any web page to PDF using Playwright headless Chrome
  • AI-Powered Classification: Machine learning + MCP integration with GPT-4, Claude, or custom AI models
  • Hybrid Classification System: Combines TF-IDF ML, rule-based patterns, and external AI for maximum accuracy
  • MCP Protocol Support: Integrate with any AI model via Model Context Protocol (future-proof architecture)
  • Interactive Graph Visualization: Obsidian-style category graph with D3.js
  • CouchDB Storage: Robust document storage with attachment support
  • Obsidian Integration: Automatic note generation with frontmatter and metadata
  • Full-text Search: Search across titles, descriptions, and metadata
  • REST API: Comprehensive API for all operations
  • Security: SSRF protection, input validation, and path traversal prevention

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Web UI        β”‚    β”‚   REST API      β”‚    β”‚   Services      β”‚
β”‚                 │◄──►│                 │◄──►│                 β”‚
β”‚ - Graph View    β”‚    β”‚ - /ingest       β”‚    β”‚ - PDF Gen       β”‚
β”‚ - Category List β”‚    β”‚ - /documents    β”‚    β”‚ - AI Classifier β”‚
β”‚ - Document View β”‚    β”‚ - /categories   β”‚    β”‚ - Obsidian      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚                        β”‚
                                β–Ό                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        Classification Pipeline            β”‚  β”‚  Obsidian Vault β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚                 β”‚
β”‚  β”‚ 1. MCP AI Models (Optional)        β”‚  β”‚  β”‚ - Notes (.md)   β”‚
β”‚  β”‚    β€’ GPT-4, Claude, Custom LLMs    β”‚  β”‚  β”‚ - Links         β”‚
β”‚  β”‚    β€’ JSON-RPC 2.0 Protocol         │──┼─── - Metadata      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ 2. ML TF-IDF Classifier            β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  β”‚    β€’ PDF Text Extraction           β”‚  β”‚  β”‚    CouchDB      β”‚
β”‚  β”‚    β€’ Cosine Similarity             β”‚  β”‚  β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚ - Documents     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚ - Categories    β”‚
β”‚  β”‚ 3. Rule-Based Patterns             β”‚  β”‚  β”‚ - PDF Binaries  β”‚
β”‚  β”‚    β€’ Keyword Matching              │──┼─── - ML Models     β”‚
β”‚  β”‚    β€’ Domain Detection              β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ 4. Merge & Deduplicate Results     β”‚  β”‚
β”‚  β”‚    β€’ Highest Confidence Wins       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Prerequisites

  • Java 11+ (LTS recommended)
  • Docker (for CouchDB)
  • Maven 3.6+
  • curl (for testing)

1. Clone and Start

# Linux/macOS
./start.sh

# Windows
start.bat

2. Manual Setup (Alternative)

# Start CouchDB
docker-compose up -d

# Build application
mvn clean package

# Run application
java -jar target/cacummaro-1.0-SNAPSHOT.jar

3. Verify Installation

# Health check
curl http://localhost:8082/actuator/health

# Expected response:
# {"status":"UP"}

# Access the web interface
# Open browser to http://localhost:8082/

API Usage

Web Interface

Open your browser to http://localhost:8082/:

  1. Ingest Documents: Enter any URL and click "Convert" to create a PDF snapshot
  2. View Graph: Click "Smart Classification" to see the interactive category graph
  3. Download PDFs: Click on any category node, then download PDFs from the sidebar

API Examples

Ingest a Web Page

curl -X POST http://localhost:8082/api/v1/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "options": {
      "createObsidianNote": true,
      "noteMetaTag": "data-note"
    }
  }'

Response:

{
  "id": "doc|550e8400-e29b-41d4-a716-446655440000",
  "status": "STORED",
  "pdfUrl": "/api/v1/documents/doc|550e8400-e29b-41d4-a716-446655440000/pdf"
}

Search Documents

curl "http://localhost:8082/api/v1/search?q=machine%20learning&size=5"

Download PDF

curl -o document.pdf \
  "http://localhost:8082/api/v1/documents/doc|550e8400-e29b-41d4-a716-446655440000/pdf"

List Categories

curl "http://localhost:8082/api/v1/categories"

Configuration

Edit src/main/resources/application.yml:

cacummaro:
  couchdb:
    host: localhost
    port: 5984
    database: cacummaro_docs
    username: admin
    password: password

  obsidian:
    vault-path: ./obsidian-vault
    enabled: true

  pdf:
    timeout: 30s
    max-page-size: 10MB

  # Classification System Configuration
  classification:
    confidence-threshold: 0.7

    # Machine Learning TF-IDF Classifier
    ml:
      enabled: true
      confidence-threshold: 0.6
      model-path: ./ml-model.json
      min-document-frequency: 2
      max-features: 1000

    # MCP AI Integration (GPT-4, Claude, Custom LLMs)
    mcp:
      enabled: false  # Set to true to enable AI classification
      server-url: http://localhost:3000/mcp
      timeout-ms: 30000
      tool-name: classify_document

  security:
    allowed-domains: []
    blocked-internal-ips: true

  ingestion:
    max-concurrent: 5
    retry-attempts: 3
    retry-delay: 5s

πŸ“ Project Structure

src/main/java/org/cacummaro/
β”œβ”€β”€ domain/              # Core entities (Document, Category)
β”œβ”€β”€ dto/                 # API contracts (IngestRequest, etc.)
β”œβ”€β”€ repository/          # Data access (CouchDB implementations)
β”œβ”€β”€ service/
β”‚   β”œβ”€β”€ pdf/            # PDF generation (Playwright)
β”‚   β”œβ”€β”€ classification/ # Document classification
β”‚   └── obsidian/       # Note generation
β”œβ”€β”€ controller/         # REST endpoints
└── config/             # Spring configuration

Testing

# Unit tests
mvn test

# Integration tests
mvn failsafe:integration-test

# All tests
mvn verify

Security Features

  • SSRF Protection: Blocks internal/private IP ranges
  • Input Validation: Comprehensive URL and data validation
  • Path Traversal Prevention: Secure Obsidian vault operations
  • Sanitization: Clean filenames and content
  • Rate Limiting: Configurable request limits

AI-Powered Classification System

Cacummaro features a next-generation hybrid classification system that combines the best of traditional ML, modern AI, and rule-based approaches.

Three-Layer Classification Architecture

1. MCP AI Models (External Intelligence)

The Future of Document Classification

Integrates with cutting-edge AI models via the Model Context Protocol (MCP):

  • GPT-4 / GPT-4 Turbo: Deep semantic understanding and contextual reasoning
  • Claude (Anthropic): Nuanced categorization with reasoning explanations
  • Local LLMs: Ollama, LLaMA, Mistral for privacy-focused deployments
  • Custom Models: Your fine-tuned domain-specific AI models

Benefits:

  • Semantic Understanding: Goes beyond keywords to understand context and meaning
  • Continuous Learning: Leverage latest AI advancements without code changes
  • Multi-language: Works across languages with advanced AI models
  • Confidence Scores: AI provides reasoning behind classifications
  • Plug & Play: Swap AI models by changing MCP server configuration

Configuration:

cacummaro:
  mcp:
    enabled: true
    server-url: http://localhost:3000/mcp  # Your MCP server
    timeout-ms: 30000
    tool-name: classify_document

Example MCP Response:

{
  "categories": [
    {
      "name": "artificial-intelligence",
      "confidence": 0.95,
      "reasoning": "Article discusses neural networks, deep learning, and ML algorithms"
    }
  ],
  "model": "gpt-4"
}

2. ML TF-IDF Classifier (Local Intelligence-->wikipedia source "tf–idf")

Content-Based Machine Learning

  • Extracts full text from PDF documents using Apache PDFBox
  • Uses TF-IDF vectorization (Term Frequency-Inverse Document Frequency)
  • Cosine similarity for category matching
  • Learns from your document corpus
  • Always available as fallback when MCP is unavailable

Training the Model:

# Train on existing categorized documents
curl -X POST http://localhost:8082/api/v1/ml/train?maxDocuments=1000

# Check ML status
curl http://localhost:8082/api/v1/ml/status

Benefits:

  • Fast: Local execution, no API calls
  • Private: No data leaves your infrastructure
  • Self-Learning: Improves as you categorize more documents
  • Reliable: Always available, no external dependencies

3. Rule-Based Classifier (Pattern Matching)

Traditional keyword and domain-based classification

Built-in categories with pattern matching:

  • Technology: software, programming, ai, javascript, python, react, docker
  • Finance: banking, investment, crypto, trading, stocks, portfolio
  • Science: research, biology, physics, medical, study, experiment
  • Business: startup, management, strategy, marketing, sales, leadership
  • News: breaking, politics, government, journalism, election

Add custom categories in application.yml:

cacummaro:
  classification:
    rule-based:
      category-keywords:
        data-science:
          - "pandas"
          - "numpy"
          - "data analysis"
          - "visualization"

Intelligent Classification Merge

All three classifiers work together:

  1. MCP AI runs first (if enabled) for deep semantic understanding
  2. ML TF-IDF analyzes PDF content for statistical patterns
  3. Rule-based checks for explicit keywords and domains
  4. Results are merged - highest confidence wins per category
  5. Source tracking - know which classifier found each category

Example Merged Result:

{
  "categories": [
    {
      "name": "artificial-intelligence",
      "confidence": 0.95,
      "classifier": "mcp-gpt4"  ← AI model
    },
    {
      "name": "technology",
      "confidence": 0.88,
      "classifier": "ml-tfidf-v1.0"  ← Local ML
    },
    {
      "name": "programming",
      "confidence": 0.75,
      "classifier": "enhanced-rule-based-v1.0"  ← Rules
    }
  ]
}

Obsidian Integration

Generated notes include:

---
source_url: https://example.com
title: Article Title
pdf_id: doc|uuid
fetchedAt: 2025-01-15T10:30:00Z
tags: [technology, page-snapshot]
---

# Article Title

## Description
Article description from meta tags

## Notes
Content from configurable meta tag

## Resources
[Download PDF](../pdfs/doc|uuid.pdf)

## Categories
- technology (confidence: 0.85)

## Metadata
- **Fetched:** 2025-01-15T10:30:00Z
- **PDF Size:** 1.2 MB

Troubleshooting

Common Issues

  1. CouchDB Connection Failed

    docker-compose logs couchdb
    curl http://localhost:5984/
  2. PDF Generation Timeout

    cacummaro:
      pdf:
        timeout: 60s  # Increase timeout
  3. Java Version Issues

    java -version  # Must be 11+
  4. Permission Issues (Linux/macOS)

    chmod +x start.sh

Logs

Application logs are available in the console output. Key components:

  • org.cacummaro.service.pdf: PDF generation
  • org.cacummaro.service.classification: Document classification
  • org.cacummaro.repository: Database operations

🀝 Development

Adding New Classifiers

  1. Implement the Classifier interface:

    @Service
    public class MyCustomClassifier implements Classifier {
        // Implementation
    }
  2. Register as Spring Bean

  3. Configure in application.yml

Adding New PDF Generators

  1. Implement the PdfGenerator interface
  2. Replace or add as alternative implementation

Monitoring

  • Web Interface: http://localhost:8082/
  • Graph Visualization: http://localhost:8082/graph
  • Health Endpoint: http://localhost:8082/actuator/health
  • CouchDB Admin: http://localhost:5984/_utils/
  • Application Metrics: Available via Spring Actuator

Tomcat Deployment

# Build WAR file
mvn clean package -Pwar

# Deploy to Tomcat
cp target/cacummaro-1.0-SNAPSHOT.war $TOMCAT_HOME/webapps/

πŸ“„ License

MIT License - see LICENSE file for details.

🚧 Roadmap

βœ… Completed

  • Web Interface with Thymeleaf
  • Interactive Graph Visualization (D3.js)
  • ML-based Classification (TF-IDF with PDF text extraction)
  • MCP Integration (AI models via Model Context Protocol)
  • Hybrid Classification (ML + AI + Rules)

πŸš€ In Progress

  • Enhanced MCP Features
    • Multi-model voting (query multiple AI models, consensus-based results)
    • Streaming classification results
    • Fine-tuning integration APIs
  • Classification Analytics Dashboard
    • Accuracy metrics per classifier
    • Confidence distribution charts
    • Category suggestion improvements

πŸ“‹ Planned

  • Batch Processing Queue for large-scale ingestion
  • Docker Image with embedded ML models
  • Kubernetes Deployment manifests
  • Advanced Search with Elasticsearch integration
  • Multi-user Support with role-based access
  • API Authentication (OAuth2, JWT)
  • Active Learning: User feedback improves classification
  • Category recommendations based on document content

Debugging (Windows)

To run the application in debug mode (waiting for a debugger to attach on port 5005), use the following command in cmd.exe:

mvn spring-boot:run "-Dspring-boot.run.jvmArguments=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"
  • The application will wait for a debugger to connect before starting.
  • You can then attach your IDE (e.g., IntelliJ, Eclipse) to localhost:5005.
  • Make sure to use double quotes as shown above when running in Windows cmd.exe.