Skip to content

Latest commit

 

History

History
689 lines (541 loc) · 20.3 KB

File metadata and controls

689 lines (541 loc) · 20.3 KB

BasicExtractor - Knowledge Graph Extraction

BasicExtractor is a production-ready tool for extracting structured knowledge graphs from text documents. It converts unstructured text into semantic entities and relationships using large language models, outputting clean JSON-LD or RDF triple formats.

Quick Start

from abstractcore.processing import BasicExtractor

# Initialize with default model (Ollama qwen3:4b-instruct-2507-q4_K_M)
extractor = BasicExtractor()

# Extract knowledge graph
result = extractor.extract("Google created TensorFlow in 2015. Microsoft uses TensorFlow for Azure AI.")

# Result contains entities and relationships in JSON-LD format
entities = [item for item in result['@graph'] if item.get('@id', '').startswith('e:')]
relationships = [item for item in result['@graph'] if item.get('@id', '').startswith('r:')]

Installation & Setup

# Install AbstractCore. The default Ollama path works with the core install.
pip install abstractcore

# Optional turnkey local-runtime installs:
pip install "abstractcore[all-apple]"    # Apple Silicon: HF/GGUF + MLX + features + server
pip install "abstractcore[all-gpu]"      # NVIDIA GPU: HF/GGUF + vLLM + features + server

# Default model requires Ollama (free, runs locally)
# 1. Install Ollama: https://ollama.com/
# 2. Download model: ollama pull qwen3:4b-instruct-2507-q4_K_M
# 3. Start Ollama service

# Alternative: Use cloud providers
pip install "abstractcore[remote]"

Model Performance Recommendations

Default Model: qwen3:4b-instruct-2507-q4_K_M

  • Size: ~4GB model
  • RAM: ~8GB required
  • Speed: Good balance of speed and quality
  • Setup: ollama pull qwen3:4b-instruct-2507-q4_K_M

For Optimal Performance:

  • qwen3-coder:30b: High quality for structured JSON-LD output (requires 32GB RAM)
  • gpt-oss:120b: Highest quality extraction (requires 120GB RAM)

For Production: Cloud providers (OpenAI GPT-4o-mini, Claude) offer the most reliable JSON-LD generation.

Output Formats

BasicExtractor supports three output formats optimized for different use cases:

1. JSON-LD Format (Default)

Standard W3C JSON-LD with schema.org vocabulary - ideal for semantic web applications:

result = extractor.extract("Apple acquired Siri in 2010", output_format="jsonld")
{
  "@context": {
    "s": "https://schema.org/",
    "e": "http://example.org/entity/",
    "r": "http://example.org/relation/",
    "confidence": "http://example.org/confidence"
  },
  "@graph": [
    {
      "@id": "e:apple",
      "@type": "s:Organization",
      "s:name": "Apple",
      "s:description": "Technology company",
      "confidence": 0.95
    },
    {
      "@id": "r:1",
      "@type": "s:Relationship",
      "s:name": "acquires",
      "s:about": {"@id": "e:apple"},
      "s:object": {"@id": "e:siri"},
      "confidence": 0.90
    }
  ]
}

2. RDF Triples Format (New)

SUBJECT PREDICATE OBJECT format following semantic web standards - perfect for graph databases:

result = extractor.extract("Apple acquired Siri in 2010", output_format="triples")

Simple output:

Apple acquires Siri

Detailed output (with metadata):

{
  "format": "triples",
  "simple_triples": ["Apple acquires Siri"],
  "triples": [
    {
      "subject": "e:apple",
      "subject_name": "Apple",
      "predicate": "acquires",
      "object": "e:siri",
      "object_name": "Siri",
      "confidence": 0.90
    }
  ],
  "entities": {...},
  "statistics": {"entities_count": 2, "relationships_count": 1}
}

3. Minified JSON-LD Format (New)

Compact JSON string without indentation - optimized for storage and transport:

result = extractor.extract("Apple acquired Siri in 2010", output_format="jsonld_minified")
{
  "format": "jsonld_minified",
  "data": "{\"@context\":{\"s\":\"https://schema.org/\"},\"@graph\":[...]}",
  "entities_count": 2,
  "relationships_count": 1
}

Python API Reference

BasicExtractor Class

class BasicExtractor:
    def __init__(
        self,
        llm: Optional[AbstractCoreInterface] = None,
        max_chunk_size: int = 8000
    )

    def extract(
        self,
        text: str,
        domain_focus: Optional[str] = None,
        entity_types: Optional[List[str]] = None,
        style: Optional[str] = None,
        length: Optional[str] = None,
        output_format: str = "jsonld"
    ) -> dict

Parameters

  • text (str): Text to extract knowledge from
  • domain_focus (str, optional): Focus area like "technology", "business", "medical"
  • entity_types (List[str], optional): Reserved for future use
  • style (str, optional): Reserved for future use
  • length (str, optional): Extraction depth
    • "brief" - 10 entities max (fast)
    • "standard" - 15 entities max (balanced)
    • "detailed" - 25 entities max (thorough)
    • "comprehensive" - 50 entities max (extensive)
  • output_format (str): Output format
    • "jsonld" - Standard JSON-LD (default)
    • "triples" - RDF SUBJECT PREDICATE OBJECT format
    • "jsonld_minified" - Compact JSON string

Custom LLM Provider

from abstractcore import create_llm
from abstractcore.processing import BasicExtractor

# RECOMMENDED: Use cloud providers for complex JSON-LD extraction
llm = create_llm("openai", model="gpt-4o-mini")  # Best for production
extractor = BasicExtractor(llm)

# OR use Anthropic Claude for high quality
llm = create_llm("anthropic", model="claude-haiku-4-5")
extractor = BasicExtractor(llm)

# LOCAL MODELS: Work well for simple extraction, may struggle with complex JSON-LD
llm = create_llm("ollama", model="qwen3-coder:30b")  # Good for code and structured tasks
extractor = BasicExtractor(llm)

# For simple fact extraction with local models, use direct prompting:
facts_prompt = """Extract facts as JSON: [{"entity": "...", "action": "...", "object": "..."}]"""
facts = llm.generate(facts_prompt)  # Works reliably with local models

Command Line Interface

The extractor CLI provides direct terminal access for knowledge graph extraction without any Python programming.

Quick CLI Usage

# Simple usage (after installing AbstractCore; add `pip install "abstractcore[media]"` for PDFs)
extractor document.pdf

# With specific format and focus
extractor report.txt --format triples --focus technology

# Extract specific entity types
extractor data.md --entity-types person,organization --output entities.jsonld

# High-quality extraction with iterations
extractor doc.txt --iterate=3 --length=detailed --verbose

Alternative Usage Methods

# Method 1: Direct command (recommended after installation)
extractor document.txt --format triples

# Method 2: Via Python module (always works)
python -m abstractcore.apps.extractor document.txt --format triples

Basic Usage

# Extract from file (default: JSON-LD)
extractor document.txt
# OR: python -m abstractcore.apps.extractor document.txt

# Specify output format
extractor document.txt --format triples
# OR: python -m abstractcore.apps.extractor document.txt --format triples

# Save to file
extractor document.txt --output knowledge_graph.jsonld
# OR: python -m abstractcore.apps.extractor document.txt --output knowledge_graph.jsonld

Advanced Options

# Domain-focused extraction
extractor tech_report.txt --focus technology --length detailed
# OR: python -m abstractcore.apps.extractor tech_report.txt --focus technology --length detailed

# Custom provider and model
extractor document.txt --provider openai --model gpt-4o-mini
# OR: python -m abstractcore.apps.extractor document.txt --provider openai --model gpt-4o-mini

# Minified output for storage
extractor document.txt --format json-ld --minified
# OR: python -m abstractcore.apps.extractor document.txt --format json-ld --minified

# Iterative refinement for quality
extractor document.txt --iterate 3 --verbose
# OR: python -m abstractcore.apps.extractor document.txt --iterate 3 --verbose

CLI Parameters

Parameter Options Default Description
file_path Any text file Required Path to the file to extract from
--focus Any text None Specific focus area (e.g., "technology", "business")
--style structured, focused, minimal, comprehensive structured Extraction style
--length brief, standard, detailed, comprehensive brief Extraction depth
--entity-types Comma-separated list All types Entity types to focus on
--format json-ld, triples, json, yaml json-ld Output format
--output File path Console Output file path
--chunk-size 1000-32000 8000 Chunk size in characters
--provider openai, anthropic, ollama, etc. ollama LLM provider
--model Provider-specific qwen3:4b-instruct-2507-q4_K_M LLM model
--iterate 1-5 1 Number of refinement iterations
--minified Flag False Output minified JSON
--verbose Flag False Show detailed progress
--timeout Seconds/none unlimited HTTP timeout for LLM requests. Use 'none' for unlimited or specify seconds (e.g., 600)

Entity Types

Available entity types for --entity-types parameter:

  • person - People and individuals
  • organization - Companies, institutions, groups
  • location - Places, cities, countries, addresses
  • concept - Abstract concepts, ideas, theories
  • event - Occurrences, meetings, incidents
  • technology - Software, hardware, technical systems
  • product - Products, services, offerings
  • date - Temporal references, dates, times
  • other - Miscellaneous entities

Output Format Examples

Simple triples (no --verbose):

python -m abstractcore.apps.extractor doc.txt --format triples
# Output:
# Google creates TensorFlow
# Microsoft uses TensorFlow
# OpenAI develops GPT-4

Detailed triples (with --verbose):

python -m abstractcore.apps.extractor doc.txt --format triples --verbose
# Output: JSON with entities, relationships, confidence scores, statistics

Minified JSON-LD:

python -m abstractcore.apps.extractor doc.txt --format json-ld --minified
# Output: {"@context":{"s":"https://schema.org/"},"@graph":[...]}

Real-World Examples

Example 1: Technology Documentation

Input:

Google's TensorFlow is an open-source machine learning framework.
Microsoft Azure integrates TensorFlow for cloud AI services.
OpenAI's GPT models use transformer architecture developed by Google Research.

Command:

python -m abstractcore.apps.extractor tech.txt --focus technology --length detailed --format triples --verbose

Expected Output:

{
  "format": "triples",
  "simple_triples": [
    "Google creates TensorFlow",
    "Microsoft integrates TensorFlow",
    "OpenAI develops GPT",
    "Google Research develops transformer architecture"
  ],
  "entities": {
    "e:google": {"name": "Google", "type": "s:Organization"},
    "e:tensorflow": {"name": "TensorFlow", "type": "s:SoftwareApplication"},
    "e:microsoft": {"name": "Microsoft", "type": "s:Organization"}
  },
  "statistics": {"entities_count": 6, "relationships_count": 4}
}

Example 2: Business Analysis

Python API:

from abstractcore.processing import BasicExtractor

extractor = BasicExtractor()

business_text = """
Amazon acquired Whole Foods for $13.7 billion in 2017.
The acquisition expanded Amazon's grocery delivery capabilities.
Jeff Bezos was CEO of Amazon during this strategic move.
"""

# Extract with business focus
result = extractor.extract(
    business_text,
    domain_focus="business",
    length="standard",
    output_format="jsonld"
)

# Access entities and relationships
entities = [item for item in result['@graph'] if item.get('@id', '').startswith('e:')]
relationships = [item for item in result['@graph'] if item.get('@id', '').startswith('r:')]

print(f"Found {len(entities)} entities and {len(relationships)} relationships")

Example 3: Research Paper Processing

Command:

# Process academic paper with comprehensive extraction
python -m abstractcore.apps.extractor research_paper.pdf \
  --focus "research" \
  --length comprehensive \
  --iterate 2 \
  --format json-ld \
  --output paper_knowledge_graph.jsonld \
  --verbose

Best Practices

1. Model Selection

For Complex JSON-LD Extraction (RECOMMENDED):

# Best quality for structured knowledge graphs
llm = create_llm("openai", model="gpt-4o-mini")  # $0.001-0.01 per request
extractor = BasicExtractor(llm)

# Alternative: High-quality Claude
llm = create_llm("anthropic", model="claude-haiku-4-5")  # Similar cost
extractor = BasicExtractor(llm)

For Simple Fact Extraction (Local):

# Works well with qwen3-coder:30b for basic structured output
llm = create_llm("ollama", model="qwen3-coder:30b")  # 18GB, free
# Use simple JSON prompts instead of complex JSON-LD

# Default option
llm = create_llm("ollama", model="qwen3:4b-instruct-2507-q4_K_M")  # 4GB, balanced

Note:

  • Cloud models: High quality at complex JSON-LD with schema.org vocabulary
  • ⚠️ Local models: Good for simple facts, struggle with complex structured formats
  • Best approach: Use cloud models for production knowledge graphs, local models for simple extraction

2. Document Processing

Small Documents (<8000 chars):

  • Use length="standard" or length="detailed"
  • Single extraction pass is sufficient

Large Documents (>8000 chars):

  • Automatic chunking with overlap
  • Use --iterate=2 for better coverage
  • Consider length="brief" to avoid token limits

Domain-Specific Text:

  • Always use domain_focus parameter
  • Examples: "technology", "business", "medical", "legal", "academic"

3. Output Format Selection

Choose JSON-LD when:

  • Building semantic web applications
  • Need W3C standard compliance
  • Integrating with knowledge graph databases
  • Require full metadata and context

Choose Triples when:

  • Building graph databases (Neo4j, etc.)
  • Need simple SUBJECT PREDICATE OBJECT format
  • Implementing reasoning systems
  • Want human-readable relationships

Choose Minified when:

  • Storage space is limited
  • Network transmission efficiency matters
  • Building APIs with compact responses

4. Quality Optimization

For Higher Quality:

# Use iterative refinement (finds missing entities)
python -m abstractcore.apps.extractor doc.txt --iterate 3 --length detailed

# Use better models
python -m abstractcore.apps.extractor doc.txt --provider openai --model gpt-4o-mini

For Faster Processing:

# Use brief extraction with fast models
python -m abstractcore.apps.extractor doc.txt --length brief

Schema & Ontology

BasicExtractor uses standard vocabularies for maximum compatibility:

Entity Types (schema.org)

  • s:Person - People by name
  • s:Organization - Companies, institutions
  • s:SoftwareApplication - Software, frameworks, tools
  • s:Place - Locations, venues
  • s:Product - Products, services
  • s:Event - Events, meetings
  • sk:Concept - Abstract concepts, technologies

Relationship Types

  • creates - Authorship, development
  • uses - Utilization, dependency
  • supports - Support, enablement
  • partOf - Structural relationships
  • integrates - Integration, compatibility
  • provides - Service provision
  • memberOf - Organizational membership

Entity Structure

{
  "@id": "e:entity_name",
  "@type": "s:EntityType",
  "s:name": "Human readable name",
  "s:description": "Brief description",
  "confidence": 0.95
}

Relationship Structure

{
  "@id": "r:1",
  "@type": "s:Relationship",
  "s:name": "relationship_type",
  "s:about": {"@id": "e:subject_entity"},
  "s:object": {"@id": "e:object_entity"},
  "s:description": "Relationship description",
  "confidence": 0.90,
  "strength": 0.85
}

JSON Self-Correction

BasicExtractor includes automatic JSON self-correction that attempts to fix malformed LLM responses before giving up:

Automatic Recovery:

  • Extracts JSON from text with extra content
  • Fixes common formatting issues (trailing commas, quote problems)
  • Repairs truncated JSON by adding missing braces
  • Creates minimal valid structure from partial content

In Action:

⚠️  JSON parsing failed: Expecting ',' delimiter: line 3 column 45
ℹ️  JSON self-fix: Fixed common formatting issues
✅ JSON self-correction successful! Extracted 3 entities and 2 relationships

This significantly improves extraction reliability, especially with local models that may produce slightly malformed JSON.

Troubleshooting

Common Issues

"Failed to initialize default Ollama model"

# Install Ollama and download model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:4b-instruct-2507-q4_K_M
ollama serve

Empty extraction results

  • Try a different model: --provider openai --model gpt-4o-mini
  • Increase extraction length: --length detailed
  • Add domain focus: --focus technology

JSON parsing errors

  • Automatic self-correction handles most cases
  • If persistent, try a more capable model
  • Check model output with --verbose flag

Large file processing slow

  • Use length=brief for fewer entities
  • Consider pre-processing to extract relevant sections

Poor entity quality

  • Use iterative refinement: --iterate=2
  • Try more capable models (GPT-4, Claude)
  • Add specific domain focus

Error Messages

"Chunk size must be at least 1000 characters"

  • Increase --chunk-size parameter
  • File might be too short for meaningful extraction

"Iterate cannot exceed 5"

  • Maximum 5 refinement iterations allowed
  • Diminishing returns beyond 3 iterations

"Provider/model required together"

  • Both --provider and --model must be specified together

Timeout Configuration

The extractor supports flexible timeout configuration for different use cases:

Default Behavior (Unlimited Timeout)

# Runs as long as needed - recommended for large documents or complex models
python -m abstractcore.apps.extractor document.txt

Custom Timeout

# Set specific timeout (useful for production environments)
python -m abstractcore.apps.extractor document.txt --timeout 600  # 10 minutes
python -m abstractcore.apps.extractor document.txt --timeout 1800 # 30 minutes

# Explicit unlimited timeout
python -m abstractcore.apps.extractor document.txt --timeout none

Programmatic Usage

from abstractcore.processing import BasicExtractor

# Unlimited timeout (default)
extractor = BasicExtractor()

# Custom timeout
extractor = BasicExtractor(timeout=600)  # 10 minutes

# Explicit unlimited timeout
extractor = BasicExtractor(timeout=None)

When to Use Timeouts:

  • Production environments: Set reasonable timeouts (300-1800 seconds) to prevent hanging
  • Large documents: Use unlimited timeout for documents >50KB or complex extractions
  • Large models: Models >30B parameters may need longer processing time
  • Development: Use unlimited timeout to avoid interruptions during testing

Integration Examples

Web Application

from flask import Flask, request, jsonify
from abstractcore.processing import BasicExtractor

app = Flask(__name__)
extractor = BasicExtractor()

@app.route('/extract', methods=['POST'])
def extract_knowledge():
    text = request.json.get('text')
    format_type = request.json.get('format', 'jsonld')

    result = extractor.extract(text, output_format=format_type)
    return jsonify(result)

Data Pipeline

import pandas as pd
from abstractcore.processing import BasicExtractor

def process_documents(file_paths):
    extractor = BasicExtractor()
    results = []

    for path in file_paths:
        with open(path, 'r') as f:
            text = f.read()

        kg = extractor.extract(
            text,
            length="standard",
            output_format="triples"
        )

        results.append({
            'file': path,
            'entities': len(kg.get('entities', {})),
            'triples': kg.get('simple_triples', [])
        })

    return pd.DataFrame(results)

Performance notes

Extraction is an LLM call, so latency and cost vary by provider/model, input size, and retry behavior.

Practical guidance:

  • For large documents, extract in chunks (or pre-summarize) to reduce latency and avoid context limits.
  • Use a smaller model for higher throughput; use a larger model when extraction quality matters most.