BasicExtractor is a production-ready tool for extracting structured knowledge graphs from text documents. It converts unstructured text into semantic entities and relationships using large language models, outputting clean JSON-LD or RDF triple formats.
from abstractcore.processing import BasicExtractor
# Initialize with default model (Ollama qwen3:4b-instruct-2507-q4_K_M)
extractor = BasicExtractor()
# Extract knowledge graph
result = extractor.extract("Google created TensorFlow in 2015. Microsoft uses TensorFlow for Azure AI.")
# Result contains entities and relationships in JSON-LD format
entities = [item for item in result['@graph'] if item.get('@id', '').startswith('e:')]
relationships = [item for item in result['@graph'] if item.get('@id', '').startswith('r:')]# Install AbstractCore. The default Ollama path works with the core install.
pip install abstractcore
# Optional turnkey local-runtime installs:
pip install "abstractcore[all-apple]" # Apple Silicon: HF/GGUF + MLX + features + server
pip install "abstractcore[all-gpu]" # NVIDIA GPU: HF/GGUF + vLLM + features + server
# Default model requires Ollama (free, runs locally)
# 1. Install Ollama: https://ollama.com/
# 2. Download model: ollama pull qwen3:4b-instruct-2507-q4_K_M
# 3. Start Ollama service
# Alternative: Use cloud providers
pip install "abstractcore[remote]"Default Model: qwen3:4b-instruct-2507-q4_K_M
- Size: ~4GB model
- RAM: ~8GB required
- Speed: Good balance of speed and quality
- Setup:
ollama pull qwen3:4b-instruct-2507-q4_K_M
For Optimal Performance:
qwen3-coder:30b: High quality for structured JSON-LD output (requires 32GB RAM)gpt-oss:120b: Highest quality extraction (requires 120GB RAM)
For Production: Cloud providers (OpenAI GPT-4o-mini, Claude) offer the most reliable JSON-LD generation.
BasicExtractor supports three output formats optimized for different use cases:
Standard W3C JSON-LD with schema.org vocabulary - ideal for semantic web applications:
result = extractor.extract("Apple acquired Siri in 2010", output_format="jsonld"){
"@context": {
"s": "https://schema.org/",
"e": "http://example.org/entity/",
"r": "http://example.org/relation/",
"confidence": "http://example.org/confidence"
},
"@graph": [
{
"@id": "e:apple",
"@type": "s:Organization",
"s:name": "Apple",
"s:description": "Technology company",
"confidence": 0.95
},
{
"@id": "r:1",
"@type": "s:Relationship",
"s:name": "acquires",
"s:about": {"@id": "e:apple"},
"s:object": {"@id": "e:siri"},
"confidence": 0.90
}
]
}SUBJECT PREDICATE OBJECT format following semantic web standards - perfect for graph databases:
result = extractor.extract("Apple acquired Siri in 2010", output_format="triples")Simple output:
Apple acquires Siri
Detailed output (with metadata):
{
"format": "triples",
"simple_triples": ["Apple acquires Siri"],
"triples": [
{
"subject": "e:apple",
"subject_name": "Apple",
"predicate": "acquires",
"object": "e:siri",
"object_name": "Siri",
"confidence": 0.90
}
],
"entities": {...},
"statistics": {"entities_count": 2, "relationships_count": 1}
}Compact JSON string without indentation - optimized for storage and transport:
result = extractor.extract("Apple acquired Siri in 2010", output_format="jsonld_minified"){
"format": "jsonld_minified",
"data": "{\"@context\":{\"s\":\"https://schema.org/\"},\"@graph\":[...]}",
"entities_count": 2,
"relationships_count": 1
}class BasicExtractor:
def __init__(
self,
llm: Optional[AbstractCoreInterface] = None,
max_chunk_size: int = 8000
)
def extract(
self,
text: str,
domain_focus: Optional[str] = None,
entity_types: Optional[List[str]] = None,
style: Optional[str] = None,
length: Optional[str] = None,
output_format: str = "jsonld"
) -> dicttext(str): Text to extract knowledge fromdomain_focus(str, optional): Focus area like "technology", "business", "medical"entity_types(List[str], optional): Reserved for future usestyle(str, optional): Reserved for future uselength(str, optional): Extraction depth"brief"- 10 entities max (fast)"standard"- 15 entities max (balanced)"detailed"- 25 entities max (thorough)"comprehensive"- 50 entities max (extensive)
output_format(str): Output format"jsonld"- Standard JSON-LD (default)"triples"- RDF SUBJECT PREDICATE OBJECT format"jsonld_minified"- Compact JSON string
from abstractcore import create_llm
from abstractcore.processing import BasicExtractor
# RECOMMENDED: Use cloud providers for complex JSON-LD extraction
llm = create_llm("openai", model="gpt-4o-mini") # Best for production
extractor = BasicExtractor(llm)
# OR use Anthropic Claude for high quality
llm = create_llm("anthropic", model="claude-haiku-4-5")
extractor = BasicExtractor(llm)
# LOCAL MODELS: Work well for simple extraction, may struggle with complex JSON-LD
llm = create_llm("ollama", model="qwen3-coder:30b") # Good for code and structured tasks
extractor = BasicExtractor(llm)
# For simple fact extraction with local models, use direct prompting:
facts_prompt = """Extract facts as JSON: [{"entity": "...", "action": "...", "object": "..."}]"""
facts = llm.generate(facts_prompt) # Works reliably with local modelsThe extractor CLI provides direct terminal access for knowledge graph extraction without any Python programming.
# Simple usage (after installing AbstractCore; add `pip install "abstractcore[media]"` for PDFs)
extractor document.pdf
# With specific format and focus
extractor report.txt --format triples --focus technology
# Extract specific entity types
extractor data.md --entity-types person,organization --output entities.jsonld
# High-quality extraction with iterations
extractor doc.txt --iterate=3 --length=detailed --verbose# Method 1: Direct command (recommended after installation)
extractor document.txt --format triples
# Method 2: Via Python module (always works)
python -m abstractcore.apps.extractor document.txt --format triples# Extract from file (default: JSON-LD)
extractor document.txt
# OR: python -m abstractcore.apps.extractor document.txt
# Specify output format
extractor document.txt --format triples
# OR: python -m abstractcore.apps.extractor document.txt --format triples
# Save to file
extractor document.txt --output knowledge_graph.jsonld
# OR: python -m abstractcore.apps.extractor document.txt --output knowledge_graph.jsonld# Domain-focused extraction
extractor tech_report.txt --focus technology --length detailed
# OR: python -m abstractcore.apps.extractor tech_report.txt --focus technology --length detailed
# Custom provider and model
extractor document.txt --provider openai --model gpt-4o-mini
# OR: python -m abstractcore.apps.extractor document.txt --provider openai --model gpt-4o-mini
# Minified output for storage
extractor document.txt --format json-ld --minified
# OR: python -m abstractcore.apps.extractor document.txt --format json-ld --minified
# Iterative refinement for quality
extractor document.txt --iterate 3 --verbose
# OR: python -m abstractcore.apps.extractor document.txt --iterate 3 --verbose| Parameter | Options | Default | Description |
|---|---|---|---|
file_path |
Any text file | Required | Path to the file to extract from |
--focus |
Any text | None | Specific focus area (e.g., "technology", "business") |
--style |
structured, focused, minimal, comprehensive |
structured |
Extraction style |
--length |
brief, standard, detailed, comprehensive |
brief |
Extraction depth |
--entity-types |
Comma-separated list | All types | Entity types to focus on |
--format |
json-ld, triples, json, yaml |
json-ld |
Output format |
--output |
File path | Console | Output file path |
--chunk-size |
1000-32000 | 8000 | Chunk size in characters |
--provider |
openai, anthropic, ollama, etc. |
ollama |
LLM provider |
--model |
Provider-specific | qwen3:4b-instruct-2507-q4_K_M |
LLM model |
--iterate |
1-5 | 1 | Number of refinement iterations |
--minified |
Flag | False | Output minified JSON |
--verbose |
Flag | False | Show detailed progress |
--timeout |
Seconds/none | unlimited | HTTP timeout for LLM requests. Use 'none' for unlimited or specify seconds (e.g., 600) |
Available entity types for --entity-types parameter:
person- People and individualsorganization- Companies, institutions, groupslocation- Places, cities, countries, addressesconcept- Abstract concepts, ideas, theoriesevent- Occurrences, meetings, incidentstechnology- Software, hardware, technical systemsproduct- Products, services, offeringsdate- Temporal references, dates, timesother- Miscellaneous entities
Simple triples (no --verbose):
python -m abstractcore.apps.extractor doc.txt --format triples
# Output:
# Google creates TensorFlow
# Microsoft uses TensorFlow
# OpenAI develops GPT-4Detailed triples (with --verbose):
python -m abstractcore.apps.extractor doc.txt --format triples --verbose
# Output: JSON with entities, relationships, confidence scores, statisticsMinified JSON-LD:
python -m abstractcore.apps.extractor doc.txt --format json-ld --minified
# Output: {"@context":{"s":"https://schema.org/"},"@graph":[...]}Input:
Google's TensorFlow is an open-source machine learning framework.
Microsoft Azure integrates TensorFlow for cloud AI services.
OpenAI's GPT models use transformer architecture developed by Google Research.
Command:
python -m abstractcore.apps.extractor tech.txt --focus technology --length detailed --format triples --verboseExpected Output:
{
"format": "triples",
"simple_triples": [
"Google creates TensorFlow",
"Microsoft integrates TensorFlow",
"OpenAI develops GPT",
"Google Research develops transformer architecture"
],
"entities": {
"e:google": {"name": "Google", "type": "s:Organization"},
"e:tensorflow": {"name": "TensorFlow", "type": "s:SoftwareApplication"},
"e:microsoft": {"name": "Microsoft", "type": "s:Organization"}
},
"statistics": {"entities_count": 6, "relationships_count": 4}
}Python API:
from abstractcore.processing import BasicExtractor
extractor = BasicExtractor()
business_text = """
Amazon acquired Whole Foods for $13.7 billion in 2017.
The acquisition expanded Amazon's grocery delivery capabilities.
Jeff Bezos was CEO of Amazon during this strategic move.
"""
# Extract with business focus
result = extractor.extract(
business_text,
domain_focus="business",
length="standard",
output_format="jsonld"
)
# Access entities and relationships
entities = [item for item in result['@graph'] if item.get('@id', '').startswith('e:')]
relationships = [item for item in result['@graph'] if item.get('@id', '').startswith('r:')]
print(f"Found {len(entities)} entities and {len(relationships)} relationships")Command:
# Process academic paper with comprehensive extraction
python -m abstractcore.apps.extractor research_paper.pdf \
--focus "research" \
--length comprehensive \
--iterate 2 \
--format json-ld \
--output paper_knowledge_graph.jsonld \
--verboseFor Complex JSON-LD Extraction (RECOMMENDED):
# Best quality for structured knowledge graphs
llm = create_llm("openai", model="gpt-4o-mini") # $0.001-0.01 per request
extractor = BasicExtractor(llm)
# Alternative: High-quality Claude
llm = create_llm("anthropic", model="claude-haiku-4-5") # Similar cost
extractor = BasicExtractor(llm)For Simple Fact Extraction (Local):
# Works well with qwen3-coder:30b for basic structured output
llm = create_llm("ollama", model="qwen3-coder:30b") # 18GB, free
# Use simple JSON prompts instead of complex JSON-LD
# Default option
llm = create_llm("ollama", model="qwen3:4b-instruct-2507-q4_K_M") # 4GB, balancedNote:
- Cloud models: High quality at complex JSON-LD with schema.org vocabulary
⚠️ Local models: Good for simple facts, struggle with complex structured formats- Best approach: Use cloud models for production knowledge graphs, local models for simple extraction
Small Documents (<8000 chars):
- Use
length="standard"orlength="detailed" - Single extraction pass is sufficient
Large Documents (>8000 chars):
- Automatic chunking with overlap
- Use
--iterate=2for better coverage - Consider
length="brief"to avoid token limits
Domain-Specific Text:
- Always use
domain_focusparameter - Examples: "technology", "business", "medical", "legal", "academic"
Choose JSON-LD when:
- Building semantic web applications
- Need W3C standard compliance
- Integrating with knowledge graph databases
- Require full metadata and context
Choose Triples when:
- Building graph databases (Neo4j, etc.)
- Need simple SUBJECT PREDICATE OBJECT format
- Implementing reasoning systems
- Want human-readable relationships
Choose Minified when:
- Storage space is limited
- Network transmission efficiency matters
- Building APIs with compact responses
For Higher Quality:
# Use iterative refinement (finds missing entities)
python -m abstractcore.apps.extractor doc.txt --iterate 3 --length detailed
# Use better models
python -m abstractcore.apps.extractor doc.txt --provider openai --model gpt-4o-miniFor Faster Processing:
# Use brief extraction with fast models
python -m abstractcore.apps.extractor doc.txt --length briefBasicExtractor uses standard vocabularies for maximum compatibility:
s:Person- People by names:Organization- Companies, institutionss:SoftwareApplication- Software, frameworks, toolss:Place- Locations, venuess:Product- Products, servicess:Event- Events, meetingssk:Concept- Abstract concepts, technologies
creates- Authorship, developmentuses- Utilization, dependencysupports- Support, enablementpartOf- Structural relationshipsintegrates- Integration, compatibilityprovides- Service provisionmemberOf- Organizational membership
{
"@id": "e:entity_name",
"@type": "s:EntityType",
"s:name": "Human readable name",
"s:description": "Brief description",
"confidence": 0.95
}{
"@id": "r:1",
"@type": "s:Relationship",
"s:name": "relationship_type",
"s:about": {"@id": "e:subject_entity"},
"s:object": {"@id": "e:object_entity"},
"s:description": "Relationship description",
"confidence": 0.90,
"strength": 0.85
}BasicExtractor includes automatic JSON self-correction that attempts to fix malformed LLM responses before giving up:
Automatic Recovery:
- Extracts JSON from text with extra content
- Fixes common formatting issues (trailing commas, quote problems)
- Repairs truncated JSON by adding missing braces
- Creates minimal valid structure from partial content
In Action:
⚠️ JSON parsing failed: Expecting ',' delimiter: line 3 column 45
ℹ️ JSON self-fix: Fixed common formatting issues
✅ JSON self-correction successful! Extracted 3 entities and 2 relationships
This significantly improves extraction reliability, especially with local models that may produce slightly malformed JSON.
"Failed to initialize default Ollama model"
# Install Ollama and download model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:4b-instruct-2507-q4_K_M
ollama serveEmpty extraction results
- Try a different model:
--provider openai --model gpt-4o-mini - Increase extraction length:
--length detailed - Add domain focus:
--focus technology
JSON parsing errors
- Automatic self-correction handles most cases
- If persistent, try a more capable model
- Check model output with
--verboseflag
Large file processing slow
- Use
length=brieffor fewer entities - Consider pre-processing to extract relevant sections
Poor entity quality
- Use iterative refinement:
--iterate=2 - Try more capable models (GPT-4, Claude)
- Add specific domain focus
"Chunk size must be at least 1000 characters"
- Increase
--chunk-sizeparameter - File might be too short for meaningful extraction
"Iterate cannot exceed 5"
- Maximum 5 refinement iterations allowed
- Diminishing returns beyond 3 iterations
"Provider/model required together"
- Both
--providerand--modelmust be specified together
The extractor supports flexible timeout configuration for different use cases:
# Runs as long as needed - recommended for large documents or complex models
python -m abstractcore.apps.extractor document.txt# Set specific timeout (useful for production environments)
python -m abstractcore.apps.extractor document.txt --timeout 600 # 10 minutes
python -m abstractcore.apps.extractor document.txt --timeout 1800 # 30 minutes
# Explicit unlimited timeout
python -m abstractcore.apps.extractor document.txt --timeout nonefrom abstractcore.processing import BasicExtractor
# Unlimited timeout (default)
extractor = BasicExtractor()
# Custom timeout
extractor = BasicExtractor(timeout=600) # 10 minutes
# Explicit unlimited timeout
extractor = BasicExtractor(timeout=None)When to Use Timeouts:
- Production environments: Set reasonable timeouts (300-1800 seconds) to prevent hanging
- Large documents: Use unlimited timeout for documents >50KB or complex extractions
- Large models: Models >30B parameters may need longer processing time
- Development: Use unlimited timeout to avoid interruptions during testing
from flask import Flask, request, jsonify
from abstractcore.processing import BasicExtractor
app = Flask(__name__)
extractor = BasicExtractor()
@app.route('/extract', methods=['POST'])
def extract_knowledge():
text = request.json.get('text')
format_type = request.json.get('format', 'jsonld')
result = extractor.extract(text, output_format=format_type)
return jsonify(result)import pandas as pd
from abstractcore.processing import BasicExtractor
def process_documents(file_paths):
extractor = BasicExtractor()
results = []
for path in file_paths:
with open(path, 'r') as f:
text = f.read()
kg = extractor.extract(
text,
length="standard",
output_format="triples"
)
results.append({
'file': path,
'entities': len(kg.get('entities', {})),
'triples': kg.get('simple_triples', [])
})
return pd.DataFrame(results)Extraction is an LLM call, so latency and cost vary by provider/model, input size, and retry behavior.
Practical guidance:
- For large documents, extract in chunks (or pre-summarize) to reduce latency and avoid context limits.
- Use a smaller model for higher throughput; use a larger model when extraction quality matters most.