This toolkit provides a robust, modular, and extensible batch entity linking pipeline for large-scale entity normalization, context disambiguation, and knowledge base linking using LLMs (Gemini) and DBpedia. It is designed for high-throughput, production-grade entity linking and annotation workflows.
- Input: List of entity-context pairs (or load from CSV/Excel/JSON)
- Batch Canonical Name Normalization: Uses Gemini LLM to normalize mentions to canonical names
- Batch Context Analysis: Uses Gemini LLM to analyze context and disambiguate entity type
- Batch DBpedia URI Lookup: Uses SPARQL to map canonical names to DBpedia URIs
- Output: Merged DataFrame/JSON with all results, including error reporting and export options
batch_preprocessing/
├── batch_canonical_name.py # Batch canonical name normalization (Gemini)
├── batch_context_analysis.py # Batch context disambiguation (Gemini)
├── batch_dbpedia_uri.py # Batch DBpedia URI lookup (SPARQL)
├── full_batch_pipeline.py # Orchestrates the full workflow
examples/* # Contains all examples scripts for the batch preprocessing full pipeline
hybrid_linking/* # Core initial implementation for DBpedia
- Accepts a list of entity mentions
- Splits into manageable chunks for Gemini
- Sends batch prompts to Gemini for canonical name normalization
- Handles errors, missing results, and deduplication
- Returns results as DataFrame, JSON, or list of dicts
- Accepts a list of dicts with 'mention' and 'context'
- Splits into manageable chunks for Gemini
- Sends batch prompts to Gemini for context analysis (entity type, confidence, keywords, description)
- Handles errors and missing results
- Returns results as DataFrame, JSON, or list of dicts
- Accepts a list of canonical names
- Splits into manageable chunks for DBpedia SPARQL
- Uses VALUES clause to look up all URIs in each batch
- Handles errors and missing results
- Returns results as DataFrame, JSON, or list of dicts
- Orchestrates the full workflow:
- Canonical name normalization
- Context analysis
- DBpedia URI lookup
- Merges all results into a single DataFrame
- Provides utility functions for loading/saving input/output
- Reports errors and progress
- Input: List of dicts (or DataFrame) with 'mention' and 'context'
- Canonical Name Normalization: Adds 'canonical_name' column
- Context Analysis: Adds 'entity_type', 'confidence', 'keywords', 'description' columns
- DBpedia URI Lookup: Adds 'dbpedia_uri' column
- Output: DataFrame/JSON/CSV/Excel with all columns merged
- Chunked Processing: All batch steps use chunking to avoid API/endpoint limits and improve reliability
- Progress & Logging: Each batch prints progress and timing for transparency
- Error Handling: All steps catch and report errors, and missing/ambiguous results are summarized
- Flexible I/O: Utility functions support loading/saving from/to CSV, Excel, and JSON
- Modularity: Each batch step is a standalone module, making it easy to swap out or extend
- Extensibility: The pipeline can be extended to support new LLMs, knowledge bases, or additional analysis steps
| mention | context | canonical_name | entity_type | confidence | keywords | description | dbpedia_uri |
|---|---|---|---|---|---|---|---|
| Apple | I work at Apple | Apple_Inc. | company | 0.95 | ["tech", ...] | Apple Inc. the company | http://dbpedia.org/resource/Apple_Inc. |
| Apple | I eat an apple every day | Apple | product | 0.90 | ["fruit", ...] | The fruit apple | http://dbpedia.org/resource/Apple |
- Add new batch modules for additional analysis (e.g., Wikidata lookup)
- Integrate with other LLMs by swapping out the Gemini-based modules
- Add CLI or API wrappers for production deployment
- Python 3.8+
- Gemini API key (for LLM steps)
- DBpedia SPARQL