This project is a text document processing pipeline designed to extract detailed character information from stories. Using embeddings, vector databases, and a large language model (LLM), the system provides structured information about characters, including their relationships, roles, and summaries.
- Document Processing: Loads and preprocesses .txt files from a specified directory.
- Embedding Computation: Generates vector embeddings for text chunks using MistralAI.
- Character Information Extraction: Retrieves structured details about characters using vector similarity search and LLM prompts.
- Command-Line Interface (CLI): Provides an easy-to-use interface for embedding computation and character queries.
- Python
- LangChain Framework
- MistralAI
- Chroma Vector Database
- Typer (CLI Framework)
- Pydantic (Data Validation)
git clone https://github.com/Aawegg/Story-Character-Extractor.git
cd Story-Character-Extractor
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtCreate a .env file or set the following environment variables:
MISTRAL_API_KEY=your-mistral-api-key- Compute Embeddings Generate embeddings for all .txt files in a directory and store them in a vector database.
python main.py compute-embeddings-cli <dataset_path>- Example:
- Example:
python main.py compute-embeddings-cli ./stories- Retrieve Character Information Query the system for details about a specific character.
python main.py get-character-info-cli <character_name>- Example:
- Example:
python main.py get-character-info-cli AliceStory-Character-Extractor/
├── document_processing.py # Handles loading and preprocessing text files.
├── embeddings.py # Computes and stores embeddings in a vector database.
├── extraction.py # Extracts structured character information using LLMs.
├── main.py # CLI for embedding computation and character queries.
├── requirements.txt # List of dependencies.
└── README.md # Project documentation.-
Prepare a Dataset: Place .txt files in a directory, e.g., ./stories.
-
Compute Embeddings: Run the compute-embeddings-cli command to generate embeddings.
-
Query Character Information: Use the get-character-info-cli command to retrieve details about characters.
python main.py get-character-info-cli "Alice"{
"name": "Alice",
"storyTitle": "Adventures in Wonderland",
"summary": "A curious and adventurous girl who explores a magical world.",
"relations": {
"White Rabbit": {
"relationType": "Friend",
"summary": "A guide and companion during her journey."
},
"Queen of Hearts": {
"relationType": "Antagonist",
"summary": "The ruler of Wonderland who opposes Alice."
}
},
"characterType": "Protagonist"
}Contributions are welcome! If you have suggestions for improvements or new features, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License.
For any inquiries or support, contact Aaweg Bhaladhare at aaweg.22110711@viit.ac.in.