md2chunks is a Python project designed for context-enriched markdown chunking, particularly useful for Retrieval-Augmented Generation (RAG) tasks. It processes markdown files, splits them into manageable chunks, and enriches them with context to facilitate efficient information retrieval and processing.
- Markdown Processing: Converts markdown files to structured text.
- Text Splitting: Splits text into chunks based on token count, with special handling for URLs, decimals, and abbreviations.
- Context Enrichment: Adds context to each chunk to maintain the hierarchical structure of the original document.
- Logging: Provides detailed logging for debugging and monitoring.
This environment is setup using UV. Board the UV train, life is easier.
- Install UV
- Build Virtual Environment:
uv sync source .venv/bin/activate
Note: You can alternatively add the following alias to your .zshrc or .bashrc:
alias activate="source .venv/bin/activate"
That way, all you have to do is run:
3. activate
- In
src/settings.pyenter your Markdown directory path inMD_DIR_PATHand add your markdown files inside it. - Create a folder to store processed markdown files (so that original files remain intact) and provide that path in the
PROCESSED_DIR_PATHinsidesrc/settings.pyNote: This is an intermediate file and is only useful for debugging purposes. - Run
python main.py
Note: main.py only returns the chunks to a variable and quits the program. You are free to extend it your usecase.
Incase you want to visualise the chunks, refer to visualisation.ipynb. To look at the chunks run the notebook instead of step 5.
6. logs can be found inside the logs folder
7. Post use run deactivate
The idea of TextNodes in src/nodes.py is inspired from LlamaIndex
Please refer to LICENSE