GitHub - verloop/md2chunks: Context Enriched Markdown Chunking for RAG

Introduction

md2chunks is a Python project designed for context-enriched markdown chunking, particularly useful for Retrieval-Augmented Generation (RAG) tasks. It processes markdown files, splits them into manageable chunks, and enriches them with context to facilitate efficient information retrieval and processing.

Features

Markdown Processing: Converts markdown files to structured text.
Text Splitting: Splits text into chunks based on token count, with special handling for URLs, decimals, and abbreviations.
Context Enrichment: Adds context to each chunk to maintain the hierarchical structure of the original document.
Logging: Provides detailed logging for debugging and monitoring.

Setup

This environment is setup using UV. Board the UV train, life is easier.

Install UV
Build Virtual Environment: uv sync
source .venv/bin/activate

Note: You can alternatively add the following alias to your .zshrc or .bashrc:

alias activate="source .venv/bin/activate"

That way, all you have to do is run: 3. activate

Usage

In src/settings.py enter your Markdown directory path in MD_DIR_PATH and add your markdown files inside it.
Create a folder to store processed markdown files (so that original files remain intact) and provide that path in the PROCESSED_DIR_PATH inside src/settings.py Note: This is an intermediate file and is only useful for debugging purposes.
Run python main.py

Note: main.py only returns the chunks to a variable and quits the program. You are free to extend it your usecase. Incase you want to visualise the chunks, refer to visualisation.ipynb. To look at the chunks run the notebook instead of step 5. 6. logs can be found inside the logs folder 7. Post use run deactivate

Acknowledgements

The idea of TextNodes in src/nodes.py is inspired from LlamaIndex

License

Please refer to LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
md_files		md_files
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock
visualisation.ipynb		visualisation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Features

Setup

Usage

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

verloop/md2chunks

Folders and files

Latest commit

History

Repository files navigation

Introduction

Features

Setup

Usage

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages