Skip to content

verloop/md2chunks

Repository files navigation

Introduction

md2chunks is a Python project designed for context-enriched markdown chunking, particularly useful for Retrieval-Augmented Generation (RAG) tasks. It processes markdown files, splits them into manageable chunks, and enriches them with context to facilitate efficient information retrieval and processing.

Features

  • Markdown Processing: Converts markdown files to structured text.
  • Text Splitting: Splits text into chunks based on token count, with special handling for URLs, decimals, and abbreviations.
  • Context Enrichment: Adds context to each chunk to maintain the hierarchical structure of the original document.
  • Logging: Provides detailed logging for debugging and monitoring.

Setup

This environment is setup using UV. Board the UV train, life is easier.

  1. Install UV
  2. Build Virtual Environment: uv sync
  3. source .venv/bin/activate

Note: You can alternatively add the following alias to your .zshrc or .bashrc:

alias activate="source .venv/bin/activate"

That way, all you have to do is run: 3. activate

Usage

  1. In src/settings.py enter your Markdown directory path in MD_DIR_PATH and add your markdown files inside it.
  2. Create a folder to store processed markdown files (so that original files remain intact) and provide that path in the PROCESSED_DIR_PATH inside src/settings.py Note: This is an intermediate file and is only useful for debugging purposes.
  3. Run python main.py

Note: main.py only returns the chunks to a variable and quits the program. You are free to extend it your usecase. Incase you want to visualise the chunks, refer to visualisation.ipynb. To look at the chunks run the notebook instead of step 5. 6. logs can be found inside the logs folder 7. Post use run deactivate

Acknowledgements

The idea of TextNodes in src/nodes.py is inspired from LlamaIndex

License

Please refer to LICENSE

About

Context Enriched Markdown Chunking for RAG

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •