🧩 Chunklet-py

“One library to split them all: Sentence, Code, Docs”

Warning

Quick heads up! Version 2 has some breaking changes. No worries though - check our Migration Guide for a smooth upgrade!

-- documentation site --

Why Smart Chunking? (Or: Why Not Just Split on Character Count?)

You might be wondering: "Can't I just split my text by character count or random line breaks?" Well, sure you could... but that's like trying to cut a wedding cake with a chainsaw! 🎂 Standard methods often give you:

Mid-sentence surprises: Your carefully crafted thoughts get chopped right in the middle, losing all meaning
Language confusion: Non-English text and code structures get treated like they're all the same
Lost context: Each chunk forgets what came before, like a conversation where everyone has amnesia

Smart chunking keeps your content's meaning and structure intact!

🤔 So What's Chunklet-py Anyway? (And Why Should You Care?)

Chunklet-py is your friendly neighborhood text splitter that takes all kinds of content - from plain text to PDFs to source code - and breaks them into smart, context-aware chunks. Instead of dumb splitting, we give you specialized tools:

Sentence Splitter
Plain Text Chunker
Document Chunker
Code Chunker
Chunk Visualizer (Interactive web interface)

Each tool keeps your content's meaning and structure intact.

Perfect for prepping data for LLMs, building RAG systems, or powering AI search - Chunklet-py gives you the precision and flexibility you need across tons of formats and languages.

Feature	Why it's awesome
🚀 Blazingly Fast	Leverages efficient parallel processing to chunk large volumes of content with remarkable speed.
🪶 Featherlight Footprint	Designed to be lightweight and memory-efficient, ensuring optimal performance without unnecessary overhead.
🗂️ Rich Metadata for RAG	Enriches chunks with valuable, context-aware metadata (source, span, document properties, code AST details) crucial for advanced RAG and LLM applications.
🔧 Infinitely Customizable	Offers extensive customization options, from pluggable token counters to custom sentence splitters and processors.
🌐 Multilingual Mastery	Supports over 50 natural languages for text and document chunking with intelligent detection and language-specific algorithms.
🧑‍💻 Code-Aware Intelligence	Language-agnostic code chunking that understands and preserves the structural integrity of your source code.
🎯 Precision Chunking	Flexible constraint-based chunking allows you to combine limits based on sentences, tokens, sections, lines, and functions.
📄 Document Format Mastery	Processes a wide array of document formats including `.pdf`, `.docx`, `.epub`, `.txt`, `.tex`, `.html`, `.hml`, `.md`, `.rst`, `.rtf`, `.odt`, `.csv`, and `.xlsx`.
💻 Triple Interface: CLI, Library & Web	Use it as a command-line tool, import as a library for deep integration, or launch the interactive web visualizer for real-time chunk exploration and parameter tuning.

And that's just the start - there's plenty more to explore!

Note

For the full documentation experience, check out our documentation site.

📦 Installation

Ready to get Chunklet-py running? Awesome! Let's get you set up quickly and painlessly.

!!! note "Package Name Change" Chunklet-py was previously named chunklet. The old chunklet package is no longer maintained. When installing, make sure to use chunklet-py (with the hyphen) to get the latest version.

The Quick & Easy Way

The simplest way to get started is with pip:

# Install and check it's working
pip install chunklet-py
chunklet --version

That's it! You're all set to start chunking.

Extra Features (Optional)

Want to unlock more Chunklet-py superpowers? Add these optional dependencies based on what you need:

Document Processing: For handling .pdf, .docx, .epub, and other document formats:
```
pip install "chunklet-py[document]"
```
Code Chunking: For advanced code analysis and chunking features:
```
pip install "chunklet-py[code]"
```
Visualization: For the interactive web-based chunk visualizer:
```
pip install "chunklet-py[visualization]"
```
All Extras: To install all optional dependencies:
```
pip install "chunklet-py[all]"
```

The From-Source Way

Prefer building from source? You can clone and install manually for full control:

git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
pip install .[all]

(But honestly, the pip way is usually way easier!)

Want to Help Make Chunklet-py Even Better?

That's awesome! We'd love to have you contribute. Check out our Contributing Guide first, then set up your development environment:

git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
# For basic development (testing, linting)
pip install -e ".[dev]"
# For documentation development
pip install -e ".[docs]"
# For comprehensive development (including all optional features like document and code chunking + docs dependencies)
pip install -e ".[dev-all]"

These install Chunklet-py in "editable" mode so your code changes take effect immediately. The different options give you just the dependencies you need.

Go forth and code! (And remember, good developers write tests. We appreciate excellent code examples!)

🗺 Features & Roadmap

CLI interface
Documents chunking with metadata
Code chunking based on interest point
Interactive chunk visualizer (web interface)
Extended file format support:
- ODT files
- CSV and Excel files
Future enhancements:
- Additional document formats

How Chunklet-py Compares

While there are other chunking libraries available, Chunklet-py stands out for its unique combination of versatility, performance, and ease of use. Here's a quick look at how it compares to some of the alternatives:

Library	Key Differentiator	Focus
chunklet-py	All-in-one, lightweight, and language-agnostic with specialized algorithms.	Text, Code, Docs
CintraAI Code Chunker	Relies on `tree-sitter`, which can add setup complexity.	Code
Chonkie	A feature-rich pipeline tool with cloud/vector integrations, but uses a more basic sentence splitter and `tree-sitter` for code.	Pipelines, Integrations
code_chunker (JimAiMoment)	Uses basic regex and rules with limited language support.	Code
Semchunk	Primarily for text, using a general-purpose sentence splitter.	Text

Chunklet-py's rule-based, language-agnostic approach to code chunking avoids the need for heavy dependencies like tree-sitter, which can sometimes introduce compatibility issues. For sentence splitting, it uses specialized libraries and algorithms for higher accuracy, rather than a one-size-fits-all approach. This makes Chunklet-py a great choice for projects that require a balance of power, flexibility, and a small footprint.

🙌 Contributors & Thanks

A huge thank you to the awesome people who helped shape Chunklet-py:

@jmbernabotto — for helping mostly on the CLI part, suggesting fixes, features, and design improvements.
@arnoldfranz — for reporting the CLI Path Validation Bug (#6) that helped improve error handling.

📜 License

Check out the LICENSE file for all the details.

MIT License. Use freely, modify boldly, and credit appropriately! (We're not that legendary... yet 😉)

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.github/workflows		.github/workflows
docs		docs
samples		samples
src/chunklet		src/chunklet
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build_docs.sh		build_docs.sh
logo_with_tagline.png		logo_with_tagline.png
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
todo.txt		todo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧩 Chunklet-py

Why Smart Chunking? (Or: Why Not Just Split on Character Count?)

🤔 So What's Chunklet-py Anyway? (And Why Should You Care?)

📦 Installation

The Quick & Easy Way

Extra Features (Optional)

The From-Source Way

Want to Help Make Chunklet-py Even Better?

🗺 Features & Roadmap

How Chunklet-py Compares

🙌 Contributors & Thanks

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors 2

Languages

License

speedyk-005/chunklet-py

Folders and files

Latest commit

History

Repository files navigation

🧩 Chunklet-py

Why Smart Chunking? (Or: Why Not Just Split on Character Count?)

🤔 So What's Chunklet-py Anyway? (And Why Should You Care?)

📦 Installation

The Quick & Easy Way

Extra Features (Optional)

The From-Source Way

Want to Help Make Chunklet-py Even Better?

🗺 Features & Roadmap

How Chunklet-py Compares

🙌 Contributors & Thanks

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 2

Languages

Packages