AutoChunker

A personal implementation of the method from the AutoChunker paper (see below). Bottom-up, structure-aware chunking of HTML/text into semantic units for RAG, with noise filtering.

The pipeline at a glance (document → markdown → split into lines → LLM merge + noise filter → hierarchical tree):

Context

I’m building a modern scraper and needed solid chunking. The paper reports strong gains on chunk quality and retrieval — so I implemented it and am sharing it under MIT for anyone who wants a lightweight, easy-to-use chunker.

The numbers back it up: AutoChunker tops the benchmarks on Support and Wikipedia (noise, completeness, context switch, task relevance, and weighted precision). Details in the paper.

Paper

Arihant Jain, Purav Aggarwal, and Anoop Saladi. 2025. AutoChunker: Structured Text Chunking and its Evaluation. Vienna, Austria. Association for Computational Linguistics.

The method described above was published by Amazon researchers at ACL 2025. This repository is an independent implementation of that method — it is not an official Amazon release and is not affiliated with or endorsed by Amazon.

What it does

AutoChunker turns web pages or other HTML into semantic chunks suitable for retrieval and RAG: it groups related content into coherent units, strips out navigation, footers, and other page noise, and can preserve document structure. You feed it HTML and get back self-contained chunks that are easier to index and retrieve than raw or naively split text.

Compared to other chunkers, it supports every feature the paper evaluates (structure, noise elimination, context-aware retrieval, and more):

Flow

flowchart LR
  A[HTML] --> B[Markdown]
  B --> C[Split on separators]
  C --> D[TextPart IDs]
  D --> E[LLM merge and noise drop]
  E --> F[Parse merged ranges]
  F --> G[Flat chunks]

HTML → Markdown (html2text).
Split on separators into numbered paragraphs TextPart(id, content).
Format as <pID>content</pID>, send to LLM with the merge prompt.
Parse <merged>start-end,...</merged> from the response.
Map ranges back to paragraph IDs and concatenate content into flat chunks.

The diagram at the top illustrates this flow. The repo also contains hierarchical_tree_recursion (markdown heading–based tree); it is not yet wired into the main pipeline.

What is still missing

More LLM backends: The pipeline uses LangChain’s BaseChatModel; src/llms/openai.py implements OpenAI. Adding other providers (Anthropic, local models) is a matter of implementing a get_llm() that returns a BaseChatModel.
Vectorization: No built-in step to embed chunks for retrieval — you get text chunks only; wiring in embeddings and optional indexing would complete a full RAG-ready pipeline.

Setup

Install dependencies from requirements.txt. Set OPENAI_API_KEY in .env; optional OPENAI_MODEL (default gpt-4o-mini). Main code: src/service.py, src/prompt.py, src/llms/openai.py.

License

This project is under the MIT License. See LICENSE for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs_assets		docs_assets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoChunker

Context

Paper

What it does

Flow

What is still missing

Setup

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoChunker

Context

Paper

What it does

Flow

What is still missing

Setup

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages