A personal implementation of the method from the AutoChunker paper (see below). Bottom-up, structure-aware chunking of HTML/text into semantic units for RAG, with noise filtering.
The pipeline at a glance (document → markdown → split into lines → LLM merge + noise filter → hierarchical tree):
I’m building a modern scraper and needed solid chunking. The paper reports strong gains on chunk quality and retrieval — so I implemented it and am sharing it under MIT for anyone who wants a lightweight, easy-to-use chunker.
The numbers back it up: AutoChunker tops the benchmarks on Support and Wikipedia (noise, completeness, context switch, task relevance, and weighted precision). Details in the paper.
Arihant Jain, Purav Aggarwal, and Anoop Saladi. 2025. AutoChunker: Structured Text Chunking and its Evaluation. Vienna, Austria. Association for Computational Linguistics.
The method described above was published by Amazon researchers at ACL 2025. This repository is an independent implementation of that method — it is not an official Amazon release and is not affiliated with or endorsed by Amazon.
AutoChunker turns web pages or other HTML into semantic chunks suitable for retrieval and RAG: it groups related content into coherent units, strips out navigation, footers, and other page noise, and can preserve document structure. You feed it HTML and get back self-contained chunks that are easier to index and retrieve than raw or naively split text.
Compared to other chunkers, it supports every feature the paper evaluates (structure, noise elimination, context-aware retrieval, and more):
flowchart LR
A[HTML] --> B[Markdown]
B --> C[Split on separators]
C --> D[TextPart IDs]
D --> E[LLM merge and noise drop]
E --> F[Parse merged ranges]
F --> G[Flat chunks]
- HTML → Markdown (
html2text). - Split on separators into numbered paragraphs
TextPart(id, content). - Format as
<pID>content</pID>, send to LLM with the merge prompt. - Parse
<merged>start-end,...</merged>from the response. - Map ranges back to paragraph IDs and concatenate content into flat chunks.
The diagram at the top illustrates this flow. The repo also contains hierarchical_tree_recursion (markdown heading–based tree); it is not yet wired into the main pipeline.
- More LLM backends: The pipeline uses LangChain’s
BaseChatModel; src/llms/openai.py implements OpenAI. Adding other providers (Anthropic, local models) is a matter of implementing aget_llm()that returns aBaseChatModel. - Vectorization: No built-in step to embed chunks for retrieval — you get text chunks only; wiring in embeddings and optional indexing would complete a full RAG-ready pipeline.
Install dependencies from requirements.txt. Set OPENAI_API_KEY in .env; optional OPENAI_MODEL (default gpt-4o-mini). Main code: src/service.py, src/prompt.py, src/llms/openai.py.
This project is under the MIT License. See LICENSE for the full text.


