Skip to content

ahmeabd/AutoChunker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoChunker

A personal implementation of the method from the AutoChunker paper (see below). Bottom-up, structure-aware chunking of HTML/text into semantic units for RAG, with noise filtering.

The pipeline at a glance (document → markdown → split into lines → LLM merge + noise filter → hierarchical tree):

Pipeline: document conversion, splitting, LLM aggregation, hierarchical tree

Context

I’m building a modern scraper and needed solid chunking. The paper reports strong gains on chunk quality and retrieval — so I implemented it and am sharing it under MIT for anyone who wants a lightweight, easy-to-use chunker.

The numbers back it up: AutoChunker tops the benchmarks on Support and Wikipedia (noise, completeness, context switch, task relevance, and weighted precision). Details in the paper.

Benchmarks: AutoChunker vs other methods (Support & Wikipedia)

Paper

Arihant Jain, Purav Aggarwal, and Anoop Saladi. 2025. AutoChunker: Structured Text Chunking and its Evaluation. Vienna, Austria. Association for Computational Linguistics.

The method described above was published by Amazon researchers at ACL 2025. This repository is an independent implementation of that method — it is not an official Amazon release and is not affiliated with or endorsed by Amazon.

What it does

AutoChunker turns web pages or other HTML into semantic chunks suitable for retrieval and RAG: it groups related content into coherent units, strips out navigation, footers, and other page noise, and can preserve document structure. You feed it HTML and get back self-contained chunks that are easier to index and retrieve than raw or naively split text.

Compared to other chunkers, it supports every feature the paper evaluates (structure, noise elimination, context-aware retrieval, and more):

Feature comparison: AutoChunker vs Recursive, Semantic, LGMGC, LLMSemantic, LumberChunker

Flow

flowchart LR
  A[HTML] --> B[Markdown]
  B --> C[Split on separators]
  C --> D[TextPart IDs]
  D --> E[LLM merge and noise drop]
  E --> F[Parse merged ranges]
  F --> G[Flat chunks]
Loading
  1. HTML → Markdown (html2text).
  2. Split on separators into numbered paragraphs TextPart(id, content).
  3. Format as <pID>content</pID>, send to LLM with the merge prompt.
  4. Parse <merged>start-end,...</merged> from the response.
  5. Map ranges back to paragraph IDs and concatenate content into flat chunks.

The diagram at the top illustrates this flow. The repo also contains hierarchical_tree_recursion (markdown heading–based tree); it is not yet wired into the main pipeline.

What is still missing

  • More LLM backends: The pipeline uses LangChain’s BaseChatModel; src/llms/openai.py implements OpenAI. Adding other providers (Anthropic, local models) is a matter of implementing a get_llm() that returns a BaseChatModel.
  • Vectorization: No built-in step to embed chunks for retrieval — you get text chunks only; wiring in embeddings and optional indexing would complete a full RAG-ready pipeline.

Setup

Install dependencies from requirements.txt. Set OPENAI_API_KEY in .env; optional OPENAI_MODEL (default gpt-4o-mini). Main code: src/service.py, src/prompt.py, src/llms/openai.py.

License

This project is under the MIT License. See LICENSE for the full text.

About

AutoChunker Paper Implementation: Structured Text Chunking and its Evaluation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages