You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both pipelines support optional document chunking using Docling's [HybridChunker](https://docling-project.github.io/docling/examples/hybrid_chunking/). This splits converted documents into smaller, semantically meaningful chunks ideal for RAG (Retrieval-Augmented Generation) workflows.
183
+
184
+
**Chunking parameters:**
185
+
- `docling_chunk_enabled`: Set to `True` to enable chunking after conversion (default: `False`).
186
+
- `docling_chunk_max_tokens`: Maximum tokens per chunk (default: `512`). Adjust based on your embedding model's context limit.
187
+
- `docling_chunk_merge_peers`: If `True`, merge adjacent small chunks for better context (default: `True`).
188
+
189
+
**Tokenizer:** Chunking uses the `sentence-transformers/all-MiniLM-L6-v2` tokenizer for accurate token counting, ensuring chunks are sized appropriately for common embedding models.
190
+
191
+
**Chunked output location:**
192
+
When chunking is enabled, an additional output file is created for each converted document:
Copy file name to clipboardExpand all lines: kubeflow-pipelines/docling-standard/README.md
+34-2Lines changed: 34 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,6 +27,39 @@ The following configuration options are available as KFP parameters when you _Cr
27
27
-`pdf_filenames`: List of PDF file names to process, separated by commas.
28
28
-`pdf_from_s3`: If `True`, PDF files will be fetched from an S3-compatible object storage rather than `pdf_base_url`. A secret must be configured as described in [docs](../README.md).
29
29
30
+
### Chunking options
31
+
32
+
Optional document chunking using Docling's [HybridChunker](https://docling-project.github.io/docling/examples/hybrid_chunking/):
33
+
34
+
-`docling_chunk_enabled`: If `True`, chunk converted documents into smaller pieces (default: `False`).
35
+
-`docling_chunk_max_tokens`: Maximum tokens per chunk (default: `512`).
36
+
-`docling_chunk_merge_peers`: If `True`, merge adjacent small chunks for better context (default: `True`).
37
+
38
+
Chunking uses the `sentence-transformers/all-MiniLM-L6-v2` tokenizer for accurate token counting.
39
+
40
+
**Chunked output**: When enabled, creates `{filename}_chunks.jsonl` files (one JSON object per line) in the same output directory as the converted documents. See [main docs](../README.md) for output format details.
41
+
42
+
## Local testing
43
+
44
+
You can test the pipeline locally using Docker before deploying to KFP.
45
+
46
+
### Prerequisites
47
+
48
+
```bash
49
+
pip install docker kfp
50
+
```
51
+
52
+
Requires a Docker-compatible daemon (Docker or Podman socket).
53
+
54
+
### Run locally
55
+
56
+
```bash
57
+
cd data-processing/kubeflow-pipelines/docling-standard
58
+
python local_run.py
59
+
```
60
+
61
+
This runs `convert_pipeline_local()` which converts PDFs and chunks the output.
0 commit comments