Skip to content

GML-2049 chunker updates#29

Merged
chengbiao-jin merged 5 commits intomainfrom
GML-2049-Chunker_Updates
Mar 18, 2026
Merged

GML-2049 chunker updates#29
chengbiao-jin merged 5 commits intomainfrom
GML-2049-Chunker_Updates

Conversation

@chengbiao-jin
Copy link
Collaborator

@chengbiao-jin chengbiao-jin commented Mar 18, 2026

PR Type

Enhancement, Bug fix, Tests


Description

  • Enable apiToken for TigerGraph connections

    • Skip getToken when token provided
    • Add unit tests for apiToken
  • Bound HTML/Markdown chunks recursively (4096)

    • Default fallback size and overlap support
  • Fix PDF image paths, form artifacts

    • Handle spaces; deduplicate table rows
  • Route graph stats to function calls

    • Update provider prompts for counts

Diagram Walkthrough

flowchart LR
  CHUNK["Chunkers updated (defaults, recursive)"]
  HTML["HTML chunker\nfallback+recursive"]
  MD["Markdown chunker\nfallback+recursive"]
  CHAR["Character chunker\n4096 fallback"]
  RECUR["Recursive chunker\n4096 fallback"]
  CONN["DB connections\napiToken support"]
  CFG["Config init\napiToken passthrough"]
  PDF["PDF extractor\nimage+markdown fixes"]
  PROMPT["Routing prompts\nGraph stats -> functions"]
  LOAD["Loader\nconfigurable batch/delay"]
  DOCKER["Compose\nTG service optional"]

  CHUNK -- "applies to" --> HTML
  CHUNK -- "applies to" --> MD
  CHUNK -- "applies to" --> CHAR
  CHUNK -- "applies to" --> RECUR
  CFG -- "used by" --> CONN
  CONN -- "unit tests" --> PROMPT
  PDF -- "clean images/markdown" --> CHUNK
  PROMPT -- "provider prompts updated" --> CHUNK
  LOAD -- "tunable throughput" --> CFG
  DOCKER -- "external TG supported" --> CONN
Loading

File Walkthrough

Relevant files
Enhancement
8 files
character_chunker.py
Default to 4096 and validate overlaps                                       
+6/-6     
html_chunker.py
Recursive split for oversized header sections                       
+31/-3   
markdown_chunker.py
Fallback size and recursive markdown splitting                     
+20/-13 
recursive_chunker.py
Default recursive chunk size set to 4096                                 
+4/-2     
config.py
Support static apiToken and conditional getToken                 
+2/-1     
connections.py
Use apiToken directly; skip getToken; async support           
+29/-1   
base_llm.py
Route graph statistics questions to function calls             
+8/-2     
supportai_ingest.py
Pass chunk size/overlap to HTML chunker                                   
+3/-1     
Bug fix
1 files
text_extractors.py
Fix image paths; clean PDF markdown artifacts                       
+99/-6   
Configuration changes
3 files
ecc_util.py
Update chunker defaults and pass new parameters                   
+5/-3     
graph_rag.py
Configurable batch size and optional upsert delay               
+7/-5     
docker-compose.yml
Comment out TigerGraph service; externalize dependency     
+12/-12 
Tests
1 files
test_connections.py
Add unit tests for apiToken connection handling                   
+117/-0 
Documentation
7 files
generate_function.txt
Clarify count queries route to Count functions                     
+1/-1     
generate_function.txt
Clarify count queries route to Count functions                     
+1/-1     
generate_function.txt
Clarify count queries route to Count functions                     
+1/-1     
generate_function.txt
Clarify count queries route to Count functions                     
+1/-1     
generate_function.txt
Clarify count queries route to Count functions                     
+1/-1     
generate_function.txt
Clarify count queries route to Count functions                     
+1/-1     
generate_function.txt
Clarify count queries route to Count functions                     
+1/-1     
Additional files
1 files
generate_function.txt +1/-1     

@chengbiao-jin chengbiao-jin merged commit 13d868a into main Mar 18, 2026
1 check failed
@chengbiao-jin chengbiao-jin deleted the GML-2049-Chunker_Updates branch March 18, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant