Skip to content

Unstructured-IO/plugin-tablesetter

Table Setter

A Platform Plugin for "fixing up" HTML Table Chunks by projecting header information into all applicable chunks.

The project is a simple JSON-to-JSON Platform plugin. It modifies metadata.text_as_html of TableChunk elements. It injects "missing" table header information if such information can be discovered in other TableChunk elements. It uses simple heuristics to find and apply this information.

The intended use is downstream of a Chunker when the files often have large tables. It can sometimes be observed that header information is only available in the first chunk, but you may want it in every chunk from the same table.

Project initially vibe-coded from prompt in prompts/main.md

Phase 1 Layout

Phase 1 focuses on the reusable core plus a CLI, while keeping a clean seam for the future HTTP plugin.

src/tablesetter/
  domain/       # pure table parsing and heuristics
  app/          # document-level orchestration and settings
  interfaces/   # CLI now, HTTP adapter later
tests/
  unit/
  integration/
  e2e/
benchmarks/

Developer Experience

This project uses uv, orjson, lxml, pytest, and ruff.

uv sync --all-groups
make help
make check
make status
uv run ruff check .
uv run ruff format .
uv run pytest
uv run pytest --cov=tablesetter

CLI

cat foo.json | uv run tablesetter > fixedup.json
cat foo.json | uv run tablesetter --modify-text > fixedup.json
uv run tablesetter --in foo.json --out fixedup.json

HTTP Plugin

Phase 2 adds a FastAPI server that implements the Unstructured plugin v3 endpoints, including /metadata, /v3/schema, and /v3/invoke.

make serve
uv run tablesetter-http
make docker-image-ref
make docker-build
make docker-smoke
make repro-docker

/v3/invoke supports both:

  • inline base64 input and output descriptors
  • mounted file path input and output descriptors

The Docker workflow now uses a source-derived image tag plus runtime provenance tags in /metadata, so a stale local image is immediately visible.

For the Docker detached-run investigation, use the repeatable harness:

./scripts/repro_docker_detach.sh
./scripts/repro_docker_detach.sh --build

About

A Platform Plugin that "fixes up" HTML table chunks, mostly by projecting table header to all applicable chunks.

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors