Skip to content

Latest commit

 

History

History
260 lines (188 loc) · 7.44 KB

File metadata and controls

260 lines (188 loc) · 7.44 KB

GLM-OCR SDK — Agent Integration Guide

This document describes how AI agents, MCP tool servers, and other programmatic callers can integrate with the GLM-OCR Python SDK without editing YAML files or understanding the internal pipeline.


Quick Start

import glmocr

# One-liner — uses ZHIPU_API_KEY from environment / .env file
result = glmocr.parse("document.pdf")
print(result.to_dict())

Or use the class-based API for multiple calls:

from glmocr import GlmOcr

parser = GlmOcr(api_key="sk-xxx", mode="maas")
result = parser.parse("page.png")
print(result.to_json())
parser.close()            # or use:  with GlmOcr(...) as parser:

Two Operational Modes

Mode Value Requires GPU? Description
MaaS "maas" No Forwards requests to Zhipu's cloud API. Recommended for agents.
Self-hosted "selfhosted" Yes Uses a local vLLM / SGLang service with optional layout detection.

When api_key is provided without an explicit mode, the SDK automatically defaults to MaaS mode.


Configuration Priority

The SDK resolves every setting using this priority chain (highest wins):

Constructor kwargs  >  os.environ  >  .env file  >  config.yaml  >  built-in defaults

This means an agent can override any setting without touching files.


Environment Variables

All variables use the prefix GLMOCR_. Place them in the shell environment or in a .env file anywhere in the working-directory ancestry.

Variable Maps to Example
GLMOCR_MODE pipeline.maas.enabled maas or selfhosted
ZHIPU_API_KEY pipeline.maas.api_key sk-abc123
GLMOCR_API_URL pipeline.maas.api_url https://open.bigmodel.cn/...
GLMOCR_MODEL pipeline.maas.model glm-ocr
GLMOCR_TIMEOUT pipeline.maas.request_timeout 600
GLMOCR_OCR_API_URL pipeline.ocr_api.api_url http://localhost:5002/v1/...
GLMOCR_OCR_API_KEY pipeline.ocr_api.api_key token-xyz
GLMOCR_OCR_API_HOST pipeline.ocr_api.api_host localhost
GLMOCR_OCR_API_PORT pipeline.ocr_api.api_port 5002
GLMOCR_OCR_MODEL pipeline.ocr_api.model glm-ocr-model
GLMOCR_LOG_LEVEL logging.level DEBUG, INFO, WARNING, ERROR

.env File Auto-Loading

The SDK walks up from the current working directory looking for a .env file. Values from the .env file are merged with real environment variables, with real env vars always taking priority.

# .env
ZHIPU_API_KEY=sk-my-secret-key
GLMOCR_MODE=maas
GLMOCR_LOG_LEVEL=DEBUG

Constructor Parameters

GlmOcr() and the convenience parse() function accept these keyword arguments. They map to the same settings as the environment variables but with higher priority.

Keyword Type Description
config_path str Path to a YAML config file (optional).
api_key str API key. Providing this without mode auto-enables MaaS.
api_url str MaaS API endpoint URL.
model str Model name.
mode str "maas" or "selfhosted".
timeout int Request timeout in seconds.
log_level str Logging level.

parse() Return Type

The return type mirrors the input type for ergonomic usage:

# Single file → single PipelineResult
result = parser.parse("image.png")
result.save("./output")

# Multiple files → list of PipelineResult
results = parser.parse(["img1.png", "doc.pdf"])
for r in results:
    r.save("./output")

Type checkers see proper @overload signatures — no casts needed.


Result Serialization

Every PipelineResult can be serialized without touching the file system:

to_dict()

Returns a JSON-serializable Python dict:

d = result.to_dict()
# {
#   "json_result": [[{"index": 0, "label": "text", "content": "...", "bbox_2d": [...]}]],
#   "markdown_result": "# Page title\n...",
#   "original_images": ["/abs/path/to/image.png"],
#   "usage": {"total_tokens": 1234},      # present in MaaS mode
#   "data_info": {"pages": [...]},         # present in MaaS mode
# }

to_json(**kwargs)

Returns a JSON string. Keyword arguments are forwarded to json.dumps. Defaults: ensure_ascii=False, indent=2.

json_str = result.to_json()                 # pretty-printed
json_str = result.to_json(indent=None)      # compact single line

save(output_dir)

Writes JSON + Markdown files (with cropped images) to disk:

result.save(output_dir="./output")

Structured Output — json_result

json_result is a list of pages, each page a list of regions:

[
  [
    {
      "index": 0,
      "label": "title",
      "content": "Annual Report 2024",
      "bbox_2d": [100, 50, 900, 120]
    },
    {
      "index": 1,
      "label": "text",
      "content": "Revenue grew 15% year-over-year...",
      "bbox_2d": [100, 140, 900, 400]
    }
  ]
]

Coordinates (bbox_2d) are normalised to a 0–1000 scale regardless of the backend (MaaS or self-hosted).

Labels: title, text, table, figure, formula, header, footer, page_number, reference, etc.


Error Handling

When a MaaS request fails, the SDK returns a PipelineResult with an _error attribute instead of raising:

result = parser.parse("image.png")
d = result.to_dict()
if "error" in d:
    print("Parsing failed:", d["error"])
else:
    print(d["markdown_result"])

MCP / Tool Integration Pattern

When wrapping GLM-OCR as an MCP tool:

import json
import glmocr

def ocr_tool(image_path: str) -> str:
    """Parse a document and return structured JSON."""
    result = glmocr.parse(image_path)
    return result.to_json()

The tool only needs ZHIPU_API_KEY in the environment (or .env file). No YAML configuration is required.


Programmatic Config Construction

For advanced use cases you can build config objects directly:

from glmocr.config import GlmOcrConfig

cfg = GlmOcrConfig.from_env(
    api_key="sk-xxx",
    mode="maas",
    timeout=600,
    log_level="DEBUG",
)
print(cfg.to_dict())

from_env() respects the full priority chain: kwargs > os.environ > .env > YAML > defaults.


Running Tests

python -m pytest glmocr/tests/test_unit.py -v

All tests run without network access or GPU. MaaS/Pipeline internals are mocked where needed.