Skip to content

Latest commit

 

History

History
268 lines (227 loc) · 9.17 KB

File metadata and controls

268 lines (227 loc) · 9.17 KB
title Document Analysis
sidebarTitle Document Analysis
icon file-pdf
description Upload PDFs for multi-endpoint safety analysis with per-page detection, chain-of-custody hashing, and zero-retention processing
keywords
document
PDF
safety
analysis
upload
multipart
per-page
chain-of-custody
zero-retention

Upload a PDF document to run safety detection across every page. Tuteliq extracts text from each page, runs your chosen detection endpoints in parallel, and returns per-page results with an overall risk assessment. No document data is stored after the response is returned.

Quick start

```bash cURL curl -X POST https://api.tuteliq.ai/api/v1/safety/document \ -H "Authorization: Bearer YOUR_API_KEY" \ -F "file=@report.pdf" \ -F "endpoints=[\"unsafe\",\"coercive-control\",\"radicalisation\"]" ```
import fs from 'fs';

const form = new FormData();
form.append('file', fs.createReadStream('report.pdf'));
form.append('endpoints', JSON.stringify(['unsafe', 'coercive-control', 'radicalisation']));

const res = await fetch('https://api.tuteliq.ai/api/v1/safety/document', {
  method: 'POST',
  headers: { Authorization: 'Bearer YOUR_API_KEY' },
  body: form,
});

const result = await res.json();
console.log(result.overall_severity);      // "high"
console.log(result.flagged_pages.length);  // 2
console.log(result.credits_used);          // 30
import requests

with open("report.pdf", "rb") as f:
    res = requests.post(
        "https://api.tuteliq.ai/api/v1/safety/document",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        files={"file": ("report.pdf", f, "application/pdf")},
        data={"endpoints": '["unsafe","coercive-control","radicalisation"]'},
    )

result = res.json()
print(result["overall_severity"])
print(result["flagged_pages"])

How it works

Send a PDF via `multipart/form-data`. Max 50 MB, max 100 pages. Text is extracted from each page using the PDF text layer. Pages with fewer than 20 characters of extractable text are skipped. A SHA-256 hash of the raw file is computed for chain-of-custody verification. Each page is analyzed against your chosen detection endpoints in parallel (bounded concurrency of 3 pages at a time). Long pages are chunked before analysis. Per-page results are aggregated into an overall risk score, severity level, and list of flagged pages. Incidents are recorded and webhooks triggered for flagged content.

Available endpoints

You can run any combination of these 8 detection endpoints against each page:

Endpoint name Detection type
unsafe Harmful content across all KOSA categories
bullying Cyberbullying and harassment
grooming Grooming patterns
social-engineering Social engineering tactics
coercive-control Coercive control patterns
radicalisation Radicalisation indicators
romance-scam Romance scam patterns
mule-recruitment Money mule recruitment

Default endpoints (when endpoints is omitted): unsafe, coercive-control, radicalisation.

Request parameters

Upload your PDF as a multipart/form-data request. The file field must be named file.

Field Type Required Description
file file Yes PDF file (max 50 MB)
endpoints string No JSON array of endpoint names, or comma-separated list. Defaults to ["unsafe","coercive-control","radicalisation"].
file_id string No Your identifier for the file (echoed back in the response)
external_id string No External reference ID (echoed back)
customer_id string No Customer reference ID (echoed back)
age_group string No "under 10", "10-12", "13-15", "16-17", or "under 18"
language string No ISO 639-1 code. Auto-detected if omitted.
platform string No Platform name for context-aware scoring
support_threshold string No Minimum severity to include crisis helplines. Default: "high".
metadata string No JSON object with custom metadata (echoed back)

Response

{
  "file_id": "report.pdf",
  "document_hash": "sha256:a1b2c3d4e5f6...",
  "total_pages": 12,
  "pages_analyzed": 10,
  "extraction_summary": {
    "text_layer_pages": 10,
    "ocr_pages": 0,
    "failed_pages": 2,
    "average_ocr_confidence": 0
  },
  "page_results": [
    {
      "page_number": 1,
      "text_preview": "Chapter 1: Introduction to the platform...",
      "extraction_method": "text_layer",
      "results": [
        {
          "endpoint": "unsafe",
          "detected": false,
          "severity": 0,
          "confidence": 0.95,
          "risk_score": 0,
          "level": "low",
          "categories": [],
          "evidence": [],
          "recommended_action": "none",
          "rationale": "No harmful content detected."
        }
      ],
      "page_risk_score": 0,
      "page_severity": "none"
    },
    {
      "page_number": 5,
      "text_preview": "The user was told to send money...",
      "extraction_method": "text_layer",
      "results": [
        {
          "endpoint": "coercive-control",
          "detected": true,
          "severity": 0.82,
          "confidence": 0.91,
          "risk_score": 0.82,
          "level": "critical",
          "categories": [
            { "tag": "FINANCIAL_CONTROL", "label": "Financial Control", "confidence": 0.91 }
          ],
          "evidence": [
            { "text": "send money or else", "tactic": "FINANCIAL_CONTROL", "weight": 0.88 }
          ],
          "recommended_action": "flag_for_review",
          "rationale": "Financial coercion pattern detected."
        }
      ],
      "page_risk_score": 0.82,
      "page_severity": "critical"
    }
  ],
  "overall_risk_score": 0.82,
  "overall_severity": "critical",
  "detected_endpoints": ["coercive-control"],
  "flagged_pages": [
    {
      "page_number": 5,
      "risk_score": 0.82,
      "severity": "critical",
      "detected_endpoints": ["coercive-control"]
    }
  ],
  "credits_used": 30,
  "processing_time_ms": 4521,
  "language": "en",
  "language_status": "stable",
  "support": {
    "helplines": [...]
  }
}

Key response fields

Field Description
document_hash SHA-256 hash of the uploaded PDF for chain-of-custody verification
total_pages Total pages in the document
pages_analyzed Pages with sufficient text that were analyzed
extraction_summary Breakdown of text extraction results per page
page_results Per-page detection results from each endpoint
overall_risk_score Highest risk score across all pages (0.0–1.0)
overall_severity none, low, medium, high, or critical
detected_endpoints Unique list of endpoints that detected threats
flagged_pages Pages with risk score >= 0.3, with their detected endpoints
credits_used Dynamic credit cost based on pages analyzed and endpoints used

Credit pricing

Document analysis uses dynamic pricing based on the actual work performed:

credits = max(10, pages_analyzed × endpoint_count)
Document Endpoints Credits
1 page, 3 default endpoints 3 10 (minimum)
5 pages, 3 default endpoints 3 15
10 pages, 1 endpoint 1 10 (minimum)
20 pages, 8 endpoints 8 160
100 pages, 8 endpoints 8 800

The minimum charge is 10 credits (covers extraction overhead). Each page-endpoint combination costs 1 credit.

Choose your endpoints carefully. Running 8 endpoints on a 100-page document costs 800 credits. For most use cases, the 3 default endpoints (`unsafe`, `coercive-control`, `radicalisation`) provide comprehensive coverage.

Chain-of-custody

Every response includes a document_hash — a SHA-256 hash of the exact bytes uploaded. Use this to:

  • Prove which file was analyzed in compliance audits
  • Verify document integrity if the same file is analyzed again
  • Include in incident reports for regulatory submissions
sha256:a1b2c3d4e5f6789...

Zero retention

**No document data is stored.** The PDF is processed entirely in memory, analyzed, and discarded. The response is the only output. This is the same privacy-by-design approach used across all Tuteliq endpoints.

Limits

Limit Value
Max file size 50 MB
Max pages 100
Supported formats PDF only (application/pdf)
Min text per page 20 characters (pages below this are skipped)
Concurrency 3 pages analyzed simultaneously

Tier access

Document analysis is available on Indie tier and above. Starter tier does not have access to this endpoint.

Error codes

Code Description
ANALYSIS_6010 PDF extraction failed (corrupt or password-protected file)
ANALYSIS_6011 Document exceeds 100-page limit
FILE_MISSING No file uploaded
FILE_INVALID_TYPE Non-PDF file uploaded
FILE_TOO_LARGE File exceeds 50 MB