GitHub - superdoc-dev/docx-corpus: The largest open corpus of .docx files for document processing research

Building the largest open corpus of .docx files for document processing and rendering research.

Vision

Document rendering is hard. Microsoft Word has decades of edge cases, quirks, and undocumented behaviors. To build reliable document processing tools, you need to test against real-world documents - not just synthetic test cases.

docx-corpus scrapes the entire public web (via Common Crawl) to collect .docx files, creating a massive test corpus for:

Document parsing and rendering engines
Visual regression testing
Feature coverage analysis
Edge case discovery
Machine learning training data

How It Works

Phase 1: Index Filtering (Lambda)
┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│  Common Crawl  │     │   cdx-filter   │     │  Cloudflare R2 │
│  CDX indexes   │ ──► │   (Lambda)     │ ──► │  cdx-filtered/ │
└────────────────┘     └────────────────┘     └────────────────┘

Phase 2: Scrape (corpus scrape)
┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│  Common Crawl  │     │                │     │    Storage     │
│  WARC archives │ ──► │  Downloads     │ ──► │  documents/    │
├────────────────┤     │  Validates     │     │  {hash}.docx   │
│  Cloudflare R2 │     │  Deduplicates  │     ├────────────────┤
│  cdx-filtered/ │ ──► │                │ ──► │   PostgreSQL   │
└────────────────┘     └────────────────┘     │  (metadata)    │
                                              └────────────────┘

Phase 3: Extract (corpus extract)
┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│    Storage     │     │    Docling     │     │    Storage     │
│  documents/    │ ──► │   (Python)     │ ──► │  extracted/    │
│  {hash}.docx   │     │                │     │  {hash}.txt    │
└────────────────┘     │  Extracts text │     ├────────────────┤
                       │  Counts words  │ ──► │   PostgreSQL   │
                       └────────────────┘     │  (word_count)  │
                                              └────────────────┘

Phase 4: Embed (corpus embed)
┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│    Storage     │     │ sentence-      │     │   PostgreSQL   │
│  extracted/    │ ──► │ transformers   │ ──► │   (pgvector)   │
│  {hash}.txt    │     │   (Python)     │     │  embedding     │
└────────────────┘     └────────────────┘     └────────────────┘

Why Common Crawl?

Common Crawl is a nonprofit that crawls the web monthly and makes it freely available:

3+ billion URLs per monthly crawl
Petabytes of data going back to 2008
Free to access - no API keys needed
Reproducible - archived crawls never change

This gives us access to every public .docx file on the web.

Installation

# Clone the repository
git clone https://github.com/superdoc-dev/docx-corpus.git
cd docx-corpus

# Install dependencies
bun install

Project Structure

packages/
  shared/         # Shared utilities (DB client, storage, formatting)
  scraper/        # Core scraper logic (downloads WARC, validates .docx)
  extractor/      # Text extraction using Docling (Python)
  embedder/       # Document embeddings using sentence-transformers (Python)
apps/
  cli/            # Unified CLI - corpus <command>
  cdx-filter/     # AWS Lambda - filters CDX indexes for .docx URLs
  web/            # Landing page - docxcorp.us
db/
  schema.sql      # PostgreSQL schema (with pgvector)
  migrations/     # Database migrations

Apps (entry points)

App	Purpose	Uses
cli	`corpus` command	scraper, extractor, embedder
cdx-filter	Filter CDX indexes (Lambda)	-
web	Landing page	-

Packages (libraries)

Package	Purpose	Runtime
shared	DB client, storage, formatting	Bun
scraper	Download and validate .docx files	Bun
extractor	Extract text (Docling)	Bun + Python
embedder	Generate embeddings	Bun + Python

Usage

1. Run Lambda to filter CDX indexes

First, deploy and run the Lambda function to filter Common Crawl CDX indexes for .docx files. See apps/cdx-filter/README.md for detailed setup instructions.

cd apps/cdx-filter
./invoke-all.sh CC-MAIN-2025-51

This reads CDX files directly from Common Crawl S3 (no rate limits) and stores filtered JSONL in your R2 bucket.

2. Run the scraper

# Scrape from a single crawl
bun run corpus scrape --crawl CC-MAIN-2025-51

# Scrape latest 3 crawls, 100 docs each
bun run corpus scrape --crawl 3 --batch 100

# Scrape from multiple specific crawls
bun run corpus scrape --crawl CC-MAIN-2025-51,CC-MAIN-2025-48 --batch 500

# Re-process URLs already in database
bun run corpus scrape --crawl CC-MAIN-2025-51 --force

# Check progress
bun run corpus status

3. Extract text from documents

# Extract all documents
bun run corpus extract

# Extract with batch limit
bun run corpus extract --batch 100

# Extract with custom workers
bun run corpus extract --batch 50 --workers 8

# Verbose output
bun run corpus extract --verbose

4. Generate embeddings

# Embed all extracted documents (default: minilm, 384 dims)
bun run corpus embed

# Use a different model
bun run corpus embed --model bge-m3      # 1024 dims
bun run corpus embed --model voyage-lite  # requires VOYAGE_API_KEY

# Embed with batch limit
bun run corpus embed --batch 100 --verbose

Note: Vector dimensions are model-specific. The default schema uses vector(384) for minilm. If using a different model, update the column dimension accordingly (e.g., vector(1024) for bge-m3).

Docker

Run the CLI in a container:

# Build the image
docker build -t docx-corpus .

# Run CLI commands
docker run docx-corpus --help
docker run docx-corpus scrape --help
docker run docx-corpus scrape --crawl CC-MAIN-2025-51 --batch 100

# With environment variables
docker run \
  -e DATABASE_URL=postgres://... \
  -e CLOUDFLARE_ACCOUNT_ID=xxx \
  -e R2_ACCESS_KEY_ID=xxx \
  -e R2_SECRET_ACCESS_KEY=xxx \
  docx-corpus scrape --batch 100

Storage Options

R2 credentials are required to read pre-filtered CDX records from the Lambda output.

Local document storage (default): Downloaded .docx files are saved to ./corpus/documents/

Cloud document storage (Cloudflare R2): Documents can also be uploaded to R2 alongside the CDX records:

export CLOUDFLARE_ACCOUNT_ID=xxx
export R2_ACCESS_KEY_ID=xxx
export R2_SECRET_ACCESS_KEY=xxx
bun run corpus scrape --crawl CC-MAIN-2025-51 --batch 1000

Local Development

Start PostgreSQL with pgvector locally:

docker compose up -d

# Verify
docker exec docx-corpus-postgres-1 psql -U postgres -d docx_corpus -c "\dt"

Run commands against local database:

DATABASE_URL=postgres://postgres:postgres@localhost:5432/docx_corpus \
CLOUDFLARE_ACCOUNT_ID='' \
bun run corpus status

Configuration

All configuration via environment variables (.env):

# Database (required)
DATABASE_URL=postgres://user:pass@host:5432/dbname

# Cloudflare R2 (required for cloud storage)
CLOUDFLARE_ACCOUNT_ID=
R2_ACCESS_KEY_ID=
R2_SECRET_ACCESS_KEY=
R2_BUCKET_NAME=docx-corpus

# Local storage (used when R2 not configured)
STORAGE_PATH=./corpus

# Scraping
CRAWL_ID=CC-MAIN-2025-51
CONCURRENCY=50
RATE_LIMIT_RPS=50
MAX_RPS=100
MIN_RPS=10
TIMEOUT_MS=45000
MAX_RETRIES=10

# Extractor
EXTRACT_INPUT_PREFIX=documents
EXTRACT_OUTPUT_PREFIX=extracted
EXTRACT_BATCH_SIZE=100
EXTRACT_WORKERS=4

# Embedder
EMBED_INPUT_PREFIX=extracted
EMBED_MODEL=minilm           # minilm | bge-m3 | voyage-lite
EMBED_BATCH_SIZE=100
VOYAGE_API_KEY=              # Required for voyage-lite model

Rate Limiting

WARC requests: Adaptive rate limiting that adjusts to server load
On 503/429 errors: Retries with exponential backoff + jitter (up to 60s)
On 403 errors: Fails immediately (indicates 24h IP block from Common Crawl)

Corpus Statistics

Metric	Description
Sources	Entire public web via Common Crawl
Deduplication	SHA-256 content hash
Validation	ZIP structure + Word XML verification
Storage	Content-addressed (hash as filename)

Development

# Run linter
bun run lint

# Format code
bun run format

# Type check
bun run typecheck

# Run tests
bun run test

# Build
bun run build

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Takedown Requests

If you find a document in this corpus that you own and would like removed, please email help@docxcorp.us with:

The document hash or URL
Proof of ownership

We will process requests within 7 days.

License

MIT

Built by 🦋SuperDoc

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
.github/workflows		.github/workflows
.husky		.husky
apps		apps
db		db
packages		packages
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
docker-compose.yml		docker-compose.yml
package.json		package.json
tsconfig.base.json		tsconfig.base.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision

How It Works

Why Common Crawl?

Installation

Project Structure

Usage

1. Run Lambda to filter CDX indexes

2. Run the scraper

3. Extract text from documents

4. Generate embeddings

Docker

Storage Options

Local Development

Configuration

Rate Limiting

Corpus Statistics

Development

Contributing

Takedown Requests

License

About

Uh oh!

Releases 76

Contributors 3

Uh oh!

Languages

License

superdoc-dev/docx-corpus

Folders and files

Latest commit

History

Repository files navigation

Vision

How It Works

Why Common Crawl?

Installation

Project Structure

Usage

1. Run Lambda to filter CDX indexes

2. Run the scraper

3. Extract text from documents

4. Generate embeddings

Docker

Storage Options

Local Development

Configuration

Rate Limiting

Corpus Statistics

Development

Contributing

Takedown Requests

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 76

Contributors 3

Uh oh!

Languages