Skip to content

Make huggingface-hub an optional dependency (extras) #1973

@clemlesne

Description

@clemlesne

Problem

tokenizers has a hard dependency on huggingface-hub, which is only used by Tokenizer.from_pretrained(). Users who load tokenizers from local files (Tokenizer.from_file()) pay the cost of 12 MB of unused transitive dependencies:

Package Size Purpose
huggingface-hub 2.2 MB Hub API client
hf-xet 7.2 MB Xet storage backend
fsspec 0.7 MB Filesystem abstraction
pyyaml 0.6 MB YAML parsing
requests 0.2 MB HTTP client
certifi 0.3 MB CA certificates
charset-normalizer 0.2 MB Encoding detection
idna 0.3 MB Domain name handling
tqdm 0.2 MB Progress bars
filelock 0.1 MB File locking
Total 12 MB

These packages also increase the attack surface (HTTP client, file system access, YAML parsing) in environments where only local inference is needed.

The Rust crate already makes this optional. In Cargo.toml, hub access is behind the http feature, which is not in the default feature set:

[features]
default = ["progressbar", "onig", "esaxx_fast"]
http = ["hf-hub"]  # NOT in default

The Python bindings lack this parity.

Business case

We maintain a production search library that uses tokenizers for ONNX model tokenization. Our models are pre-exported at build time and bundled as compressed archives — only Tokenizer.from_file() is called at runtime, never from_pretrained().

Our Docker images run on both x86_64 and aarch64 Linux. Every MB matters for:

  • Container pull time across 10+ services
  • Cold start latency (serverless / scale-to-zero)
  • Reduced dependency CVE surface in production

We verified that tokenizers works perfectly without huggingface-hub:

import sys
sys.modules['huggingface_hub'] = None  # Block import

from tokenizers import Tokenizer
t = Tokenizer.from_file("tokenizer.json")
enc = t.encode("hello world")
# Works — ids=[101, 7592, 2088, 102]

Proposed solution

Move huggingface-hub to an optional extra. Lazy-import it only in from_pretrained().

pyproject.toml:

[project]
dependencies = []  # Core: no hub dependency

[project.optional-dependencies]
hub = ["huggingface-hub"]

Python (from_pretrained):

def from_pretrained(identifier, ...):
    try:
        from huggingface_hub import hf_hub_download
    except ImportError:
        raise ImportError(
            "Tokenizer.from_pretrained() requires huggingface-hub. "
            "Install with: pip install tokenizers[hub]"
        )
    ...

Backward compatibility:

  • pip install tokenizers — works for from_file(), lighter install
  • pip install tokenizers[hub] — works for from_pretrained()
  • pip install transformers — unaffected, transformers already depends on huggingface-hub directly, so from_pretrained() continues to work without any change

This is the same pattern used by httpx (httpx[http2]), uvicorn (uvicorn[standard]), and sqlalchemy (sqlalchemy[postgresql]).

Prior art and community context

Reference Relevance
#1845 AWS Lambda ARM64 deployment failure caused by huggingface-hubhf-xet chain
PR #1848 Previous proposal to make hub optional. Maintainer Narsil acknowledged "this dependency is only for a single function" but rejected as breaking. The extras approach here avoids that concern.
huggingface-hub #3067 hf-xet added as hard dep in v0.31.0, adding 200+ MB. Reverted to optional in v0.31.2 after community pushback. Demonstrates the cascading risk of hard deps.
huggingface-hub #3073 AWS Lambda deployment failures from hf-xet size. Same pattern — hard dep causes downstream breakage.
BERTopic #786 Successfully made sentence-transformers (and its PyTorch/tokenizers chain) optional in v0.13.
transformers #23666 HuggingFace's own pip install transformers[torch] extras pattern — proves they endorse this approach.
Rust Cargo.toml The tokenizers Rust crate already gates hf-hub behind an optional http feature. Python should match.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions