From f607737d898ae5109e9e002dd638a60d2c4e78a7 Mon Sep 17 00:00:00 2001 From: vincentzed <207368749+vincentzed@users.noreply.github.com> Date: Fri, 9 Jan 2026 17:29:32 -0500 Subject: [PATCH 1/2] add some docs Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com> --- README.md | 2 ++ crates/decon-py/README.md | 53 +++++++++++++++++++++++++--------- crates/decon-py/pyproject.toml | 49 +++++++++++++++++++------------ doc/python.md | 24 +++++++++++++-- 4 files changed, 94 insertions(+), 34 deletions(-) diff --git a/README.md b/README.md index d9d9361..34d38f9 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,9 @@ It uses [simple](doc/simple.md) token based sampling and counting methods, makin Decon can produce contamination reports and cleaned datasets. +> [!NOTE] > **🐍 This fork adds Python bindings** — the core Rust functionality is unchanged. Skip to [Python Quick Start](#python) to get started, or see the [Architecture](#architecture) section to understand how bindings are structured. For the full Python API signature, see [`crates/decon-py/src/lib.rs`](crates/decon-py/src/lib.rs). +> The goal of this fork is to simply expose the API transparently for python users ## How Decon Works diff --git a/crates/decon-py/README.md b/crates/decon-py/README.md index fbc197a..724cc29 100644 --- a/crates/decon-py/README.md +++ b/crates/decon-py/README.md @@ -1,34 +1,61 @@ -# decontaminate +# Python Bindings -Fast contamination detection for ML training data. Python bindings for [decon](https://github.com/vincentzed/decon). +Python bindings for decon via [PyO3](https://pyo3.rs/). ## Installation +This project is on PyPI: https://pypi.org/project/decontaminate/ + ```bash pip install decontaminate ``` -## Usage +Or, + +```bash +uv pip install decontaminate +``` + +There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using +`datasets` for easy management: + +So in the same environment you can do `pip install datasets`. + +> [!IMPORTANT] +> The PyPI package is `decontaminate`, but the import is `import decon`. + +## Quickstart + +Here is a common use case: ```python import decon +# Run contamination detection config = decon.Config( - training_dir="/path/to/training/data", - evals_dir="/path/to/eval/references", - report_output_dir="/path/to/output", + training_dir="path/to/training", + evals_dir="path/to/evals", + report_output_dir="/tmp/decon-results", ) report_dir = decon.detect(config) + +# Tokenizer (same tokenizers used internally) +tok = decon.Tokenizer("cl100k") +tokens = tok.encode("hello world") # [15339, 1917] + +# Text normalization (same as internal preprocessing) +cleaned = decon.clean_text("Hello, World!") # "hello world" ``` -## API +We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it. -The Python API is a thin PyO3 wrapper over the Rust implementation. See [`src/lib.rs`](https://github.com/vincentzed/decon/blob/main/crates/decon-py/src/lib.rs) for all `Config` parameters and available functions: +## API Reference -- `detect()`, `review()`, `compare()`, `evals()`, `server()` -- `Tokenizer` (encode/decode with cl100k, o200k, etc.) -- `clean_text()` (text normalization) +The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs). -## Documentation +Please refer to these sections for the full detail of API. `lib.rs`: +- **`PyConfig`** (line ~230): All `Config` parameters with defaults in the `#[pyo3(signature = ...)]` block +- **`PyTokenizer`** (line ~740): Tokenizer with `encode()`, `decode()`, `is_space_token()` +- **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()` -Full documentation: https://github.com/vincentzed/decon +The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python). diff --git a/crates/decon-py/pyproject.toml b/crates/decon-py/pyproject.toml index 2b1f98c..1202a1a 100644 --- a/crates/decon-py/pyproject.toml +++ b/crates/decon-py/pyproject.toml @@ -12,32 +12,45 @@ license-files = ["LICENSE"] requires-python = ">=3.12" authors = [ { name = "Allen Institute for AI", email = "decon@allenai.org" }, + { name = "vincentzed", email = "vincent.zhong@uwaterloo.ca" }, ] maintainers = [ - { name = "Vincent Zed" }, + { name = "vincentzed" }, ] keywords = ["machine-learning", "contamination", "detection", "llm", "evaluation", "decontamination", "benchmark", "data-quality"] classifiers = [ - "Development Status :: 4 - Beta", - "Intended Audience :: Developers", - "Intended Audience :: Science/Research", - "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.12", - "Programming Language :: Python :: 3.13", - "Programming Language :: Python :: 3.14", - "Programming Language :: Python :: Implementation :: CPython", - "Programming Language :: Rust", - "License :: OSI Approved :: Apache Software License", - "Operating System :: POSIX :: Linux", - "Operating System :: MacOS", - "Operating System :: Microsoft :: Windows", - "Topic :: Scientific/Engineering :: Artificial Intelligence", - "Topic :: Software Development :: Libraries :: Python Modules", - "Typing :: Typed", + "Development Status :: 4 - Beta", + "Intended Audience :: Developers", + "Intended Audience :: Science/Research", + + "License :: OSI Approved :: Apache Software License", + + "Operating System :: POSIX :: Linux", + "Operating System :: MacOS", + "Operating System :: Microsoft :: Windows", + + "Programming Language :: Python", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3 :: Only", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", + # Keep 3.14 only if you actually CI-test it + "Programming Language :: Python :: 3.14", + "Programming Language :: Python :: Implementation :: CPython", + "Programming Language :: Rust", + + "Topic :: Scientific/Engineering :: Artificial Intelligence", + "Topic :: Scientific/Engineering :: Information Analysis", + "Topic :: Text Processing", + "Topic :: Text Processing :: Indexing", + "Topic :: Software Development :: Libraries :: Python Modules", + "Topic :: Utilities", + + "Typing :: Typed", ] [project.urls] -Homepage = "https://github.com/allenai/decon" + Homepage = "https://github.com/allenai/decon" Repository = "https://github.com/vincentzed/decon" Documentation = "https://github.com/vincentzed/decon/blob/main/doc/python.md" Issues = "https://github.com/vincentzed/decon/issues" diff --git a/doc/python.md b/doc/python.md index 4a62f9c..724cc29 100644 --- a/doc/python.md +++ b/doc/python.md @@ -4,13 +4,29 @@ Python bindings for decon via [PyO3](https://pyo3.rs/). ## Installation +This project is on PyPI: https://pypi.org/project/decontaminate/ + ```bash pip install decontaminate ``` +Or, + +```bash +uv pip install decontaminate +``` + +There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using +`datasets` for easy management: + +So in the same environment you can do `pip install datasets`. + +> [!IMPORTANT] > The PyPI package is `decontaminate`, but the import is `import decon`. -## Quick Example +## Quickstart + +Here is a common use case: ```python import decon @@ -31,13 +47,15 @@ tokens = tok.encode("hello world") # [15339, 1917] cleaned = decon.clean_text("Hello, World!") # "hello world" ``` +We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it. + ## API Reference The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs). -Key sections in `lib.rs`: +Please refer to these sections for the full detail of API. `lib.rs`: - **`PyConfig`** (line ~230): All `Config` parameters with defaults in the `#[pyo3(signature = ...)]` block - **`PyTokenizer`** (line ~740): Tokenizer with `encode()`, `decode()`, `is_space_token()` - **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()` -The Rust parameter names map directly to Python kwargs (e.g., `ngram_size` in Rust = `ngram_size=` in Python). +The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python). From 5631890fa90fb1122ed16f619e5dc480d8e017cb Mon Sep 17 00:00:00 2001 From: vincentzed <207368749+vincentzed@users.noreply.github.com> Date: Fri, 9 Jan 2026 17:32:08 -0500 Subject: [PATCH 2/2] more Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com> --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 34d38f9..18df9f9 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,9 @@ Decon can produce contamination reports and cleaned datasets. > [!NOTE] > **🐍 This fork adds Python bindings** — the core Rust functionality is unchanged. Skip to [Python Quick Start](#python) to get started, or see the [Architecture](#architecture) section to understand how bindings are structured. For the full Python API signature, see [`crates/decon-py/src/lib.rs`](crates/decon-py/src/lib.rs). -> The goal of this fork is to simply expose the API transparently for python users +> The goal of this fork is to simply expose the API transparently for python users. +> Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`. +> Use `pip install decontaminate` to install the package to get started. ## How Decon Works