Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,11 @@ It uses [simple](doc/simple.md) token based sampling and counting methods, makin

Decon can produce contamination reports and cleaned datasets.

> [!NOTE]
> **🐍 This fork adds Python bindings** — the core Rust functionality is unchanged. Skip to [Python Quick Start](#python) to get started, or see the [Architecture](#architecture) section to understand how bindings are structured. For the full Python API signature, see [`crates/decon-py/src/lib.rs`](crates/decon-py/src/lib.rs).
> The goal of this fork is to simply expose the API transparently for python users.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Capitalize 'python' to 'Python' to maintain consistency with Python naming conventions and the rest of the document.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At README.md, line 11:

<comment>Capitalize 'python' to 'Python' to maintain consistency with Python naming conventions and the rest of the document.</comment>

<file context>
@@ -6,7 +6,11 @@ It uses [simple](doc/simple.md) token based sampling and counting methods, makin
 
+> [!NOTE]
 > **🐍 This fork adds Python bindings** — the core Rust functionality is unchanged. Skip to [Python Quick Start](#python) to get started, or see the [Architecture](#architecture) section to understand how bindings are structured. For the full Python API signature, see [`crates/decon-py/src/lib.rs`](crates/decon-py/src/lib.rs).
+> The goal of this fork is to simply expose the API transparently for python users.
+> Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`.
+> Use `pip install decontaminate` to install the package to get started.
</file context>
Suggested change
> The goal of this fork is to simply expose the API transparently for python users.
> The goal of this fork is to simply expose the API transparently for Python users.
Fix with Cubic

> Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`.
> Use `pip install decontaminate` to install the package to get started.
Comment on lines +11 to +13
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a helpful note for new users. The wording could be made clearer and more concise. The installation command is also mentioned in the 'Python Quick Start' section, so it might be redundant here. Consider consolidating this information for better readability.

Suggested change
> The goal of this fork is to simply expose the API transparently for python users.
> Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`.
> Use `pip install decontaminate` to install the package to get started.
> The goal of this fork is to provide transparent Python bindings for the core Rust API.
>
> Please note that the package is installed from PyPI as `decontaminate` but imported into Python as `decon`.


## How Decon Works

Expand Down
53 changes: 40 additions & 13 deletions crates/decon-py/README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,61 @@
# decontaminate
# Python Bindings

Fast contamination detection for ML training data. Python bindings for [decon](https://github.com/vincentzed/decon).
Python bindings for decon via [PyO3](https://pyo3.rs/).

## Installation

This project is on PyPI: https://pypi.org/project/decontaminate/

```bash
pip install decontaminate
```

## Usage
Or,

```bash
uv pip install decontaminate
```

There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Spelling error: 'recomend' should be 'recommend'

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/README.md, line 19:

<comment>Spelling error: 'recomend' should be 'recommend'</comment>

<file context>
@@ -1,34 +1,61 @@
+uv pip install decontaminate
+```
+
+There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
+`datasets` for easy management: 
+
</file context>
Suggested change
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
There are no dependencies installed by default. Since it is common to load dataset from python, we recommend using
Fix with Cubic

`datasets` for easy management:

So in the same environment you can do `pip install datasets`.
Comment on lines +19 to +22
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section has a typo ('recomend' should be 'recommend') and the phrasing is a bit conversational and split across lines. I suggest rephrasing for clarity and correcting the typo.

Suggested change
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
`datasets` for easy management:
So in the same environment you can do `pip install datasets`.
There are no dependencies installed by default. Since it is common to load datasets from Python, we recommend using the `datasets` library for easy management.
In the same environment, you can run:
`pip install datasets`


> [!IMPORTANT]
> The PyPI package is `decontaminate`, but the import is `import decon`.

## Quickstart

Here is a common use case:

```python
import decon

# Run contamination detection
config = decon.Config(
training_dir="/path/to/training/data",
evals_dir="/path/to/eval/references",
report_output_dir="/path/to/output",
training_dir="path/to/training",
evals_dir="path/to/evals",
report_output_dir="/tmp/decon-results",
)
report_dir = decon.detect(config)

# Tokenizer (same tokenizers used internally)
tok = decon.Tokenizer("cl100k")
tokens = tok.encode("hello world") # [15339, 1917]

# Text normalization (same as internal preprocessing)
cleaned = decon.clean_text("Hello, World!") # "hello world"
```

## API
We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be phrased more formally and clearly. 'Loss of quality' is a bit vague, and 'please help report it' is informal.

Suggested change
We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it.
We strive to maintain parity with the Rust API. If you encounter any discrepancies or issues, please report them.


The Python API is a thin PyO3 wrapper over the Rust implementation. See [`src/lib.rs`](https://github.com/vincentzed/decon/blob/main/crates/decon-py/src/lib.rs) for all `Config` parameters and available functions:
## API Reference

- `detect()`, `review()`, `compare()`, `evals()`, `server()`
- `Tokenizer` (encode/decode with cl100k, o200k, etc.)
- `clean_text()` (text normalization)
The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Incorrect relative path: ../crates/decon-py/src/lib.rs should be src/lib.rs

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/README.md, line 54:

<comment>Incorrect relative path: `../crates/decon-py/src/lib.rs` should be `src/lib.rs`</comment>

<file context>
@@ -1,34 +1,61 @@
-- `detect()`, `review()`, `compare()`, `evals()`, `server()`
-- `Tokenizer` (encode/decode with cl100k, o200k, etc.)
-- `clean_text()` (text normalization)
+The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
 
-## Documentation
</file context>
Suggested change
The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](src/lib.rs).
Fix with Cubic


## Documentation
Please refer to these sections for the full detail of API. `lib.rs`:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be slightly rephrased for better flow before the list of items from lib.rs.

Suggested change
Please refer to these sections for the full detail of API. `lib.rs`:
For full API details, please refer to the following sections in `lib.rs`:

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Grammar issue: 'full detail of API' should be 'full details of the API'

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/README.md, line 56:

<comment>Grammar issue: 'full detail of API' should be 'full details of the API'</comment>

<file context>
@@ -1,34 +1,61 @@
+The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
 
-## Documentation
+Please refer to these sections for the full detail of API. `lib.rs`:
+- **`PyConfig`** (line ~230): All `Config` parameters with defaults in the `#[pyo3(signature = ...)]` block
+- **`PyTokenizer`** (line ~740): Tokenizer with `encode()`, `decode()`, `is_space_token()`
</file context>
Suggested change
Please refer to these sections for the full detail of API. `lib.rs`:
Please refer to these sections for the full details of the API. `lib.rs`:
Fix with Cubic

- **`PyConfig`** (line ~230): All `Config` parameters with defaults in the `#[pyo3(signature = ...)]` block
- **`PyTokenizer`** (line ~740): Tokenizer with `encode()`, `decode()`, `is_space_token()`
- **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()`

Full documentation: https://github.com/vincentzed/decon
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example ngram_size= could be confusing as it's incomplete. It would be clearer to show it as a keyword argument.

Suggested change
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python).
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable (e.g., `ngram_size` in Rust becomes the `ngram_size` keyword argument in Python).

49 changes: 31 additions & 18 deletions crates/decon-py/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,32 +12,45 @@ license-files = ["LICENSE"]
requires-python = ">=3.12"
authors = [
{ name = "Allen Institute for AI", email = "decon@allenai.org" },
{ name = "vincentzed", email = "vincent.zhong@uwaterloo.ca" },
]
maintainers = [
{ name = "Vincent Zed" },
{ name = "vincentzed" },
]
keywords = ["machine-learning", "contamination", "detection", "llm", "evaluation", "decontamination", "benchmark", "data-quality"]
classifiers = [
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: 3.14",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Rust",
"License :: OSI Approved :: Apache Software License",
"Operating System :: POSIX :: Linux",
"Operating System :: MacOS",
"Operating System :: Microsoft :: Windows",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Topic :: Software Development :: Libraries :: Python Modules",
"Typing :: Typed",
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",

"License :: OSI Approved :: Apache Software License",

"Operating System :: POSIX :: Linux",
"Operating System :: MacOS",
"Operating System :: Microsoft :: Windows",

"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3 :: Only",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
# Keep 3.14 only if you actually CI-test it
"Programming Language :: Python :: 3.14",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Rust",

"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Topic :: Scientific/Engineering :: Information Analysis",
"Topic :: Text Processing",
"Topic :: Text Processing :: Indexing",
"Topic :: Software Development :: Libraries :: Python Modules",
"Topic :: Utilities",

"Typing :: Typed",
]

[project.urls]
Homepage = "https://github.com/allenai/decon"
Homepage = "https://github.com/allenai/decon"
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Inconsistent indentation: Homepage has 4 leading spaces while other keys in [project.urls] have none. Remove the leading spaces for consistency.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/pyproject.toml, line 53:

<comment>Inconsistent indentation: Homepage has 4 leading spaces while other keys in [project.urls] have none. Remove the leading spaces for consistency.</comment>

<file context>
@@ -12,32 +12,45 @@ license-files = ["LICENSE"]
 
 [project.urls]
-Homepage = "https://github.com/allenai/decon"
+    Homepage = "https://github.com/allenai/decon"
 Repository = "https://github.com/vincentzed/decon"
 Documentation = "https://github.com/vincentzed/decon/blob/main/doc/python.md"
</file context>
Suggested change
Homepage = "https://github.com/allenai/decon"
Homepage = "https://github.com/allenai/decon"
Fix with Cubic

Repository = "https://github.com/vincentzed/decon"
Documentation = "https://github.com/vincentzed/decon/blob/main/doc/python.md"
Issues = "https://github.com/vincentzed/decon/issues"
Expand Down
24 changes: 21 additions & 3 deletions doc/python.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,29 @@ Python bindings for decon via [PyO3](https://pyo3.rs/).

## Installation

This project is on PyPI: https://pypi.org/project/decontaminate/

```bash
pip install decontaminate
```

Or,

```bash
uv pip install decontaminate
```

There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Spelling error: 'recomend' should be 'recommend'

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At doc/python.md, line 19:

<comment>Spelling error: 'recomend' should be 'recommend'</comment>

<file context>
@@ -4,13 +4,29 @@ Python bindings for decon via [PyO3](https://pyo3.rs/).
+uv pip install decontaminate
+```
+
+There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
+`datasets` for easy management: 
+
</file context>
Suggested change
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
There are no dependencies installed by default. Since it is common to load datasets from python, we recommend using
Fix with Cubic

`datasets` for easy management:

So in the same environment you can do `pip install datasets`.
Comment on lines +19 to +22
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section has a typo ('recomend' should be 'recommend') and the phrasing is a bit conversational and split across lines. I suggest rephrasing for clarity and correcting the typo.

Suggested change
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
`datasets` for easy management:
So in the same environment you can do `pip install datasets`.
There are no dependencies installed by default. Since it is common to load datasets from Python, we recommend using the `datasets` library for easy management.
In the same environment, you can run:
`pip install datasets`


> [!IMPORTANT]
> The PyPI package is `decontaminate`, but the import is `import decon`.

## Quick Example
## Quickstart

Here is a common use case:

```python
import decon
Expand All @@ -31,13 +47,15 @@ tokens = tok.encode("hello world") # [15339, 1917]
cleaned = decon.clean_text("Hello, World!") # "hello world"
```

We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be phrased more formally and clearly. 'Loss of quality' is a bit vague, and 'please help report it' is informal.

Suggested change
We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it.
We strive to maintain parity with the Rust API. If you encounter any discrepancies or issues, please report them.


## API Reference

The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).

Key sections in `lib.rs`:
Please refer to these sections for the full detail of API. `lib.rs`:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be slightly rephrased for better flow before the list of items from lib.rs.

Suggested change
Please refer to these sections for the full detail of API. `lib.rs`:
For full API details, please refer to the following sections in `lib.rs`:

- **`PyConfig`** (line ~230): All `Config` parameters with defaults in the `#[pyo3(signature = ...)]` block
- **`PyTokenizer`** (line ~740): Tokenizer with `encode()`, `decode()`, `is_space_token()`
- **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()`

The Rust parameter names map directly to Python kwargs (e.g., `ngram_size` in Rust = `ngram_size=` in Python).
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example ngram_size= could be confusing as it's incomplete. It would be clearer to show it as a keyword argument.

Suggested change
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python).
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable (e.g., `ngram_size` in Rust becomes the `ngram_size` keyword argument in Python).