-
Notifications
You must be signed in to change notification settings - Fork 0
small docs fix. #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -6,7 +6,11 @@ It uses [simple](doc/simple.md) token based sampling and counting methods, makin | |||||||||||||
|
|
||||||||||||||
| Decon can produce contamination reports and cleaned datasets. | ||||||||||||||
|
|
||||||||||||||
| > [!NOTE] | ||||||||||||||
| > **🐍 This fork adds Python bindings** — the core Rust functionality is unchanged. Skip to [Python Quick Start](#python) to get started, or see the [Architecture](#architecture) section to understand how bindings are structured. For the full Python API signature, see [`crates/decon-py/src/lib.rs`](crates/decon-py/src/lib.rs). | ||||||||||||||
| > The goal of this fork is to simply expose the API transparently for python users. | ||||||||||||||
| > Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`. | ||||||||||||||
| > Use `pip install decontaminate` to install the package to get started. | ||||||||||||||
|
Comment on lines
+11
to
+13
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a helpful note for new users. The wording could be made clearer and more concise. The installation command is also mentioned in the 'Python Quick Start' section, so it might be redundant here. Consider consolidating this information for better readability.
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| ## How Decon Works | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,34 +1,61 @@ | ||||||||||||||||||
| # decontaminate | ||||||||||||||||||
| # Python Bindings | ||||||||||||||||||
|
|
||||||||||||||||||
| Fast contamination detection for ML training data. Python bindings for [decon](https://github.com/vincentzed/decon). | ||||||||||||||||||
| Python bindings for decon via [PyO3](https://pyo3.rs/). | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Installation | ||||||||||||||||||
|
|
||||||||||||||||||
| This project is on PyPI: https://pypi.org/project/decontaminate/ | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| pip install decontaminate | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Usage | ||||||||||||||||||
| Or, | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| uv pip install decontaminate | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P3: Spelling error: 'recomend' should be 'recommend' Prompt for AI agents
Suggested change
|
||||||||||||||||||
| `datasets` for easy management: | ||||||||||||||||||
|
|
||||||||||||||||||
| So in the same environment you can do `pip install datasets`. | ||||||||||||||||||
|
Comment on lines
+19
to
+22
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section has a typo ('recomend' should be 'recommend') and the phrasing is a bit conversational and split across lines. I suggest rephrasing for clarity and correcting the typo.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| > [!IMPORTANT] | ||||||||||||||||||
| > The PyPI package is `decontaminate`, but the import is `import decon`. | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Quickstart | ||||||||||||||||||
|
|
||||||||||||||||||
| Here is a common use case: | ||||||||||||||||||
|
|
||||||||||||||||||
| ```python | ||||||||||||||||||
| import decon | ||||||||||||||||||
|
|
||||||||||||||||||
| # Run contamination detection | ||||||||||||||||||
| config = decon.Config( | ||||||||||||||||||
| training_dir="/path/to/training/data", | ||||||||||||||||||
| evals_dir="/path/to/eval/references", | ||||||||||||||||||
| report_output_dir="/path/to/output", | ||||||||||||||||||
| training_dir="path/to/training", | ||||||||||||||||||
| evals_dir="path/to/evals", | ||||||||||||||||||
| report_output_dir="/tmp/decon-results", | ||||||||||||||||||
| ) | ||||||||||||||||||
| report_dir = decon.detect(config) | ||||||||||||||||||
|
|
||||||||||||||||||
| # Tokenizer (same tokenizers used internally) | ||||||||||||||||||
| tok = decon.Tokenizer("cl100k") | ||||||||||||||||||
| tokens = tok.encode("hello world") # [15339, 1917] | ||||||||||||||||||
|
|
||||||||||||||||||
| # Text normalization (same as internal preprocessing) | ||||||||||||||||||
| cleaned = decon.clean_text("Hello, World!") # "hello world" | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| ## API | ||||||||||||||||||
| We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it. | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence could be phrased more formally and clearly. 'Loss of quality' is a bit vague, and 'please help report it' is informal.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| The Python API is a thin PyO3 wrapper over the Rust implementation. See [`src/lib.rs`](https://github.com/vincentzed/decon/blob/main/crates/decon-py/src/lib.rs) for all `Config` parameters and available functions: | ||||||||||||||||||
| ## API Reference | ||||||||||||||||||
|
|
||||||||||||||||||
| - `detect()`, `review()`, `compare()`, `evals()`, `server()` | ||||||||||||||||||
| - `Tokenizer` (encode/decode with cl100k, o200k, etc.) | ||||||||||||||||||
| - `clean_text()` (text normalization) | ||||||||||||||||||
| The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs). | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P1: Incorrect relative path: Prompt for AI agents
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| ## Documentation | ||||||||||||||||||
| Please refer to these sections for the full detail of API. `lib.rs`: | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P3: Grammar issue: 'full detail of API' should be 'full details of the API' Prompt for AI agents
Suggested change
|
||||||||||||||||||
| - **`PyConfig`** (line ~230): All `Config` parameters with defaults in the `#[pyo3(signature = ...)]` block | ||||||||||||||||||
| - **`PyTokenizer`** (line ~740): Tokenizer with `encode()`, `decode()`, `is_space_token()` | ||||||||||||||||||
| - **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()` | ||||||||||||||||||
|
|
||||||||||||||||||
| Full documentation: https://github.com/vincentzed/decon | ||||||||||||||||||
| The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python). | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The example
Suggested change
|
||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -12,32 +12,45 @@ license-files = ["LICENSE"] | |||||
| requires-python = ">=3.12" | ||||||
| authors = [ | ||||||
| { name = "Allen Institute for AI", email = "decon@allenai.org" }, | ||||||
| { name = "vincentzed", email = "vincent.zhong@uwaterloo.ca" }, | ||||||
| ] | ||||||
| maintainers = [ | ||||||
| { name = "Vincent Zed" }, | ||||||
| { name = "vincentzed" }, | ||||||
| ] | ||||||
| keywords = ["machine-learning", "contamination", "detection", "llm", "evaluation", "decontamination", "benchmark", "data-quality"] | ||||||
| classifiers = [ | ||||||
| "Development Status :: 4 - Beta", | ||||||
| "Intended Audience :: Developers", | ||||||
| "Intended Audience :: Science/Research", | ||||||
| "Programming Language :: Python :: 3", | ||||||
| "Programming Language :: Python :: 3.12", | ||||||
| "Programming Language :: Python :: 3.13", | ||||||
| "Programming Language :: Python :: 3.14", | ||||||
| "Programming Language :: Python :: Implementation :: CPython", | ||||||
| "Programming Language :: Rust", | ||||||
| "License :: OSI Approved :: Apache Software License", | ||||||
| "Operating System :: POSIX :: Linux", | ||||||
| "Operating System :: MacOS", | ||||||
| "Operating System :: Microsoft :: Windows", | ||||||
| "Topic :: Scientific/Engineering :: Artificial Intelligence", | ||||||
| "Topic :: Software Development :: Libraries :: Python Modules", | ||||||
| "Typing :: Typed", | ||||||
| "Development Status :: 4 - Beta", | ||||||
| "Intended Audience :: Developers", | ||||||
| "Intended Audience :: Science/Research", | ||||||
|
|
||||||
| "License :: OSI Approved :: Apache Software License", | ||||||
|
|
||||||
| "Operating System :: POSIX :: Linux", | ||||||
| "Operating System :: MacOS", | ||||||
| "Operating System :: Microsoft :: Windows", | ||||||
|
|
||||||
| "Programming Language :: Python", | ||||||
| "Programming Language :: Python :: 3", | ||||||
| "Programming Language :: Python :: 3 :: Only", | ||||||
| "Programming Language :: Python :: 3.12", | ||||||
| "Programming Language :: Python :: 3.13", | ||||||
| # Keep 3.14 only if you actually CI-test it | ||||||
| "Programming Language :: Python :: 3.14", | ||||||
| "Programming Language :: Python :: Implementation :: CPython", | ||||||
| "Programming Language :: Rust", | ||||||
|
|
||||||
| "Topic :: Scientific/Engineering :: Artificial Intelligence", | ||||||
| "Topic :: Scientific/Engineering :: Information Analysis", | ||||||
| "Topic :: Text Processing", | ||||||
| "Topic :: Text Processing :: Indexing", | ||||||
| "Topic :: Software Development :: Libraries :: Python Modules", | ||||||
| "Topic :: Utilities", | ||||||
|
|
||||||
| "Typing :: Typed", | ||||||
| ] | ||||||
|
|
||||||
| [project.urls] | ||||||
| Homepage = "https://github.com/allenai/decon" | ||||||
| Homepage = "https://github.com/allenai/decon" | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P2: Inconsistent indentation: Homepage has 4 leading spaces while other keys in [project.urls] have none. Remove the leading spaces for consistency. Prompt for AI agents
Suggested change
|
||||||
| Repository = "https://github.com/vincentzed/decon" | ||||||
| Documentation = "https://github.com/vincentzed/decon/blob/main/doc/python.md" | ||||||
| Issues = "https://github.com/vincentzed/decon/issues" | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -4,13 +4,29 @@ Python bindings for decon via [PyO3](https://pyo3.rs/). | |||||||||||||||||
|
|
||||||||||||||||||
| ## Installation | ||||||||||||||||||
|
|
||||||||||||||||||
| This project is on PyPI: https://pypi.org/project/decontaminate/ | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| pip install decontaminate | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| Or, | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| uv pip install decontaminate | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P2: Spelling error: 'recomend' should be 'recommend' Prompt for AI agents
Suggested change
|
||||||||||||||||||
| `datasets` for easy management: | ||||||||||||||||||
|
|
||||||||||||||||||
| So in the same environment you can do `pip install datasets`. | ||||||||||||||||||
|
Comment on lines
+19
to
+22
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section has a typo ('recomend' should be 'recommend') and the phrasing is a bit conversational and split across lines. I suggest rephrasing for clarity and correcting the typo.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| > [!IMPORTANT] | ||||||||||||||||||
| > The PyPI package is `decontaminate`, but the import is `import decon`. | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Quick Example | ||||||||||||||||||
| ## Quickstart | ||||||||||||||||||
|
|
||||||||||||||||||
| Here is a common use case: | ||||||||||||||||||
|
|
||||||||||||||||||
| ```python | ||||||||||||||||||
| import decon | ||||||||||||||||||
|
|
@@ -31,13 +47,15 @@ tokens = tok.encode("hello world") # [15339, 1917] | |||||||||||||||||
| cleaned = decon.clean_text("Hello, World!") # "hello world" | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it. | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence could be phrased more formally and clearly. 'Loss of quality' is a bit vague, and 'please help report it' is informal.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| ## API Reference | ||||||||||||||||||
|
|
||||||||||||||||||
| The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs). | ||||||||||||||||||
|
|
||||||||||||||||||
| Key sections in `lib.rs`: | ||||||||||||||||||
| Please refer to these sections for the full detail of API. `lib.rs`: | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||||||||
| - **`PyConfig`** (line ~230): All `Config` parameters with defaults in the `#[pyo3(signature = ...)]` block | ||||||||||||||||||
| - **`PyTokenizer`** (line ~740): Tokenizer with `encode()`, `decode()`, `is_space_token()` | ||||||||||||||||||
| - **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()` | ||||||||||||||||||
|
|
||||||||||||||||||
| The Rust parameter names map directly to Python kwargs (e.g., `ngram_size` in Rust = `ngram_size=` in Python). | ||||||||||||||||||
| The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python). | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The example
Suggested change
|
||||||||||||||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: Capitalize 'python' to 'Python' to maintain consistency with Python naming conventions and the rest of the document.
Prompt for AI agents