Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions docs/en/notes/guide/selector/selector_offline_near.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,13 @@ if __name__ == "__main__":
near = offline_near_Selector(
candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = train
query_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = vaildation

# If you want to use vllm,please add "vllm:" before model's name
# Otherwise it automatically use sentence-transfromer
embed_model="vllm:Qwen/Qwen3-Embedding-0.6B",
# It automatically try vllm first, then sentence-transformers
embed_model="Qwen/Qwen3-Embedding-0.6B",
# support method:
#auto(It automatically try vllm first, then sentence-transformers),
#vllm,
#sentence-transformer
embed_method= "auto",
batch_size=32,
save_indices_path="top_indices.npy",
max_K=1000,
Expand All @@ -65,7 +68,7 @@ if __name__ == "__main__":
near.selector()
```

Note: model_name is used to encode the already-tokenized text into sentence embeddings (e.g., 512-dim), supporting both vLLM and sentence-transformer inference.
Note: model_name is used to encode the already-tokenized text into sentence embeddings (e.g., 1024-dim), supporting both vLLM and sentence-transformer inference.

Output: save as the indices matrix that contain the max_K close data for each query
---
Expand Down
16 changes: 9 additions & 7 deletions docs/en/notes/guide/selector/selector_offline_tsds.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,12 +77,14 @@ Modify training set, embedding model, and parameters inside
```python
if __name__ == "__main__":
tsds = offline_tsds_Selector(
candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w", # training set
query_path="OpenDCAI/DataFlex-selector-openhermes-10w", # validation set

# If you want to use vllm, please add "vllm:" before the model name
# Otherwise it automatically uses sentence-transformer
embed_model="vllm:Qwen/Qwen3-Embedding-0.6B", # embedding model
candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w",
query_path="OpenDCAI/DataFlex-selector-openhermes-10w",
embed_model="Qwen/Qwen3-Embedding-0.6B",
# support method:
#auto(It automatically try vllm first, then sentence-transformers),
#vllm,
#sentence-transformer
embed_method="auto",
batch_size=32,
save_probs_path="tsds_probs.npy",
max_K=5000,
Expand All @@ -94,7 +96,7 @@ if __name__ == "__main__":
tsds.selector()
```

Note: model_name is used to encode the already-tokenized text into sentence embeddings (e.g., 512-dim), supporting both vLLM and sentence-transformer inference.
Note: model_name is used to encode the already-tokenized text into sentence embeddings (e.g., 1024-dim), supporting both vLLM and sentence-transformer inference.

Output: a sampling probability for each training sample.
---
Expand Down
13 changes: 8 additions & 5 deletions docs/zh/notes/guide/selector/selector_offline_near.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,13 @@ if __name__ == "__main__":
near = offline_near_Selector(
candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = train
query_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = vaildation

# If you want to use vllm,please add "vllm:" before model's name
# Otherwise it automatically use sentence-transfromer
embed_model="vllm:Qwen/Qwen3-Embedding-0.6B",
# It automatically try vllm first, then sentence-transformers
embed_model="Qwen/Qwen3-Embedding-0.6B",
# support method:
#auto(It automatically try vllm first, then sentence-transformers),
#vllm,
#sentence-transformer
embed_method= "auto",
batch_size=32,
save_indices_path="top_indices.npy",
max_K=1000,
Expand All @@ -61,7 +64,7 @@ if __name__ == "__main__":

```

> **注意**:此处的 `model_name` 用于将**tokenized**后的文本进一步编码为**句向量**(例如 512 维),支持vllm和sentence-transformer 推理。
> **注意**:此处的 `model_name` 用于将**tokenized**后的文本进一步编码为**句向量**(例如 1024 维),支持vllm和sentence-transformer 推理。

**最终保存为每个query的max_K个最邻近训练数据的索引矩阵 ( N ,max_K )**

Expand Down
16 changes: 9 additions & 7 deletions docs/zh/notes/guide/selector/selector_offline_tsds.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,12 +76,14 @@ pip install faiss-cpu vllm sentence-transformer
```python
if __name__ == "__main__":
tsds = offline_tsds_Selector(
candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w",#训练集
query_path="OpenDCAI/DataFlex-selector-openhermes-10w",#验证集

# If you want to use vllm,please add "vllm:" before model's name
# Otherwise it automatically use sentence-transfromer
embed_model="vllm:Qwen/Qwen3-Embedding-0.6B",#编码模型
candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w",
query_path="OpenDCAI/DataFlex-selector-openhermes-10w",
embed_model="Qwen/Qwen3-Embedding-0.6B",
# support method:
#auto(It automatically try vllm first, then sentence-transformers),
#vllm,
#sentence-transformer
embed_method="auto",
batch_size=32,
save_probs_path="tsds_probs.npy",
max_K=5000,
Expand All @@ -94,7 +96,7 @@ if __name__ == "__main__":

```

> **注意**:此处的 `model_name` 用于将**tokenized**后的文本进一步编码为**句向量**(例如 512 维),支持vllm和sentence-transformer 推理。
> **注意**:此处的 `model_name` 用于将**tokenized**后的文本进一步编码为**句向量**(例如 1024 维),支持vllm和sentence-transformer 推理。

**最终保存为每个训练样本的采样概率**

Expand Down