Kiji ships with a pre-trained PII detection model, but you can train your own to add new entity types, improve accuracy for your domain, or support additional languages. This chapter walks through the full customization workflow: generating training data, reviewing it in Label Studio, running the training pipeline, and loading your custom model into the app.
The customization workflow has four stages:
1. Generate Data ──► 2. Review in Label Studio ──► 3. Train with Metaflow ──► 4. Load in App
There are two ways to get training data into the pipeline.
Kiji includes tooling to generate synthetic PII samples using either the Doubleword batch API or the OpenAI API. Both produce Label Studio-compatible JSON files containing text with annotated PII entities.
The generation prompts and entity definitions live in model/dataset/label_utils.py. The LABEL_DESCRIPTIONS dictionary is the single source of truth for all PII labels. Each entry defines the label name, color, selection probability, and generation hints. To add a new entity type or adjust how existing types are generated, edit this dictionary.
For example, to add a MEDICAL_RECORD_NUMBER entity type:
"MEDICAL_RECORD_NUMBER": {
"name": "Medical Record Number",
"color": "#00acc1",
"chance": 0.1,
"hints": ["Use formats like MRN-123456 or various hospital ID formats"],
},After modifying the labels, regenerate the Label Studio configuration so it includes your new entity types:
uv run python -c "from model.dataset.label_utils import LabelUtils; LabelUtils.update_labelstudio_config()"The Doubleword pipeline automates the full generation workflow with batch processing, automatic polling, and resumability:
export DOUBLEWORD_API_KEY="your-api-key"
# Generate 500 samples
uv run python -m model.dataset.doubleword.pipeline \
--command=start \
--num_samples=500
# Enable an optional review stage for higher quality
uv run python -m model.dataset.doubleword.pipeline \
--command=start \
--num_samples=500 \
--enable_review
# Check progress
uv run python -m model.dataset.doubleword.pipeline --command=status
# Resume after interruption
uv run python -m model.dataset.doubleword.pipeline --command=resumeGenerated samples are saved to model/dataset/data_samples/annotation_samples/.
For fast iteration, use the OpenAI direct API:
export OPENAI_API_KEY="your-api-key"
# Generate 10 samples
uv run python -m model.dataset.openai.training_set --num_samples=10
# Use a custom OpenAI-compatible API server
export URL=http://your-server:8000/v1/chat/completions
uv run python -m model.dataset.openai.training_set --num_samples=100 --api_url=$URL
# Increase parallelism for faster generation
uv run python -m model.dataset.openai.training_set --num_samples=200 --max_workers=12Generated samples are saved to model/dataset/data_samples/annotation_samples/.
You can bootstrap your training data from an existing HuggingFace dataset. The public dataset DataikuNLP/kiji-pii-training-data contains pre-labeled samples ready for training or further annotation.
Download and convert the dataset to local Label Studio JSON files:
# Download all splits to the default training_samples directory
uv run python model/dataset/huggingface/download_dataset_from_hf.py \
--repo-id "DataikuNLP/kiji-pii-training-data"
# Download to a custom directory
uv run python model/dataset/huggingface/download_dataset_from_hf.py \
--repo-id "DataikuNLP/kiji-pii-training-data" \
--output-dir path/to/output
# Download only specific splits
uv run python model/dataset/huggingface/download_dataset_from_hf.py \
--repo-id "DataikuNLP/kiji-pii-training-data" \
--splits trainEach row is converted to an individual JSON file in the Label Studio format that the training pipeline consumes directly.
To expand the dataset with your own samples, generate additional data using Option 1 and merge them into the same directory. You can also upload your expanded dataset back to HuggingFace:
export HF_TOKEN=hf_xxxxx
uv run python model/dataset/huggingface/upload_dataset_to_hf.py \
--repo-id "your-org/kiji-pii-training-data" \
--public --create-repoRegardless of how you generated your data, you should review it in Label Studio before training. This lets you correct mislabeled entities, add missing annotations, and ensure quality.
# Install Label Studio dependencies
uv sync --extra labelstudio
# Start Label Studio
uv run --extra labelstudio python -m label_studio.server startLabel Studio will open in your browser at http://localhost:8080.
- Click Create Project on the welcome screen.
- Give the project a descriptive name (e.g., "Kiji PII - Custom Entities").
- Skip the Data Import tab for now (we will import via script).
- On the Labeling Setup tab, click Custom template at the bottom of the template list.
- Paste the contents of
model/dataset/labelstudio/LabelingConfig.txtinto the code editor. This configuration is auto-generated fromlabel_utils.pyand includes all your entity types with color coding. - Click Save.
Use the import script to load your generated samples with pre-annotations:
# Set environment variables
export LABEL_STUDIO_API_KEY='your-api-key' # From Label Studio > Account & Settings > Personal Access Token
export LABEL_STUDIO_PROJECT_ID='your-project-id' # From the project URL (e.g., /projects/3/)
# Import annotation samples
uv run python model/dataset/labelstudio/import_predictions.pyThe script imports all JSON files from model/dataset/data_samples/annotation_samples/ with their pre-annotations, so you can review and correct them rather than labeling from scratch.
- Click Label All Tasks to enter the labeling workflow.
- Review the pre-annotated entities. Click on labels to select them, then highlight text to apply them.
- For coreference relations, select a pronoun region, click the "create relation" button, then click the noun it refers to.
- Click Submit to save and move to the next task.
After reviewing, export the corrected annotations to the training samples directory:
export LABEL_STUDIO_API_KEY='your-api-key'
export LABEL_STUDIO_PROJECT_ID='your-project-id'
uv run model/dataset/labelstudio/python export_annotations.pyThe exported samples are saved to model/dataset/data_samples/training_samples/ in the format the training pipeline expects.
Before training, you can inspect the composition of your dataset using the analysis script. It auto-detects the file format and works with any of the sample directories:
# Analyze reviewed samples (default)
uv run python model/dataset/analyze_dataset.py
# Analyze a specific directory
uv run python model/dataset/analyze_dataset.py \
--samples-dir model/dataset/data_samples/annotation_samplesThe report includes language and country distributions, PII label frequencies, entity count statistics, text length metrics, coreference coverage, and the most common co-occurring label pairs. This is useful for spotting imbalances (e.g., underrepresented languages or missing entity types) before committing to a training run.
The training pipeline is orchestrated by Metaflow and configured via model/flows/training_config.toml.
Edit model/flows/training_config.toml to adjust training parameters:
[model]
name = "microsoft/deberta-v3-small"
[training]
num_epochs = 30
batch_size = 4
learning_rate = 2e-5
warmup_steps = 500 # Gradual LR ramp-up to stabilize early training
weight_decay = 0.01 # L2 regularization strength
# Early stopping settings
early_stopping_enabled = true
early_stopping_patience = 3 # Number of epochs with no improvement before stopping
early_stopping_threshold = 0.005 # Minimum improvement (0.5%) to qualify
[paths]
training_samples_dir = "model/dataset/data_samples/training_samples"
output_dir = "model/trained"
[pipeline]
skip_export = true # true if you already exported from Label Studio manually
skip_quantization = false # ONNX quantization
skip_signing = false # Cryptographic model signingKey settings:
skip_export: Set totrueif you already exported from Label Studio (Step 2). Set tofalseto have the pipeline export directly from Label Studio (requireslabelstudio.project_idandLABEL_STUDIO_API_KEY).skip_quantization: Set totrueto skip ONNX quantization.training_samples_dir: Points to the directory containing your reviewed training samples.early_stopping_patience: Evaluation runs once per epoch. With patience of 3, training stops after 3 consecutive epochs with no F1 improvement.
# Run from the project root
./model/flows/run_training.sh
# Or run directly with uv
uv run --extra training --extra quantization --extra signing \
python model/flows/training_pipeline.py run
# Use a custom config file
./model/flows/run_training.sh --config custom_config.tomlThe pipeline executes the following steps:
- Export data (optional) - Pulls annotations from Label Studio via its SDK
- Preprocess data - Loads JSON samples, tokenizes text, aligns labels
- Train model - Multi-task learning for PII detection and coreference resolution
- Evaluate model - Runs inference on test cases and reports metrics
- Quantize model - Exports to ONNX and applies dynamic quantization
- Sign model - Generates a cryptographic signature for integrity verification
After training completes, you will find:
model/
├── trained/ # Full trained model (PyTorch)
│ ├── config.json
│ ├── model.safetensors
│ ├── tokenizer.json
│ └── label_mappings.json
├── quantized/ # Quantized ONNX model (smaller, faster)
│ ├── model_quantized.onnx
│ ├── tokenizer.json
│ └── label_mappings.json
└── quantized.sig # Cryptographic signature
The official Kiji models and training data are published on HuggingFace:
| Resource | HuggingFace Repo |
|---|---|
| Training dataset | DataikuNLP/kiji-pii-training-data |
| Trained model (SafeTensors) | DataikuNLP/kiji-pii-model |
| Quantized model (ONNX) | DataikuNLP/kiji-pii-model-onnx |
You can use the quantized ONNX model directly with the Kiji app by downloading it and pointing to the directory in Advanced Settings.
Share your trained model via HuggingFace Hub:
export HF_TOKEN=hf_xxxxx
# Upload the quantized ONNX model (add --create-repo to create the repo)
uv run python model/dataset/huggingface/upload_model_to_hf.py \
--variant quantized \
--repo-id "your-org/kiji-pii-model-onnx" \
--dataset-repo-id "your-org/kiji-pii-training-data" \
--public --create-repoOnce you have a trained model, you can load it into the Kiji desktop app through the Advanced Settings modal.
Your model directory must contain these three files:
model_quantized.onnx- The quantized ONNX modeltokenizer.json- The tokenizer configurationlabel_mappings.json- The label-to-ID mappings
These are produced automatically by the training pipeline in the model/quantized/ directory.
- Open the Kiji desktop app.
- Open the hamburger menu and click Advanced Settings.
- In the Load Custom Kiji PII Model section, enter the path to your model directory or click Browse to select it.
- Click Reload Model.
- The status indicator will show Healthy if the model loaded successfully.
After loading a custom model, you may want to adjust the PII detection sensitivity. In the same Advanced Settings modal:
- Low (0.1 threshold) - Catches more potential PII but may have more false positives
- Medium (0.25 threshold) - Balanced detection (default)
- High (0.5 threshold) - More precise but may miss some PII
Here is a complete walkthrough of adding a new MEDICAL_RECORD_NUMBER entity type:
-
Add the label to
model/dataset/label_utils.py:"MEDICAL_RECORD_NUMBER": { "name": "Medical Record Number", "color": "#00acc1", "chance": 0.1, "hints": ["Use formats like MRN-123456 or various hospital ID formats"], },
-
Regenerate the Label Studio config:
uv run python -c "from model.dataset.label_utils import LabelUtils; LabelUtils.update_labelstudio_config()" -
Generate training data (using OpenAI for a quick test):
export OPENAI_API_KEY="your-key" uv run python -m model.dataset.openai.training_set --num_samples=100
-
Review in Label Studio:
uv sync --extra labelstudio uv run --extra labelstudio python -m label_studio.server start # Create a project, import data, review annotations, export -
Train the model:
./model/flows/run_training.sh
-
Load in the app: Open Advanced Settings, browse to
model/quantized/, and click Reload Model.
- Chapter 2: Development Guide - Set up your development environment
- Chapter 5: Advanced Topics - Model signing and security
- Dataset README - Detailed data generation reference
- Label Studio README - Full Label Studio integration guide