CLI Reference — Complete Argument Guide

This document covers every command-line option available in AnonShield. Each argument includes a plain-language description, a concrete example, and notes on when to use it.

How to run the tool:

Method Command prefix

Docker — from this repository ./docker/run.sh [OPTIONS] <file_or_folder>

Docker — downloaded from Docker Hub ./run.sh [OPTIONS] <file_or_folder> (script downloaded per DOCKERHUB_README instructions)

Local install uv run anon.py [OPTIONS] <file_or_folder>

All examples below use ./docker/run.sh. If you downloaded the script from Docker Hub, use ./run.sh instead. If you installed locally, use uv run anon.py.

Method	Command prefix
Docker — from this repository	`./docker/run.sh [OPTIONS] <file_or_folder>`
Docker — downloaded from Docker Hub	`./run.sh [OPTIONS] <file_or_folder>` (script downloaded per DOCKERHUB_README instructions)
Local install	`uv run anon.py [OPTIONS] <file_or_folder>`

Positional Argument — The File or Folder
Informational Commands
General Options
Anonymization Options
Word List
Performance and Filtering Options
Anonymization Strategy and Models
Database Options
Chunking and Batching Options
NER Data Generation Options
SLM / AI Options (Experimental)
Ollama Service Options (Experimental)
Quick Reference Table

1. Positional Argument — The File or Folder

`file_path`

What it is: The path to the file or folder you want to anonymize. This is the only required argument.

Accepted values:

A single file: report.csv, ./data/scan.json, /home/user/incident.pdf
A directory: ./reports/, /data/openvas/

When you point to a folder, the tool processes every supported file inside it, including subdirectories.

Supported file types: .txt .log .csv .xlsx .json .jsonl .xml .pdf .docx .png .jpg .gif .bmp .tiff .webp

Examples:

# Anonymize a single CSV file
./docker/run.sh ./incident_report.csv

# Anonymize an entire folder
./docker/run.sh ./openvas_reports/

# Anonymize a PDF scan report
./docker/run.sh ./nessus_scan.pdf

Note: Unsupported file types (e.g., .exe, .zip) are silently skipped when processing a folder.

2. Informational Commands

These flags print information and exit immediately — they do not process any file.

`--list-entities`

What it does: Prints all entity types detectable for the given strategy and model combination, then exits. The output varies by --anonymization-strategy and --transformer-model — different combinations produce different entity sets.

When to use: Before running on a file, to know which entity names to use with --preserve-entities.

# Default (filtered strategy, xlm-roberta model)
./docker/run.sh --list-entities

# See entities available with the presidio strategy (includes AU_ABN, US_SSN, IBAN, etc.)
./docker/run.sh --list-entities --anonymization-strategy presidio

# See entities for the cybersecurity-focused model
./docker/run.sh --list-entities --transformer-model attack-vector/SecureModernBERT-NER

Sample output (filtered + default model):

Supported entity types (strategy=filtered, model=Davlan/xlm-roberta-base-ner-hrl):
 - AUTH_TOKEN
 - CERTIFICATE
 - CERT_SERIAL
 - CPE_STRING
 - CREDIT_CARD
 - CRYPTOGRAPHIC_KEY
 - CVE_ID
 - EMAIL_ADDRESS
 - FILE_PATH
 - HASH
 - HOSTNAME
 - IP_ADDRESS
 - LOCATION
 - MAC_ADDRESS
 - OID
 - ORGANIZATION
 - PASSWORD
 - PERSON
 - PGP_BLOCK
 - PHONE_NUMBER
 - PORT
 - URL
 - USERNAME
 - UUID

Strategy differences: presidio includes all Presidio built-in recognizers (~46 entities, including AU_ABN, US_SSN, IBAN_CODE, UK_NHS, etc.). All other strategies (filtered, hybrid, standalone) use only the curated cybersecurity recognizer set shown above.

`--list-languages`

What it does: Prints all supported document languages, then exits.

./docker/run.sh --list-languages

Sample output:

Supported languages:
 - en: English
 - pt: Portuguese
 - es: Spanish
 - de: German
 - fr: French
 - ...

3. General Options

`--lang <code>`

Default: en

What it does: Tells the tool which language the document is written in. This loads the correct spaCy linguistic model and activates language-specific NER rules.

When to use: Any time your document is not in English.

# Portuguese security report
./docker/run.sh ./relatorio.pdf --lang pt

# Spanish document
./docker/run.sh ./incidente.docx --lang es

Supported codes: en, pt, es, de, fr, it, nl, pl, ru, zh, and more — run --list-languages for the full list.

`--output-dir <path>`

Default: output

What it does: Sets the folder where anonymized files are saved. The tool creates the folder automatically if it does not exist.

Output filenames always follow the pattern anon_<original_filename>.<ext>.

# Save results to a specific folder
./docker/run.sh ./reports/ --output-dir ./anonymized_results/

# Use an absolute path
./docker/run.sh ./scan.json --output-dir /data/processed/

`--overwrite`

Default: off (existing files are skipped)

What it does: Allows the tool to replace existing output files. Without this flag, if anon_report.csv already exists in the output folder, it is skipped.

./docker/run.sh ./report.csv --overwrite

Use case: Re-running the tool with different settings on the same input file.

`--no-report`

Default: off (a performance report is always written)

What it does: Disables the creation of a text performance report in the logs/ directory after each run.

./docker/run.sh ./report.csv --no-report

Use case: Automated pipelines where you do not need the extra log file.

`--log-level <LEVEL>`

Default: WARNING

What it does: Controls how much information is printed to the terminal during processing.

Level	What you see
`DEBUG`	Everything — very verbose, useful for troubleshooting
`INFO`	Progress updates (model loading, file processing)
`WARNING`	Only problems and warnings (default)
`ERROR`	Only errors
`CRITICAL`	Only fatal errors

# See progress while processing
./docker/run.sh ./report.csv --log-level INFO

# Debug a problem
./docker/run.sh ./report.csv --log-level DEBUG

4. Anonymization Options

`--preserve-entities <TYPES>`

Default: none (everything is anonymized)

What it does: A comma-separated list of entity types that should not be anonymized. Useful when some entities are not sensitive in your context (e.g., you want to keep IP addresses visible for analysis but anonymize names).

# Keep IP addresses and hostnames visible
./docker/run.sh ./scan.json --preserve-entities "IP_ADDRESS,HOSTNAME"

# Keep locations and organizations visible
./docker/run.sh ./report.txt --preserve-entities "LOCATION,ORGANIZATION"

Tip: Use --list-entities to see all valid entity type names.

`--allow-list <TERMS>`

Default: none

What it does: A comma-separated list of specific text values that should never be anonymized, regardless of what the NER model detects.

When to use: When the tool incorrectly anonymizes a word that is not sensitive in your context — for example, product names, public organization names, or known safe identifiers.

# Never anonymize these specific values
./docker/run.sh ./report.csv --allow-list "Google,Microsoft,CVSS"

# Multiple terms with spaces — quote the whole list
./docker/run.sh ./report.txt --allow-list "John Doe,127.0.0.1,internal-only"

`--slug-length <n>`

Default: 64

What it does: Controls the length of the random-looking suffix added to each anonymized entity.

When the tool replaces a name like John Smith, it generates a tag like [PERSON_a1b2c3d4...]. The number after the underscore is a hash derived from the original value and your secret key — the --slug-length controls how many characters of that hash appear.

Value	Output example	Notes
`64` (default)	`[PERSON_a1b2c3...64chars]`	Maximum uniqueness, full reversibility
`8`	`[PERSON_a1b2c3d4]`	Shorter, easier to read, still unique for most datasets
`0`	`[PERSON]`	No hash — entity type label only. No secret key required.

# Short slugs for readability
./docker/run.sh ./report.txt --slug-length 8

# Label-only (no de-anonymization possible later)
./docker/run.sh ./report.txt --slug-length 0

Important: With --slug-length 0 you do not need ANON_SECRET_KEY. With any other value, the key is required and must be kept for de-anonymization.

`--anonymization-config <path>`

Default: none

What it does: Points to a JSON file that gives the tool precise, field-level instructions for structured files (JSON, CSV, XML). Without this, the tool runs NER inference on every field — correct, but slow on large datasets.

The config supports three keys:

Key	Effect
`fields_to_exclude`	These fields are never touched
`fields_to_anonymize`	Only these fields run through NER inference
`force_anonymize`	These fields are always anonymized as a specific entity type, skipping NER entirely

Fields are specified using dot notation (e.g., asset.ipv4_addresses).

Example config file (anon_config.json):

{
  "fields_to_exclude": ["severity", "port", "protocol", "cvss_score", "age_in_days"],
  "fields_to_anonymize": ["asset.name", "output", "description"],
  "force_anonymize": {
    "asset.ipv4_addresses": { "entity_type": "IP_ADDRESS" },
    "asset.display_fqdn":   { "entity_type": "HOSTNAME" },
    "asset.display_mac_address": { "entity_type": "MAC_ADDRESS" },
    "scan.target":          { "entity_type": "HOSTNAME" }
  }
}

With this config:

severity, port, cvss_score, etc. are preserved as-is
asset.ipv4_addresses is always pseudonymized as IP_ADDRESS, no matter its format
asset.name, output, and description go through full NER analysis
Everything else is ignored

./docker/run.sh ./nessus_scan.json --anonymization-config ./anon_config.json

Performance impact: Using force_anonymize and fields_to_exclude can speed up processing by 25–134x on large structured files by eliminating NER inference for known fields.

5. Word List

`--word-list <path>`

Default: none

What it does: Points to a JSON file containing lists of known terms that must always be anonymized, even if the NER model would otherwise miss them.

This is ideal for internal terminology: internal system names, internal organization names, project codenames, or employee names that a general NER model might not recognize.

Format: A JSON object where each key is the entity type label (uppercased automatically) and each value is a list of terms. Any string key is valid — no preset mapping is required. Use the same type labels you would use in --preserve-entities or force_anonymize.

Example word list file (my_terms.json):

{
  "ORGANIZATION": ["AcmeCorp", "CSIRT-BR", "ProjectPhoenix"],
  "PERSON":       ["Jane Doe", "Carlos Souza"],
  "HOSTNAME":     ["fw-edge.internal", "siem.corp.local"],
  "IP_ADDRESS":   ["10.0.0.1", "192.168.100.254"],
  "MY_SYSTEM":    ["SIEM-Alpha", "FW-CORE-01", "PROXY-DMZ"]
}

You can use any entity type label, including custom ones (MY_SYSTEM, THREAT_ACTOR, CAMPAIGN, etc.). The label becomes the entity type recorded in the anonymization database.

./docker/run.sh ./incident_report.txt --word-list ./my_terms.json

How it works: Each term is compiled into a case-insensitive, word-boundary-aware regex pattern with a confidence score of 1.0 (maximum). This ensures these terms are always detected and anonymized, regardless of surrounding context.

Tip: Combine with --anonymization-config on structured files for full control: use force_anonymize for known fields, and --word-list for known values that appear across free-text fields.

6. Performance and Filtering Options

`--optimize`

Default: off

What it does: A single flag that enables all recommended performance settings at once. Equivalent to running with --anonymization-strategy standalone --db-mode in-memory --use-cache --min-word-length 3.

./docker/run.sh ./large_dataset/ --optimize

When to use: Processing large files or directories where speed matters more than maximum accuracy. On GPU, the standalone strategy is 4× faster than the default.

`--use-cache` / `--no-use-cache`

Default: --use-cache (caching is on)

What it does: Keeps an in-memory dictionary of already-anonymized values. If the same entity (e.g., the same IP address) appears many times in a large file, the tool looks it up in the cache instead of recomputing its pseudonym.

# Caching is on by default, so this is the same as not specifying it
./docker/run.sh ./report.json --use-cache

# Disable caching (useful for very large files where cache memory is a concern)
./docker/run.sh ./report.json --no-use-cache

`--max-cache-size <n>`

Default: 10000

What it does: The maximum number of unique entities to store in the cache. Once the cache is full, the oldest entries are evicted (LRU policy).

# Larger cache for files with many unique entities
./docker/run.sh ./large_report.csv --max-cache-size 50000

# Smaller cache to reduce memory usage
./docker/run.sh ./report.csv --max-cache-size 1000

`--min-word-length <n>`

Default: 0 (no minimum — all words are considered)

What it does: Skips any text token shorter than n characters. For example, with --min-word-length 3, the tokens "a", "to", and "ok" are never sent for NER inference. This can significantly reduce processing time on files with many short tokens.

# Skip tokens shorter than 3 characters (recommended for most datasets)
./docker/run.sh ./report.csv --min-word-length 3

Note: --optimize automatically sets --min-word-length 3.

`--skip-numeric`

Default: off

What it does: Skips strings that consist entirely of digits (e.g., "12345", "2024"). Useful when numeric identifiers in your dataset are not sensitive (e.g., vulnerability counts, port numbers that should not be anonymized).

./docker/run.sh ./report.csv --skip-numeric

Caution: Do not use this if your data contains sensitive numeric strings like credit card numbers or phone numbers in pure-digit format. Use --preserve-entities CREDIT_CARD instead for fine-grained control.

`--preserve-row-context`

Default: off

What it does: For CSV and XLSX files, the tool by default groups identical values across rows and processes each unique value only once (much faster). With --preserve-row-context, every cell value is processed in full, which is slower but ensures that the surrounding column context is preserved for each row.

./docker/run.sh ./dataset.csv --preserve-row-context

When to use: When your dataset has repeated values that have different meanings depending on the row context (rare), or when you suspect unique-value grouping is causing incorrect detections.

`--json-stream-threshold-mb <n>`

Default: 100

What it does: JSON files larger than this size (in megabytes) are processed by streaming from disk instead of loading the entire file into memory. Streaming uses more processing time but prevents out-of-memory errors on very large JSON files.

# Stream JSON files larger than 50 MB
./docker/run.sh ./large_export.json --json-stream-threshold-mb 50

# Only stream files larger than 500 MB (load smaller ones into RAM for speed)
./docker/run.sh ./export.json --json-stream-threshold-mb 500

`--regex-priority`

Default: off

What it does: Gives custom regex recognizers (e.g., the built-in cybersecurity patterns for IPs, CVEs, hashes) a score boost over the transformer model-based detections. When both a regex and a model detect the same text span, the regex result wins.

./docker/run.sh ./vuln_report.txt --regex-priority

When to use: On cybersecurity data where you want exact pattern matching (for IP addresses, CVE IDs, hashes) to take precedence over model-based detections, which can occasionally be imprecise.

`--force-large-xml`

Default: off

What it does: By default, the tool refuses to process XML files that exceed internal memory safety limits, to avoid out-of-memory crashes. This flag disables that safety check.

./docker/run.sh ./massive_nessus_export.xml --force-large-xml

Warning: Only use this on a machine with sufficient RAM. Processing a very large XML file without enough memory will crash the process.

`--disable-gc`

Default: off

What it does: Disables Python's automatic garbage collector during processing. This can boost speed for a single large file because the GC is not pausing at inopportune moments, but it increases peak memory usage since unused objects accumulate.

./docker/run.sh ./huge_file.pdf --disable-gc

When to use: Processing one very large file on a machine with ample RAM. Do not use when processing many files in a loop — memory will grow unbounded.

7. Anonymization Strategy and Models

`--anonymization-strategy <strategy>`

Default: filtered

What it does: Selects the internal engine used to detect and replace entities.

Strategy	Description	Best for
`filtered`	Presidio pipeline with a curated, optimized recognizer set	Default — best accuracy
`presidio`	Full Presidio pipeline with all recognizers enabled	Broadest detection, more false positives
`hybrid`	Presidio detection + manual text replacement (no Presidio anonymizer)	When Presidio's anonymizer causes issues
`standalone`	Loads NER models directly, bypasses Presidio entirely	Maximum GPU throughput (4× faster)
`slm`	End-to-end anonymization using a local language model (Ollama)	Experimental / research use

Performance comparison — GPU (NVIDIA RTX 5060 Ti, 551 MB JSON, 70,951 records):

Strategy	CSV (KB/s)	JSON (KB/s)
`standalone`	732	1,250
`hybrid`	248	632
`filtered` (default)	240	627
`presidio`	171	575

Accuracy comparison (67 annotated vulnerability records, SecureModernBERT model):

Strategy	Precision	Recall	F1
`filtered` (default)	91.9 %	96.7 %	94.2 %
`hybrid`	91.9 %	96.7 %	94.2 %
`standalone`	87.9 %	94.5 %	91.1 %
`presidio`	71.6 %	96.7 %	82.3 %

# Best accuracy (default — no flag needed)
./docker/run.sh ./report.csv

# Maximum GPU throughput
./docker/run.sh ./large_dataset.json --anonymization-strategy standalone

# Experimental: use a local LLM (requires Ollama)
./docker/run.sh ./report.txt --anonymization-strategy slm

Recommendation: Use filtered for accuracy. Use standalone on GPU when processing large datasets where speed is the priority.

`--transformer-model <model>`

Default: Davlan/xlm-roberta-base-ner-hrl

What it does: Selects the transformer model used for Named Entity Recognition (NER). The model is downloaded automatically on first use and cached locally.

Model	Scope	Languages	Notes
`Davlan/xlm-roberta-base-ner-hrl`	General-purpose NER	24 languages	Default
`attack-vector/SecureModernBERT-NER`	Cybersecurity-focused NER	English only	Adds `MALWARE`, `THREAT_ACTOR`, `ATTACK_PATTERN`, etc.
`dslim/bert-base-NER`	General-purpose NER, fast	English only	Smaller, faster

# Cybersecurity-focused model (recommended for security reports)
./docker/run.sh ./vuln_report.txt --transformer-model attack-vector/SecureModernBERT-NER

# Fast English model
./docker/run.sh ./report.txt --transformer-model dslim/bert-base-NER

First run: The model is downloaded from HuggingFace (~1 GB for the default, ~400 MB for SecureModernBERT) and cached in ./anon/models/ for all subsequent runs.

8. Database Options

The tool uses a SQLite database to store entity mappings (original value → pseudonym). This database is what makes de-anonymization possible later.

`--db-mode <mode>`

Default: persistent

What it does: Controls where the database lives.

Mode	Behavior
`persistent`	Saved to disk in `--db-dir`. Can be used for de-anonymization later.
`in-memory`	Lives only in RAM during the run. Faster, but mappings are lost when the tool finishes.

# Save mappings to disk (default — keep for de-anonymization)
./docker/run.sh ./report.csv --db-mode persistent

# Temporary run, no de-anonymization needed
./docker/run.sh ./report.csv --db-mode in-memory

Note: --optimize sets --db-mode in-memory automatically. Only use in-memory if you do not need to reverse the anonymization later.

`--db-dir <path>`

Default: db

What it does: The directory where the persistent SQLite database file is stored.

./docker/run.sh ./report.csv --db-dir /secure/storage/anondb/

`--db-synchronous-mode <mode>`

Default: none (SQLite default, which is NORMAL)

What it does: Sets SQLite's synchronous PRAGMA, which controls how aggressively SQLite flushes data to disk. Lower synchronous modes are faster but less safe in case of a power failure.

Mode	Speed	Safety
`EXTRA`	Slowest	Maximum
`FULL`	Slow	High
`NORMAL`	Balanced	Good (default)
`OFF`	Fastest	Risk of corruption on crash

# Maximum speed (acceptable for one-time batch runs)
./docker/run.sh ./large_dataset/ --db-synchronous-mode OFF

# Maximum safety (for production environments)
./docker/run.sh ./report.csv --db-synchronous-mode FULL

9. Chunking and Batching Options

These options control how large files are split into pieces for processing. The defaults are sensible for most use cases. Only adjust these if you are experiencing memory issues or want to fine-tune performance.

`--batch-size <n>`

Default: 1000

What it does: The number of text chunks processed in each batch by the NER pipeline.

./docker/run.sh ./report.csv --batch-size 500

# Use adaptive sizing based on file characteristics
./docker/run.sh ./report.csv --batch-size auto

`--csv-chunk-size <n>`

Default: 1000

What it does: The number of rows read from a CSV file at a time using pandas. Lower values reduce memory usage; higher values may be faster.

./docker/run.sh ./large.csv --csv-chunk-size 500

`--json-chunk-size <n>`

Default: 1000

What it does: The number of JSON array items processed per streaming batch on large JSON files.

./docker/run.sh ./export.json --json-chunk-size 500

`--ner-chunk-size <n>`

Default: 1500

What it does: Maximum character length of each text chunk sent to the NER model. Text longer than this is automatically split. Larger chunks provide more context to the model; smaller chunks reduce memory usage.

./docker/run.sh ./report.txt --ner-chunk-size 2000

`--nlp-batch-size <n>`

Default: 500

What it does: Batch size for spaCy's nlp.pipe() processing. Larger values speed up spaCy on GPU; smaller values reduce memory pressure.

./docker/run.sh ./corpus/ --nlp-batch-size 1000

`--use-datasets`

Default: off

What it does: Uses HuggingFace datasets for batch NER processing instead of standard Python iteration. This eliminates the "running pipelines sequentially on GPU" warning and improves GPU utilization. Recommended for files larger than 50 MB.

./docker/run.sh ./large_corpus/ --use-datasets

10. NER Data Generation Options

These options switch the tool from anonymization mode into NER training data generation mode. In this mode, the tool detects entities but instead of replacing them, it writes a JSONL file recording what was found — ready to use as training data for a custom NER model.

No secret key required in NER data generation mode.

`--generate-ner-data`

What it does: Activates NER data generation mode. Output is one JSONL file per input file, where each line is:

{"text": "The attacker at 192.168.1.1 used CVE-2021-44228.", "label": [[16, 29, "IP_ADDRESS"], [35, 50, "CVE_ID"]]}

./docker/run.sh ./corpus/ --generate-ner-data --output-dir ./ner_training_data/

`--ner-include-all`

Default: off (only texts with at least one detected entity are included)

What it does: Includes all texts in the NER output, even those where no entities were detected. Useful when training a model that also needs to learn "there is nothing to detect here."

./docker/run.sh ./corpus/ --generate-ner-data --ner-include-all

`--ner-aggregate-record`

Default: off

What it does: For JSON and JSONL files, instead of extracting each field separately, this option merges all fields of each record into a single text line before running NER. Useful when you want to preserve the full context of each record.

./docker/run.sh ./logs.jsonl --generate-ner-data --ner-aggregate-record

11. SLM / AI Options (Experimental)

These options use a Small Language Model (SLM) running locally via Ollama to assist with entity detection or anonymization. They are experimental and intended for research use.

Prerequisite: Ollama must be installed and running, or the tool must be allowed to manage an Ollama Docker container (--no-auto-ollama disabled).

`--slm-map-entities`

What it does: Uses the SLM to scan a file and produce a detailed report of all potential entities, including confidence scores and the model's reasoning for each detection. Does not anonymize — only maps and reports.

Output files:

<name>_entity_map.jsonl — one JSON object per entity
<name>_entity_map.csv — same data in spreadsheet format

./docker/run.sh ./report.txt --slm-map-entities --output-dir ./entity_analysis/

`--slm-detector`

What it does: Adds the SLM as an additional entity detector alongside the standard transformer NER model. The SLM results are merged with the traditional NER results.

./docker/run.sh ./report.txt --slm-detector

`--slm-detector-mode <mode>`

Default: hybrid

What it does: Controls how SLM detections are combined with traditional NER results.

Mode	Behavior
`hybrid`	Merges SLM and traditional NER results (more detections)
`exclusive`	Uses only SLM results, ignoring the transformer model

./docker/run.sh ./report.txt --slm-detector --slm-detector-mode exclusive

`--slm-prompt-version <version>`

Default: v1

What it does: Selects which prompt template version to use for SLM tasks.

./docker/run.sh ./report.txt --slm-map-entities --slm-prompt-version v2

`--slm-chunk-size <n>`

Default: 2000 (characters)

What it does: Maximum character length of each text chunk sent to the SLM mapper.

./docker/run.sh ./report.txt --slm-map-entities --slm-chunk-size 1500

`--slm-anonymizer-chunk-size <n>`

Default: 3000 (characters)

What it does: Maximum character length of each text chunk sent to the SLM anonymizer (used with --anonymization-strategy slm).

./docker/run.sh ./report.txt --anonymization-strategy slm --slm-anonymizer-chunk-size 2000

`--slm-confidence-threshold <n>`

Default: 0.7

What it does: Minimum confidence score (0.0–1.0) for an entity detected by the SLM to be accepted. Entities with lower confidence are discarded.

# Only accept very high-confidence detections
./docker/run.sh ./report.txt --slm-map-entities --slm-confidence-threshold 0.9

`--slm-context-window <n>`

Default: 200 (characters)

What it does: The number of characters before and after each detected entity to include as context in the SLM mapper output. Larger windows give more surrounding context.

./docker/run.sh ./report.txt --slm-map-entities --slm-context-window 400

`--slm-temperature <n>`

Default: 0.0

What it does: Controls the randomness of the SLM's output. 0.0 is fully deterministic (best for structured tasks). Higher values produce more varied output.

./docker/run.sh ./report.txt --slm-map-entities --slm-temperature 0.1

12. Ollama Service Options (Experimental)

These options control how the tool manages the Ollama service that runs the local SLM.

`--no-auto-ollama`

Default: off (auto-management is enabled)

What it does: Disables automatic Ollama Docker container management. By default, the tool starts an Ollama container if one is not already running. With this flag, you must start Ollama manually before running the tool.

# You started Ollama yourself; tell the tool not to touch it
./docker/run.sh ./report.txt --slm-map-entities --no-auto-ollama

`--ollama-docker-image <image>`

Default: ollama/ollama:latest

What it does: Specifies which Docker image to use when the tool starts an Ollama container automatically.

./docker/run.sh ./report.txt --slm-map-entities --ollama-docker-image ollama/ollama:0.3.0

`--ollama-container-name <name>`

Default: ollama-anon

What it does: Sets the name of the Ollama Docker container managed by the tool.

./docker/run.sh ./report.txt --slm-map-entities --ollama-container-name my-ollama

`--ollama-no-gpu`

Default: off (GPU is used if available)

What it does: Disables GPU support when starting the Ollama Docker container. Use this if your GPU is not compatible or you want the SLM to run on CPU.

./docker/run.sh ./report.txt --slm-map-entities --ollama-no-gpu

13. Quick Reference Table

Argument	Default	Description
`file_path`	—	File or folder to anonymize (required)
`--list-entities`	—	Print all supported entity types and exit
`--list-languages`	—	Print all supported languages and exit
`--lang`	`en`	Document language code
`--output-dir`	`output`	Where to save anonymized files
`--overwrite`	off	Overwrite existing output files
`--no-report`	off	Skip the performance report in `logs/`
`--log-level`	`WARNING`	Verbosity: `DEBUG` `INFO` `WARNING` `ERROR` `CRITICAL`
`--preserve-entities`	—	Comma-separated entity types to skip
`--allow-list`	—	Comma-separated terms to never anonymize
`--slug-length`	`64`	Length of the hash suffix in pseudonyms (0–64)
`--anonymization-config`	—	Path to JSON config for field-level control
`--word-list`	—	Path to JSON file of known terms to always anonymize
`--preserve-row-context`	off	Process every CSV/XLSX cell individually
`--json-stream-threshold-mb`	`100`	Stream JSON files larger than this many MB
`--optimize`	off	Enable all performance optimizations
`--use-cache` / `--no-use-cache`	on	Enable/disable in-memory result cache
`--max-cache-size`	`10000`	Maximum entries in the cache
`--min-word-length`	`0`	Skip tokens shorter than this (characters)
`--skip-numeric`	off	Skip purely numeric strings
`--regex-priority`	off	Prioritize regex over model detections
`--force-large-xml`	off	Override XML memory safety limits
`--disable-gc`	off	Disable Python garbage collection
`--anonymization-strategy`	`filtered`	Detection engine: `filtered` `presidio` `hybrid` `standalone` `slm`
`--transformer-model`	`Davlan/xlm-roberta-base-ner-hrl`	NER model to use
`--db-mode`	`persistent`	Database mode: `persistent` or `in-memory`
`--db-dir`	`db`	Directory for the database file
`--db-synchronous-mode`	—	SQLite sync PRAGMA: `OFF` `NORMAL` `FULL` `EXTRA`
`--batch-size`	`1000`	Processing batch size (or `auto`)
`--csv-chunk-size`	`1000`	Rows per CSV read chunk
`--json-chunk-size`	`1000`	Items per JSON streaming chunk
`--ner-chunk-size`	`1500`	Max characters per NER text chunk
`--nlp-batch-size`	`500`	Batch size for spaCy's pipe
`--use-datasets`	off	Use HuggingFace datasets for GPU batch processing
`--generate-ner-data`	off	Generate NER training data instead of anonymizing
`--ner-include-all`	off	Include texts with no detected entities in NER output
`--ner-aggregate-record`	off	Merge JSON/JSONL record fields into one text line
`--slm-map-entities`	off	Map entities with SLM (no anonymization)
`--slm-detector`	off	Use SLM as an additional entity detector
`--slm-detector-mode`	`hybrid`	`hybrid` or `exclusive`
`--slm-prompt-version`	`v1`	Prompt template version for SLM
`--slm-chunk-size`	`2000`	Max chars per SLM mapper chunk
`--slm-anonymizer-chunk-size`	`3000`	Max chars per SLM anonymizer chunk
`--slm-confidence-threshold`	`0.7`	Minimum SLM confidence to accept an entity
`--slm-context-window`	`200`	Context characters around each entity in SLM output
`--slm-temperature`	`0.0`	SLM output randomness (0 = deterministic)
`--no-auto-ollama`	off	Disable automatic Ollama Docker management
`--ollama-docker-image`	`ollama/ollama:latest`	Ollama Docker image
`--ollama-container-name`	`ollama-anon`	Ollama container name
`--ollama-no-gpu`	off	Disable GPU for Ollama container

FilesExpand file tree

CLI_REFERENCE.md

Latest commit

History

CLI_REFERENCE.md

File metadata and controls

CLI Reference — Complete Argument Guide

Table of Contents

1. Positional Argument — The File or Folder

file_path

2. Informational Commands

--list-entities

--list-languages

3. General Options

--lang <code>

--output-dir <path>

--overwrite

--no-report

--log-level <LEVEL>

4. Anonymization Options

--preserve-entities <TYPES>

--allow-list <TERMS>

--slug-length <n>

--anonymization-config <path>

5. Word List

--word-list <path>

6. Performance and Filtering Options

--optimize

--use-cache / --no-use-cache

--max-cache-size <n>

--min-word-length <n>

--skip-numeric

--preserve-row-context

--json-stream-threshold-mb <n>

--regex-priority

--force-large-xml

--disable-gc

7. Anonymization Strategy and Models

--anonymization-strategy <strategy>

--transformer-model <model>

8. Database Options

--db-mode <mode>

--db-dir <path>

--db-synchronous-mode <mode>

9. Chunking and Batching Options

--batch-size <n>

--csv-chunk-size <n>

--json-chunk-size <n>

--ner-chunk-size <n>

--nlp-batch-size <n>

--use-datasets

10. NER Data Generation Options

--generate-ner-data

--ner-include-all

--ner-aggregate-record

11. SLM / AI Options (Experimental)

--slm-map-entities

--slm-detector

--slm-detector-mode <mode>

--slm-prompt-version <version>

--slm-chunk-size <n>

--slm-anonymizer-chunk-size <n>

--slm-confidence-threshold <n>

--slm-context-window <n>

--slm-temperature <n>

12. Ollama Service Options (Experimental)

--no-auto-ollama

--ollama-docker-image <image>

--ollama-container-name <name>

--ollama-no-gpu

13. Quick Reference Table

`file_path`

`--list-entities`

`--list-languages`

`--lang <code>`

`--output-dir <path>`

`--overwrite`

`--no-report`

`--log-level <LEVEL>`

`--preserve-entities <TYPES>`

`--allow-list <TERMS>`

`--slug-length <n>`

`--anonymization-config <path>`

`--word-list <path>`

`--optimize`

`--use-cache` / `--no-use-cache`

`--max-cache-size <n>`

`--min-word-length <n>`

`--skip-numeric`

`--preserve-row-context`

`--json-stream-threshold-mb <n>`

`--regex-priority`

`--force-large-xml`

`--disable-gc`

`--anonymization-strategy <strategy>`

`--transformer-model <model>`

`--db-mode <mode>`

`--db-dir <path>`

`--db-synchronous-mode <mode>`

`--batch-size <n>`

`--csv-chunk-size <n>`

`--json-chunk-size <n>`

`--ner-chunk-size <n>`

`--nlp-batch-size <n>`

`--use-datasets`

`--generate-ner-data`

`--ner-include-all`

`--ner-aggregate-record`

`--slm-map-entities`

`--slm-detector`

`--slm-detector-mode <mode>`

`--slm-prompt-version <version>`

`--slm-chunk-size <n>`

`--slm-anonymizer-chunk-size <n>`

`--slm-confidence-threshold <n>`

`--slm-context-window <n>`

`--slm-temperature <n>`

`--no-auto-ollama`

`--ollama-docker-image <image>`

`--ollama-container-name <name>`

`--ollama-no-gpu`