LeJIT enforces logic formulas during autoregressive LM generation of network key-value records. It combines:
- NetNomos dataset schemas and rule artifacts
- Hugging Face causal language models
- Z3-backed stepwise constraint checking during decoding
LeJIT is designed for records such as:
- NetFlow-style flow records
- packet-header windows derived from PCAP traces
- pre-aggregated telemetry and network logs
The Python package and CLI entry point: lejit
@inproceedings{he2025lejit,
title={Just-in-Time Logic Enforcement: A new paradigm of combining statistical and symbolic reasoning for network management},
author={H{\`e}, Hongyu and Apostolaki, Maria},
booktitle={Proceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets 25)},
pages={184--192},
year={2025}
}
LeJIT trains a causal language model (LM) over a serialized representation of network records and filters every generation step against a formal logic theory.
The typical workflow is:
- Describe the dataset with a NetNomos dataset schema JSON file.
- Provide a rule artifact in NetNomos rule format.
- Train a LeJIT model bundle with
lejit train. - Generate new records with
lejit generateor complete record prefixes/prompts withlejit complete.
LeJIT reuses NetNomos dataset schemas. A dataset schema defines:
- where the input data comes from
- what fields exist and what their value types are
- preprocessing steps such as renaming, mapping, casting, filtering, and hex parsing
- context windows for packet-local or sequence-local reasoning
- derived variables such as interarrival statistics
LeJIT consumes NetNomos rule artifacts from rules.json. Each rule is a structured logic formula,
not a free-form string. The rule bundle can be learned with NetNomos or supplied manually in the
same JSON format.
lejit train writes a self-contained bundle directory with:
model/: Hugging Face model weights and configconfig.json: resolved LeJIT configdataset_spec.json: embedded dataset schemarules.json: embedded rule artifactmetadata.json: rule metadata when presentschema.json: LeJIT field serialization schemavocab.json: LeJIT vocabularymanifest.json: summary metadata
Generation and completion load their model, schema, and embedded rules from the bundle.
- Python
>=3.10 uvfor environment management and command execution- A local checkout of NetNomos because this project depends on the
netnomosPython package via a localuvsource
The current pyproject.toml expects NetNomos to live beside this repository
for dependency resolution:
<workspace>/
βββ LeJIT/
βββ NetNomos/
git clone https://github.com/HongyuHe/LeJIT
git clone https://github.com/HongyuHe/NetNomos
cd LeJIT
uv syncIf your NetNomos checkout is not at ../NetNomos, update [tool.uv.sources] in
pyproject.toml.
uv run lejit --help
uv run pytest -q- local datasets:
data/ - local dataset schemas:
data/specs/ - local rules:
rules/ - preset LeJIT configs:
configs/ - thin dataset wrappers:
scripts/ - library code:
lejit/
The shipped configs currently use local paths for all dataset assets:
dataset_specpoints intodata/specs/input_pathpoints intodata/rules_pathpoints intorules/
NetNomos is still needed at install time because the Python package is imported directly by LeJIT.
LeJIT builds models with transformers.AutoModelForCausalLM of Hugging Face in
lejit/modeling.py. In practice, this means:
- supported: causal language model architectures available through
AutoModelForCausalLM.from_config(...)orAutoModelForCausalLM.from_pretrained(...) - intended scope: decoder-style next-token models
- not supported: encoder-only masked language models, translation-style seq2seq models, or models
that are not exposed through
AutoModelForCausalLM
Examples of compatible decoder-only checkpoints include:
gpt2distilgpt2EleutherAI/gpt-neo-125MEleutherAI/gpt-j-6Bfacebook/opt-125m- etc.
For model.mode = "config", common architecture values include:
"gpt2""gpt_neo""gptj""opt"- etc.
Other decoder-only families such as LLaMA-style or Mistral-style models may also work when they
are available through the Hugging Face AutoModelForCausalLM registry, but the most direct
starting point in this repository is still a GPT-2-family or GPT-Neo/OPT-family configuration.
Important implementation detail:
- LeJIT does not use a Hugging Face tokenizer
- it builds its own vocabulary from serialized field/value tokens
- the model embedding matrix is resized to that LeJIT vocabulary
Because of that, model.mode = "config" is the most natural starting point for training.
model.mode = "pretrained" is supported when you intentionally want to initialize from an
existing causal LM architecture before resizing embeddings to the LeJIT vocabulary.
Each CLI command takes a TOML config parsed by
LeJITConfig. The top-level sections are:
[dataset] options
dataset_spec: path to a NetNomos dataset schema JSON fileinput_path: path to the raw dataset file; overridessource.pathfrom the dataset schemarules_path: path to the NetNomosrules.jsonartifactlimit: optional row or packet limit for smoke tests
[model] options
mode:"config"or"pretrained"architecture: model type key used when building from config, such as"gpt2"name_or_path: required in"pretrained"mode; optional in"config"mode when loading a base config from an existing checkpointconfig_overrides: model config overrides such asn_positions,n_embd,n_layer, andn_head
[serialization] options
field_order: optional explicit field order for serialization and promptingmax_categorical_domain: optional hard stop for very large categorical domainsforce_string_fields: reserved for future schema coercionnumeric_precision: decimal precision used when rendering real-valued fields
[training] options
epochsbatch_sizelearning_rateweight_decaywarmup_ratiogradient_accumulation_stepsmax_stepsseedlogging_stepssave_steps
[decoding] options
max_new_tokenstemperaturetop_ktop_pdo_samplebacktrack_limitnum_return_sequences
[run] options
n_samples: default sample count forgeneratebatch_size: currently reserved for higher-level orchestrationsamples_per_prompt: default fan-out forcompleteprompt_columns: reserved for future prompt presets
Show command summary
usage: lejit [-h] {train,generate,complete} ...
Train and run LeJIT.
positional arguments:
{train,generate,complete}
train Train a LeJIT bundle.
generate Generate constrained rows.
complete Complete prefix prompts.
Purpose:
- load the dataset through NetNomos preprocessing
- derive the LeJIT serialization schema and vocabulary
- build and train a causal LM
- save a self-contained bundle
Options
--config: required TOML config--output: required bundle directory
Example
uv run lejit train \
--config configs/cidds/train.toml \
--output artifacts/cidds-modelPurpose:
- load a trained bundle
- instantiate the embedded schema, vocabulary, and rule theory
- generate full records under rule enforcement
Options
--config: required TOML config--model-bundle: required trained bundle--output: required CSV output path--n-samples: optional override for[run].n_samples--device: device string such ascpu,cuda, ormps
Behavior note:
- the bundle supplies the trained model, schema, and embedded rule artifact
- the external config is mainly useful for decoding and run defaults at generation time
Example
uv run lejit generate \
--config configs/netflix/generate.toml \
--model-bundle artifacts/netflix-model \
--output out/netflix.csv \
--n-samples 1000 \
--device cpuPurpose:
- load a trained bundle
- accept prompt records from CSV
- complete the remaining fields under rule enforcement
Options
--config: required TOML config--model-bundle: required trained bundle--prompts: required CSV file--output: required CSV output path--device: device string such ascpu,cuda, ormps--samples-per-prompt: optional override for[run].samples_per_prompt
Important prompt contract:
Prompt prefix rules
- prompt columns must be a strict left prefix of the active serialization order
- if the schema order is
[A, B, C, D], valid prompts may contain[A],[A, B], or[A, B, C] - prompts like
[B, C]are rejected
Example
uv run lejit complete \
--config configs/metadc/complete.toml \
--model-bundle artifacts/metadc-model \
--prompts prompts/metadc_prefixes.csv \
--output out/metadc_completed.csv \
--samples-per-prompt 4 \
--device cpuThe repository also ships thin wrappers such as:
scripts/generate_cidds.pyscripts/complete_cidds.pyscripts/generate_netflix.pyscripts/complete_netflix.py
These wrappers only preselect a config and forward to the main CLI.
The main entry points are exported from lejit/__init__.py:
LeJITConfigLeJITPipeline
from lejit import LeJITConfig
config = LeJITConfig.from_toml("configs/cidds/train.toml")LeJITConfig.from_toml(...) validates the full TOML file with Pydantic and returns a structured
object you can inspect or modify in code.
from lejit import LeJITConfig, LeJITPipeline
config = LeJITConfig.from_toml("configs/cidds/train.toml")
pipeline = LeJITPipeline.build_from_config(
config,
base_dir="configs/cidds",
)
bundle_dir = pipeline.train("artifacts/cidds-model")LeJITPipeline.build_from_config(...) does the following:
- resolves config-relative paths
- loads the dataset via NetNomos
- loads and validates the rule artifact
- derives the LeJIT schema and vocabulary
- builds a Hugging Face causal LM
pipeline.train(...) then:
- serializes each prepared record into a LeJIT token sequence
- trains the model with
transformers.Trainer - saves a bundle directory
from lejit import LeJITPipeline
pipeline = LeJITPipeline.load("artifacts/cidds-model", device="cpu")LeJITPipeline.load(...) restores:
- the trained model
- the saved vocabulary
- the saved LeJIT schema
- the embedded dataset schema and rule artifact
frame = pipeline.generate(
n_samples=100,
device="cpu",
)Return type:
pandas.DataFrame
import pandas as pd
prompts = pd.read_csv("prompts.csv")
completed = pipeline.complete(
prompts,
samples_per_prompt=4,
device="cpu",
)Return type:
pandas.DataFrame
complete(...) enforces the same prefix rule as the CLI: prompt columns must match the leftmost
portion of the active field order.
Put the raw data file under data/ or another location reachable by your config.
Examples:
data/my_flows.csvdata/my_trace.pcapdata/my_logs.csv
Create a NetNomos dataset schema JSON file. You may:
- keep it under
data/specs/ - or store it elsewhere and point
dataset_specat that path
A minimal CSV example looks like:
{
"name": "my_dataset",
"source": {
"type": "csv",
"path": "data/my_flows.csv"
},
"fields": [
{"name": "SrcIp", "value_type": "categorical", "roles": ["src", "identifier"]},
{"name": "DstIp", "value_type": "categorical", "roles": ["dst", "identifier"]},
{"name": "Bytes", "value_type": "integer", "roles": ["size", "measurement"]}
]
}Add the richer NetNomos fields you need:
preprocessingcontext_windowderived_variables- explicit
domainorbounds - semantic
roles
Create a rule directory such as rules/my_dataset/ containing:
rules.jsonmetadata.json- optionally
interpreted_rules.clj
rules.json must follow the NetNomos learned-rule format. Each entry is a structured formula with
fields like:
rule_idformuladisplaysupportsource
LeJIT validates rule references against the prepared dataset fields before training starts.
Create:
configs/my_dataset/train.tomlconfigs/my_dataset/generate.tomlconfigs/my_dataset/complete.toml
Minimal training config:
Show example train.toml
[dataset]
dataset_spec = "../../data/specs/my_dataset.json"
input_path = "../../data/my_flows.csv"
rules_path = "../../rules/my_dataset/rules.json"
[model]
mode = "config"
architecture = "gpt2"
[model.config_overrides]
n_positions = 512
n_ctx = 512
n_embd = 256
n_layer = 6
n_head = 8
[serialization]
numeric_precision = 6
max_categorical_domain = 50000
[training]
epochs = 3
batch_size = 16
learning_rate = 0.0005
seed = 42
[decoding]
temperature = 1.0
do_sample = true
top_p = 0.95
[run]
n_samples = 100
samples_per_prompt = 1Pay attention to:
field_orderThis controls serialization order and therefore prompt legality incomplete.max_categorical_domainUse this to catch exploding categorical vocabularies early.numeric_precisionThis affects how real-valued fields are rendered into fixed-width token streams.
If a categorical field is too large, consider:
- preprocessing it into a coarser representation
- mapping values into semantic classes
- using a bounded integer encoding in the dataset schema
uv run lejit train \
--config configs/my_dataset/train.toml \
--output artifacts/my_dataset-model
uv run lejit generate \
--config configs/my_dataset/generate.toml \
--model-bundle artifacts/my_dataset-model \
--output out/my_dataset.csv \
--n-samples 10If you want completion:
- inspect the active field order
- create a prompt CSV containing only the leftmost
kfields - keep the exact column order
Example:
- if the field order is
[rackid, hostid, IngressBytesAgg, EgressBytesAgg] - valid prompt CSV columns are:
[rackid][rackid, hostid][rackid, hostid, IngressBytesAgg]
The repository currently ships preset configs for: