Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,31 @@ status quo, `2` once it is not backwards compatible. Entries are grouped under
the spec's current class. Every pull request that alters normative content adds
an entry here.

## OVOS-TRANSFORM-1 — Transformer Plugins

### 1

- Initial draft. Defines six transformer chains at six injection
points in the OVOS-PIPELINE-1 §6 utterance lifecycle, in lifecycle
order: audio (raw audio before STT, §3.1), utterance (post-STT text
normalization before intent matching, §3.2), metadata (session
enrichment after the utterance text, before the match round, §3.3),
intent (match-result adjustment after the match round, before
dispatch, §3.4), dialog (response-text transformation after a skill
emits `speak()`, before TTS, §3.5), and tts (synthesized-audio
transformation after TTS, before playback, §3.6). An orchestrator
MAY implement any subset of the six points; an unimplemented chain
is a no-op. Chains are ordered; the output of one transformer is the
input to the next. Per-session ordering and denylists via the
`<type>_transformers` / `blacklisted_<type>_transformers` session
fields (§5). Defines session mutation discipline: transformers MAY
mutate session fields they own (SESSION-1 §2.1) but MUST NOT mutate
fields owned by other specs; and utterance cancellation (§8) as the
only sanctioned early short-circuit of the lifecycle, preserving the
`ovos.utterance.handled` invariant. Conformance roles: Audio,
Utterance, Metadata, Intent, Dialog, and TTS Transformer, plus
Orchestrator.

## OVOS-INTENT-1 — Sentence Template Grammar

### 2
Expand Down
167 changes: 167 additions & 0 deletions appendix/rationale.md
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,173 @@ the normative sections.
transformer types are natural producers of which
signals; consolidation is the consumer's decision per
SESSION-1 §3.2.7.
- **Why each injection point is the only point.**
Each of the six transformer chains exists at the *only*
lifecycle stage where its input artifact is available and
its class of mutation is possible:
- **Audio (§3.1)** — the only stage where unprocessed
audio exists. STT is information-lossy by design; it
preserves *what was said* and discards almost everything
about *how it was said*: prosody, acoustic language cues,
speaker characteristics, ambient context, sub-vocal
signals. Any concern that depends on the audio signal
itself — voice activity, acoustic language detection,
speaker identification, acoustic-event detection, noise
reduction for downstream STT accuracy — has exactly one
place to live.
- **Utterance (§3.2)** — the only stage where the user's
utterance exists as text but no semantic interpretation
has been committed to yet. Once intent matching runs,
the utterance is bound to a specific intent's
slot-and-vocabulary shape; any cross-cutting text
manipulation after that point would have to be
intent-aware. Mutations here therefore ripple uniformly
through every downstream stage and every intent engine —
normalize contractions once and every engine sees the
normalized form; translate Spanish to English once and
every English-trained engine becomes reachable.
- **Metadata (§3.3)** — the only stage where the joint
audio-plus-text signal is fully available, intent
matching has not yet committed, and the full
`Message.context` is in flight and mutable. Audio
transformers had no text and no session; utterance
transformers primarily mutate the utterance list; intent
transformers operate after match. Here a metadata
transformer can derive cross-cutting signals from the
joint audio+text material and make them available *once*
to every downstream stage, by writing wherever in
`Message.context` the consumers will look.
- **Intent (§3.4)** — the only stage that holds *both* the
resolved intent identity and the user's free-text capture
values. Before match, the intent is unknown — there's
nothing to enrich. After dispatch, the handler has
already been called — too late to add typed equivalents
or contextual fallbacks. The capture map is the universal
interface every engine produces (OVOS-INTENT-3 §7), so
enrichment here is engine-agnostic.
- **Dialog (§3.5)** — the only stage where the assistant's
response exists as *final text* — the skill has committed
to what to say but TTS has not committed to how it
sounds. Mutations here are language-aware, persona-aware,
and content-policy-aware in ways no later stage can be:
once the text is synthesized into audio, the
modifications available are audio-domain only.
- **TTS (§3.6)** — the only stage where the final response
exists as *synthesized audio bytes* — speech text has
been rendered to a waveform, but the waveform hasn't been
played yet. Audio-domain modifications belong here for
the same reason audio transformers belong pre-STT: this
is where the acoustic dimension exists and is mutable.

- **Canonical use cases, per injection point.**
- **Audio §3.1:** voice activity detection; audio language
detection (writing detected language into metadata for
downstream STT and intent stages to read); acoustic noise
reduction; format/sample-rate normalization.
- **Utterance §3.2:** text normalization (contractions,
casing, common typo correction); STT transcription
validation — dropping garbled candidates;
cancellation/stop-word detection; source-language
translation into the matching language; code-switching
cleanup.
- **Metadata §3.3:** caller/speaker identification written
to a top-level context key; mood/urgency/formality
classification from the joint signal; per-utterance
language override (combining audio-language detection with
utterance-language hint, writing the resolved language to
`session.lang`); per-utterance pipeline switch (detecting a
sensitive-query signal and swapping `session.pipeline`);
system context injection (writing entries to
`session.intent_context` for downstream pipeline plugins
and skills to read as gates, without round-tripping
through CONTEXT-1 §5 bus events).
- **Intent §3.4:** system entity injection — the canonical
use. Parse free-text capture values into typed system
entities (dates, numbers, durations, named locations,
ordinals) and add typed equivalents under
conventionally-named keys for skill handlers to consume
uniformly. This is OVOS-INTENT-1 §5.3's deferred value
typing; this chain is the agreed home for applying that
normalization globally so individual skills do not each
implement it. Also: named-entity recognition over capture
values; per-skill enrichment a deployer wants applied
without each skill re-implementing it.
- **Dialog §3.5:** translation to the user's preferred
language when it differs from the rendering language;
persona/tone rewriting; content moderation (profanity
filtering, sensitive-topic rephrasing); length
normalization for voice responses.
- **TTS §3.6:** voice effects (character voices, pitch
shifting, post-processing EQ); cross-fade or jingle
injection for branded assistants; format conversion for
downstream playback constraints.

- **Where LLMs fit, per injection point.**
- **Audio §3.1:** language identification is the typical
model-backed audio transformer; full LLMs do not run at
this stage in any practical deployment.
- **Utterance §3.2:** a natural injection point for
language models — a small local model validating STT
plausibility, a translation model producing a candidate
string in the assistant's primary language, a paraphrase
model adding alternative candidates so a downstream intent
engine has more material to match against.
- **Metadata §3.3:** a small classifier (LLM-backed or
otherwise) inferring conversational metadata from the
utterance and feeding the result into `Message.context` —
useful when several pipeline plugins or skills want to
read the same derived signal without each computing it
themselves. Also: an LLM that reads the utterance text
and decides per-utterance which `session.pipeline`
configuration to apply.
- **Intent §3.4:** the strongest match in the stack. A
small LLM can extract structured entities (dates,
durations, quantities) from free-text capture values and
inject the typed forms into `Match.captures` — once,
centrally — so every skill receives the same typed payload
regardless of which engine matched.
- **Dialog §3.5:** the most prominent LLM application —
response rewriting under a persona prompt. A `tone` or
`persona` directive on a dialog transformer routes the
skill's plain response through an LLM with a system
prompt, yielding the user-facing voice the assistant wants
to present. Translation models also live here for runtime
localization of skill-rendered text.
- **TTS §3.6:** not applicable in any practical sense; this
stage operates on audio bytes only.

- **Cross-cutting concerns are the architectural value.**
Transformer chains are how a voice OS layers cross-cutting
concerns — translation, normalization, entity tagging,
persona rewriting, audio filtering — onto the lifecycle
without each skill or pipeline plugin having to reinvent
them. The architectural value is *uniformity*: a
cross-cutting concern applied via a transformer chain
affects every utterance / response / artifact that flows
through that injection point, with no skill-side opt-in or
coordination required.

- **Cancellation in-spec use cases.** An utterance transformer
(§3.2) recognises a stop / cancel / never-mind cue in the
user's speech and wants the lifecycle to terminate without
reaching intent matching. A metadata or intent transformer
detects a condition under which the utterance should not be
acted on (a profanity filter rejecting unsafe input, a
sensitive-context guard halting in a parental-control mode,
a transcription-validator dropping garbage transcriptions).
A dialog or TTS transformer determines the response itself
should not be spoken (policy block, late content filter).

- **Introspection surface: no aggregate query.** There is
deliberately no "give me everything" query; that would
imply a single responder with a global view, which this
specification does not assume exists. A consumer that wants
all six types issues six queries.

- **Typical introspection consumers.** Developer tooling
surfacing the loaded set; monitoring services tracking chain
composition; integration tests asserting on chain order
under specific session policies.

### 4.8 Stop pipeline plugin (STOP-1)

Expand Down
Loading