Skip to content

Fix codegen build (toml 0.9 broke things), improve v_htmlescape behaviour (don’t encode slash)#164

Closed
chris-morgan wants to merge 6 commits intozzau13:masterfrom
chris-morgan:master
Closed

Fix codegen build (toml 0.9 broke things), improve v_htmlescape behaviour (don’t encode slash)#164
chris-morgan wants to merge 6 commits intozzau13:masterfrom
chris-morgan:master

Conversation

@chris-morgan
Copy link
Copy Markdown

Read the commit messages for more detail.

Why the v_escape_codegen@0.1.9 → 0.1.8? Dunno, haven’t delved.

Dependency version specifier "0" is always wrong.
In this case, toml 0.8 could parse an entire TOML file as Value,
but toml 0.9 fixed that obvious wrongness, insisting you use Table.

Instead of changing the version to 0.8, I updated it to 0.9.
I reckon this improves things quite a bit.
Still messy, but I find it noticeably easier to follow.

(I still dislike rustfmt, makes it worse.)
Also ran cargo fmt, since that minimises the {src,tests}/lib.rs diffs.
(If that’s to be done, then I ask what the point of prettyplease is.)
This was always a mistake; nothing has *ever* required it.
This was one of the worse problems with OWASP’s XSS prevention cheat sheet,
a thoroughly bad document that was bad when it was written around 2010,
and became worse as edits were made to it,
though some edits in 2020–2023 finally improved it a little.
Details *were* in <OWASP/CheatSheetSeries#515>,
but that issue has been deleted, and the Wayback Machine didn’t have it.
Sigh. I don’t like OWASP because of things like this.

This should be considered a breaking change,
because some people will have tests depending on the wonky behaviour.
  • HTML 3.2 (January 1997) lacked &quot; and &apos;.
  • HTML 4 (December 1997) had &quot; but lacked &apos;.
  • XML 1.0 (February 1998) had both &quot; and &apos;.
  • HTML 5 (January 2008) added &apos;.
  • IE 8 (March 2009) was the last browser that lacked &apos;.
    By that time everyone else had been doing this HTML 5 thing for a while,
    and Microsoft followed suit in IE 9.

Frankly, I don’t like apostrophe being encoded;
I would declare double-quoted attribute values the One True Form,
rejecting single-quoted attribute values,
just like unquoted attribute values are rejected by libraries like this.
But that would be a bit too drastic a change to make at this stage.

Another alternative is to use &zzau13#39;, which is shorter.

This should again be considered a breaking change.
And a slightly more serious one than stopping escaping slash,
because it *will* actually break IE≤8.
@zzau13
Copy link
Copy Markdown
Owner

zzau13 commented Apr 25, 2026

Thanks for the PR @chris-morgan, and sorry for the late reply.

After thinking this over I'm going to go in a different direction, so I'll be closing this one. Here's the plan and reasoning:

Two crates instead of changing the existing behaviour in place

Rather than mutating v_htmlescape's output, I want to ship two HTML-escape libraries side by side:

  1. One preserving the historical 6-char set (<, >, &, ", ', /) with &#x27; — current v_htmlescape behaviour, kept for backward compatibility.
  2. One following the current OWASP XSS Prevention Cheat Sheet, Rule #1 — 5-char set, no /.

Notes on the entity choices in this PR:

  • Dropping &#x2f; for /: agreed — matches current OWASP guidance. This will be the behaviour of crate add support for character in hexadecimal, octal and numbers ascii #2.
  • &apos; for ': I'd push back on this. &apos; is not defined in HTML 4.01 (only XML/XHTML); OWASP explicitly recommends &#x27; / &#39; for cross-parser safety. Both crates will keep &#x27;.

Codegen rework first

Before adding the second crate I want to land some codegen improvements that have been pending:

  1. Add the OR operation to the set of operations the codegen can pick from — the current search space misses patterns that benefit from a fused-OR mask.
  2. Backtracking-based operation generator, replacing the current greedy selection with a real search over valid operation sequences.
  3. Minimax-style scoring to pick the best sequence for a given input set (worst-case cost as the metric, so we don't regress the hot path while improving the average).

The toml 0.9 fix in 9529be0 is independently useful — I'll bring it in (with attribution) when I land the codegen rework so it doesn't get lost.

TL;DR — closing in favour of: (1) codegen rework (OR op + backtracking + minimax selection), then (2) two coexisting HTML-escape crates (legacy 6-char + modern 5-char OWASP), both keeping &#x27;. Thanks again — the slash change and the toml fix were on the right track.

@zzau13 zzau13 closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants