diff --git a/CHANGELOG.md b/CHANGELOG.md index 00f6800..511e0d2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,27 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), ## [Unreleased] +## [3.2.0] + +Streaming batch parsing, severity classification for validation errors, RFC 5322 §4.4 obs-route support, and broader CFWS tolerance around addr-spec boundaries. All additions are non-breaking for v3.1 callers. + +### Added +- `Parse::parseStream(iterable, string): Generator` — lazy batch parsing that yields one typed address at a time, reducing memory footprint for large inputs (CSV rows, pipelines, etc.). Each input item may itself contain multiple separator-delimited addresses. +- `ValidationSeverity` backed enum with `Critical`, `Warning`, `Info` cases. Callers can distinguish structural parse failures (Critical) from policy violations where the address is syntactically well-formed (Warning) to accept soft failures in non-SMTP contexts. +- `ParseErrorCode::severity(): ValidationSeverity` — every error code is now classified. 13 codes are Warning (UTF-8 rejection, C0/C1 controls, empty-quoted, FQDN requirement, IP global-range, length limits, punycode conversion); all others are Critical. +- `ParsedEmailAddress::invalidSeverity(): ?ValidationSeverity` — derived from `invalidReasonCode`; returns `null` when the address is valid. +- RFC 5322 §4.4 obs-route support: `<@host1,@host2:user@host3>` source-route prefixes are recognized and stripped; the real addr-spec becomes the parsed address. The route string is captured on `ParsedEmailAddress::$obsRoute`. Gated by `ParseOptions::$allowObsRoute` (default `false`; enabled in `rfc5322()` and `rfc2822()`). +- `ParseOptions::$allowObsRoute` property and `withAllowObsRoute()` fluent builder. +- `obs_route` field on the array output of `Parse::parse()` (populated when an obs-route is consumed; `null` otherwise). + +### Changed +- RFC 5322 §3.2.2 CFWS: folding whitespace is now absorbed at dot-atom boundaries and around angle-addr delimiters via look-ahead in the whitespace handler. Previously-rejected inputs like `local @domain.com`, `local@ domain.com`, `< local@domain.com >`, ``, and multi-line folded whitespace now parse successfully. +- Parser internal: added `STATE_OBS_ROUTE` state for absorbing obs-route prefixes; added `in_angle_addr` and `obs_route` tracking fields to the internal email-address accumulator. +- `composer stan` now runs with `--memory-limit=512M` to accommodate the larger codebase. + +### Fixed +- None — no behavior regressions; only additions and tolerance expansions. + ## [3.1.0] Immutable `ParseOptions`, typed value-object output, structured error codes, and two new validation rules. All additions are non-breaking for v3.0 callers; readonly rule properties are a hard cutover for code that was mutating them directly (the factory methods and deprecated setters continue to work). diff --git a/README.md b/README.md index c17396a..a208a14 100644 --- a/README.md +++ b/README.md @@ -45,6 +45,12 @@ if ($address->invalid) { $result = Parse::getInstance()->parseMultiple('a@a.com, b@b.com'); foreach ($result->emailAddresses as $addr) { /* ... */ } + +// Streaming for large batches (v3.2+) — yields one address at a time. +foreach (Parse::getInstance()->parseStream($csvRows) as $addr) { + if ($addr->invalid) continue; + // ... +} ``` ### Advanced Usage with ParseOptions @@ -166,6 +172,7 @@ $parser = new Parse(null, $options); | `applyNfcNormalization` | `false` | Apply NFC Unicode normalization (RFC 6532 §3.1) | | `validateDisplayNamePhrase` | `false` | Enforce RFC 5322 §3.2.5 phrase syntax on unquoted display names | | `strictIdna` | `false` | Apply full IDNA2008 conformance on U-label domains (RFC 5891/5892/5893) | +| `allowObsRoute` | `false` | Accept RFC 5322 §4.4 obs-route source-routes like `<@host1,@host2:user@host3>` | | **Length & Output** | | | | `enforceLengthLimits` | `true` | Enforce RFC 5321 length limits (64/254/63) | | `includeDomainAscii` | `false` | Include punycode `domain_ascii` in output | diff --git a/ROADMAP.md b/ROADMAP.md index 804ee76..845a398 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -38,22 +38,24 @@ Future plans by version. Items here are intent, not commitment — priority and - [x] `strictIdna: bool` — apply full IDNA2008 conformance (`IDNA_USE_STD3_RULES | IDNA_CHECK_BIDI | IDNA_CHECK_CONTEXTJ | IDNA_NONTRANSITIONAL_TO_ASCII`) per RFC 5891/5892/5893. Enabled by default in `rfc6531()`. - [x] Extended test coverage: 265 assertions (target: 250+). -## v3.2 — Streaming, Severity Levels, Obsolete Syntax +## v3.2 — Streaming, Severity Levels, Obsolete Syntax — shipped **Batch streaming:** -- [ ] `parseStream(iterable): Generator` — yield `ParsedEmailAddress` one at a time for large email lists, reducing memory footprint. +- [x] `Parse::parseStream(iterable, string): Generator` — yields one typed address at a time; each input item may itself contain multiple separator-delimited addresses. **Validation severity levels:** -- [ ] Add a `ValidationSeverity` enum (`Critical`, `Warning`, `Info`) attached to each parsed address — allows callers to accept "soft" failures while rejecting hard ones. +- [x] `ValidationSeverity` enum with `Critical`, `Warning`, `Info` cases. +- [x] `ParseErrorCode::severity()` method classifying every code (13 Warning, rest Critical). +- [x] `ParsedEmailAddress::invalidSeverity()` accessor returning the derived severity (or `null` when valid). **Obsolete syntax extensions (RFC 5322 §4):** -> Note: `obs-local-part` is already supported via `allowObsLocalPart` in v3.0. The items below cover the remaining obsolete forms. +> Note: `obs-local-part` was already supported via `allowObsLocalPart` in v3.0. -- [ ] `obs-route` handling for the `rfc5322()` preset. -- [ ] CFWS (comments / folding whitespace) improvements. -- [ ] `obs-angle-addr` support. -- [ ] `obs-domain-list` syntax for the `rfc2822()` preset. +- [x] `obs-route` handling — `ParseOptions::$allowObsRoute` gates acceptance of `<@host1,@host2:user@host3>` source-route prefixes; the route is captured on `ParsedEmailAddress::$obsRoute`. Enabled by default in `rfc5322()` and `rfc2822()`. +- [x] `obs-angle-addr` — implied by obs-route support (it is the outer `[CFWS] "<" obs-route addr-spec ">" [CFWS]` form). +- [x] `obs-domain-list` — the `*("," [CFWS] ["@" domain])` shape is consumed inside `STATE_OBS_ROUTE`. +- [x] CFWS (comments / folding whitespace) improvements — look-ahead in the whitespace handler now absorbs CFWS at dot-atom boundaries (`local @domain`, `local@ domain`, `local @ domain`) and around angle-addr delimiters (`< local@domain >`, ``), including folded whitespace (LF + WSP). Comments in these positions were already supported in v3.0. ## v4.0 — Breaking Modernization diff --git a/UPGRADE.md b/UPGRADE.md index 9d5e28c..1c9b055 100644 --- a/UPGRADE.md +++ b/UPGRADE.md @@ -1,5 +1,44 @@ # Upgrade Guide +## v3.1 → v3.2 + +v3.2 is fully additive — no breaking changes. Two behavior changes are worth noting for callers who depended on them: + +### Behavior Changes (Tolerance Expansions) + +**CFWS around `@` and inside `<…>` is now accepted.** The v3.1 parser rejected these inputs as "Email address contains whitespace"; v3.2 treats them as RFC 5322 §3.2.2 folding whitespace: + +```php +// All of these now parse successfully (v3.2+): +'local @domain.com' // trailing CFWS on local-part +'local@ domain.com' // leading CFWS on domain +'local @ domain.com' // both +'< local@domain.com >' // inside angle-addr +'' // both, inside angle-addr +"local\n\t@domain.com" // folded whitespace +``` + +If your code validated that addresses are "tight" (no whitespace), re-check with the v3.2 definition — these now register as `invalid=false`. + +**Obs-route `<@host:addr>` is accepted in `rfc5322()` and `rfc2822()` presets.** Previously rejected as "Invalid character in domain"; now recognized, stripped, and the real addr-spec is exposed. The captured route is available as `$parsed->obsRoute`. Disabled in `rfc5321()` and legacy defaults — no change there. To opt out, call `->withAllowObsRoute(false)` on the preset. + +### Additions (Non-Breaking) + +- **`Parse::parseStream(iterable, string): Generator`** — lazy batch parsing. Use it for large inputs where holding every `ParsedEmailAddress` in memory is undesirable. +- **`ValidationSeverity` enum** — `Critical` / `Warning` / `Info`. Access via `$parsed->invalidSeverity()` or `$errorCode->severity()`. Use it to distinguish "unparseable" from "policy-rejected but well-formed": + ```php + if ($parsed->invalid && $parsed->invalidSeverity() === ValidationSeverity::Warning) { + // Well-formed address rejected by a configured rule (UTF-8, FQDN, IP range, length). + // Safe to accept in non-SMTP contexts if desired. + } + ``` +- **`ParsedEmailAddress::$obsRoute`** — captured obs-route prefix (e.g. `@hostA,@hostB`) when one was stripped. `null` for normal addresses. +- **`ParseOptions::$allowObsRoute`** (readonly) + `withAllowObsRoute()` builder. + +### Minimum Requirements (Unchanged) + +PHP `^8.1`, `ext-mbstring`, `ext-intl`. + ## v3.0 → v3.1 v3.1 is additive with one hard cutover: the 15 `ParseOptions` rule properties are now `readonly`. Factory presets and the deprecated setters still work. Everything else is new and non-breaking. diff --git a/composer.json b/composer.json index c1bcd8d..27779f5 100644 --- a/composer.json +++ b/composer.json @@ -52,7 +52,7 @@ "test:coverage": "phpunit --coverage-html coverage", "cs:check": "php-cs-fixer fix --dry-run --diff", "cs:fix": "php-cs-fixer fix", - "stan": "phpstan analyse", + "stan": "phpstan analyse --memory-limit=512M", "ci": [ "@cs:check", "@stan", diff --git a/src/Parse.php b/src/Parse.php index 612ddea..3d4f6da 100644 --- a/src/Parse.php +++ b/src/Parse.php @@ -24,6 +24,14 @@ class Parse private const STATE_END_ADDRESS = 10; private const STATE_START = 11; + /** + * Absorbs the obsolete source-route prefix inside angle-addr + * (RFC 5322 §4.4 obs-route: `"<" obs-domain-list ":" addr-spec ">"`). + * Consumes characters from the leading `@` up to the `:` terminator, + * then resumes normal addr-spec parsing. + */ + private const STATE_OBS_ROUTE = 12; + /** * @var ?Parse */ @@ -224,6 +232,35 @@ public function parseMultiple(string $emails, string $encoding = 'UTF-8'): Parse return ParseResult::fromArray($this->parse($emails, true, $encoding)); } + /** + * Lazily parse a batch of email address strings, yielding one + * {@see ParsedEmailAddress} per matched address. + * + * Use this when processing large batches (e.g. a CSV of mailing-list + * addresses) where holding every parsed result in memory is undesirable. + * Each item in `$input` is parsed with multi-address separator handling, + * so a single item may contain several comma- or whitespace-separated + * addresses. + * + * foreach ($parser->parseStream($csvRows) as $addr) { + * if ($addr->invalid) continue; + * $repo->upsert($addr->simpleAddress); + * } + * + * @param iterable $input Each item is an address string (optionally multi-address). + * @param string $encoding Character encoding of the input strings. + * @return \Generator + */ + public function parseStream(iterable $input, string $encoding = 'UTF-8'): \Generator + { + foreach ($input as $emails) { + $result = $this->parse((string) $emails, true, $encoding); + foreach ($result['email_addresses'] as $address) { + yield ParsedEmailAddress::fromArray($address); + } + } + } + public function parse(string $emails, bool $multiple = true, string $encoding = 'UTF-8'): array { $emailAddresses = []; @@ -318,25 +355,66 @@ public function parse(string $emails, bool $multiple = true, string $encoding = } elseif (' ' == $curChar || "\t" == $curChar || "\r" == $curChar || "\n" == $curChar) { - // Handle Whitespace - - // Look ahead for comments after the address + // RFC 5322 §3.2.2 CFWS — folding whitespace. Look ahead past the + // WSP run to find the next significant character; that character + // determines which kind of CFWS this is and whether it can be + // silently absorbed or if it marks an end-of-address / error. $foundComment = false; + $lookAheadChar = null; for ($j = ($i + 1); $j < $len; ++$j) { - $lookAheadChar = mb_substr($emails, $j, 1, $encoding); - if ('(' == $lookAheadChar) { + $c = mb_substr($emails, $j, 1, $encoding); + if ('(' === $c) { $foundComment = true; break; - } elseif (' ' != $lookAheadChar && - "\t" != $lookAheadChar && - "\r" != $lookAheadChar && - "\n" != $lookAheadChar) { + } + if (' ' !== $c && "\t" !== $c && "\r" !== $c && "\n" !== $c) { + $lookAheadChar = $c; + break; } } - // Check if there's a comment found ahead - if ($foundComment) { + + // CFWS absorption: whitespace is legal per RFC 5322 §3.2.3 at + // dot-atom boundaries ("[CFWS] dot-atom-text [CFWS]") and per + // §4.4 obs-angle-addr around the angle brackets. Detect the + // position from subState + lookahead rather than emitting a + // WhitespaceInAddress error. + $cfwsAbsorbed = false; + if (!$foundComment && $lookAheadChar !== null) { + if (self::STATE_LOCAL_PART === $subState) { + if ('@' === $lookAheadChar) { + // Trailing CFWS of the local-part dot-atom: "local @domain". + $cfwsAbsorbed = true; + } elseif ( + $emailAddress['in_angle_addr'] + && $emailAddress['local_part_parsed'] === '' + && $emailAddress['address_temp'] === '' + && $emailAddress['quote_temp'] === '' + ) { + // Leading CFWS inside angle-addr: "< local@domain>". + $cfwsAbsorbed = true; + } + } elseif (self::STATE_DOMAIN === $subState) { + if ($emailAddress['domain'] === '' && $emailAddress['ip'] === '') { + // Leading CFWS of the domain dot-atom: "local@ domain". + $cfwsAbsorbed = true; + } + } elseif ( + self::STATE_START === $subState + && '@' === $lookAheadChar + && $emailAddress['address_temp'] !== '' + ) { + // Top-level addr-spec with no angle-addr: "local @domain". + // The accumulated address_temp IS the local-part; absorb the + // whitespace as trailing CFWS before the `@`. + $cfwsAbsorbed = true; + } + } + + if ($cfwsAbsorbed) { + // Silently skip the whitespace character; state unchanged. + } elseif ($foundComment) { if (self::STATE_DOMAIN == $subState) { $subState = self::STATE_AFTER_DOMAIN; } elseif (self::STATE_LOCAL_PART == $subState) { @@ -344,10 +422,17 @@ public function parse(string $emails, bool $multiple = true, string $encoding = $emailAddress['invalid_reason'] = 'Email address contains whitespace'; $emailAddress['invalid_reason_code'] = Err::WhitespaceInAddress; } + } elseif ( + $emailAddress['in_angle_addr'] + && self::STATE_DOMAIN == $subState + && $lookAheadChar === '>' + ) { + // Trailing CFWS inside angle-addr before `>`: "". + // Absorb and transition as if we saw `>` next. + $subState = self::STATE_AFTER_DOMAIN; } elseif ($this->options->getUseWhitespaceAsSeparator() && (self::STATE_DOMAIN == $subState || self::STATE_AFTER_DOMAIN == $subState)) { - // If we're already in the domain part and whitespace is a separator, - // this should be the end of the whole address + // Already past `@` and whitespace-as-separator: end address. $state = self::STATE_END_ADDRESS; break; @@ -357,7 +442,7 @@ public function parse(string $emails, bool $multiple = true, string $encoding = $emailAddress['invalid_reason'] = 'Email address contains whitespace'; $emailAddress['invalid_reason_code'] = Err::WhitespaceInAddress; } else { - // If the previous section was a quoted string, then use that for the name + // Display-name phrase: absorb into name_parsed. $this->handleQuote($emailAddress); $emailAddress['name_parsed'] .= $curChar; } @@ -372,6 +457,7 @@ public function parse(string $emails, bool $multiple = true, string $encoding = // Here should be the start of the local part for sure everything else then is part of the name $subState = self::STATE_LOCAL_PART; $emailAddress['special_char_in_substate'] = null; + $emailAddress['in_angle_addr'] = true; $this->handleQuote($emailAddress); } } elseif ('>' == $curChar) { @@ -382,6 +468,7 @@ public function parse(string $emails, bool $multiple = true, string $encoding = $emailAddress['invalid_reason_code'] = Err::MissingDomainBeforeClosingAngle; } else { $subState = self::STATE_AFTER_DOMAIN; + $emailAddress['in_angle_addr'] = false; } } elseif ('"' == $curChar) { // If we hit a quote - change to the quote state, unless it's in the domain, in which case it's error @@ -406,6 +493,20 @@ public function parse(string $emails, bool $multiple = true, string $encoding = $emailAddress['invalid'] = true; $emailAddress['invalid_reason'] = "Invalid character found in email address local part: '{$emailAddress['special_char_in_substate']}'"; $emailAddress['invalid_reason_code'] = Err::InvalidCharacterInLocalPart; + } elseif ( + $this->options->allowObsRoute + && $emailAddress['in_angle_addr'] + && $emailAddress['obs_route'] === '' + && $emailAddress['local_part_parsed'] === '' + && $emailAddress['quote_temp'] === '' + && $emailAddress['address_temp'] === '' + ) { + // RFC 5322 §4.4 obs-route: first `@` seen inside `<...>` with no + // preceding local-part starts the source-route prefix. Consume + // the remainder until `:` via STATE_OBS_ROUTE, then resume + // addr-spec parsing with local-part reset. + $state = self::STATE_OBS_ROUTE; + $emailAddress['obs_route'] = '@'; } else { $subState = self::STATE_DOMAIN; if ($emailAddress['address_temp'] && $emailAddress['quote_temp']) { @@ -600,6 +701,29 @@ public function parse(string $emails, bool $multiple = true, string $encoding = $emailAddress['ip'] .= $curChar; } + break; + case self::STATE_OBS_ROUTE: + // RFC 5322 §4.4 obs-route absorption — consume the + // `@host1,@host2:` source-route prefix inside angle-addr. + // On `:` terminator, resume normal addr-spec parsing with + // local-part state cleared. An unterminated obs-route + // (end of input or `>` before `:`) is an invalid address. + $emailAddress['original_address'] .= $curChar; + if (':' == $curChar) { + $state = self::STATE_ADDRESS; + $subState = self::STATE_LOCAL_PART; + } elseif ('>' == $curChar) { + // `<@host>` without a colon — incomplete obs-route. + $emailAddress['invalid'] = true; + $emailAddress['invalid_reason'] = 'Incomplete obs-route: missing colon before closing angle-bracket'; + $emailAddress['invalid_reason_code'] = Err::IncompleteAddress; + $emailAddress['in_angle_addr'] = false; + $state = self::STATE_ADDRESS; + $subState = self::STATE_AFTER_DOMAIN; + } else { + $emailAddress['obs_route'] .= $curChar; + } + break; case self::STATE_QUOTE: // Handle quoted strings @@ -807,6 +931,13 @@ private function buildEmailAddressArray(): array 'special_char_in_substate' => null, 'comment_temp' => '', 'comments' => [], + // True while the parser is inside angle-addr (between `<` and `>`). + // Used to gate obs-route detection per RFC 5322 §4.4. + 'in_angle_addr' => false, + // Accumulates the obs-route prefix (everything between `<` and the + // terminating `:`) when ParseOptions::$allowObsRoute is true. + // Empty string when no obs-route was seen. + 'obs_route' => '', ]; } @@ -995,7 +1126,8 @@ private function addAddress( 'invalid' => $emailAddress['invalid'], 'invalid_reason' => $emailAddress['invalid_reason'], 'invalid_reason_code' => $emailAddress['invalid_reason_code'], - 'comments' => $emailAddress['comments'], ]; + 'comments' => $emailAddress['comments'], + 'obs_route' => $emailAddress['obs_route'] !== '' ? $emailAddress['obs_route'] : null, ]; // Build the proper address by hand (has comments stripped out and should have quotes in the proper places) if (!$emailAddrDef['invalid']) { diff --git a/src/ParseErrorCode.php b/src/ParseErrorCode.php index 140450f..2be6e61 100644 --- a/src/ParseErrorCode.php +++ b/src/ParseErrorCode.php @@ -182,4 +182,35 @@ enum ParseErrorCode: string /** Unquoted display name contains characters outside atext + WSP (RFC 5322 §3.2.5 phrase). */ case InvalidDisplayNamePhrase = 'invalid_display_name_phrase'; + + /** + * Classify this error by severity. + * + * Critical: the input is structurally unparseable or violates a fundamental + * RFC 5322 / 5321 syntax rule — the address is not valid in any interpretation. + * + * Warning: the address is well-formed but was rejected by a configured + * validation rule (UTF-8 gating, FQDN requirement, IP range check, length + * limits, C0/C1 control policy, empty-quoted rejection, punycode conversion). + * Callers may choose to accept Warning-level failures depending on context. + */ + public function severity(): ValidationSeverity + { + return match ($this) { + self::Utf8NotAllowedInLocalPart, + self::C0ControlInLocalPart, + self::C1ControlInLocalPart, + self::C1ControlInQuotedString, + self::EmptyQuotedLocalPart, + self::FqdnRequired, + self::IpNotInGlobalRange, + self::Ipv6NotInGlobalRange, + self::LocalPartTooLong, + self::TotalLengthExceeded, + self::DomainTooLong, + self::DomainLabelTooLong, + self::PunycodeConversionFailed => ValidationSeverity::Warning, + default => ValidationSeverity::Critical, + }; + } } diff --git a/src/ParseOptions.php b/src/ParseOptions.php index 5c9014a..e454852 100644 --- a/src/ParseOptions.php +++ b/src/ParseOptions.php @@ -42,6 +42,7 @@ class ParseOptions * @param bool $includeDomainAscii Emit punycode domain in output. * @param bool $validateDisplayNamePhrase Enforce RFC 5322 §3.2.5 phrase syntax for unquoted display names (atext + WSP only). * @param bool $strictIdna Apply full IDNA2008 conformance on U-label domains (CONTEXTJ/O, Bidi rule, STD3, nontransitional mapping). + * @param bool $allowObsRoute Accept RFC 5322 §4.4 obs-route source-route prefix inside angle-addr (e.g. `<@host1,@host2:user@host3>`); the route is captured and the real addr-spec is used ("accept and discard" per spec). */ public function __construct( array $bannedChars = [], @@ -64,6 +65,7 @@ public function __construct( public readonly bool $includeDomainAscii = false, public readonly bool $validateDisplayNamePhrase = false, public readonly bool $strictIdna = false, + public readonly bool $allowObsRoute = false, ) { foreach ($bannedChars as $char) { $this->bannedChars[$char] = true; @@ -155,6 +157,7 @@ public static function rfc5322(): self applyNfcNormalization: false, enforceLengthLimits: true, includeDomainAscii: false, + allowObsRoute: true, ); } @@ -182,6 +185,7 @@ public static function rfc2822(): self applyNfcNormalization: false, enforceLengthLimits: true, includeDomainAscii: false, + allowObsRoute: true, ); } @@ -295,6 +299,11 @@ public function withStrictIdna(bool $value): self return $this->cloneWith(['strictIdna' => $value]); } + public function withAllowObsRoute(bool $value): self + { + return $this->cloneWith(['allowObsRoute' => $value]); + } + /** * Build a new ParseOptions preserving every current value except those * listed in $overrides. @@ -326,6 +335,7 @@ private function cloneWith(array $overrides): self includeDomainAscii: $get('includeDomainAscii', $this->includeDomainAscii), validateDisplayNamePhrase: $get('validateDisplayNamePhrase', $this->validateDisplayNamePhrase), strictIdna: $get('strictIdna', $this->strictIdna), + allowObsRoute: $get('allowObsRoute', $this->allowObsRoute), ); } diff --git a/src/ParsedEmailAddress.php b/src/ParsedEmailAddress.php index 3a89dee..3aa9239 100644 --- a/src/ParsedEmailAddress.php +++ b/src/ParsedEmailAddress.php @@ -27,6 +27,7 @@ final class ParsedEmailAddress * @param ?string $invalidReason Human-readable failure reason; `null` if valid. * @param ?ParseErrorCode $invalidReasonCode Structured failure code; `null` if valid. * @param array $comments RFC 5322 comments extracted from the address. + * @param ?string $obsRoute RFC 5322 §4.4 obs-route prefix if one was stripped from inside angle-addr (e.g. `@host1,@host2`); `null` otherwise. Only populated when {@see ParseOptions::$allowObsRoute} is enabled. */ public function __construct( public readonly string $address, @@ -44,6 +45,7 @@ public function __construct( public readonly ?string $invalidReason, public readonly ?ParseErrorCode $invalidReasonCode, public readonly array $comments, + public readonly ?string $obsRoute = null, ) { } @@ -70,6 +72,25 @@ public static function fromArray(array $arr): self invalidReason: $arr['invalid_reason'], invalidReasonCode: $arr['invalid_reason_code'], comments: $arr['comments'], + obsRoute: $arr['obs_route'] ?? null, ); } + + /** + * Severity of the validation failure, derived from {@see $invalidReasonCode}. + * Returns `null` when the address is valid (no failure to classify). + * + * Callers can use this to distinguish structural failures from policy + * violations: + * + * if ($parsed->invalid && $parsed->invalidSeverity() === ValidationSeverity::Warning) { + * // Well-formed but violates a configured rule — e.g. private-range IP + * // literal, non-FQDN domain, octet length over RFC 5321 §4.5.3.1. + * // Safe to accept in non-SMTP contexts. + * } + */ + public function invalidSeverity(): ?ValidationSeverity + { + return $this->invalidReasonCode?->severity(); + } } diff --git a/src/ValidationSeverity.php b/src/ValidationSeverity.php new file mode 100644 index 0000000..21df31b --- /dev/null +++ b/src/ValidationSeverity.php @@ -0,0 +1,41 @@ +assertTrue($result->invalid); $this->assertSame(\Email\ParseErrorCode::InvalidCharInQuotedString, $result->invalidReasonCode); } + + public function testValidAddressHasNullInvalidSeverity(): void + { + $result = Parse::getInstance()->parseSingle('user@example.com'); + $this->assertFalse($result->invalid); + $this->assertNull($result->invalidSeverity()); + } + + public function testStructuralFailureIsCriticalSeverity(): void + { + // Missing '@' — structural failure, unparseable. + $result = Parse::getInstance()->parseSingle('not-an-email'); + $this->assertTrue($result->invalid); + $this->assertSame(\Email\ValidationSeverity::Critical, $result->invalidSeverity()); + } + + public function testPolicyFailureIsWarningSeverity(): void + { + // FQDN requirement: single-label domain is syntactically fine but policy-rejected. + $opts = ParseOptions::rfc5321(); + $result = (new Parse(null, $opts))->parseSingle('user@localhost'); + $this->assertTrue($result->invalid); + $this->assertSame(\Email\ValidationSeverity::Warning, $result->invalidSeverity()); + + // Private-range IP literal is syntactically valid but rejected by the global-range rule. + $result = Parse::getInstance()->parseSingle('user@[192.168.0.1]'); + $this->assertTrue($result->invalid); + $this->assertSame(\Email\ValidationSeverity::Warning, $result->invalidSeverity()); + } + + public function testEveryErrorCodeHasASeverity(): void + { + // Defensive: ensure no new ParseErrorCode is added without mapping its severity. + foreach (\Email\ParseErrorCode::cases() as $code) { + $severity = $code->severity(); + $this->assertInstanceOf(\Email\ValidationSeverity::class, $severity); + } + } + + public function testParseStreamYieldsTypedObjects(): void + { + $parser = Parse::getInstance(); + $gen = $parser->parseStream(['a@a.com', 'b@b.com']); + $this->assertInstanceOf(\Generator::class, $gen); + $results = iterator_to_array($gen, false); + $this->assertCount(2, $results); + $this->assertInstanceOf(\Email\ParsedEmailAddress::class, $results[0]); + $this->assertSame('a', $results[0]->localPart); + $this->assertSame('b.com', $results[1]->domain); + } + + public function testParseStreamSplitsMultiAddressItems(): void + { + // Each input item may itself contain several comma-separated addresses; + // parseStream yields one ParsedEmailAddress per address regardless. + $parser = Parse::getInstance(); + $results = iterator_to_array( + $parser->parseStream(['a@a.com, b@b.com', 'c@c.com']), + false, + ); + $this->assertCount(3, $results); + $this->assertSame(['a', 'b', 'c'], array_map(fn ($r) => $r->localPart, $results)); + } + + public function testParseStreamAcceptsGeneratorInput(): void + { + // A caller-supplied generator should be consumed lazily. + $input = (function () { + yield 'one@example.com'; + yield 'two@example.com'; + })(); + + $results = iterator_to_array(Parse::getInstance()->parseStream($input), false); + $this->assertCount(2, $results); + $this->assertSame('one', $results[0]->localPart); + $this->assertSame('two', $results[1]->localPart); + } + + public function testParseStreamEmitsInvalidEntries(): void + { + // Invalid addresses still appear in the stream — callers filter by $addr->invalid. + $results = iterator_to_array( + Parse::getInstance()->parseStream(['valid@ok.com', 'not-an-email']), + false, + ); + $this->assertCount(2, $results); + $this->assertFalse($results[0]->invalid); + $this->assertTrue($results[1]->invalid); + } + + public function testObsRouteIsAcceptedAndCapturedInRfc5322(): void + { + // RFC 5322 §4.4: obs-route prefix is recognized, captured, and discarded; + // the real addr-spec (after the colon) becomes the parsed address. + $result = (new Parse(null, ParseOptions::rfc5322())) + ->parseSingle('<@hostA:user@hostB>'); + $this->assertFalse($result->invalid); + $this->assertSame('user', $result->localPart); + $this->assertSame('hostB', $result->domain); + $this->assertSame('@hostA', $result->obsRoute); + } + + public function testObsRouteSupportsMultipleHosts(): void + { + // Multiple routed hosts joined by comma per obs-domain-list. + $result = (new Parse(null, ParseOptions::rfc5322())) + ->parseSingle('<@hostA,@hostB:user@hostC>'); + $this->assertFalse($result->invalid); + $this->assertSame('user', $result->localPart); + $this->assertSame('hostC', $result->domain); + $this->assertSame('@hostA,@hostB', $result->obsRoute); + } + + public function testObsRoutePreservesDisplayName(): void + { + $result = (new Parse(null, ParseOptions::rfc5322())) + ->parseSingle('John Doe <@route.com:jdoe@example.com>'); + $this->assertFalse($result->invalid); + $this->assertSame('John Doe', $result->nameParsed); + $this->assertSame('jdoe', $result->localPart); + $this->assertSame('example.com', $result->domain); + $this->assertSame('@route.com', $result->obsRoute); + } + + public function testObsRouteInMultiAddressBatch(): void + { + // Each address in a batch parses its own obs-route independently. + $result = (new Parse(null, ParseOptions::rfc5322())) + ->parseMultiple('<@routeA:a@x.com>, <@routeB:b@y.com>'); + $this->assertTrue($result->success); + $this->assertCount(2, $result->emailAddresses); + $this->assertSame('@routeA', $result->emailAddresses[0]->obsRoute); + $this->assertSame('@routeB', $result->emailAddresses[1]->obsRoute); + } + + public function testObsRouteRejectedWhenFlagIsOff(): void + { + // Default constructor (legacy mode) has allowObsRoute=false — the colon + // inside <...> is rejected as an invalid domain character. + $result = (new Parse(null, new ParseOptions())) + ->parseSingle('<@hostA:user@hostB>'); + $this->assertTrue($result->invalid); + $this->assertNull($result->obsRoute); + + // rfc5321() also keeps obs-route off per SMTP Mailbox strictness. + $result = (new Parse(null, ParseOptions::rfc5321())) + ->parseSingle('<@hostA:user@hostB>'); + $this->assertTrue($result->invalid); + } + + public function testObsRouteIncompleteWithoutColonIsInvalid(): void + { + // `<@host>` has no colon — incomplete obs-route. + $result = (new Parse(null, ParseOptions::rfc5322())) + ->parseSingle('<@host>'); + $this->assertTrue($result->invalid); + $this->assertSame(\Email\ParseErrorCode::IncompleteAddress, $result->invalidReasonCode); + } + + public function testObsRouteWithEmptyAddrSpecIsInvalid(): void + { + // `<@hostA:>` — empty addr-spec after the colon. + $result = (new Parse(null, ParseOptions::rfc5322())) + ->parseSingle('<@hostA:>'); + $this->assertTrue($result->invalid); + } + + public function testValidAddressHasNullObsRoute(): void + { + // A normal address produces obsRoute=null (not empty string). + $result = Parse::getInstance()->parseSingle('user@example.com'); + $this->assertNull($result->obsRoute); + } + + /** + * RFC 5322 §3.2.2 CFWS — folding whitespace is allowed around dot-atom + * boundaries. Each case below is a structurally valid RFC 5322 addr-spec + * that the v3.1 parser rejected as "Email address contains whitespace"; + * v3.2 absorbs the CFWS positionally via look-ahead in the WSP handler. + */ + public function testCfwsTrailingLocalPart(): void + { + // "local @domain" — trailing CFWS on local-part dot-atom. + $result = Parse::getInstance()->parseSingle('local @domain.com'); + $this->assertFalse($result->invalid); + $this->assertSame('local', $result->localPart); + $this->assertSame('domain.com', $result->domain); + } + + public function testCfwsLeadingDomain(): void + { + // "local@ domain" — leading CFWS on domain dot-atom. + $result = Parse::getInstance()->parseSingle('local@ domain.com'); + $this->assertFalse($result->invalid); + $this->assertSame('local', $result->localPart); + $this->assertSame('domain.com', $result->domain); + } + + public function testCfwsAroundAtSymbol(): void + { + $result = Parse::getInstance()->parseSingle('local @ domain.com'); + $this->assertFalse($result->invalid); + $this->assertSame('local', $result->localPart); + $this->assertSame('domain.com', $result->domain); + } + + public function testCfwsInsideAngleAddr(): void + { + // Whitespace inside <> flanking the addr-spec. + $result = Parse::getInstance()->parseSingle('John Doe < local@domain.com >'); + $this->assertFalse($result->invalid); + $this->assertSame('John Doe', $result->nameParsed); + $this->assertSame('local', $result->localPart); + $this->assertSame('domain.com', $result->domain); + } + + public function testCfwsAroundAtInsideAngleAddr(): void + { + $result = Parse::getInstance()->parseSingle(''); + $this->assertFalse($result->invalid); + $this->assertSame('local', $result->localPart); + $this->assertSame('domain.com', $result->domain); + } + + public function testCfwsFoldingWhitespace(): void + { + // Folded whitespace (LF + WSP) is still whitespace per CFWS lookahead. + $result = Parse::getInstance()->parseSingle("local\n\t@domain.com"); + $this->assertFalse($result->invalid); + $this->assertSame('local', $result->localPart); + $this->assertSame('domain.com', $result->domain); + } }