This document defines the intended public surface and parser semantics for Taker. It is the compatibility guide for future releases: if behavior listed here changes, that change should be deliberate, tested, and called out in release notes.
Taker is a parser-combinator library for Java 21+. The API favors readable names
and fluent composition over dense parser-combinator terminology. Method names such
as zeroOrMore, oneOrMore, thenSkip, and skipThen are intentionally more
descriptive than traditional names like many, many1, left, or right.
The JPMS module is io.github.parseworks.taker and exports these packages:
io.github.parseworks.takerio.github.parseworks.taker.parsersio.github.parseworks.taker.results
Classes under io.github.parseworks.taker.impl and
io.github.parseworks.taker.internal are internal implementation details. They
must not be exported from the module and should not be referenced by user code.
The following types are intended to be stable public API:
Taker<A>: the core parser type.Input: immutable input cursor abstraction.TextInput: input extension that can report line, column, and snippets.Result<A>: parser result abstraction.Failure<A>: result subtype for failures.Located<A>: value wrapper with zero-based source offsets.ResultType: result category, currentlyMATCH,NO_MATCH, andPARTIAL.CharPredicate: character predicate helper type.ApplyBuilder: fluent sequence builder returned byTaker.then.ApplyBuilder.Func3throughApplyBuilder.Func8: arity-specific function interfaces used byApplyBuildermap overloads.io.github.parseworks.taker.parsers.*: built-in parser libraries.io.github.parseworks.taker.results.*: concrete result records for low-level parser authors.
Input represents a position in a character sequence.
Input.of(CharSequence)creates a random-access input.Inputinstances are logically immutable. Callingnext()orskip(int)returns a new cursor and does not mutate the original cursor.position()is a zero-based absolute character offset.current()returns the current character and is only valid whenisEof()is false.next()advances by one character and is only valid whenisEof()is false.skip(offset)advances byoffsetcharacters.hasMore()is equivalent to!isEof().- Case-insensitive matching belongs at the parser or predicate layer. Use
Chars.chrIgnoreCase,Lexical.stringIgnoreCase,Chars.oneOfIgnoreCase, or theCharPredicate.*IgnoreCasehelpers instead of transforming the input cursor.
TextInput additionally exposes line/column and snippet formatting. Error
messages may use this information when available. Line and column reporting is
user-facing and should remain one-based.
A parser returns a Result<A>.
- A successful parse returns
matches() == true,type() == MATCH, avalue(), and the nextinput()cursor. - A failed parse returns
matches() == false. - Calling
value()on a failure throws an exception containing the formatted error message. maptransforms only successful values and preserves failures.castis a type-level convenience for preserving failures across generic parser boundaries.handle(success, failure)dispatches according tomatches().toOptional()returns the successful value orOptional.empty().errorOptional()returns the formatted error for failures orOptional.empty()for success.diagnosticsOptional()returns structured diagnostics for failures orOptional.empty()for success.diagnostics()returns structured diagnostics for failures and throwsIllegalStateExceptionfor successful results.
Failure<A> exposes structured failure information:
expected()is the current expectation or grammar context label.context()istruewhenexpected()names grammar context fromlabel(...)rather than a concrete expectation fromexpecting(...)or a primitive parser.cause()is an optional nested failure.combinedFailures()is an optional list of failures from alternative parsers.error()formats a user-facing diagnostic lazily.diagnostics()creates a lazy structured view over the existing failure.- Literal character expectations should use escaped display forms for control
characters, such as
'\n'or'\t'.
ParseDiagnostics is the public, renderer-friendly failure view. It contains
the result type, failure offset, line/column when available, escaped found
input, distinct expectations, grammar contexts, and nested causes. Rendering is
lazy: callers can inspect fields directly, call render() for a message without
a source snippet, or call render(source) to include a caret snippet from the
original input. Successful parses do not allocate diagnostics.
Taker distinguishes two failure categories:
NO_MATCH: the parser did not match. This failure allows alternatives to be tried by choice combinators.PARTIAL: a committed failure. This indicates that the parser had progressed far enough that alternatives should not be tried.
Failure type is about backtracking control, not whether the underlying Input
object was mutated. Inputs are immutable; a failed result may still report an
advanced error cursor.
parse(Input) and parse(CharSequence) apply a parser without requiring the
entire input to be consumed.
parseAll(Input) and parseAll(CharSequence) require a successful parser to end
at EOF. If a parser succeeds but leaves trailing input, parseAll returns a
PARTIAL failure expecting end of input.
stream(Input) and iterateParse(Input) scan through input and yield each
successful parse. If the parser fails at the current position, scanning advances
by one character and tries again. Parsers used with these methods must consume
input on success, or callers risk non-terminating iteration.
Combinators.pure(value) always succeeds without consuming input.
Chars.take(predicate) matches exactly one character when the predicate succeeds.
It fails at EOF or when the predicate is false.
Chars.takeWhile(predicate) greedily consumes one or more matching characters.
It fails if no characters match.
Use .orElse("") when a zero-length match is desired.
Chars.collectChars(predicate) is an explicit alias for takeWhile(predicate).
It greedily consumes one or more matching input characters and returns the
matched text.
Prefer collectChars(predicate) or takeWhile(predicate) over
chr(predicate).collectString() when the grammar is simply accumulating
consecutive raw input characters. The scanner form avoids per-character parser
result allocation.
Chars.skipWhile(predicate) greedily consumes zero or more matching characters
and returns null.
It always succeeds and does not allocate a matched string. Use it for ignored input such as whitespace or comments when the skipped text is not needed.
Chars.countWhile(predicate) greedily consumes zero or more matching characters
and returns the number of consumed characters.
It always succeeds and does not allocate a matched string.
Chars.takeUntil(predicate) and Lexical.takeUntil(String) consume characters
until a terminator is found. The terminator is not consumed. If no terminator is
found, these parsers consume to EOF and succeed.
An empty string delimiter succeeds with the empty string and consumes no input.
parser.map(f) applies f only to a successful parser value. Failures are
propagated unchanged.
parser.located() applies parser and wraps a successful value in
Located<A>. The recorded start offset is the parser's starting
Input.position(). The recorded end offset is the successful result input's
position(). Offsets are zero-based, with start inclusive and end exclusive.
located() is opt-in and does not change input consumption. Failures are
propagated unchanged and do not allocate a Located value.
parser.flatMap(f) applies parser; on success, it calls f with the parsed
value and applies the returned parser at the new input position. If f returns
null, parsing fails with an internal parser expectation.
a.then(b) parses a followed by b and returns an ApplyBuilder for mapping
both values. Additional .then(...) calls extend the sequence.
Sequential parsers propagate failures from the first failed component. They do
not automatically convert failures into PARTIAL.
a.thenSkip(b) parses a followed by b and returns a's value.
a.skipThen(b) parses a followed by b and returns b's value.
parser.between(open, close) parses the opening parser or character, then the
main parser, then the closing parser or character, returning the main value.
Combinators.not(parser) is zero-width negative lookahead. It succeeds without
consuming input when parser fails at the same input position, including at
EOF. It fails when parser succeeds.
Use not(parser).skipThen(Combinators.any()) when the grammar should consume
the character that was validated by negative lookahead.
a.or(b) is equivalent to a two-branch oneOf.
Combinators.oneOf(...) tries alternatives from left to right:
- the first successful alternative wins;
- a
PARTIALfailure stops choice immediately and is returned; - if all alternatives fail with
NO_MATCH, only failures at the farthest reported input position are kept; - failures tied at that farthest position are combined so the formatted diagnostic can report multiple expectations.
oneOf requires at least one parser. Character-set forms such as
Combinators.oneOf(char...), Chars.oneOf(chars), and
Chars.oneOfIgnoreCase(chars) require at least one character.
Combinators.commit(parser) applies parser. If parser fails and reports an input
position greater than the starting position, the failure is converted to
PARTIAL. Choice combinators do not try later alternatives after a PARTIAL
failure.
Use commit when a grammar branch has become specific enough that continuing to
other alternatives would produce a worse error.
parser.expecting(label) relabels what the parser expected while preserving the
original failure as its cause. It is intended for tokens, values, and local
syntax expectations such as identifier, integer, or "]". It does not
change successful results or input consumption.
parser.label(label) adds a grammar label to a failure while preserving the
original failure as its cause. It is intended for naming larger grammar rules in
diagnostics, such as assignment, expression, or TOML table. Renderers
should present labels as context, for example while parsing assignment, not as
expected tokens. It does not change successful results or input consumption.
parser.optional() always succeeds. It returns Optional.of(value) when the
parser succeeds and Optional.empty() without consuming input when the parser
fails.
parser.orElse(value) succeeds with value without consuming input when
parser fails. It returns the original result when parser succeeds.
repeat(n) parses exactly n items.
repeat(min, max) parses between min and max items.
repeatAtLeast(n) parses at least n items.
repeatAtMost(n) parses up to n items.
Repetition fails when:
minormaxis negative;min > max;- fewer than
minitems match; - the repeated parser succeeds without advancing input.
Successful repetition returns an unmodifiable list.
zeroOrMore() is repeat(0, Integer.MAX_VALUE).
oneOrMore() is repeat(1, Integer.MAX_VALUE).
foldZeroOrMore(identity, accumulator) and
foldOneOrMore(identity, accumulator) repeat this parser and fold each value
into an accumulator instead of allocating an
intermediate list.
foldZeroOrMoreFrom(identitySupplier, accumulator) and
foldOneOrMoreFrom(identitySupplier, accumulator) are the mutable-accumulator forms.
The supplier is called once per parse to avoid sharing mutable state across
parse calls.
skipZeroOrMore() and skipOneOrMore() repeat this parser and discard parsed
values.
collectString() applies this parser one or more times and concatenates parsed
values with String.valueOf(value). It is the allocation-conscious equivalent
of collecting a list with oneOrMore() and joining it afterward.
For raw input characters, prefer the scanner-level
Chars.collectChars(predicate) / Chars.takeWhile(predicate) APIs. Use
collectString() when the repeated parser produces values that are not simply
consecutive characters from the input.
These parsers repeat until the terminator parser succeeds. The terminator is consumed when found. If the terminator appears before the minimum count, parsing fails.
Separated parsers parse values with a separator parser between them.
oneOrMoreSeparatedBy(sep) requires at least one value.
zeroOrMoreSeparatedBy(sep) returns an empty list when the first value is absent.
foldSeparatedBy(sep, identity, accumulator) and
foldZeroOrMoreSeparatedBy(sep, identity, accumulator) are allocation-conscious
alternatives that fold separated values without allocating an intermediate list.
foldSeparatedByFrom(sep, identitySupplier, accumulator) and
foldZeroOrMoreSeparatedByFrom(sep, identitySupplier, accumulator) are the
mutable-accumulator forms.
parser.onlyIf(validationParser) succeeds only when validationParser succeeds
at the same starting input position. The validation parser is lookahead; the main
parser is applied at the original input position.
parser.onlyIf(CharPredicate) checks the current character before applying the
main parser.
parser.peek(lookahead) first applies parser. If it succeeds, lookahead must
also succeed at the result input position. The returned value and input position
are from parser; the lookahead result is not consumed.
parser.recover(recovery) applies recovery at the original input position when
parser fails.
parser.recoverWith(function) calls function with the failure when parser
fails. The function is responsible for returning a result.
Taker.ref() creates an uninitialized parser reference for recursive grammars.
Calling set(parser) or set(handler) initializes the reference exactly once.
Calling set again throws an exception.
Applying an uninitialized reference throws an exception.
Recursive references detect direct infinite recursion at the same input position and return a failure instead of recursing indefinitely.
systemOut() and systemOut(label) wrap a parser with diagnostic logging to
standard output. They are debugging helpers and should not be used in library
code paths that require quiet output.
Chars contains character-level parsers and scanner fast paths.
take(predicate)matches exactly one character when the predicate succeeds.chr(char)matches exactly one character.chrIgnoreCase(char)matches exactly one character ignoring case.chr(CharPredicate)matches one character satisfying the predicate.oneOf(chars)matches one character from the supplied character set.oneOfIgnoreCase(chars)matches one character from the supplied character set ignoring case.spacesmatches one or more ASCII spaces (' ') and does not match tabs, newlines, or other whitespace.whitespacematches one or more characters accepted byCharacter.isWhitespace, including line separators.wordmatches one or more letters.lineconsumes until a newline and does not consume the newline.
Lexical contains string, regex, quoted-string, and trim parsers.
string(str)matchesstrexactly. The empty string succeeds without consuming input. On failure, it reports the next expected character at the failure position using escaped literal formatting for control characters.stringIgnoreCase(str)matchesstrignoring case. On success it returns the parser's expected string, not the source slice. Failure labels followstring(str)formatting.regex(pattern, flags)matches withMatcher.lookingAt()from the current input position.trim(parser)skips ASCII spaces aroundparser. It does not skip tabs, newlines, or other Unicode whitespace.trimSpaces(parser)is an explicit alias fortrim(parser).trimWhitespace(parser)skipsCharacter.isWhitespacearoundparser, including line separators. Use it only when crossing line boundaries is part of the grammar.lexeme(parser, ignored)repeatedly applies caller-defined ignored input before and afterparser.escapedString(quote, escape, escapes)parses a quoted string and applies the supplied escape replacements.
Numeric contains digit, integer, long, double, and hex parsers.
numericmatches one decimal digit.nonZeroDigitmatches one decimal digit from1to9.signparses+,-, or no sign, defaulting to positive.unsignedIntegerandunsignedLongparse0or a non-zero digit followed by digits. Leading-zero input such as0123parses only the leading zero unless the caller usesparseAll.integerandlongValueparse optional signs.- Integer and long overflow should fail instead of saturating or wrapping.
longValueaccepts-9223372036854775808.doubleValueparses Java double-compatible decimal forms with optional exponent.hexparses0xor0Xfollowed by one or more hex digits.
Combinators exposes static forms of common parser operations:
any,eof,fail,not,isNot,oneOf,sequence,between,satisfy,is,chainLeft, andchainRight.throwErrordeliberately throws and is primarily a test/debugging helper.
TokensParser is an opt-in facade for token grammars. It does not change
Input behavior. Instead, each token parser skips caller-defined ignored input
before and after a raw parser.
TokensParser.skipping(predicate)skips zero or more characters accepted by the predicate around each token.TokensParser.skipping(parser)repeatedly applies a caller-defined ignored parser around each token.token(parser)wraps any raw parser in the ignored-input policy.chr,string,oneOf, and their ignore-case variants are token-aware forms of the correspondingLexicalparsers.keywordandkeywordIgnoreCasematch standalone keyword tokens and reject input where the matched keyword is followed by an identifier-part character. Mismatches reportkeyword "..."while preserving the underlying character failure as the cause.identifier()matches a Java-like ASCII identifier token usingCharPredicate.identifierStartandCharPredicate.identifierPart.
Csv and IsoDates are convenience parser collections. Their documented edge
cases are covered by focused tests under src/test/java/.../parsers.
Compatibility rules and the release checklist live in release-policy.md. This document remains the source of truth for parser semantics that should be preserved by compatible releases.