Conversation
Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.
📝 WalkthroughWalkthroughRefactors extraction into a unified content-extraction pipeline with a new ExtractionMode; renames ParseError → ParseErrorRecord; adjusts struct-tree element allocation and traversal depth limits; updates an import in Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Extractor as ExtractionEngine
participant Font as FontLookup/Cache
participant Writer as BufferWriter/NullWriter
participant Struct as StructTree
Client->>Extractor: request extract page (mode: stream/bounds/structured)
Extractor->>Font: resolve font/context for text operators
Font-->>Extractor: font decoding/metrics
Extractor->>Writer: writeTextToBuffer / writeTJArrayToBuffer (or NullWriter)
Writer-->>Extractor: buffered text / noop
alt structured extraction
Extractor->>Struct: consult MCID / structure tree
Struct-->>Extractor: MCID mapping / children
end
Extractor-->>Client: extracted text / structured output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/root.zig`:
- Around line 976-978: The fixed-size text_buf ([4096]u8) and text_pos may
truncate long text spans; change to a growable buffer: introduce a compile-time
constant (e.g., const TEXT_BUF_INITIAL = 4096) and replace text_buf/text_pos
with an allocator-backed slice (allocating TEXT_BUF_INITIAL) that is resized
(realloc/grow) when needed before writes, updating all uses to track
length/capacity instead of a raw array; alternatively, if dynamic allocation is
undesired, expose the buffer size as a compile-time constant and document the
limitation so callers can tune it.
- Around line 927-938: The code currently constructs an ExtractionContext with
.xref_table = undefined and .object_cache = undefined which is fragile; change
ExtractionMode.stream.ctx to be an optional pointer (e.g., ?*const
ExtractionContext) and pass null instead of undefined when no context is
available in the call site in extractContentStream; then update handleDoOperator
to test for a null ctx (e.g., if (ctx == null) return) and safely unwrap ctx
only when present before accessing xref_table or object_cache; ensure all other
call sites and any pattern matches are updated to the new optional type so no
undefined pointer values are used.
🧹 Nitpick comments (2)
src/root.zig (2)
869-905: Consider adding overflow detection for debugging.When
count.* >= operands.len, operands are silently dropped. While 64 operands is generous for standard PDF operators, silently ignoring overflow could mask issues with malformed PDFs or extraction bugs.💡 Optional: Add debug assertion or logging
.number => |n| { if (count.* < operands.len) { operands[count.*] = .{ .number = n }; count.* += 1; - } + } else if (builtin.mode == .Debug) { + std.debug.print("Warning: operand overflow at count {d}\n", .{count.*}); + } },
730-747: Consider simplifying the labeled block pattern.The
nw_blk:labeled block pattern is valid but unusual. A simpler approach might improve readability:💡 Alternative pattern
- if (nw_blk: { - var nw: NullWriter = .{}; - break :nw_blk extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw); - }) |_| { + var nw: NullWriter = .{}; + const extract_ok = extractContentStream(content, .{ .structured = &extractor }, &self.font_cache, page_num, arena, &nw); + if (extract_ok) |_| { // Collect text in structure tree order ... } else |_| {}
- Make ExtractionMode.stream.ctx optional (?*const ExtractionContext) so extractTextFromContent no longer passes a dummy context with undefined xref_table/object_cache fields; handleDoOperator now returns early on null - Expose MCID text buffer size as compile-time constant MCID_TEXT_BUF_SIZE and document the truncation behaviour - Simplify labeled nw_blk block into a named extract_ok variable
…d tests structtree.zig had a memory-safety bug: StructChild.element pointers were taken as &elements.items[i] into an ArrayList that could reallocate on subsequent appends, leaving those pointers dangling and causing a segfault on tagged PDFs with non-trivial structure trees (e.g. PDFUA-Ref-2-08_BookChapter). Fix: allocator.create(StructElement) for every node so addresses are stable. StructTree.elements is now []*StructElement; deinit calls allocator.destroy. Also add MAX_STRUCT_DEPTH=256 guard in collectMcidsInOrder as secondary safety. Expand test suite 26->150: reading_order, markdown extraction, TextSpan props, Document.__len__, PageInfo.__repr__, page separators, all 7 benchmark tagged PDFs, and all 88 malformed Test_Corpus PDFs for robustness.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/tests/test_zpdf.py (1)
388-397: Consider using a tuple instead of a list for immutable class constant.The static analysis tool flags
TAGGED_PDFSas a mutable class attribute (RUF012). Since this list is meant to be constant, using a tuple prevents accidental mutation.♻️ Proposed fix
- TAGGED_PDFS = [ + TAGGED_PDFS = ( "PDFUA-Ref-2-01_Magazine-danish.pdf", "PDFUA-Ref-2-02_Invoice.pdf", "PDFUA-Ref-2-03_AcademicAbstract.pdf", "PDFUA-Ref-2-04_Presentation.pdf", "PDFUA-Ref-2-05_BookChapter-german.pdf", "PDFUA-Ref-2-06_Brochure.pdf", "PDFUA-Ref-2-08_BookChapter.pdf", - ] + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/tests/test_zpdf.py` around lines 388 - 397, TAGGED_PDFS is defined as a mutable list but intended as an immutable class constant; replace the list literal with a tuple literal (e.g., use parentheses instead of square brackets for TAGGED_PDFS) to prevent accidental mutation and satisfy static analysis, leaving BENCHMARK_DIR unchanged and ensuring any code that iterates over TAGGED_PDFS continues to work since tuples are iterable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@python/tests/test_zpdf.py`:
- Around line 388-397: TAGGED_PDFS is defined as a mutable list but intended as
an immutable class constant; replace the list literal with a tuple literal
(e.g., use parentheses instead of square brackets for TAGGED_PDFS) to prevent
accidental mutation and satisfy static analysis, leaving BENCHMARK_DIR unchanged
and ensuring any code that iterates over TAGGED_PDFS continues to work since
tuples are iterable.
Unify extractTextFromContentWithContext, extractTextFromContentWithBounds, and extractTextWithMcidTracking into a single extractContentStream() using an ExtractionMode union (stream/bounds/structured). Extract shared logic into pushOperand() and lookupFont() helpers. Remove 4 duplicate text write helpers. Rename ParseError to ParseErrorRecord to avoid collision with parser.ParseError. Fix decompress.zig to import Object from parser.zig instead of root.zig.
Summary by CodeRabbit
Breaking Changes
New Features
Refactor
Tests