feat: add --json-ensure-ascii flag to control Unicode escaping in JSON / all output #1668
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds configurable Unicode handling for JSON output in the CLI. Previously, when using the
-o alloutput format, non-ASCII characters (like 'š') were automatically escaped as Unicode escape sequences (like\u0161). This made the output different than when e.g.-o markdownwas specified, where the characters were not escaped. I looked into this incosistency and found out it's because of the default behavior ofjson.dumps.This change allows users to control this behavior via a global configuration setting or CLI flag, enabling proper Unicode character preservation in JSON output when desired.
Key Changes:
JSON_ENSURE_ASCIIglobal configuration option (default:Truefor backward compatibility)--json-ensure-ascii/--no-json-ensure-asciiCLI flags to override the global settingjson.dumps()calls in the CLI to respect theensure_asciiparameterList of files changed and why
1.
crawl4ai/config.pyWhy: Added the new
JSON_ENSURE_ASCIIsetting toUSER_SETTINGSdictionary to allow users to configure a global default for Unicode handling in JSON output viacrwl config set JSON_ENSURE_ASCII false.2.
crawl4ai/cli.pyWhy: Implemented the feature by adding CLI flags (
--json-ensure-ascii/--no-json-ensure-ascii), updating alljson.dumps()calls (7 locations) to use theensure_asciiparameter, and implementing priority logic where CLI flags override global config. This ensures consistent Unicode handling across all JSON output paths in the CLI.How Has This Been Tested?
Manual testing and output comparision.
Checklist: