Skip to content

Conversation

@microHoffman
Copy link

Summary

This PR adds configurable Unicode handling for JSON output in the CLI. Previously, when using the -o all output format, non-ASCII characters (like 'š') were automatically escaped as Unicode escape sequences (like \u0161). This made the output different than when e.g. -o markdown was specified, where the characters were not escaped. I looked into this incosistency and found out it's because of the default behavior of json.dumps.

This change allows users to control this behavior via a global configuration setting or CLI flag, enabling proper Unicode character preservation in JSON output when desired.

Key Changes:

  • Added JSON_ENSURE_ASCII global configuration option (default: True for backward compatibility)
  • Added --json-ensure-ascii/--no-json-ensure-ascii CLI flags to override the global setting
  • Updated all json.dumps() calls in the CLI to respect the ensure_ascii parameter
  • Maintains backward compatibility by defaulting to ASCII escaping (existing behavior)

List of files changed and why

1. crawl4ai/config.py

Why: Added the new JSON_ENSURE_ASCII setting to USER_SETTINGS dictionary to allow users to configure a global default for Unicode handling in JSON output via crwl config set JSON_ENSURE_ASCII false.

2. crawl4ai/cli.py

Why: Implemented the feature by adding CLI flags (--json-ensure-ascii/--no-json-ensure-ascii), updating all json.dumps() calls (7 locations) to use the ensure_ascii parameter, and implementing priority logic where CLI flags override global config. This ensures consistent Unicode handling across all JSON output paths in the CLI.

How Has This Been Tested?

Manual testing and output comparision.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@microHoffman microHoffman changed the title feat: add --json-ensure-ascii flag to control Unicode escaping in JSON output feat: add --json-ensure-ascii flag to control Unicode escaping in JSON / all output Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant