A fast, fault-tolerant tool that normalizes messy JSONL files into clean, dict-only, BigQuery-friendly JSONL. Supports discard logging, SHA-256 deduplication, JSONL concatenation, and mixed-type top-level lines (dicts, lists, strings, numbers).
-
Normalize any JSONL file
- Accepts dicts, lists, numbers, strings, malformed lines
- Extracts dicts from lists
- Logs non-dict elements instead of failing
-
BigQuery-friendly output
Ensures one JSON object per line. -
Robust error handling
- Malformed JSON β logged
- Non-dict top-level values β logged
- Mixed lists β dicts kept, junk discarded
-
Optional SHA-256 deduplication
Canonical JSON hashing removes duplicate objects across large files. -
Zero dependencies
Pure standard library. Fast and lightweight.
- Batch convert a directory of classic
.jsonfiles to.jsonl - Perfect for converting legacy exports or API dumps
- Optional dedupe and custom discard logging
- Clean argparse-based CLI (
json-to-jsonl)
- Combine many
normalized_*.jsonlfiles into one newline-delimited JSONL - Perfect for BigQuery (
NEWLINE_DELIMITED_JSON) - Optional dedupe via SHA-256
- Gentle warnings for non-standard output filenames
- Clean argparse-based CLI (
jsonl-concat)
pip install jsonl-normalizerDevelopment install:
pip install -e .Normalize a JSONL file:
jsonl-normalize input.jsonlProduces:
normalized.jsonl # clean dict-only output
discarded.jsonl # log of malformed or discarded items
Enable deduplication:
jsonl-normalize input.jsonl --dedupeSpecify custom output:
jsonl-normalize input.jsonl --output clean.jsonl --discarded junk.jsonljson-to-jsonl converts all .json files in a source directory to .jsonl files in an output directory.
json-to-jsonl source_dir output_dirBy default, if --discarded-dir is not provided, it will create a discarded_json directory to save logs of discarded items (but only if there are actual items to discard).
- Detects all
.jsonfiles insource_dir - Converts each to
output_dir/<filename>.jsonl - Optional SHA-256 dedupe (
--dedupe) - Default discarded directory
discarded_json(optional override via--discarded-dir) - Fault-tolerant: Empty discarded files are never created
- Quiet mode (
--quiet)
json-to-jsonl ./raw_jsons ./converted_jsonlsWith deduplication and discarded logs:
json-to-jsonl ./raw_jsons ./converted_jsonls --discarded-dir ./discarded --dedupejsonl-concat concatenates multiple normalized JSONL files into a single multi-line JSONL file.
This is ideal when your workflow produces many files such as:
norm_jsonl/
normalized_0044a4b1d5099e2a.jsonl
normalized_007b2d5c01abc0b9.jsonl
normalized_02231d6de9a07833.jsonl
...
Combine them into one BigQuery-friendly file:
jsonl-concatDefault behavior is equivalent to:
jsonl-concat norm_jsonl/ combined.jsonl --pattern "*.jsonl"- Reads files matching the given pattern (default
*.jsonl) under the given directory - Processes files line-by-line for proper record-level deduplication
- Writes one JSON object per line
- Optional SHA-256 dedupe (
--no-dedupeto disable) - Quiet mode (
--quiet) - Gentle suffix warning when output file is not
.jsonl/.ndjson
Use defaults:
jsonl-concatExplicit directory and output:
jsonl-concat norm_jsonl/ final.jsonlCustom file pattern (e.g., if your files don't start with normalized_ or have different extensions):
jsonl-concat norm_jsonl/ combined.jsonl --pattern "*.json"Quiet mode:
jsonl-concat --quiet norm_jsonl/ combined.jsonlDisable deduplication:
jsonl-concat --no-dedupe norm_jsonl/ combined.jsonlIf verbose and output filename is non-standard:
[WARN] Output file 'out.json' does not use .jsonl or .ndjson. Continuing.
{"a": 1, "b": 2}
[{"a": 2}, [7]]
"just a string"{"a": 1, "b": 2}
{"a": 2}{"line": 2, "index": 1, "type": "list", "value": [7], "reason": "non-dict element in list"}
{"line": 3, "type": "str", "value": "just a string", "reason": "top-level value is not dict or list"}from pathlib import Path
from jsonl_normalizer import normalize_jsonl, convert_json_dir_to_jsonl, concat_jsonl
# 1. Normalize a single file
stats = normalize_jsonl(
input_path=Path("input.jsonl"),
output_path=Path("normalized.jsonl"),
discarded_path=Path("discarded.jsonl"),
dedupe=True,
)
print(f"Single file: {stats}")
# 2. Batch convert a directory (discarded_dir is optional)
results = convert_json_dir_to_jsonl(
source_dir=Path("./json_inputs"),
output_dir=Path("./jsonl_outputs"),
dedupe=True,
)
for filename, stats in results.items():
print(f"{filename}: {stats.written} records")
# 3. Concatenate multiple JSONL files
concat_jsonl(
source_dir=Path("./norm_jsonl"),
output_file=Path("combined.jsonl"),
pattern="*.jsonl",
dedupe=True,
)Real-world JSONL is messy:
- LLMs output arrays or malformed fragments
- Excel corrupts JSON strings
- Some APIs return non-dict top-level structures
- Data lakes accumulate junk
- BigQuery requires strict dict-per-line JSONL
- ETL pipelines fail on partial corruption
jsonl-normalizer fixes these problems by:
- Normalizing structure
- Logging all junk transparently
- Keeping valid dicts only
- Providing optional dedupe mode
- Producing warehouse-ready JSONL
When --dedupe is enabled:
- Each object is canonicalized (sorted keys, compact JSON)
- Hashed using SHA-256
- Duplicates are skipped automatically
Example:
Normalized records seen: 200
Unique records written: 173
Duplicates skipped: 27
Discarded items logged: 12 β discarded.jsonl
pip install -e .
pip install pytest
pytestPull requests are welcome. Please ensure:
- Tests pass
- Code follows PEP 8
- Changes remain backward compatible
MIT License. See LICENSE for details.