Skip to content

Fast, fault-tolerant tool to normalize messy JSON/JSONL into warehouse-ready, dict-only JSONL. Features: Batch directory conversion, record-level deduplication (SHA-256), smart concatenation, and detailed discard logging. Zero dependencies.

License

Notifications You must be signed in to change notification settings

yeiichi/jsonl-normalizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

jsonl-normalizer

PyPI version Python versions License

A fast, fault-tolerant tool that normalizes messy JSONL files into clean, dict-only, BigQuery-friendly JSONL. Supports discard logging, SHA-256 deduplication, JSONL concatenation, and mixed-type top-level lines (dicts, lists, strings, numbers).


πŸš€ Features

Normalization

  • Normalize any JSONL file

    • Accepts dicts, lists, numbers, strings, malformed lines
    • Extracts dicts from lists
    • Logs non-dict elements instead of failing
  • BigQuery-friendly output
    Ensures one JSON object per line.

  • Robust error handling

    • Malformed JSON β†’ logged
    • Non-dict top-level values β†’ logged
    • Mixed lists β†’ dicts kept, junk discarded
  • Optional SHA-256 deduplication
    Canonical JSON hashing removes duplicate objects across large files.

  • Zero dependencies
    Pure standard library. Fast and lightweight.

NEW (v0.2.1): Batch JSON to JSONL

  • Batch convert a directory of classic .json files to .jsonl
  • Perfect for converting legacy exports or API dumps
  • Optional dedupe and custom discard logging
  • Clean argparse-based CLI (json-to-jsonl)

NEW (v0.2.1): JSONL Concatenation

  • Combine many normalized_*.jsonl files into one newline-delimited JSONL
  • Perfect for BigQuery (NEWLINE_DELIMITED_JSON)
  • Optional dedupe via SHA-256
  • Gentle warnings for non-standard output filenames
  • Clean argparse-based CLI (jsonl-concat)

πŸ“¦ Installation

pip install jsonl-normalizer

Development install:

pip install -e .

πŸ–₯️ CLI Usage

1. Normalize JSONL

Normalize a JSONL file:

jsonl-normalize input.jsonl

Produces:

normalized.jsonl   # clean dict-only output
discarded.jsonl    # log of malformed or discarded items

Enable deduplication:

jsonl-normalize input.jsonl --dedupe

Specify custom output:

jsonl-normalize input.jsonl --output clean.jsonl --discarded junk.jsonl

πŸ“‚ NEW: json-to-jsonl β€” Batch JSON to JSONL Converter

json-to-jsonl converts all .json files in a source directory to .jsonl files in an output directory.

Usage

json-to-jsonl source_dir output_dir

By default, if --discarded-dir is not provided, it will create a discarded_json directory to save logs of discarded items (but only if there are actual items to discard).

Features

  • Detects all .json files in source_dir
  • Converts each to output_dir/<filename>.jsonl
  • Optional SHA-256 dedupe (--dedupe)
  • Default discarded directory discarded_json (optional override via --discarded-dir)
  • Fault-tolerant: Empty discarded files are never created
  • Quiet mode (--quiet)

Examples

json-to-jsonl ./raw_jsons ./converted_jsonls

With deduplication and discarded logs:

json-to-jsonl ./raw_jsons ./converted_jsonls --discarded-dir ./discarded --dedupe

πŸ”— NEW: jsonl-concat β€” JSONL Concatenation Tool

jsonl-concat concatenates multiple normalized JSONL files into a single multi-line JSONL file.

This is ideal when your workflow produces many files such as:

norm_jsonl/
  normalized_0044a4b1d5099e2a.jsonl
  normalized_007b2d5c01abc0b9.jsonl
  normalized_02231d6de9a07833.jsonl
  ...

Combine them into one BigQuery-friendly file:

jsonl-concat

Default behavior is equivalent to:

jsonl-concat norm_jsonl/ combined.jsonl --pattern "*.jsonl"

Features

  • Reads files matching the given pattern (default *.jsonl) under the given directory
  • Processes files line-by-line for proper record-level deduplication
  • Writes one JSON object per line
  • Optional SHA-256 dedupe (--no-dedupe to disable)
  • Quiet mode (--quiet)
  • Gentle suffix warning when output file is not .jsonl/.ndjson

Examples

Use defaults:

jsonl-concat

Explicit directory and output:

jsonl-concat norm_jsonl/ final.jsonl

Custom file pattern (e.g., if your files don't start with normalized_ or have different extensions):

jsonl-concat norm_jsonl/ combined.jsonl --pattern "*.json"

Quiet mode:

jsonl-concat --quiet norm_jsonl/ combined.jsonl

Disable deduplication:

jsonl-concat --no-dedupe norm_jsonl/ combined.jsonl

If verbose and output filename is non-standard:

[WARN] Output file 'out.json' does not use .jsonl or .ndjson. Continuing.

πŸ“„ Example (Normalization)

Input (mixed.jsonl)

{"a": 1, "b": 2}
[{"a": 2}, [7]]
"just a string"

Output: normalized.jsonl

{"a": 1, "b": 2}
{"a": 2}

Output: discarded.jsonl

{"line": 2, "index": 1, "type": "list", "value": [7], "reason": "non-dict element in list"}
{"line": 3, "type": "str", "value": "just a string", "reason": "top-level value is not dict or list"}

πŸ§ͺ Library Usage

from pathlib import Path
from jsonl_normalizer import normalize_jsonl, convert_json_dir_to_jsonl, concat_jsonl

# 1. Normalize a single file
stats = normalize_jsonl(
    input_path=Path("input.jsonl"),
    output_path=Path("normalized.jsonl"),
    discarded_path=Path("discarded.jsonl"),
    dedupe=True,
)
print(f"Single file: {stats}")

# 2. Batch convert a directory (discarded_dir is optional)
results = convert_json_dir_to_jsonl(
    source_dir=Path("./json_inputs"),
    output_dir=Path("./jsonl_outputs"),
    dedupe=True,
)
for filename, stats in results.items():
    print(f"{filename}: {stats.written} records")

# 3. Concatenate multiple JSONL files
concat_jsonl(
    source_dir=Path("./norm_jsonl"),
    output_file=Path("combined.jsonl"),
    pattern="*.jsonl",
    dedupe=True,
)

❓ Why jsonl-normalizer?

Real-world JSONL is messy:

  • LLMs output arrays or malformed fragments
  • Excel corrupts JSON strings
  • Some APIs return non-dict top-level structures
  • Data lakes accumulate junk
  • BigQuery requires strict dict-per-line JSONL
  • ETL pipelines fail on partial corruption

jsonl-normalizer fixes these problems by:

  • Normalizing structure
  • Logging all junk transparently
  • Keeping valid dicts only
  • Providing optional dedupe mode
  • Producing warehouse-ready JSONL

🧹 Deduplication

When --dedupe is enabled:

  • Each object is canonicalized (sorted keys, compact JSON)
  • Hashed using SHA-256
  • Duplicates are skipped automatically

Example:

Normalized records seen: 200
Unique records written: 173
Duplicates skipped: 27
Discarded items logged: 12 β†’ discarded.jsonl

πŸ§ͺ Testing

pip install -e .
pip install pytest
pytest

🀝 Contributing

Pull requests are welcome. Please ensure:

  • Tests pass
  • Code follows PEP 8
  • Changes remain backward compatible

πŸ“„ License

MIT License. See LICENSE for details.

About

Fast, fault-tolerant tool to normalize messy JSON/JSONL into warehouse-ready, dict-only JSONL. Features: Batch directory conversion, record-level deduplication (SHA-256), smart concatenation, and detailed discard logging. Zero dependencies.

Resources

License

Stars

Watchers

Forks

Packages

No packages published