jsonl-normalizer

A fast, fault-tolerant tool that normalizes messy JSONL files into clean, dict-only, BigQuery-friendly JSONL. Supports discard logging, SHA-256 deduplication, JSONL concatenation, and mixed-type top-level lines (dicts, lists, strings, numbers).

🚀 Features

Normalization

Normalize any JSONL file
- Accepts dicts, lists, numbers, strings, malformed lines
- Extracts dicts from lists
- Logs non-dict elements instead of failing
BigQuery-friendly output
Ensures one JSON object per line.
Robust error handling
- Malformed JSON → logged
- Non-dict top-level values → logged
- Mixed lists → dicts kept, junk discarded
Optional SHA-256 deduplication
Canonical JSON hashing removes duplicate objects across large files.
Zero dependencies
Pure standard library. Fast and lightweight.

NEW (v0.2.1): Batch JSON to JSONL

Batch convert a directory of classic .json files to .jsonl
Perfect for converting legacy exports or API dumps
Optional dedupe and custom discard logging
Clean argparse-based CLI (json-to-jsonl)

NEW (v0.2.1): JSONL Concatenation

Combine many normalized_*.jsonl files into one newline-delimited JSONL
Perfect for BigQuery (NEWLINE_DELIMITED_JSON)
Optional dedupe via SHA-256
Gentle warnings for non-standard output filenames
Clean argparse-based CLI (jsonl-concat)

📦 Installation

pip install jsonl-normalizer

Development install:

pip install -e .

🖥️ CLI Usage

1. Normalize JSONL

Normalize a JSONL file:

jsonl-normalize input.jsonl

Produces:

normalized.jsonl   # clean dict-only output
discarded.jsonl    # log of malformed or discarded items

Enable deduplication:

jsonl-normalize input.jsonl --dedupe

Specify custom output:

jsonl-normalize input.jsonl --output clean.jsonl --discarded junk.jsonl

📂 NEW: `json-to-jsonl` — Batch JSON to JSONL Converter

json-to-jsonl converts all .json files in a source directory to .jsonl files in an output directory.

Usage

json-to-jsonl source_dir output_dir

By default, if --discarded-dir is not provided, it will create a discarded_json directory to save logs of discarded items (but only if there are actual items to discard).

Features

Detects all .json files in source_dir
Converts each to output_dir/<filename>.jsonl
Optional SHA-256 dedupe (--dedupe)
Default discarded directory discarded_json (optional override via --discarded-dir)
Fault-tolerant: Empty discarded files are never created
Quiet mode (--quiet)

Examples

json-to-jsonl ./raw_jsons ./converted_jsonls

With deduplication and discarded logs:

json-to-jsonl ./raw_jsons ./converted_jsonls --discarded-dir ./discarded --dedupe

🔗 NEW: `jsonl-concat` — JSONL Concatenation Tool

jsonl-concat concatenates multiple normalized JSONL files into a single multi-line JSONL file.

This is ideal when your workflow produces many files such as:

norm_jsonl/
  normalized_0044a4b1d5099e2a.jsonl
  normalized_007b2d5c01abc0b9.jsonl
  normalized_02231d6de9a07833.jsonl
  ...

Combine them into one BigQuery-friendly file:

jsonl-concat

Default behavior is equivalent to:

jsonl-concat norm_jsonl/ combined.jsonl --pattern "*.jsonl"

Features

Reads files matching the given pattern (default *.jsonl) under the given directory
Processes files line-by-line for proper record-level deduplication
Writes one JSON object per line
Optional SHA-256 dedupe (--no-dedupe to disable)
Quiet mode (--quiet)
Gentle suffix warning when output file is not .jsonl/.ndjson

Examples

Use defaults:

jsonl-concat

Explicit directory and output:

jsonl-concat norm_jsonl/ final.jsonl

Custom file pattern (e.g., if your files don't start with normalized_ or have different extensions):

jsonl-concat norm_jsonl/ combined.jsonl --pattern "*.json"

Quiet mode:

jsonl-concat --quiet norm_jsonl/ combined.jsonl

Disable deduplication:

jsonl-concat --no-dedupe norm_jsonl/ combined.jsonl

If verbose and output filename is non-standard:

[WARN] Output file 'out.json' does not use .jsonl or .ndjson. Continuing.

📄 Example (Normalization)

Input (`mixed.jsonl`)

{"a": 1, "b": 2}
[{"a": 2}, [7]]
"just a string"

Output: `normalized.jsonl`

{"a": 1, "b": 2}
{"a": 2}

Output: `discarded.jsonl`

{"line": 2, "index": 1, "type": "list", "value": [7], "reason": "non-dict element in list"}
{"line": 3, "type": "str", "value": "just a string", "reason": "top-level value is not dict or list"}

🧪 Library Usage

from pathlib import Path
from jsonl_normalizer import normalize_jsonl, convert_json_dir_to_jsonl, concat_jsonl

# 1. Normalize a single file
stats = normalize_jsonl(
    input_path=Path("input.jsonl"),
    output_path=Path("normalized.jsonl"),
    discarded_path=Path("discarded.jsonl"),
    dedupe=True,
)
print(f"Single file: {stats}")

# 2. Batch convert a directory (discarded_dir is optional)
results = convert_json_dir_to_jsonl(
    source_dir=Path("./json_inputs"),
    output_dir=Path("./jsonl_outputs"),
    dedupe=True,
)
for filename, stats in results.items():
    print(f"{filename}: {stats.written} records")

# 3. Concatenate multiple JSONL files
concat_jsonl(
    source_dir=Path("./norm_jsonl"),
    output_file=Path("combined.jsonl"),
    pattern="*.jsonl",
    dedupe=True,
)

❓ Why jsonl-normalizer?

Real-world JSONL is messy:

LLMs output arrays or malformed fragments
Excel corrupts JSON strings
Some APIs return non-dict top-level structures
Data lakes accumulate junk
BigQuery requires strict dict-per-line JSONL
ETL pipelines fail on partial corruption

jsonl-normalizer fixes these problems by:

Normalizing structure
Logging all junk transparently
Keeping valid dicts only
Providing optional dedupe mode
Producing warehouse-ready JSONL

🧹 Deduplication

When --dedupe is enabled:

Each object is canonicalized (sorted keys, compact JSON)
Hashed using SHA-256
Duplicates are skipped automatically

Example:

Normalized records seen: 200
Unique records written: 173
Duplicates skipped: 27
Discarded items logged: 12 → discarded.jsonl

🧪 Testing

pip install -e .
pip install pytest
pytest

🤝 Contributing

Pull requests are welcome. Please ensure:

Tests pass
Code follows PEP 8
Changes remain backward compatible

📄 License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src/jsonl_normalizer		src/jsonl_normalizer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jsonl-normalizer

🚀 Features

Normalization

NEW (v0.2.1): Batch JSON to JSONL

NEW (v0.2.1): JSONL Concatenation

📦 Installation

🖥️ CLI Usage

1. Normalize JSONL

📂 NEW: `json-to-jsonl` — Batch JSON to JSONL Converter

Usage

Features

Examples

🔗 NEW: `jsonl-concat` — JSONL Concatenation Tool

Features

Examples

📄 Example (Normalization)

Input (`mixed.jsonl`)

Output: `normalized.jsonl`

Output: `discarded.jsonl`

🧪 Library Usage

❓ Why jsonl-normalizer?

🧹 Deduplication

🧪 Testing

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Languages

License

yeiichi/jsonl-normalizer

Folders and files

Latest commit

History

Repository files navigation

jsonl-normalizer

🚀 Features

Normalization

NEW (v0.2.1): Batch JSON to JSONL

NEW (v0.2.1): JSONL Concatenation

📦 Installation

🖥️ CLI Usage

1. Normalize JSONL

📂 NEW: json-to-jsonl — Batch JSON to JSONL Converter

Usage

Features

Examples

🔗 NEW: jsonl-concat — JSONL Concatenation Tool

Features

Examples

📄 Example (Normalization)

Input (mixed.jsonl)

Output: normalized.jsonl

Output: discarded.jsonl

🧪 Library Usage

❓ Why jsonl-normalizer?

🧹 Deduplication

🧪 Testing

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

📂 NEW: `json-to-jsonl` — Batch JSON to JSONL Converter

🔗 NEW: `jsonl-concat` — JSONL Concatenation Tool

Input (`mixed.jsonl`)

Output: `normalized.jsonl`

Output: `discarded.jsonl`

Packages