Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ __pycache__/
*$py.class
*.so
.Python
.coverage.*
build/
develop-eggs/
dist/
Expand Down
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,19 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
### Added
- None

### Changed
- None

### Fixed
- None

### Removed
- None

## [0.5.0] 2025-9-18

### Added
- feat(schema): Implement syntactic sugar for type definitions in schema rules
Expand Down
276 changes: 81 additions & 195 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,259 +5,145 @@
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code Coverage](https://img.shields.io/badge/coverage-80%25-green.svg)](https://github.com/litedatum/validatelite)

**ValidateLite: A lightweight data validation tool for engineers who need answers, fast.**
**ValidateLite: A lightweight, scenario-driven data validation tool for modern data practitioners.**

Unlike other complex **data validation tools**, ValidateLite provides two powerful, focused commands for different scenarios:
Whether you're a data scientist cleaning a messy CSV, a data engineer building robust pipelines, or a developer needing a quick check, ValidateLite provides powerful, focused commands for your use case:

* **`vlite check`**: For quick, ad-hoc data checks. Need to verify if a column is unique or not null *right now*? The `check` command gets you an answer in 30 seconds, zero config required.
* **`vlite check`**: For quick, ad-hoc data checks. Need to verify if a column is unique or not null *right now*? The `check` command gets you an answer in seconds, zero config required.

* **`vlite schema`**: For robust, repeatable **database schema validation**. It's your best defense against **schema drift**. Embed it in your CI/CD and ETL pipelines to enforce data contracts, ensuring data integrity before it becomes a problem.
* **`vlite schema`**: For robust, repeatable, and automated validation. Define your data's contract in a JSON schema and let ValidateLite verify everything from data types and ranges to complex type-conversion feasibility.

---

## Core Use Case: Automated Schema Validation
## Who is it for?

The `vlite schema` command is key to ensuring the stability of your data pipelines. It allows you to quickly verify that a database table or data file conforms to a defined structure.
### For the Data Scientist: Preparing Data for Analysis

### Scenario 1: Gate Deployments in CI/CD
You have a messy dataset (`legacy_data.csv`) where everything is a `string`. Before you can build a model, you need to clean it up and convert columns to their proper types (`integer`, `float`, `date`). How much work will it be?

Automatically check for breaking schema changes before they get deployed, preventing production issues caused by unexpected modifications.
Instead of writing complex cleaning scripts first, use `vlite schema` to **assess the feasibility of the cleanup**.

**Example Workflow (`.github/workflows/ci.yml`)**
```yaml
jobs:
validate-db-schema:
name: Validate Database Schema
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
**1. Define Your Target Schema (`rules.json`)**

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
Create a schema file that describes the *current* type and the *desired* type.

- name: Install ValidateLite
run: pip install validatelite

- name: Run Schema Validation
run: |
vlite schema --conn "mysql://${{ secrets.DB_USER }}:${{ secrets.DB_PASS }}@${{ secrets.DB_HOST }}/sales" \
--rules ./schemas/customers_schema.json
```

### Scenario 2: Monitor ETL/ELT Pipelines

Set up validation checkpoints at various stages of your data pipelines to guarantee data quality and avoid "garbage in, garbage out."

**Example Rule File (`customers_schema.json`)**
```json
{
"customers": {
"rules": [
{ "field": "id", "type": "integer", "required": true },
{ "field": "name", "type": "string", "required": true },
{ "field": "email", "type": "string", "required": true },
{ "field": "age", "type": "integer", "min": 18, "max": 100 },
{ "field": "gender", "enum": ["Male", "Female", "Other"] },
{ "field": "invalid_col" }
]
}
}
```

**Run Command:**
```bash
vlite schema --conn "mysql://user:pass@host:3306/sales" --rules customers_schema.json
```

### Advanced Schema Examples

**Multi-Table Validation:**
```json
{
"customers": {
"rules": [
{ "field": "id", "type": "integer", "required": true },
{ "field": "name", "type": "string", "required": true },
{ "field": "email", "type": "string", "required": true },
{ "field": "age", "type": "integer", "min": 18, "max": 100 }
],
"strict_mode": true
},
"orders": {
"rules": [
{ "field": "id", "type": "integer", "required": true },
{ "field": "customer_id", "type": "integer", "required": true },
{ "field": "total", "type": "float", "min": 0 },
{ "field": "status", "enum": ["pending", "completed", "cancelled"] }
]
}
}
```

**CSV File Validation:**
```bash
# Validate CSV file structure
vlite schema --conn "sales_data.csv" --rules csv_schema.json --output json
```

**Complex Data Types:**
```json
{
"events": {
"rules": [
{ "field": "timestamp", "type": "datetime", "required": true },
{ "field": "event_type", "enum": ["login", "logout", "purchase"] },
{ "field": "user_id", "type": "string", "required": true },
{ "field": "metadata", "type": "string" }
],
"case_insensitive": true
}
}
```

**Available Data Types:**
- `string` - Text data (VARCHAR, TEXT, CHAR)
- `integer` - Whole numbers (INT, BIGINT, SMALLINT)
- `float` - Decimal numbers (FLOAT, DOUBLE, DECIMAL)
- `boolean` - True/false values (BOOLEAN, BOOL, BIT)
- `date` - Date only (DATE)
- `datetime` - Date and time (DATETIME, TIMESTAMP)

### Enhanced Schema Validation with Metadata

ValidateLite now supports **metadata validation** for precise schema enforcement without scanning table data. This provides superior performance by validating column constraints directly from database metadata.

**Metadata Validation Features:**
- **String Length Validation**: Validate `max_length` for string columns
- **Float Precision Validation**: Validate `precision` and `scale` for decimal columns
- **Database-Agnostic**: Works across MySQL, PostgreSQL, and SQLite
- **Performance Optimized**: Uses database catalog queries, not data scans

**Enhanced Schema Examples:**

**String Metadata Validation:**
```json
{
"users": {
"legacy_users": {
"rules": [
{
"field": "username",
"field": "user_id",
"type": "string",
"max_length": 50,
"desired_type": "integer",
"required": true
},
{
"field": "email",
"field": "salary",
"type": "string",
"max_length": 255,
"desired_type": "float(10,2)",
"required": true
},
{
"field": "biography",
"field": "bio",
"type": "string",
"max_length": 1000
"desired_type": "string(500)",
"required": false
}
]
}
}
```

**Float Precision Validation:**
```json
{
"products": {
"rules": [
{
"field": "price",
"type": "float",
"precision": 10,
"scale": 2,
"required": true
},
{
"field": "weight",
"type": "float",
"precision": 8,
"scale": 3
}
]
}
}
**2. Run the Validation**

```bash
vlite schema --conn legacy_data.csv --rules rules.json
```

**Mixed Metadata Schema:**
```json
{
"orders": {
"rules": [
{ "field": "id", "type": "integer", "required": true },
{
"field": "customer_name",
"type": "string",
"max_length": 100,
"required": true
},
{
"field": "total_amount",
"type": "float",
"precision": 12,
"scale": 2,
"required": true
},
{ "field": "order_date", "type": "datetime", "required": true },
{ "field": "notes", "type": "string", "max_length": 500 }
],
"strict_mode": true
}
}
ValidateLite will generate a report telling you exactly what can and cannot be converted, saving you hours of guesswork.

```
FIELD VALIDATION RESULTS
========================

**Backward Compatibility**: Existing schema files without metadata continue to work unchanged. Metadata validation is optional and can be added incrementally to enhance validation precision.
Field: user_id
✓ Field exists (string)
✓ Not Null constraint
✗ Type Conversion Validation (string → integer): 15 incompatible records found

**Command Options:**
```bash
# Basic validation
vlite schema --conn <connection> --rules <rules_file>
Field: salary
✓ Field exists (string)
✗ Type Conversion Validation (string → float(10,2)): 8 incompatible records found

Field: bio
✓ Field exists (string)
✓ Length Constraint Validation (string → string(500)): PASSED
```

### For the Data Engineer: Ensuring Data Integrity in CI/CD

You need to prevent breaking schema changes and bad data from ever reaching production. Embed ValidateLite into your CI/CD pipeline to act as a quality gate.

**Example Workflow (`.github/workflows/ci.yml`)**

This workflow automatically validates the database schema on every pull request.

```yaml
jobs:
validate-db-schema:
name: Validate Database Schema
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3

# JSON output for automation
vlite schema --conn <connection> --rules <rules_file> --output json
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'

# Exit with error code on any failure
vlite schema --conn <connection> --rules <rules_file> --fail-on-error
- name: Install ValidateLite
run: pip install validatelite

# Verbose logging
vlite schema --conn <connection> --rules <rules_file> --verbose
- name: Run Schema Validation
run: |
vlite schema --conn "mysql://${{ secrets.DB_USER }}:${{ secrets.DB_PASS }}@${{ secrets.DB_HOST }}/sales" \
--rules ./schemas/customers_schema.json \
--fail-on-error
```
This same approach can be used to monitor data quality at every stage of your ETL/ELT pipelines, preventing "garbage in, garbage out."

---

## Quick Start: Ad-Hoc Checks with `check`

For temporary, one-off validation needs, the `check` command is your best friend.
For temporary, one-off validation needs, the `check` command is your best friend. You can run multiple rules on any supported data source (files or databases) directly from the command line.

**1. Install (if you haven't already):**
```bash
pip install validatelite
```

**2. Run a check:**
```bash
# Check for nulls in a CSV file's 'id' column
vlite check --conn "customers.csv" --table customers --rule "not_null(id)"

# Check for uniqueness in a database table's 'email' column
vlite check --conn "mysql://user:pass@host/db" --table customers --rule "unique(email)"
```bash
# Check for nulls and uniqueness in a CSV file
vlite check --conn "customers.csv" --table customers \
--rule "not_null(id)" \
--rule "unique(email)"

# Check value ranges and formats in a database table
vlite check --conn "mysql://user:pass@host/db" --table customers \
--rule "range(age, 18, 99)" \
--rule "enum(status, 'active', 'inactive')"
```

---

## Learn More

- **[Usage Guide (USAGE.md)](docs/USAGE.md)**: Learn about all commands, arguments, and advanced features.
- **[Configuration Reference (CONFIG_REFERENCE.md)](docs/CONFIG_REFERENCE.md)**: See how to configure the tool via `toml` files.
- **[Usage Guide (docs/usage.md)](docs/usage.md)**: Learn about all commands, data sources, rule types, and advanced features like the **Desired Type** system.
- **[Configuration Reference (docs/CONFIG_REFERENCE.md)](docs/CONFIG_REFERENCE.md)**: See how to configure the tool via `toml` files.
- **[Contributing Guide (CONTRIBUTING.md)](CONTRIBUTING.md)**: We welcome contributions!

---
Expand All @@ -274,4 +160,4 @@ Follow the journey of building ValidateLite through our development blog posts:

## 📄 License

This project is licensed under the [MIT License](LICENSE).
This project is licensed under the [MIT License](LICENSE)
2 changes: 1 addition & 1 deletion cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Provides a unified `vlite check` command for data quality checking.
"""

__version__ = "0.4.3"
__version__ = "0.5.0"

from .app import cli_app

Expand Down
2 changes: 1 addition & 1 deletion cli/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ def _setup_logging() -> None:


@click.group(name="vlite", invoke_without_command=True)
@click.version_option(version="0.4.3", prog_name="vlite")
@click.version_option(version="0.5.0", prog_name="vlite")
@click.pass_context
def cli_app(ctx: click.Context) -> None:
"""
Expand Down
Loading