Skip to content

Interactive Julia CLI for exploring the data.gov catalog with smart search, fuzzy matching, and multi-format export

Notifications You must be signed in to change notification settings

justin4957/DataGovExplorer

Repository files navigation

DataGovExplorer

A Julia-based interactive CLI tool for exploring and downloading datasets from the data.gov catalog. Built on the CKAN API, this tool provides an intuitive command-line interface for browsing thousands of government datasets.

Screenshot 2025-10-22 at 10 30 39 AM

Features

  • Dual Operation Modes:
    • Interactive Mode: Menu-driven interface for browsing datasets
    • CLI Mode: Non-interactive commands for automation and scripting
  • Smart Search: Fuzzy matching and auto-correction for dataset discovery
  • Multiple Browse Modes:
    • Browse by organization
    • Browse by tags
    • Search by keywords
    • View recent datasets
  • Flexible Export: Export catalog metadata to CSV, JSON, Excel, or Arrow formats
  • Flexible Output Formats: Choose between table, JSON, CSV, or plain text output
  • Caching: Built-in caching to reduce API calls and improve performance
  • Rate Limiting: Automatic rate limiting and retry logic with exponential backoff
  • Error Handling: Graceful error handling with helpful suggestions
  • Automation-Ready: Perfect for CI/CD pipelines and batch operations

Installation

Prerequisites

  • Julia 1.9 or later

Setup

  1. Clone or download this repository:
cd /Users/coolbeans/Development/dev/DataGovExplorer
  1. Install dependencies:
using Pkg
Pkg.activate(".")
Pkg.instantiate()

Quick Start

Interactive Mode

Launch the interactive explorer:

julia run_explorer.jl

Or from Julia REPL:

using DataGovExplorer
interactive_explorer()

CLI Mode (Non-Interactive)

Search for datasets directly from the command line:

# Search for datasets
julia run_explorer.jl search "climate data" --limit 20

# Export results directly
julia run_explorer.jl search "climate" --export climate.csv

# Browse by organization
julia run_explorer.jl org "Department of Commerce" --limit 50

# Browse by tag
julia run_explorer.jl tag "environment" --output json

# View recent datasets
julia run_explorer.jl recent --limit 10

# List all organizations
julia run_explorer.jl orgs --export organizations.csv

CLI Output Formats

Control output format with the --output flag:

# Table format (default)
julia run_explorer.jl search "health" --output table

# JSON format (machine-readable)
julia run_explorer.jl search "health" --output json

# CSV format (pipe to other tools)
julia run_explorer.jl search "health" --output csv

# Plain text (simple list)
julia run_explorer.jl search "health" --output plain

Disable Colors

For piping or logging, use --no-color:

julia run_explorer.jl search "climate" --no-color --output json > results.json

Programmatic Usage

using DataGovExplorer

# Create a client
client = CKANClient()

# Search for datasets
climate_data = search_packages(client, q="climate", rows=20)

# Export results
export_to_csv(climate_data, "climate_datasets.csv")

Usage Examples

Example 1: Search for Datasets

Screenshot 2025-10-22 at 10 30 30 AM
using DataGovExplorer

client = CKANClient()

# Search for datasets about climate
results = search_packages(client, query="climate change", rows=50)

# View results
println("Found $(nrow(results)) datasets")
println(results)

# Export to CSV
export_to_csv(results, "climate_datasets.csv")

Example 2: Browse by Organization

CLI mode:

# List all organizations
julia run_explorer.jl orgs --export orgs.csv

# Browse datasets from a specific organization
julia run_explorer.jl org "NOAA" --limit 100 --export noaa_datasets.xlsx

Programmatic:

# Get all organizations
orgs = get_organizations(client)

# Get datasets from a specific organization
noaa_data = search_packages(client, fq="organization:\"noaa-gov\"", rows=100)

# Export to Excel
export_to_xlsx(noaa_data, "noaa_datasets.xlsx")

Example 3: Browse by Tags

CLI mode:

# Browse datasets by tag
julia run_explorer.jl tag "health" --limit 50 --export health.json

# Output as JSON for processing
julia run_explorer.jl tag "covid-19" --output json

Programmatic:

# Get all available tags
tags = get_tags(client)

# Find datasets with specific tags
health_data = search_packages(client, fq="tags:\"health\"", rows=50)

# Export to JSON
export_to_json(health_data, "health_datasets.json")

Example 4: Get Dataset Details

# Get detailed metadata for a specific dataset
dataset_name = "monthly-us-air-quality-1980-2020"
metadata = get_package_metadata(client, dataset_name)

println(metadata)
export_to_csv(metadata, "dataset_details.csv")

Example 5: Multiple Format Export

CLI mode:

# Export to different formats
julia run_explorer.jl search "education" --limit 100 --export education.csv
julia run_explorer.jl search "education" --limit 100 --export education.json
julia run_explorer.jl search "education" --limit 100 --export education.xlsx
julia run_explorer.jl search "education" --limit 100 --export education.arrow

Programmatic:

# Search for datasets
results = search_packages(client, q="education", rows=100)

# Export to multiple formats
export_to_csv(results, "education.csv")
export_to_json(results, "education.json")
export_to_arrow(results, "education.arrow")  # Efficient binary format
export_to_xlsx(results, "education.xlsx")

Example 6: Automation and Batch Operations

Use CLI mode in shell scripts for automated data collection:

#!/bin/bash
# collect_datasets.sh - Automated dataset collection

# Collect datasets from different categories
julia run_explorer.jl search "climate" --export data/climate.csv
julia run_explorer.jl search "health" --export data/health.csv
julia run_explorer.jl search "education" --export data/education.csv

# Get recent datasets
julia run_explorer.jl recent --limit 100 --export data/recent.json

# Archive organizations
julia run_explorer.jl orgs --export data/organizations.csv

echo "Data collection complete!"

API Reference

Client Configuration

# Create client with default configuration
client = CKANClient()

# Create client with custom configuration
config = CKANConfig(
    base_url="https://catalog.data.gov/api/3",
    timeout=30,           # Request timeout in seconds
    rate_limit_ms=500,    # Minimum delay between requests
    max_retries=3,        # Maximum retry attempts
    page_size=100         # Results per page
)
client = CKANClient(config)

Metadata Functions

get_packages(client; limit=nothing, force_refresh=false)

Get list of all packages (datasets).

get_organizations(client; force_refresh=false)

Get list of all organizations.

get_tags(client; force_refresh=false)

Get list of all tags.

get_package_details(client, package_id::String)

Get detailed information about a specific package.

get_package_metadata(client, package_id::String)

Get formatted metadata for a package as DataFrame.

search_packages(client; query=nothing, organization=nothing, tags=nothing, rows=100)

Search for packages with various filters.

Export Functions

export_to_csv(df, filepath)

Export DataFrame to CSV format.

export_to_json(df, filepath; pretty=false)

Export DataFrame to JSON format.

export_to_arrow(df, filepath)

Export DataFrame to Apache Arrow format (efficient binary).

export_to_xlsx(df, filepath; sheet_name="Data")

Export DataFrame to Excel format.

export_data(df, filepath; kwargs...)

Smart export based on file extension.

auto_export(df, base_name; format=:csv, output_dir="./output")

Export with auto-generated filename and timestamp.

export_multi_sheet_xlsx(data_dict, filepath)

Export multiple DataFrames to Excel with multiple sheets.

Project Structure

DataGovExplorer/
├── src/
│   ├── DataGovExplorer.jl      # Main module
│   ├── config.jl               # Configuration structures
│   ├── client.jl               # HTTP client with rate limiting
│   ├── metadata.jl             # Metadata retrieval functions
│   ├── exports.jl              # Export utilities
│   ├── cli.jl                  # CLI command definitions
│   ├── explorer.jl             # Interactive CLI main loop
│   └── explorer/
│       ├── display.jl          # Table formatting and colors
│       ├── input.jl            # User input validation
│       └── menu.jl             # Menu navigation logic
├── examples/
│   ├── quick_start.jl          # Basic connectivity test
│   └── basic_usage.jl          # Common usage patterns
├── Project.toml                # Package dependencies
├── run_explorer.jl             # CLI launcher (interactive & non-interactive)
└── README.md                   # This file

Architecture

The project follows a modular architecture similar to UNStatsExplorer:

  • Configuration Layer: Centralized configuration for API settings
  • Client Layer: HTTP client with caching, rate limiting, and retry logic
  • Metadata Layer: Functions for retrieving catalog information
  • Export Layer: Multi-format export utilities
  • Explorer Layer: Interactive CLI with menu navigation

Key Design Principles

  1. Composability: Small, focused functions that combine into larger workflows
  2. Reusability: Core utilities work independently
  3. Descriptive Naming: Clear, self-documenting function names
  4. Type Safety: Robust handling of missing/invalid data
  5. Minimal Overhead: Efficient caching and resource usage
  6. Progressive Disclosure: Simple API with advanced options

CLI Commands Reference

search <query>

Search for datasets by keyword

  • --limit <N>: Maximum number of results (default: 50)
  • --export <file>: Export results to file
  • --output <format>: Output format (table, json, csv, plain)
  • --no-color: Disable colored output

org <organization>

Browse datasets from a specific organization

  • --limit <N>: Maximum number of results (default: 50)
  • --export <file>: Export results to file
  • --output <format>: Output format (table, json, csv, plain)
  • --no-color: Disable colored output

tag <tag_name>

Browse datasets by tag

  • --limit <N>: Maximum number of results (default: 50)
  • --export <file>: Export results to file
  • --output <format>: Output format (table, json, csv, plain)
  • --no-color: Disable colored output

recent

View recently updated datasets

  • --limit <N>: Maximum number of results (default: 20)
  • --export <file>: Export results to file
  • --output <format>: Output format (table, json, csv, plain)
  • --no-color: Disable colored output

orgs

List all organizations

  • --export <file>: Export results to file
  • --output <format>: Output format (table, json, csv, plain)
  • --no-color: Disable colored output

interactive

Launch interactive explorer mode

Dependencies

  • HTTP.jl: HTTP client for API requests
  • JSON3.jl: Fast JSON parsing
  • DataFrames.jl: Tabular data manipulation
  • CSV.jl: CSV export
  • Arrow.jl: Apache Arrow format
  • XLSX.jl: Excel export
  • JSONTables.jl: JSON export
  • PrettyTables.jl: Console table formatting
  • ProgressMeter.jl: Progress bars
  • StringDistances.jl: Fuzzy matching (Jaro-Winkler)
  • Crayons.jl: ANSI color output
  • Comonicon.jl: Command-line interface framework

CKAN API

This tool uses the CKAN API (version 3) provided by data.gov. CKAN (Comprehensive Knowledge Archive Network) is an open-source data management system used by governments worldwide.

Key API endpoints used:

  • /api/3/action/package_list - List all packages
  • /api/3/action/package_search - Search packages
  • /api/3/action/package_show - Get package details
  • /api/3/action/organization_list - List organizations
  • /api/3/action/group_list - List groups
  • /api/3/action/tag_list - List tags

API Documentation: https://docs.ckan.org/en/2.11/api/index.html

Performance Tips

  1. Use Caching: Metadata queries are cached by default
  2. Specify Filters: Use organization, tags, and query parameters to narrow searches
  3. Arrow Format: Use Arrow format for large datasets (fastest for re-import)
  4. Pagination: Results are automatically paginated with progress bars
  5. Rate Limiting: Built-in rate limiting respects API constraints

Troubleshooting

DNS/Connection Issues

If you encounter DNS errors like DNSError: catalog.data.gov, unknown node or service (EAI_NONAME):

This is a known issue with Julia's HTTP.jl DNS resolution on some systems. Solutions:

Option 1: Use environment variable (Recommended)

# Set this environment variable before running
export JULIA_NO_VERIFY_HOSTS="catalog.data.gov"

# Then run normally
julia run_explorer.jl

Option 2: Add to your shell profile

# Add to ~/.zshrc or ~/.bashrc
export JULIA_NO_VERIFY_HOSTS="catalog.data.gov"

Option 3: Use with each command

JULIA_NO_VERIFY_HOSTS="catalog.data.gov" julia run_explorer.jl search "climate"
JULIA_NO_VERIFY_HOSTS="catalog.data.gov" julia run_explorer.jl interactive

Option 4: Check your DNS settings

  • Verify DNS is working: ping catalog.data.gov
  • Try flushing DNS cache: sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder (macOS)
  • Temporarily switch to Google DNS (8.8.8.8) in System Settings

Connection Issues

If you encounter other connection issues:

  • Check your internet connection
  • Verify the data.gov API is accessible: curl -I https://catalog.data.gov/api/3/action/organization_list
  • Try increasing the timeout in configuration
  • Check if you're behind a corporate firewall or proxy

API Errors

If you get API errors:

  • Check if the dataset name/ID is correct
  • Some datasets may have restricted access
  • Try again later if the API is under heavy load
  • Verify API status: https://catalog.data.gov/

Performance Issues

If searches are slow:

  • Reduce the rows parameter
  • Use more specific search queries
  • Clear the cache: client.cache = Dict()
  • The API may fetch more data than requested during pagination

Contributing

This project was adapted from UNStatsExplorer. Contributions are welcome!

License

[Specify your license here]

Acknowledgments

Related Projects

  • UNStatsExplorer: Julia tool for exploring UN SDG data
  • CKAN: Open-source data management system

Contact

[Your contact information]

About

Interactive Julia CLI for exploring the data.gov catalog with smart search, fuzzy matching, and multi-format export

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages