DataGovExplorer

A Julia-based interactive CLI tool for exploring and downloading datasets from the data.gov catalog. Built on the CKAN API, this tool provides an intuitive command-line interface for browsing thousands of government datasets.

Features

Dual Operation Modes:
- Interactive Mode: Menu-driven interface for browsing datasets
- CLI Mode: Non-interactive commands for automation and scripting
Smart Search: Fuzzy matching and auto-correction for dataset discovery
Multiple Browse Modes:
- Browse by organization
- Browse by tags
- Search by keywords
- View recent datasets
Flexible Export: Export catalog metadata to CSV, JSON, Excel, or Arrow formats
Flexible Output Formats: Choose between table, JSON, CSV, or plain text output
Caching: Built-in caching to reduce API calls and improve performance
Rate Limiting: Automatic rate limiting and retry logic with exponential backoff
Error Handling: Graceful error handling with helpful suggestions
Automation-Ready: Perfect for CI/CD pipelines and batch operations

Installation

Prerequisites

Julia 1.9 or later

Setup

Clone or download this repository:

cd /Users/coolbeans/Development/dev/DataGovExplorer

Install dependencies:

using Pkg
Pkg.activate(".")
Pkg.instantiate()

Quick Start

Interactive Mode

Launch the interactive explorer:

julia run_explorer.jl

Or from Julia REPL:

using DataGovExplorer
interactive_explorer()

CLI Mode (Non-Interactive)

Search for datasets directly from the command line:

# Search for datasets
julia run_explorer.jl search "climate data" --limit 20

# Export results directly
julia run_explorer.jl search "climate" --export climate.csv

# Browse by organization
julia run_explorer.jl org "Department of Commerce" --limit 50

# Browse by tag
julia run_explorer.jl tag "environment" --output json

# View recent datasets
julia run_explorer.jl recent --limit 10

# List all organizations
julia run_explorer.jl orgs --export organizations.csv

CLI Output Formats

Control output format with the --output flag:

# Table format (default)
julia run_explorer.jl search "health" --output table

# JSON format (machine-readable)
julia run_explorer.jl search "health" --output json

# CSV format (pipe to other tools)
julia run_explorer.jl search "health" --output csv

# Plain text (simple list)
julia run_explorer.jl search "health" --output plain

Disable Colors

For piping or logging, use --no-color:

julia run_explorer.jl search "climate" --no-color --output json > results.json

Programmatic Usage

using DataGovExplorer

# Create a client
client = CKANClient()

# Search for datasets
climate_data = search_packages(client, q="climate", rows=20)

# Export results
export_to_csv(climate_data, "climate_datasets.csv")

Usage Examples

Example 1: Search for Datasets

using DataGovExplorer

client = CKANClient()

# Search for datasets about climate
results = search_packages(client, query="climate change", rows=50)

# View results
println("Found $(nrow(results)) datasets")
println(results)

# Export to CSV
export_to_csv(results, "climate_datasets.csv")

Example 2: Browse by Organization

CLI mode:

# List all organizations
julia run_explorer.jl orgs --export orgs.csv

# Browse datasets from a specific organization
julia run_explorer.jl org "NOAA" --limit 100 --export noaa_datasets.xlsx

Programmatic:

# Get all organizations
orgs = get_organizations(client)

# Get datasets from a specific organization
noaa_data = search_packages(client, fq="organization:\"noaa-gov\"", rows=100)

# Export to Excel
export_to_xlsx(noaa_data, "noaa_datasets.xlsx")

Example 3: Browse by Tags

CLI mode:

# Browse datasets by tag
julia run_explorer.jl tag "health" --limit 50 --export health.json

# Output as JSON for processing
julia run_explorer.jl tag "covid-19" --output json

Programmatic:

# Get all available tags
tags = get_tags(client)

# Find datasets with specific tags
health_data = search_packages(client, fq="tags:\"health\"", rows=50)

# Export to JSON
export_to_json(health_data, "health_datasets.json")

Example 4: Get Dataset Details

# Get detailed metadata for a specific dataset
dataset_name = "monthly-us-air-quality-1980-2020"
metadata = get_package_metadata(client, dataset_name)

println(metadata)
export_to_csv(metadata, "dataset_details.csv")

Example 5: Multiple Format Export

CLI mode:

# Export to different formats
julia run_explorer.jl search "education" --limit 100 --export education.csv
julia run_explorer.jl search "education" --limit 100 --export education.json
julia run_explorer.jl search "education" --limit 100 --export education.xlsx
julia run_explorer.jl search "education" --limit 100 --export education.arrow

Programmatic:

# Search for datasets
results = search_packages(client, q="education", rows=100)

# Export to multiple formats
export_to_csv(results, "education.csv")
export_to_json(results, "education.json")
export_to_arrow(results, "education.arrow")  # Efficient binary format
export_to_xlsx(results, "education.xlsx")

Example 6: Automation and Batch Operations

Use CLI mode in shell scripts for automated data collection:

#!/bin/bash
# collect_datasets.sh - Automated dataset collection

# Collect datasets from different categories
julia run_explorer.jl search "climate" --export data/climate.csv
julia run_explorer.jl search "health" --export data/health.csv
julia run_explorer.jl search "education" --export data/education.csv

# Get recent datasets
julia run_explorer.jl recent --limit 100 --export data/recent.json

# Archive organizations
julia run_explorer.jl orgs --export data/organizations.csv

echo "Data collection complete!"

API Reference

Client Configuration

# Create client with default configuration
client = CKANClient()

# Create client with custom configuration
config = CKANConfig(
    base_url="https://catalog.data.gov/api/3",
    timeout=30,           # Request timeout in seconds
    rate_limit_ms=500,    # Minimum delay between requests
    max_retries=3,        # Maximum retry attempts
    page_size=100         # Results per page
)
client = CKANClient(config)

Metadata Functions

`get_packages(client; limit=nothing, force_refresh=false)`

Get list of all packages (datasets).

`get_organizations(client; force_refresh=false)`

Get list of all organizations.

`get_tags(client; force_refresh=false)`

Get list of all tags.

`get_package_details(client, package_id::String)`

Get detailed information about a specific package.

`get_package_metadata(client, package_id::String)`

Get formatted metadata for a package as DataFrame.

`search_packages(client; query=nothing, organization=nothing, tags=nothing, rows=100)`

Search for packages with various filters.

Export Functions

`export_to_csv(df, filepath)`

Export DataFrame to CSV format.

`export_to_json(df, filepath; pretty=false)`

Export DataFrame to JSON format.

`export_to_arrow(df, filepath)`

Export DataFrame to Apache Arrow format (efficient binary).

`export_to_xlsx(df, filepath; sheet_name="Data")`

Export DataFrame to Excel format.

`export_data(df, filepath; kwargs...)`

Smart export based on file extension.

`auto_export(df, base_name; format=:csv, output_dir="./output")`

Export with auto-generated filename and timestamp.

`export_multi_sheet_xlsx(data_dict, filepath)`

Export multiple DataFrames to Excel with multiple sheets.

Project Structure

DataGovExplorer/
├── src/
│   ├── DataGovExplorer.jl      # Main module
│   ├── config.jl               # Configuration structures
│   ├── client.jl               # HTTP client with rate limiting
│   ├── metadata.jl             # Metadata retrieval functions
│   ├── exports.jl              # Export utilities
│   ├── cli.jl                  # CLI command definitions
│   ├── explorer.jl             # Interactive CLI main loop
│   └── explorer/
│       ├── display.jl          # Table formatting and colors
│       ├── input.jl            # User input validation
│       └── menu.jl             # Menu navigation logic
├── examples/
│   ├── quick_start.jl          # Basic connectivity test
│   └── basic_usage.jl          # Common usage patterns
├── Project.toml                # Package dependencies
├── run_explorer.jl             # CLI launcher (interactive & non-interactive)
└── README.md                   # This file

Architecture

The project follows a modular architecture similar to UNStatsExplorer:

Configuration Layer: Centralized configuration for API settings
Client Layer: HTTP client with caching, rate limiting, and retry logic
Metadata Layer: Functions for retrieving catalog information
Export Layer: Multi-format export utilities
Explorer Layer: Interactive CLI with menu navigation

Key Design Principles

Composability: Small, focused functions that combine into larger workflows
Reusability: Core utilities work independently
Descriptive Naming: Clear, self-documenting function names
Type Safety: Robust handling of missing/invalid data
Minimal Overhead: Efficient caching and resource usage
Progressive Disclosure: Simple API with advanced options

CLI Commands Reference

`search <query>`

Search for datasets by keyword

--limit <N>: Maximum number of results (default: 50)
--export <file>: Export results to file
--output <format>: Output format (table, json, csv, plain)
--no-color: Disable colored output

`org <organization>`

Browse datasets from a specific organization

--limit <N>: Maximum number of results (default: 50)
--export <file>: Export results to file
--output <format>: Output format (table, json, csv, plain)
--no-color: Disable colored output

`tag <tag_name>`

Browse datasets by tag

--limit <N>: Maximum number of results (default: 50)
--export <file>: Export results to file
--output <format>: Output format (table, json, csv, plain)
--no-color: Disable colored output

`recent`

View recently updated datasets

--limit <N>: Maximum number of results (default: 20)
--export <file>: Export results to file
--output <format>: Output format (table, json, csv, plain)
--no-color: Disable colored output

`orgs`

List all organizations

--export <file>: Export results to file
--output <format>: Output format (table, json, csv, plain)
--no-color: Disable colored output

`interactive`

Launch interactive explorer mode

Dependencies

HTTP.jl: HTTP client for API requests
JSON3.jl: Fast JSON parsing
DataFrames.jl: Tabular data manipulation
CSV.jl: CSV export
Arrow.jl: Apache Arrow format
XLSX.jl: Excel export
JSONTables.jl: JSON export
PrettyTables.jl: Console table formatting
ProgressMeter.jl: Progress bars
StringDistances.jl: Fuzzy matching (Jaro-Winkler)
Crayons.jl: ANSI color output
Comonicon.jl: Command-line interface framework

CKAN API

This tool uses the CKAN API (version 3) provided by data.gov. CKAN (Comprehensive Knowledge Archive Network) is an open-source data management system used by governments worldwide.

Key API endpoints used:

/api/3/action/package_list - List all packages
/api/3/action/package_search - Search packages
/api/3/action/package_show - Get package details
/api/3/action/organization_list - List organizations
/api/3/action/group_list - List groups
/api/3/action/tag_list - List tags

API Documentation: https://docs.ckan.org/en/2.11/api/index.html

Performance Tips

Use Caching: Metadata queries are cached by default
Specify Filters: Use organization, tags, and query parameters to narrow searches
Arrow Format: Use Arrow format for large datasets (fastest for re-import)
Pagination: Results are automatically paginated with progress bars
Rate Limiting: Built-in rate limiting respects API constraints

Troubleshooting

DNS/Connection Issues

If you encounter DNS errors like DNSError: catalog.data.gov, unknown node or service (EAI_NONAME):

This is a known issue with Julia's HTTP.jl DNS resolution on some systems. Solutions:

Option 1: Use environment variable (Recommended)

# Set this environment variable before running
export JULIA_NO_VERIFY_HOSTS="catalog.data.gov"

# Then run normally
julia run_explorer.jl

Option 2: Add to your shell profile

# Add to ~/.zshrc or ~/.bashrc
export JULIA_NO_VERIFY_HOSTS="catalog.data.gov"

Option 3: Use with each command

JULIA_NO_VERIFY_HOSTS="catalog.data.gov" julia run_explorer.jl search "climate"
JULIA_NO_VERIFY_HOSTS="catalog.data.gov" julia run_explorer.jl interactive

Option 4: Check your DNS settings

Verify DNS is working: ping catalog.data.gov
Try flushing DNS cache: sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder (macOS)
Temporarily switch to Google DNS (8.8.8.8) in System Settings

Connection Issues

If you encounter other connection issues:

Check your internet connection
Verify the data.gov API is accessible: curl -I https://catalog.data.gov/api/3/action/organization_list
Try increasing the timeout in configuration
Check if you're behind a corporate firewall or proxy

API Errors

If you get API errors:

Check if the dataset name/ID is correct
Some datasets may have restricted access
Try again later if the API is under heavy load
Verify API status: https://catalog.data.gov/

Performance Issues

If searches are slow:

Reduce the rows parameter
Use more specific search queries
Clear the cache: client.cache = Dict()
The API may fetch more data than requested during pagination

Contributing

This project was adapted from UNStatsExplorer. Contributions are welcome!

License

[Specify your license here]

Acknowledgments

Based on the architecture of UNStatsExplorer
Data provided by data.gov
CKAN API by CKAN Association

Related Projects

UNStatsExplorer: Julia tool for exploring UN SDG data
CKAN: Open-source data management system

Contact

[Your contact information]

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
examples		examples
src		src
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Project.toml		Project.toml
QUICKSTART.md		QUICKSTART.md
README.md		README.md
UI_ROADMAP.md		UI_ROADMAP.md
run_explorer.jl		run_explorer.jl

justin4957/DataGovExplorer

Folders and files

Latest commit

History

Repository files navigation

DataGovExplorer

Features

Installation

Prerequisites

Setup

Quick Start

Interactive Mode

CLI Mode (Non-Interactive)

CLI Output Formats

Disable Colors

Programmatic Usage

Usage Examples

Example 1: Search for Datasets

Example 2: Browse by Organization

Example 3: Browse by Tags

Example 4: Get Dataset Details

Example 5: Multiple Format Export

Example 6: Automation and Batch Operations

API Reference

Client Configuration

Metadata Functions

get_packages(client; limit=nothing, force_refresh=false)

get_organizations(client; force_refresh=false)

get_tags(client; force_refresh=false)

get_package_details(client, package_id::String)

get_package_metadata(client, package_id::String)

search_packages(client; query=nothing, organization=nothing, tags=nothing, rows=100)

Export Functions

export_to_csv(df, filepath)

export_to_json(df, filepath; pretty=false)

export_to_arrow(df, filepath)

export_to_xlsx(df, filepath; sheet_name="Data")

export_data(df, filepath; kwargs...)

auto_export(df, base_name; format=:csv, output_dir="./output")

export_multi_sheet_xlsx(data_dict, filepath)

Project Structure

Architecture

Key Design Principles

CLI Commands Reference

search <query>

org <organization>

tag <tag_name>

recent

orgs

interactive

Dependencies

CKAN API

Performance Tips

Troubleshooting

DNS/Connection Issues

Connection Issues

API Errors

Performance Issues

Contributing

License

Acknowledgments

Related Projects

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`get_packages(client; limit=nothing, force_refresh=false)`

`get_organizations(client; force_refresh=false)`

`get_tags(client; force_refresh=false)`

`get_package_details(client, package_id::String)`

`get_package_metadata(client, package_id::String)`

`search_packages(client; query=nothing, organization=nothing, tags=nothing, rows=100)`

`export_to_csv(df, filepath)`

`export_to_json(df, filepath; pretty=false)`

`export_to_arrow(df, filepath)`

`export_to_xlsx(df, filepath; sheet_name="Data")`

`export_data(df, filepath; kwargs...)`

`auto_export(df, base_name; format=:csv, output_dir="./output")`

`export_multi_sheet_xlsx(data_dict, filepath)`

`search <query>`

`org <organization>`

`tag <tag_name>`

`recent`

`orgs`

`interactive`

Packages