A Julia-based interactive CLI tool for exploring and downloading datasets from the data.gov catalog. Built on the CKAN API, this tool provides an intuitive command-line interface for browsing thousands of government datasets.
- Dual Operation Modes:
- Interactive Mode: Menu-driven interface for browsing datasets
- CLI Mode: Non-interactive commands for automation and scripting
- Smart Search: Fuzzy matching and auto-correction for dataset discovery
- Multiple Browse Modes:
- Browse by organization
- Browse by tags
- Search by keywords
- View recent datasets
- Flexible Export: Export catalog metadata to CSV, JSON, Excel, or Arrow formats
- Flexible Output Formats: Choose between table, JSON, CSV, or plain text output
- Caching: Built-in caching to reduce API calls and improve performance
- Rate Limiting: Automatic rate limiting and retry logic with exponential backoff
- Error Handling: Graceful error handling with helpful suggestions
- Automation-Ready: Perfect for CI/CD pipelines and batch operations
- Julia 1.9 or later
- Clone or download this repository:
cd /Users/coolbeans/Development/dev/DataGovExplorer- Install dependencies:
using Pkg
Pkg.activate(".")
Pkg.instantiate()Launch the interactive explorer:
julia run_explorer.jlOr from Julia REPL:
using DataGovExplorer
interactive_explorer()Search for datasets directly from the command line:
# Search for datasets
julia run_explorer.jl search "climate data" --limit 20
# Export results directly
julia run_explorer.jl search "climate" --export climate.csv
# Browse by organization
julia run_explorer.jl org "Department of Commerce" --limit 50
# Browse by tag
julia run_explorer.jl tag "environment" --output json
# View recent datasets
julia run_explorer.jl recent --limit 10
# List all organizations
julia run_explorer.jl orgs --export organizations.csvControl output format with the --output flag:
# Table format (default)
julia run_explorer.jl search "health" --output table
# JSON format (machine-readable)
julia run_explorer.jl search "health" --output json
# CSV format (pipe to other tools)
julia run_explorer.jl search "health" --output csv
# Plain text (simple list)
julia run_explorer.jl search "health" --output plainFor piping or logging, use --no-color:
julia run_explorer.jl search "climate" --no-color --output json > results.jsonusing DataGovExplorer
# Create a client
client = CKANClient()
# Search for datasets
climate_data = search_packages(client, q="climate", rows=20)
# Export results
export_to_csv(climate_data, "climate_datasets.csv")
using DataGovExplorer
client = CKANClient()
# Search for datasets about climate
results = search_packages(client, query="climate change", rows=50)
# View results
println("Found $(nrow(results)) datasets")
println(results)
# Export to CSV
export_to_csv(results, "climate_datasets.csv")CLI mode:
# List all organizations
julia run_explorer.jl orgs --export orgs.csv
# Browse datasets from a specific organization
julia run_explorer.jl org "NOAA" --limit 100 --export noaa_datasets.xlsxProgrammatic:
# Get all organizations
orgs = get_organizations(client)
# Get datasets from a specific organization
noaa_data = search_packages(client, fq="organization:\"noaa-gov\"", rows=100)
# Export to Excel
export_to_xlsx(noaa_data, "noaa_datasets.xlsx")CLI mode:
# Browse datasets by tag
julia run_explorer.jl tag "health" --limit 50 --export health.json
# Output as JSON for processing
julia run_explorer.jl tag "covid-19" --output jsonProgrammatic:
# Get all available tags
tags = get_tags(client)
# Find datasets with specific tags
health_data = search_packages(client, fq="tags:\"health\"", rows=50)
# Export to JSON
export_to_json(health_data, "health_datasets.json")# Get detailed metadata for a specific dataset
dataset_name = "monthly-us-air-quality-1980-2020"
metadata = get_package_metadata(client, dataset_name)
println(metadata)
export_to_csv(metadata, "dataset_details.csv")CLI mode:
# Export to different formats
julia run_explorer.jl search "education" --limit 100 --export education.csv
julia run_explorer.jl search "education" --limit 100 --export education.json
julia run_explorer.jl search "education" --limit 100 --export education.xlsx
julia run_explorer.jl search "education" --limit 100 --export education.arrowProgrammatic:
# Search for datasets
results = search_packages(client, q="education", rows=100)
# Export to multiple formats
export_to_csv(results, "education.csv")
export_to_json(results, "education.json")
export_to_arrow(results, "education.arrow") # Efficient binary format
export_to_xlsx(results, "education.xlsx")Use CLI mode in shell scripts for automated data collection:
#!/bin/bash
# collect_datasets.sh - Automated dataset collection
# Collect datasets from different categories
julia run_explorer.jl search "climate" --export data/climate.csv
julia run_explorer.jl search "health" --export data/health.csv
julia run_explorer.jl search "education" --export data/education.csv
# Get recent datasets
julia run_explorer.jl recent --limit 100 --export data/recent.json
# Archive organizations
julia run_explorer.jl orgs --export data/organizations.csv
echo "Data collection complete!"# Create client with default configuration
client = CKANClient()
# Create client with custom configuration
config = CKANConfig(
base_url="https://catalog.data.gov/api/3",
timeout=30, # Request timeout in seconds
rate_limit_ms=500, # Minimum delay between requests
max_retries=3, # Maximum retry attempts
page_size=100 # Results per page
)
client = CKANClient(config)Get list of all packages (datasets).
Get list of all organizations.
Get list of all tags.
Get detailed information about a specific package.
Get formatted metadata for a package as DataFrame.
Search for packages with various filters.
Export DataFrame to CSV format.
Export DataFrame to JSON format.
Export DataFrame to Apache Arrow format (efficient binary).
Export DataFrame to Excel format.
Smart export based on file extension.
Export with auto-generated filename and timestamp.
Export multiple DataFrames to Excel with multiple sheets.
DataGovExplorer/
├── src/
│ ├── DataGovExplorer.jl # Main module
│ ├── config.jl # Configuration structures
│ ├── client.jl # HTTP client with rate limiting
│ ├── metadata.jl # Metadata retrieval functions
│ ├── exports.jl # Export utilities
│ ├── cli.jl # CLI command definitions
│ ├── explorer.jl # Interactive CLI main loop
│ └── explorer/
│ ├── display.jl # Table formatting and colors
│ ├── input.jl # User input validation
│ └── menu.jl # Menu navigation logic
├── examples/
│ ├── quick_start.jl # Basic connectivity test
│ └── basic_usage.jl # Common usage patterns
├── Project.toml # Package dependencies
├── run_explorer.jl # CLI launcher (interactive & non-interactive)
└── README.md # This file
The project follows a modular architecture similar to UNStatsExplorer:
- Configuration Layer: Centralized configuration for API settings
- Client Layer: HTTP client with caching, rate limiting, and retry logic
- Metadata Layer: Functions for retrieving catalog information
- Export Layer: Multi-format export utilities
- Explorer Layer: Interactive CLI with menu navigation
- Composability: Small, focused functions that combine into larger workflows
- Reusability: Core utilities work independently
- Descriptive Naming: Clear, self-documenting function names
- Type Safety: Robust handling of missing/invalid data
- Minimal Overhead: Efficient caching and resource usage
- Progressive Disclosure: Simple API with advanced options
Search for datasets by keyword
--limit <N>: Maximum number of results (default: 50)--export <file>: Export results to file--output <format>: Output format (table, json, csv, plain)--no-color: Disable colored output
Browse datasets from a specific organization
--limit <N>: Maximum number of results (default: 50)--export <file>: Export results to file--output <format>: Output format (table, json, csv, plain)--no-color: Disable colored output
Browse datasets by tag
--limit <N>: Maximum number of results (default: 50)--export <file>: Export results to file--output <format>: Output format (table, json, csv, plain)--no-color: Disable colored output
View recently updated datasets
--limit <N>: Maximum number of results (default: 20)--export <file>: Export results to file--output <format>: Output format (table, json, csv, plain)--no-color: Disable colored output
List all organizations
--export <file>: Export results to file--output <format>: Output format (table, json, csv, plain)--no-color: Disable colored output
Launch interactive explorer mode
- HTTP.jl: HTTP client for API requests
- JSON3.jl: Fast JSON parsing
- DataFrames.jl: Tabular data manipulation
- CSV.jl: CSV export
- Arrow.jl: Apache Arrow format
- XLSX.jl: Excel export
- JSONTables.jl: JSON export
- PrettyTables.jl: Console table formatting
- ProgressMeter.jl: Progress bars
- StringDistances.jl: Fuzzy matching (Jaro-Winkler)
- Crayons.jl: ANSI color output
- Comonicon.jl: Command-line interface framework
This tool uses the CKAN API (version 3) provided by data.gov. CKAN (Comprehensive Knowledge Archive Network) is an open-source data management system used by governments worldwide.
Key API endpoints used:
/api/3/action/package_list- List all packages/api/3/action/package_search- Search packages/api/3/action/package_show- Get package details/api/3/action/organization_list- List organizations/api/3/action/group_list- List groups/api/3/action/tag_list- List tags
API Documentation: https://docs.ckan.org/en/2.11/api/index.html
- Use Caching: Metadata queries are cached by default
- Specify Filters: Use
organization,tags, andqueryparameters to narrow searches - Arrow Format: Use Arrow format for large datasets (fastest for re-import)
- Pagination: Results are automatically paginated with progress bars
- Rate Limiting: Built-in rate limiting respects API constraints
If you encounter DNS errors like DNSError: catalog.data.gov, unknown node or service (EAI_NONAME):
This is a known issue with Julia's HTTP.jl DNS resolution on some systems. Solutions:
Option 1: Use environment variable (Recommended)
# Set this environment variable before running
export JULIA_NO_VERIFY_HOSTS="catalog.data.gov"
# Then run normally
julia run_explorer.jlOption 2: Add to your shell profile
# Add to ~/.zshrc or ~/.bashrc
export JULIA_NO_VERIFY_HOSTS="catalog.data.gov"Option 3: Use with each command
JULIA_NO_VERIFY_HOSTS="catalog.data.gov" julia run_explorer.jl search "climate"
JULIA_NO_VERIFY_HOSTS="catalog.data.gov" julia run_explorer.jl interactiveOption 4: Check your DNS settings
- Verify DNS is working:
ping catalog.data.gov - Try flushing DNS cache:
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder(macOS) - Temporarily switch to Google DNS (8.8.8.8) in System Settings
If you encounter other connection issues:
- Check your internet connection
- Verify the data.gov API is accessible:
curl -I https://catalog.data.gov/api/3/action/organization_list - Try increasing the
timeoutin configuration - Check if you're behind a corporate firewall or proxy
If you get API errors:
- Check if the dataset name/ID is correct
- Some datasets may have restricted access
- Try again later if the API is under heavy load
- Verify API status: https://catalog.data.gov/
If searches are slow:
- Reduce the
rowsparameter - Use more specific search queries
- Clear the cache:
client.cache = Dict() - The API may fetch more data than requested during pagination
This project was adapted from UNStatsExplorer. Contributions are welcome!
[Specify your license here]
- Based on the architecture of UNStatsExplorer
- Data provided by data.gov
- CKAN API by CKAN Association
- UNStatsExplorer: Julia tool for exploring UN SDG data
- CKAN: Open-source data management system
[Your contact information]