NASA Project Exploration & eXtraction enables to search and explore thousands of NASA's groundbreaking scientific and technological projects. Powered by hybrid search combining traditional keywords with AI-powered semantic understanding.
The core feature is a web interface for searching and exploring NASA projects. It provides:
- Smart Search: Hybrid retrieval combining keyword matching and semantic similarity
- Embedding Options: Choose between OpenAI (3072-dim) or MiniLM (384-dim) embeddings
- Multiple Search Modes: Keyword-only, enhanced keyword, semantic, or hybrid ranking
- Rich Metadata: Projects include taxonomy classification, facilities, partners, and technology readiness levels
- Responsive Design: Built with Next.js and React for fast, intuitive browsing
The search API (web/app/api/search/route.ts) is configured to easily switch between embedding providers:
// To use OpenAI: Uncomment the OpenAI section and comment out MiniLM
// To use MiniLM: Keep MiniLM active (default)
// OpenAI section (commented by default)
/*
async function getEmbedding(text: string): Promise<number[]> {
// OpenAI implementation
}
*/
// MiniLM section (active by default)
async function getEmbedding(text: string): Promise<number[]> {
// MiniLM implementation with Transformers.js
}- Keyword: Standard Solr keyword search
- Keyword+: Enhanced keyword search with semantic boosts
- Semantic: Pure vector similarity search
- Hybrid: Application-side RRF (Reciprocal Rank Fusion) combining vector and keyword search
# From the project root (pri-stellar folder)
docker-compose up -dNote
The Docker setup includes both the web application and Solr search engine. Make sure to configure the appropriate schema in docker-compose.yml and ensure the JSON data is properly loaded for the chosen embedding model (OpenAI or MiniLM).
To switch between embedding models, update the solr-init service in docker-compose.yml:
- For MiniLM: Use
schema-hybrid-final-MiniLM.jsonandoutput_techport_embeddings_MiniLM.json - For OpenAI: Change to
schema-hybrid-final-OpenAI.jsonandoutput_techport_embeddings_OpenAI.json
Note
For OpenAI embeddings, ensure OPENAI_API_KEY is set in web/.env.local. MiniLM works locally without API keys.
Note
Before running docker-compose up -d, ensure that contacts.json and organizations.json are present in the database/extracted_data/ folder (they may need to be unzipped from an archive). These files populate the MongoDB database. Also ensure the appropriate output_techport_embeddings_*.json file is present in the data/ folder based on your chosen embedding model (MiniLM or OpenAI).
scraping → facilities merge → taxonomies merge → final document
- E (Extract) → Scrape TechPort data to JSON
- T (Transform) → Merge with facilities and clean data
- L (Load) → Merge with taxonomies to create Solr-ready JSON file
- uv (recommended) or Python 3.8+
- Chrome browser (for web scraping)
- Required data files (see Data Setup below)
# Run complete pipeline (auto-setup + validation + ETL)
make all# 1. Setup environment and install dependencies
make setup
# 2. Validate data files exist
make validate-data
# 3. Run individual pipeline steps
make extract # Step 1: Scrape TechPort data
make transform-facilities # Step 2: Merge facilities data
make transform-taxonomies # Step 3: Merge taxonomies data
# Or run all steps at once
make allDownload and place these files in the data/ folder:
-
NASA_TechPort_rows.csv
-
NASA_Facilities_rows.csv
-
NASA_Taxonomies.xlsx
- Download: https://techport.nasa.gov/taxonomy/8817
⚠️ Important: Rename downloaded file fromXXXX-XX-XX TechPort Taxonomies Export.xlsxtoNASA_Taxonomies.xlsx- The pipeline will auto-convert Excel to CSV
If you have a datasets.zip file in data folder containing all CSV files (version: October 2025):
# Place datasets.zip in data/ folder, then:
make unzip-datasets # Extract all files
make all # Run pipelinemake status # Show environment and file status
make stats # Show pipeline statistics
make validate-output # Validate final JSON outputmake clean-output # Remove generated pipeline files
make clean-all # Remove everything (data + environment)# Force reinstall requirements
make reinstall-requirements
# Restart from specific step
make restart-from-facilities # Re-run facilities + taxonomies merge
make restart-from-taxonomies # Re-run only taxonomies merge
make restart-solr # Re-run Solr transformation
make restart-database # Re-run database extraction
# Run full Solr + database extraction
make solr-pipeline # Run both Solr and database extraction sequentially
# Check all validations
make checkmake help # Show all available commands and usageIf you prefer not to use the Makefile:
# Create virtual environment
uv venv
# Install dependencies
uv pip install -r requirements.txt
# Run pipeline manually
uv run scraping.py
uv run merge_facilities.py
uv run merge_taxonomies.py# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run pipeline manually
python scraping.py
python merge_facilities.py
python merge_taxonomies.py