Skip to content
This repository was archived by the owner on Jan 27, 2026. It is now read-only.

YZ1GO/M.EIC003_PRI_PRJ

Repository files navigation

NPEX

Python 3.8+ TypeScript 5.0+ Next.js 14+ Apache Solr 9 MongoDB

NASA Project Exploration & eXtraction enables to search and explore thousands of NASA's groundbreaking scientific and technological projects. Powered by hybrid search combining traditional keywords with AI-powered semantic understanding.

Demo video

Search Engine

The core feature is a web interface for searching and exploring NASA projects. It provides:

  • Smart Search: Hybrid retrieval combining keyword matching and semantic similarity
  • Embedding Options: Choose between OpenAI (3072-dim) or MiniLM (384-dim) embeddings
  • Multiple Search Modes: Keyword-only, enhanced keyword, semantic, or hybrid ranking
  • Rich Metadata: Projects include taxonomy classification, facilities, partners, and technology readiness levels
  • Responsive Design: Built with Next.js and React for fast, intuitive browsing

Switching Between Embeddings

The search API (web/app/api/search/route.ts) is configured to easily switch between embedding providers:

// To use OpenAI: Uncomment the OpenAI section and comment out MiniLM
// To use MiniLM: Keep MiniLM active (default)

// OpenAI section (commented by default)
/*
async function getEmbedding(text: string): Promise<number[]> {
    // OpenAI implementation
}
*/

// MiniLM section (active by default)
async function getEmbedding(text: string): Promise<number[]> {
    // MiniLM implementation with Transformers.js
}

Search Modes

  • Keyword: Standard Solr keyword search
  • Keyword+: Enhanced keyword search with semantic boosts
  • Semantic: Pure vector similarity search
  • Hybrid: Application-side RRF (Reciprocal Rank Fusion) combining vector and keyword search

Running the Web Application

# From the project root (pri-stellar folder)
docker-compose up -d

Note

The Docker setup includes both the web application and Solr search engine. Make sure to configure the appropriate schema in docker-compose.yml and ensure the JSON data is properly loaded for the chosen embedding model (OpenAI or MiniLM).

To switch between embedding models, update the solr-init service in docker-compose.yml:

  • For MiniLM: Use schema-hybrid-final-MiniLM.json and output_techport_embeddings_MiniLM.json
  • For OpenAI: Change to schema-hybrid-final-OpenAI.json and output_techport_embeddings_OpenAI.json

Note

For OpenAI embeddings, ensure OPENAI_API_KEY is set in web/.env.local. MiniLM works locally without API keys.

Note

Before running docker-compose up -d, ensure that contacts.json and organizations.json are present in the database/extracted_data/ folder (they may need to be unzipped from an archive). These files populate the MongoDB database. Also ensure the appropriate output_techport_embeddings_*.json file is present in the data/ folder based on your chosen embedding model (MiniLM or OpenAI).

Pipeline (ETL)

scraping → facilities merge → taxonomies merge → final document

  • E (Extract) → Scrape TechPort data to JSON
  • T (Transform) → Merge with facilities and clean data
  • L (Load) → Merge with taxonomies to create Solr-ready JSON file

How to run

Prerequisites

  • uv (recommended) or Python 3.8+
  • Chrome browser (for web scraping)
  • Required data files (see Data Setup below)

Quick Start (Recommended)

# Run complete pipeline (auto-setup + validation + ETL)
make all

Step-by-Step Commands

# 1. Setup environment and install dependencies
make setup

# 2. Validate data files exist
make validate-data

# 3. Run individual pipeline steps
make extract                 # Step 1: Scrape TechPort data
make transform-facilities    # Step 2: Merge facilities data  
make transform-taxonomies    # Step 3: Merge taxonomies data

# Or run all steps at once
make all

Data Setup

Option 1: Individual Downloads

Download and place these files in the data/ folder:

  1. NASA_TechPort_rows.csv

  2. NASA_Facilities_rows.csv

  3. NASA_Taxonomies.xlsx

    • Download: https://techport.nasa.gov/taxonomy/8817
    • ⚠️ Important: Rename downloaded file from XXXX-XX-XX TechPort Taxonomies Export.xlsx to NASA_Taxonomies.xlsx
    • The pipeline will auto-convert Excel to CSV

Option 2: Bulk Download

If you have a datasets.zip file in data folder containing all CSV files (version: October 2025):

# Place datasets.zip in data/ folder, then:
make unzip-datasets  # Extract all files
make all            # Run pipeline

Development Commands

Validation & Analysis

make status          # Show environment and file status
make stats           # Show pipeline statistics
make validate-output # Validate final JSON output

Cleanup

make clean-output    # Remove generated pipeline files
make clean-all      # Remove everything (data + environment)

Troubleshooting

# Force reinstall requirements
make reinstall-requirements

# Restart from specific step
make restart-from-facilities   # Re-run facilities + taxonomies merge
make restart-from-taxonomies   # Re-run only taxonomies merge
make restart-solr              # Re-run Solr transformation
make restart-database          # Re-run database extraction

# Run full Solr + database extraction
make solr-pipeline             # Run both Solr and database extraction sequentially

# Check all validations
make check

Need Help?

make help  # Show all available commands and usage

Manual Setup (Alternative)

If you prefer not to use the Makefile:

Using uv (recommended)

# Create virtual environment
uv venv

# Install dependencies  
uv pip install -r requirements.txt

# Run pipeline manually
uv run scraping.py
uv run merge_facilities.py
uv run merge_taxonomies.py

Using pip

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  

# Install dependencies
pip install -r requirements.txt

# Run pipeline manually
python scraping.py
python merge_facilities.py  
python merge_taxonomies.py

About

NPEX(Nasa Project Exploration and eXtraction): A compact searchable platform and ETL pipeline for NASA TechPort data.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors