Skip to content

arzvaak/pyq-finder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PYQ Finder

A web app to scrape, search, and download MIT Manipal previous year question papers.

Local-Only Docker Setup (Recommended)

This project now runs fully locally with Docker Compose:

  • Frontend + API served behind one URL: http://localhost:8080
  • Local SQLite database persisted in a Docker volume
  • Optional local PDF caching (saved under /data/pdfs inside backend container)
  • Selenium/Chromium included for Portal 1 and Portal 2 scraping

Prerequisites

  • Docker Desktop (or Docker Engine + Compose plugin)

1. Configure environment (optional)

cp .env.example .env

You can run without .env; defaults are provided in docker-compose.yml.

Admin access defaults:

  • Admin password is @Yush06012002! by default.
  • You should override it in .env (ADMIN_PASSWORD=...) or use ADMIN_PASSWORD_HASH for production.
  • You can also change the admin password from the Admin page; it is stored as a hash in SQLite and survives restarts.

2. Build and run

docker compose up -d --build

3. Open the app

  • http://localhost:8080
  • Host/port check: docker compose ps
  • Startup URL log: docker compose logs -f frontend
  • If 8080 is occupied on your machine, set PUBLIC_PORT in .env (example: PUBLIC_PORT=8090) and run docker compose up -d.

Stop / start behavior

  • Stop stack: docker compose down
  • Start existing stack: docker compose up -d
  • Services use restart: unless-stopped, so they come back when Docker starts (after initial creation).

Data persistence

  • Persistent volume: pyq_data
  • Stores:
    • SQLite DB: /data/pyqfinder.db
    • Local PDF cache: /data/pdfs
  • Recreating containers does not remove data unless volume is deleted.

Useful commands

docker compose logs -f backend
docker compose logs -f frontend
docker compose ps

Architecture

  • Frontend: SvelteKit static build served by Nginx
  • Backend: Flask + Gunicorn
  • Database: SQLite (local)
  • Scraping:
    • Portal 1: requests + BeautifulSoup
    • Portal 2: Selenium + Chromium

Notes on scraping and network

  • App infrastructure/data are local.
  • Scraping still needs outbound internet access to:
    1. https://mitmpllibportal.manipal.edu/question-papers
    2. https://libportal.manipal.edu/mit/Question%20Paper.aspx
  • Link scraping and bulk caching are separate actions:
    1. Run scrape (Portal 1, Portal 2, or both) to collect linked PDFs.
    2. Run "Download All Linked PDFs" from Admin to cache files locally.
  • Portal 2 parallel scraping:
    1. In Admin, set Portal 2 Workers / Year up to 10.
    2. Selected years are sharded so each year gets its own worker pool.
    3. Total worker cap is controlled by PORTAL2_MAX_TOTAL_WORKERS (default 300).
  • Scrape status panel now includes a live event log and per-phase worker activity.
  • Scrape status events are verbose (year/session/folder/worker logs) and retained up to 3000 lines per run.
  • Backend uses a single Gunicorn worker by default so scrape state/progress is consistent across API calls.
  • Admin API routes are protected by cookie session auth + CSRF header checks and login attempt throttling.
  • Login lockouts now show a cooldown countdown in the Admin UI.
  • One-time duplicate cleanup for older data:
    1. Open Admin and run "Preview (Dry Run)" in "Duplicate Cleanup".
    2. If results look correct, run "Run Cleanup" to remove old duplicates and backfill dedupe keys.

Local (non-Docker) development

Backend

cd backend
pip install -r requirements.txt
python main.py

Frontend

cd frontend
npm install
npm run dev

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors