A web app to scrape, search, and download MIT Manipal previous year question papers.
This project now runs fully locally with Docker Compose:
- Frontend + API served behind one URL:
http://localhost:8080 - Local SQLite database persisted in a Docker volume
- Optional local PDF caching (saved under
/data/pdfsinside backend container) - Selenium/Chromium included for Portal 1 and Portal 2 scraping
- Docker Desktop (or Docker Engine + Compose plugin)
cp .env.example .envYou can run without .env; defaults are provided in docker-compose.yml.
Admin access defaults:
- Admin password is
@Yush06012002!by default. - You should override it in
.env(ADMIN_PASSWORD=...) or useADMIN_PASSWORD_HASHfor production. - You can also change the admin password from the Admin page; it is stored as a hash in SQLite and survives restarts.
docker compose up -d --buildhttp://localhost:8080- Host/port check:
docker compose ps - Startup URL log:
docker compose logs -f frontend - If
8080is occupied on your machine, setPUBLIC_PORTin.env(example:PUBLIC_PORT=8090) and rundocker compose up -d.
- Stop stack:
docker compose down - Start existing stack:
docker compose up -d - Services use
restart: unless-stopped, so they come back when Docker starts (after initial creation).
- Persistent volume:
pyq_data - Stores:
- SQLite DB:
/data/pyqfinder.db - Local PDF cache:
/data/pdfs
- SQLite DB:
- Recreating containers does not remove data unless volume is deleted.
docker compose logs -f backend
docker compose logs -f frontend
docker compose ps- Frontend: SvelteKit static build served by Nginx
- Backend: Flask + Gunicorn
- Database: SQLite (local)
- Scraping:
- Portal 1: requests + BeautifulSoup
- Portal 2: Selenium + Chromium
- App infrastructure/data are local.
- Scraping still needs outbound internet access to:
https://mitmpllibportal.manipal.edu/question-papershttps://libportal.manipal.edu/mit/Question%20Paper.aspx
- Link scraping and bulk caching are separate actions:
- Run scrape (Portal 1, Portal 2, or both) to collect linked PDFs.
- Run "Download All Linked PDFs" from Admin to cache files locally.
- Portal 2 parallel scraping:
- In Admin, set Portal 2
Workers / Yearup to10. - Selected years are sharded so each year gets its own worker pool.
- Total worker cap is controlled by
PORTAL2_MAX_TOTAL_WORKERS(default300).
- In Admin, set Portal 2
- Scrape status panel now includes a live event log and per-phase worker activity.
- Scrape status events are verbose (year/session/folder/worker logs) and retained up to 3000 lines per run.
- Backend uses a single Gunicorn worker by default so scrape state/progress is consistent across API calls.
- Admin API routes are protected by cookie session auth + CSRF header checks and login attempt throttling.
- Login lockouts now show a cooldown countdown in the Admin UI.
- One-time duplicate cleanup for older data:
- Open Admin and run "Preview (Dry Run)" in "Duplicate Cleanup".
- If results look correct, run "Run Cleanup" to remove old duplicates and backfill dedupe keys.
cd backend
pip install -r requirements.txt
python main.pycd frontend
npm install
npm run dev