Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.ipynb_checkpoints/
__pycache__/
*.pyc
.venv/
33 changes: 33 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# AGENTS.md

This repository is a collection of standalone data-science / ML **Jupyter notebooks**
(originally authored for Google Colab). There is no server, database, or build step —
the "application" is JupyterLab plus the scientific Python stack.

Notebooks:
- `CIS545FinalProject.ipynb` — EDA + sklearn models on the US Accidents dataset (pandas, geopandas, sklearn).
- `Transformer_Exercise.ipynb` — implement GPT-2 from scratch (torch, transformers, einops).
- `WAFChallenge.ipynb` — TF-IDF + KMeans clustering on a YouTube videos CSV (sklearn).

## Cursor Cloud specific instructions

- Dependencies are installed with `pip install --user -r requirements.txt` (handled by the
startup update script). Packages land in `~/.local`.
- The Jupyter CLIs (`jupyter`, `jupyter-lab`) live in `~/.local/bin`, which is **not on PATH**
by default. Prefix commands with `export PATH="$HOME/.local/bin:$PATH"` or invoke via
`python3 -m jupyterlab` / `python3 -m jupyter`.
- Run JupyterLab with:
`jupyter lab --no-browser --ip=0.0.0.0 --port=8888 --ServerApp.token=devtoken`
then open `http://localhost:8888/lab?token=devtoken`.
- Execute a notebook headlessly (good for CI-style checks / quick validation):
`jupyter nbconvert --to notebook --execute --ExecutePreprocessor.timeout=120 <file>.ipynb`
- `requirements.txt` intentionally **omits** the Colab-only pieces that cannot run outside
Colab: `google.colab`, `PyDrive`, `oauth2client`, `google_drive_downloader`, and the GitHub
research packages `easy_transformer` / `pysvelte`. Cells that import those (used only to pull
input CSVs from Google Drive, or for the interpretability visualizations) will fail locally;
this is expected. Supply the input data files locally to run the data-dependent cells.
- The notebooks target newer library versions than Colab pinned; some legacy calls (e.g.
`from sklearn.externals.six import StringIO` in `CIS545FinalProject.ipynb`) are removed in
current sklearn and will error. This is a notebook code issue, not an environment problem.
- No GPU is available; PyTorch is the CPU build (`torch==*+cpu`). The Transformer notebook runs
on CPU but slowly.
34 changes: 34 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Python dependencies for the Jupyter notebooks in this repository.
# These notebooks were authored for Google Colab; this file lets you run the
# scientific/ML portions locally in JupyterLab. Google Colab-only pieces
# (google.colab, PyDrive, Google Drive data ingestion) are intentionally
# omitted since they only work inside Colab. See AGENTS.md for details.

# Use CPU-only PyTorch wheels (no GPU in the dev environment).
--extra-index-url https://download.pytorch.org/whl/cpu

# Jupyter
jupyterlab
notebook
ipykernel

# Data-science core (all notebooks)
numpy
pandas
matplotlib
seaborn
scikit-learn

# Geospatial / decision-tree viz (CIS545FinalProject.ipynb)
geopandas
shapely
pydotplus

# Transformer exercise (Transformer_Exercise.ipynb)
torch
transformers
datasets
einops
fancy_einsum
tqdm
plotly