I co-founded and solo-engineered UniMate in my second year of computer science. It aggregated and processed student events data from a variety of sources, providing students one place to see information about social activities at Sydney universities. It used a distributed proxy service for web scraping and a GPT text classifier to process events data into categories, which we served to users. The system served over 1,500 users, validating our product, before the university released a competing product. Unfortunately they didn't communicate their work on this to us when we contacted them. Nonetheless, this was a very education project. It gave me footing in quick engineering under competing business pressures, reinforced by the pressure of live users.
This repo is a sanitised version of the final commit of UniMate's MVP. Development finished when we wound up the business.
You may also wish to read the retrospective for more information on the engineering process and its lessons.
- Scrapes data on club events from a variety of sources using Infatica proxies.
- Parsed and regularised HTML data.
- Classified text into bins for user filtering using the OpenAI API.
- Dumps processed data to CSV and binary for partial-restarts.
- Terminal frontend for ease-of-use.
scraper/
βββ Scraper.py β Entry point for scraping club and event pages
βββ Scripts.py β Extracts data from dynamically loaded web content
βββ InfaticaRequests.py β Calls the proxies
processor/
βββ Tagger.py β Classify event descriptions using OpenAI API
utils/
βββ Parser.py β Methods for parsing HTML
βββ Utils.py β Date handling and URL validation
io/
βββ Filesystem.py β Centralised handling of file structure
βββ Input.py β Reads from saved files and dumps
βββ Output.py β Writes CSV and binaries to disk
config/
βββ Config.py β Prompts and constants
βββ Env.py
core/
βββ Interface.py β Logic for the command-line interface
βββ Debug.py β Test methods
Note that this was an MVP and it was designed for speed rather than safety or reliability. Proceed at your own risk.
- Clone the repo
- Set your
infatica_api_keyinConfig.py - Run
Interface.pyto launc the CLI
Dependencies:
- Python 3.9+
- OpenAI API key (for
Tagger.py) - A valid Infatica account if testing live scraping
- Pipeline - Modules include Input, Scraper, Processor, and Output, chained in sequence
- Strategy - GPT binary classifiers called using a common interface
- Factory - Centralised filesystem manager is a path-building factory
- Controller - CLI decides pipeline construction
- Observer-like Logging
| Decision | Rationale | Cost |
|---|---|---|
| No Selenium | It didn't work well on our target sites | More fragile |
| No database | Fast iteration, simple output | Harder to scale and query |
| OpenAI tagging via GPT | Fast to implement and worked | Could get expensive |
| Single-threaded scraping | Easier to implement | Impracticably slow |
| No automatic tests | Faster new code | Slower refactoring of old code |
- Better logging and more exception handling
- Async scraping
- Unit and integration tests
- Replace CSV-based workflows with a database
- Use a minimal backend like FastAPI to handle automatic upload to the frontend
Stephen Elliott
[Sydney, 2023]