Skip to content
This repository was archived by the owner on Mar 24, 2026. It is now read-only.

one-2/unimate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UniMate Backend (MVP)

I co-founded and solo-engineered UniMate in my second year of computer science. It aggregated and processed student events data from a variety of sources, providing students one place to see information about social activities at Sydney universities. It used a distributed proxy service for web scraping and a GPT text classifier to process events data into categories, which we served to users. The system served over 1,500 users, validating our product, before the university released a competing product. Unfortunately they didn't communicate their work on this to us when we contacted them. Nonetheless, this was a very education project. It gave me footing in quick engineering under competing business pressures, reinforced by the pressure of live users.

This repo is a sanitised version of the final commit of UniMate's MVP. Development finished when we wound up the business.

You may also wish to read the retrospective for more information on the engineering process and its lessons.


What It Does

  • Scrapes data on club events from a variety of sources using Infatica proxies.
  • Parsed and regularised HTML data.
  • Classified text into bins for user filtering using the OpenAI API.
  • Dumps processed data to CSV and binary for partial-restarts.
  • Terminal frontend for ease-of-use.

Pipeline Architecture

scraper/
β”œβ”€β”€ Scraper.py β†’ Entry point for scraping club and event pages
β”œβ”€β”€ Scripts.py β†’ Extracts data from dynamically loaded web content
β”œβ”€β”€ InfaticaRequests.py β†’ Calls the proxies

processor/
β”œβ”€β”€ Tagger.py β†’ Classify event descriptions using OpenAI API

utils/
β”œβ”€β”€ Parser.py β†’ Methods for parsing HTML
β”œβ”€β”€ Utils.py β†’ Date handling and URL validation

io/
β”œβ”€β”€ Filesystem.py β†’ Centralised handling of file structure
β”œβ”€β”€ Input.py β†’ Reads from saved files and dumps
β”œβ”€β”€ Output.py β†’ Writes CSV and binaries to disk

config/
β”œβ”€β”€ Config.py β†’ Prompts and constants
β”œβ”€β”€ Env.py

core/
β”œβ”€β”€ Interface.py β†’ Logic for the command-line interface
β”œβ”€β”€ Debug.py β†’ Test methods


Setup

Note that this was an MVP and it was designed for speed rather than safety or reliability. Proceed at your own risk.

  1. Clone the repo
  2. Set your infatica_api_key in Config.py
  3. Run Interface.py to launc the CLI

Dependencies:

  • Python 3.9+
  • OpenAI API key (for Tagger.py)
  • A valid Infatica account if testing live scraping

Design Patterns

  • Pipeline - Modules include Input, Scraper, Processor, and Output, chained in sequence
  • Strategy - GPT binary classifiers called using a common interface
  • Factory - Centralised filesystem manager is a path-building factory
  • Controller - CLI decides pipeline construction
  • Observer-like Logging

πŸ“‰ Tradeoffs

Decision Rationale Cost
No Selenium It didn't work well on our target sites More fragile
No database Fast iteration, simple output Harder to scale and query
OpenAI tagging via GPT Fast to implement and worked Could get expensive
Single-threaded scraping Easier to implement Impracticably slow
No automatic tests Faster new code Slower refactoring of old code

πŸ”„ Future Improvements

  • Better logging and more exception handling
  • Async scraping
  • Unit and integration tests
  • Replace CSV-based workflows with a database
  • Use a minimal backend like FastAPI to handle automatic upload to the frontend

πŸ§‘β€πŸ’» Author

Stephen Elliott
[Sydney, 2023]

About

Production CLI-driven event aggregator using proxy-based scraping, GPT-based classification, and modular pipeline design I wrote for an events aggregator business I co-founded in 2023.

Resources

Stars

Watchers

Forks

Contributors

Languages