Meta Production Engineering Hackathon

This is the most scalable, reliable and guaranteed to wake up the on-call engineer url-shortner of all time. Provided to you by 4 students from Canada, 2 from Waterloo and 2 from Concordia.

Getting Started

Prerequisites

Docker & Docker Compose
Node.js 20+ (for local frontend dev)
Python 3.11+ with uv package manager

Initial Setup

For the setup instructions, we will assume the user remains in the same directory as indicated by the steps.

Clone this repository:

git clone <repository-url>
cd MetaHackathon # remain in this directory

Create a .env file:

touch .env

Start the full stack with Docker:

docker compose up --build -d

Verify services are running:

For logging in to Grafana, users can user admin as the username and admin as the password`

Architecture

Endpoints

Health Checks

GET /health - Service health status
GET /health/live - Liveness probe
GET /health/ready - Readiness probe

Links API

POST /links/create - Create a short URL
GET /links/<id> - Retrieve link metadata

Authentication

POST /auth/register - Register a new user
POST /auth/login - User login

Deploy Guide

DigitalOcean Deploy

Provision 3 DigitalOcean droplets:
- App droplet 1 (backend + nginx)
- App droplet 2 (backend + nginx)
- Observability droplet (Prometheus, Grafana, Loki, Alertmanager, OTel)
On each droplet:
- Install Docker Engine + Docker Compose plugin
- Open required firewall ports (80 for app, 3000/9090/9093/3100/4318/8889 for observability)
- Clone this repository to ~/MetaHackathon
Set up managed services in DigitalOcean:
- Managed PostgreSQL cluster
- Managed Redis instance
Add the droplet IPs and service configuration to your deployment environment (DO_HOST_1, DO_HOST_2, DO_OBSERVABILITY, DB/Redis values).
Push to main:

git checkout main
git pull
git push

CI/CD workflow will automatically:
- Deploy app stack to DO_HOST_1 and DO_HOST_2 with docker-compose.prod.yml
- Run database migrations (on droplet 1)
- Deploy observability stack to DO_OBSERVABILITY with docker-compose.observability.prod.yml
- Full deploy logic lives in .github/workflows/ci-cd.yml under the deploy job
Validate after deploy:
- App health: http://<DO_HOST_1>/health/live and http://<DO_HOST_2>/health/live
- Prometheus targets: http://<DO_OBSERVABILITY>:9090/targets
- Grafana: http://<DO_OBSERVABILITY>:3000

Troubleshooting

Common Issues

Services won't start

Check Docker is running
Verify .env variables are set correctly
Review logs: docker compose logs -f <service>

Database connection errors

Ensure PostgreSQL is healthy: docker compose ps db
Check connection string in .env

Prometheus has no data

Verify exporters are running: docker compose ps
Check scrape targets: http://localhost:9090/targets

Advanced Debugging

Debugging issues, especially during runtime can be facilitated by the detailed logs we included in the application. These log files are generated locally in ./logs/app* for each instance of the server.

The log files must include the following fields:

ts: time log was recorded
level: level of importance
logger: the application instance that logged the entry
event: event recorded in the application such as request_completed
service: which service the log occured in

Logs can have additional entries to become traces. The additional entries can include:

endpoint: The endpoint the log occured on
user_id: Which user caused the log
method: Whether it was a POST/GET request
. . . (more defined in init.py in the JsonFormatter class)

So, if a bug occurs, a good first step is to look through the logs at the time it occured, and look for a WARNING or ERROR log.

Users can consult the log file by SSHing in the machine running the application or by going to localhost:3001 on the explore tab and query logs with Loki. In the worst case, if even localhost:3001 is inaccessible, logs are stored for long term storage in an Amazon S3 storage through Loki, to be accessed from the cloud.

3 Example Bugs

Caching Problem
One example problem we faced was during scalability testing where we were faced with too many cache misses, requiring us to visit the main Postgress database. Thanks to our test logs which displayed the cache miss and hit percentages, we were able to decrease the misses from 30% to 15%.
Latency Problem
When stress testing our architecture, we originally had a single instance which ended up causing a lot of latency when we had a lot of requests at once. We knew that one of the ways to decrease latency under load woul dbe to scale horizontally, so we set up tests to track latency with 1 instance, 2, 3, 4, and 5 instances, and 4 instances performed the best. We were able to track and confirm this thanks to our metrics, allowing us to implement a robust solution.
Malformed Data
One of the problems we had was with malformed data and we would get a lone error message. When fixing that problem we didn't yet have logging, but now we get warning in our logs. We were able to find there was a problem by adding unit tests, but having logs earlier would have helped pinpoint the issue faster.

Config

Environment Variables

Variable	Example	Description
`DATABASE_NAME`	`hackathon_db`	PostgreSQL database name
`DATABASE_HOST`	`db`	PostgreSQL host inside Docker network
`DATABASE_PORT`	`5432`	PostgreSQL port
`DATABASE_USER`	`postgres`	DB user
`DATABASE_PASSWORD`	`postgres`	DB password
`REDIS_URL`	`redis://redis:6379`	Redis connection
`SECRET_KEY`	`random_secret_key`	Flask secret key
`LOG_LEVEL`	`INFO`	Application log verbosity
`LOG_FILE_PATH`	`/app/logs/app-1.log`	Per-instance app log file path
`LOG_FILE_MAX_BYTES`	`10485760`	Max size for a single rotated log file
`LOG_FILE_BACKUP_COUNT`	`5`	Number of rotated log files to retain
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://otel:4318`	OpenTelemetry collector endpoint
`FLASK_HOST`	`0.0.0.0`	Flask bind address (if running directly)
`FLASK_PORT`	`5000`	Flask port (if running directly)
`FLASK_DEBUG`	`false`	Flask debug mode toggle
`SMTP_SMARTHOST`	`smtp.gmail.com:587`	SMTP relay for Alertmanager
`SMTP_FROM`	`alerts@example.com`	Alert sender email
`SMTP_AUTH_USERNAME`	`smtp-user`	SMTP auth username
`SMTP_AUTH_PASSWORD`	`smtp-password`	SMTP auth password
`ALERT_EMAIL_TO`	`oncall@example.com`	Alert recipient email
`ALERT_SMS_TO`	`+15551234567`	Alert recipient phone (if SMS integration is configured)
`DISCORD_WEBHOOK_URL`	`https://discord.com/api/webhooks/...`	Discord webhook for alerts
`S3_KEY`	`AKIA...`	AWS access key for Loki object storage
`SECRET_S3_KEY`	`***`	AWS secret key for Loki object storage
`AWS_REGION`	`us-east-1`	AWS region for Loki S3 bucket
`LOKI_S3_BUCKET`	`metahackathon-loki-logs`	S3 bucket used by Loki for long-term log storage

Runbooks

Runbook

Decision Log

For our decisions, we balanced choosing the best decision for the scope of the hackathon, but also scaling further for the future.

Encoding Mechanism

We had different options to choose for shortening the url:

MD5
XOR hashing + Bas62
Base 62 (option chosen)

We decidied to choose Base 62 encoding because it was the most efficient option out of our choices. XOR hashing would have been similar in efficiency, but it would guarantee a uniquely generated URL whereas plain Base 62 does not. This sounds like XOR hashing + Base 62 is more advantageous, however we would need a to use XOR hashing against an 8 byte number which is generated by our database, requiring at minimum 1 more database access every request, increasing overhead. Thus, we discarded that option.

User verification

To verify users and optimize efficiency we had 2 options:

Use session management (option chosen)
Use JWT

JWT is more efficient, as it is stateless, and doesn't require a check every request. This is really nice, but comes with its challenges because it is hard to revoke tokens on logout and there are security concerns. This would require maintaining more information and increase complexity. We chose session management, but caching logged in users to Redis to avoid hitting the main database each request.

Duplicate data

If users try to hash the same website multiple time, or if a website was already hashed should we generate a new code?

Since users can modify what url the generated code points to, we decided that, yes, we would generate the code each time, so there's no point in even checking the database for existing instances to save time. The endpoint would just generate the new code and allow duplication if the user causesi it.

Log Files

We wanted to have persistent log storage integrated with our observability stack. So, we decided to use Loki. We could have used different technologies for log aggregation such as:

OpenSearch
Graylog
ELK stack

We decided to choose Loki, because it was the most natural complement to Grafana with direct integration. To use Loki easily with Grafana, we use promtail which collects the logs from our folders and feeds it to Loki. Promtail is more than just that though, it allows smooth log aggregation across different containers, adding more metadata such as adding exactly which service generated the log.
Another advantage of Loki is that with out JSON logs, we can make queries across every single container, so we could check for errors across ALL nodes easily.

Observability and Telemetry

We decided to use prometheus alongside OpenTelemetry because they are very lightweight and great to integrate to our application. To collect hardware metrics such as CPU usage, I/O operations, RAM usage, we used process-exporter and node-exporter (exposed in the /metrics endpoint) which pair well with prometheus. We could've used other options like:

Grafana Alloy
Telegraf
Datadog Agent

We chose to use process-exporter and node-exporter because they have great support for Grafana as well as good options for dashboards.
OpenTelemtry was a good fit to track metrics specific to our internal application (like latency). Instead of exposing an endpoint for prometheus to scrape, but OpenTelemetry can host its own server to which our server can send batches of data (we avoid sending traces every single request, adding a ton of overhead). This is a good option to keep growing our distributed application and scale even further, and add pre-processing to our app traces.

Databases

For caching, we used Redis because it is the industry standard for in-memory caches, being extremely light weight. It has a very well documented API to integrate with python, so Redis was the natural choice for scalability. Other options could have been:

Memcached
Dragonfly

We used PostgresSQL because for the scope of the application SQL would probably perform NoSQL because of its fast index lookup.

Digital Ocean Scalability

In digital ocean, we decided to use droplets to run our application because we could use multiple machines to scale our servers horizontally. Our entire design was carefully crafted with the end goal of running a large scale distributed system.

To accomodate running multiple droplets, we used a Postgres and Redis instance managed by postgress. If we hosted our own, we would have had to avoid running a new instance in each droplet, and creating backup databases with automatic promotion would have been unrealistic to manage alone. Digital Ocean's support was perfectly suited to our needs. Thus, even if a Database instance fails, our application can keep going while it restarts because we have standby databases always ready.

We run 2 droplets, testin each with 4 instances of our application. We chose 4 insances on every droplet after performance testing (3 Example Bugs, point 2). We chose 2 droplets because we wanted to distribute server load across multiple machines and have a fail-safe if one machine ever fails.

Capacity Plan

Current Limits

Sustained throughput: 500+ req/s
Concurrent load ceiling: ~7000 concurrent users
Bottleneck: Database saturation and host CPU/network on current 2 droplets
Primary optimizations in place: Redis caching (5–60 min TTLs), Nginx keepalive + gzip, 4 Gunicorn workers per droplet

Scale Path

Horizontal scaling: Add more droplets and distribute app instances before upgrading managed PostgreSQL/Redis.
Sharding (if reaching ~50k+ users): Partition user/URL/event data by user_id ranges across multiple PostgreSQL instances to unlock beyond single-database limits.

License

This project is licensed under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.github/workflows		.github/workflows
alertmanager		alertmanager
app		app
docs		docs
frontend		frontend
grafana		grafana
k6		k6
loki		loki
migrations		migrations
nginx		nginx
otel		otel
prometheus		prometheus
promtail		promtail
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.observability.prod.yml		docker-compose.observability.prod.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.swarm.yml		docker-compose.swarm.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Meta Production Engineering Hackathon

Quick Links

Getting Started

Prerequisites

Initial Setup

Architecture

Endpoints

Health Checks

Links API

Authentication

Deploy Guide

DigitalOcean Deploy

Troubleshooting

Common Issues

Services won't start

Database connection errors

Prometheus has no data

Advanced Debugging

3 Example Bugs

Config

Environment Variables

Runbooks

Decision Log

Encoding Mechanism

User verification

Duplicate data

Log Files

Observability and Telemetry

Databases

Digital Ocean Scalability

Capacity Plan

Current Limits

Scale Path

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages