Cloud-Native Lakehouse for Valorant Competitive Data
VLR Analytics implements a Medallion (Bronze–Silver–Gold) architecture on Google Cloud Platform to ingest, process, and analyze competitive Valorant statistics from VLR.GG. The system processes 500K–1M player performance records across competitive events, with a full pipeline running end-to-end in under 40 minutes — from raw scrape to analytics-ready Gold tables in BigQuery.
Insights are surfaced through interactive Looker Studio dashboards covering player performance, agent meta trends, and map-level statistics — built for analysts and coaches who need fast, reliable access to competitive data. Jump to Dashboards →
- Architecture
- Technology Stack
- Processing Strategy
- Data Architecture
- Pipeline Performance
- Orchestration Design
- Infrastructure
- Monitoring & Observability
- Analytics & Dashboards
- Repository Structure
| Component | Service |
|---|---|
| Orchestration | Cloud Composer (Airflow) |
| Scraping Compute | Cloud Run |
| Distributed Processing | Google Cloud Dataproc (Apache Spark Serverless) |
| Data Lake Storage | Google Cloud Storage |
| Data Warehouse | BigQuery |
| Visualization | Looker Studio |
| Messaging | Google Cloud Pub/Sub |
| Metadata Store | Cloud SQL (PostgreSQL) |
| Infrastructure as Code | Terraform |
Historical events are processed in bulk across all scraped events. Airflow dispatches Batch Cloud Run jobs, and Silver and Gold transformations run after ingestion completes. This ensures a complete historical dataset before incremental scheduling begins.
See: VLR Stats Scrapper Batch README.md
Airflow triggers the ingestion DAG every 15 days. New event metadata is queried from PostgreSQL, and only unprocessed events are dispatched to Cloud Run. Downstream Spark transformations run after completion signals via Pub/Sub.
This design supports both large-scale historical ingestion and lightweight incremental updates without reprocessing existing data.
Service: Cloud Run | Format: CSV | Storage: Cloud Storage
- Scrapes player statistics across competitive events from VLR.GG
- Each Cloud Run job completes in under 5 minutes per event
- Writes partitioned raw CSV data to Cloud Storage
- Emits a Pub/Sub completion event upon success
- Stores ingestion status metadata in PostgreSQL for idempotent reruns
Partitioning scheme:
gs://vlr-data-lake/bronze/
event_id={event_id}/
region={region}/
map={map}/
agent={agent}/
snapshot_date={date}/data.csv
Service: Dataproc (Spark) | Format: Parquet
Transformations applied to 500K–1M raw records:
- Schema enforcement and type casting
- Deduplication on composite key:
(player_id, snapshot_date, agent, map, event_id, region) - Percentage normalization and clutch ratio parsing
- Null handling and string normalization
Data is written as partitioned Parquet, optimized for downstream analytical reads.
Service: Dataproc (Spark) | Format: Parquet
Aggregated datasets produced:
- Player career statistics
- Agent performance distribution
- Map-level performance metrics
- Team-level win rates
- Event-level summaries
Silver and Gold Spark transformations run sequentially and complete within 15–20 minutes combined, triggered automatically via Pub/Sub after Bronze ingestion confirms success. Gold tables are then loaded into BigQuery for SQL-based analysis and dashboard consumption.
| Stage | Duration |
|---|---|
| Cloud Run scrape (per event) | < 2 minutes |
| Spark Silver + Gold transforms | 15–20 minutes |
| Full Airflow DAG (end-to-end) | 30–40 minutes |
| Records processed | 500K–1M rows |
| Events ingested | ~100 events |
Two Airflow DAGs coordinate the full pipeline:
- Runs every 15 days
- Queries PostgreSQL metadata for pending events
- Dispatches Cloud Run jobs in parallel
- Handles retries and enforces idempotency via event status tracking
- Triggered via Pub/Sub on ingestion completion
- Marks event as ingested in PostgreSQL
- Triggers Silver Spark transformation
- Triggers Gold aggregation after Silver success signal
This event-driven pattern decouples ingestion from transformation, allowing each layer to scale independently.
Provisioned entirely with Terraform and fully reproducible from a single terraform apply:
- Cloud Storage buckets (Bronze, Silver, Gold)
- Cloud Run Jobs (scraping)
- Dataproc cluster (Spark serverless)
- Pub/Sub topics and subscriptions
- Cloud SQL instance (PostgreSQL metadata store)
- Cloud Composer environment (Airflow)
- Artifact Registry (container images)
- Cloud Logging integrated across Cloud Run, Dataproc, and Composer
- Pub/Sub-based job completion tracking between pipeline stages
- PostgreSQL metadata table tracks per-event ingestion status
- Spark job monitoring via Dataproc UI
- Parquet tables loaded from Gold layer
- Structured analytics datasets available for ad-hoc SQL queries
- Optimized partitioning for dashboard query performance
| Dashboard | Live Link | PDF Backup |
|---|---|---|
| Player Performance | Open | |
| Agent Meta | Open | |
| Map Stats | Open |
PDF backups are available if live dashboards are unavailable due to budget limits.
Gold Layer Exploration Notebook
.
├── airflow/ # DAGs & orchestration logic
├── functions/
│ ├── vlr-stats-scrapper-batch/ # Bronze layer — initial backfill (Cloud Run)
│ ├── vlr-stats-scrapper/ # Bronze layer — incremental (Cloud Run)
│ ├── vlr-silver-transform/ # Silver layer (Spark)
│ └── vlr-gold-transform/ # Gold layer (Spark)
├── terraform/ # Infrastructure as Code
├── notebooks/ # Exploratory analysis
└── README.md
