VLR Analytics

Cloud-Native Lakehouse for Valorant Competitive Data

VLR Analytics implements a Medallion (Bronze–Silver–Gold) architecture on Google Cloud Platform to ingest, process, and analyze competitive Valorant statistics from VLR.GG. The system processes 500K–1M player performance records across competitive events, with a full pipeline running end-to-end in under 40 minutes — from raw scrape to analytics-ready Gold tables in BigQuery.

Insights are surfaced through interactive Looker Studio dashboards covering player performance, agent meta trends, and map-level statistics — built for analysts and coaches who need fast, reliable access to competitive data. Jump to Dashboards →

Architecture

Technology Stack

Component	Service
Orchestration	Cloud Composer (Airflow)
Scraping Compute	Cloud Run
Distributed Processing	Google Cloud Dataproc (Apache Spark Serverless)
Data Lake Storage	Google Cloud Storage
Data Warehouse	BigQuery
Visualization	Looker Studio
Messaging	Google Cloud Pub/Sub
Metadata Store	Cloud SQL (PostgreSQL)
Infrastructure as Code	Terraform

Processing Strategy

Initial Backfill (Batch)

Historical events are processed in bulk across all scraped events. Airflow dispatches Batch Cloud Run jobs, and Silver and Gold transformations run after ingestion completes. This ensures a complete historical dataset before incremental scheduling begins.

See: VLR Stats Scrapper Batch README.md

Incremental Processing (Every 15 Days)

Airflow triggers the ingestion DAG every 15 days. New event metadata is queried from PostgreSQL, and only unprocessed events are dispatched to Cloud Run. Downstream Spark transformations run after completion signals via Pub/Sub.

This design supports both large-scale historical ingestion and lightweight incremental updates without reprocessing existing data.

Data Architecture

Bronze Layer — Raw Ingestion

Service: Cloud Run | Format: CSV | Storage: Cloud Storage

Scrapes player statistics across competitive events from VLR.GG
Each Cloud Run job completes in under 5 minutes per event
Writes partitioned raw CSV data to Cloud Storage
Emits a Pub/Sub completion event upon success
Stores ingestion status metadata in PostgreSQL for idempotent reruns

Partitioning scheme:

gs://vlr-data-lake/bronze/
  event_id={event_id}/
  region={region}/
  map={map}/
  agent={agent}/
  snapshot_date={date}/data.csv

Silver Layer — Clean & Structured

Service: Dataproc (Spark) | Format: Parquet

Transformations applied to 500K–1M raw records:

Schema enforcement and type casting
Deduplication on composite key: (player_id, snapshot_date, agent, map, event_id, region)
Percentage normalization and clutch ratio parsing
Null handling and string normalization

Data is written as partitioned Parquet, optimized for downstream analytical reads.

Gold Layer — Aggregated Analytics

Service: Dataproc (Spark) | Format: Parquet

Aggregated datasets produced:

Player career statistics
Agent performance distribution
Map-level performance metrics
Team-level win rates
Event-level summaries

Silver and Gold Spark transformations run sequentially and complete within 15–20 minutes combined, triggered automatically via Pub/Sub after Bronze ingestion confirms success. Gold tables are then loaded into BigQuery for SQL-based analysis and dashboard consumption.

Pipeline Performance

Stage	Duration
Cloud Run scrape (per event)	< 2 minutes
Spark Silver + Gold transforms	15–20 minutes
Full Airflow DAG (end-to-end)	30–40 minutes
Records processed	500K–1M rows
Events ingested	~100 events

Orchestration Design

Two Airflow DAGs coordinate the full pipeline:

1. Ingestion Dispatcher

Runs every 15 days
Queries PostgreSQL metadata for pending events
Dispatches Cloud Run jobs in parallel
Handles retries and enforces idempotency via event status tracking

2. Completion & Transformation DAG

Triggered via Pub/Sub on ingestion completion
Marks event as ingested in PostgreSQL
Triggers Silver Spark transformation
Triggers Gold aggregation after Silver success signal

This event-driven pattern decouples ingestion from transformation, allowing each layer to scale independently.

Infrastructure

Provisioned entirely with Terraform and fully reproducible from a single terraform apply:

Cloud Storage buckets (Bronze, Silver, Gold)
Cloud Run Jobs (scraping)
Dataproc cluster (Spark serverless)
Pub/Sub topics and subscriptions
Cloud SQL instance (PostgreSQL metadata store)
Cloud Composer environment (Airflow)
Artifact Registry (container images)

Monitoring & Observability

Cloud Logging integrated across Cloud Run, Dataproc, and Composer
Pub/Sub-based job completion tracking between pipeline stages
PostgreSQL metadata table tracks per-event ingestion status
Spark job monitoring via Dataproc UI

Analytics & Dashboards

BigQuery

Parquet tables loaded from Gold layer
Structured analytics datasets available for ad-hoc SQL queries
Optimized partitioning for dashboard query performance

Looker Studio Dashboards

Dashboard	Live Link	PDF Backup
Player Performance	Open	PDF
Agent Meta	Open	PDF
Map Stats	Open	PDF

PDF backups are available if live dashboards are unavailable due to budget limits.

Notebook Exploration

Gold Layer Exploration Notebook

Repository Structure

.
├── airflow/                           # DAGs & orchestration logic
├── functions/
│   ├── vlr-stats-scrapper-batch/      # Bronze layer — initial backfill (Cloud Run)
│   ├── vlr-stats-scrapper/            # Bronze layer — incremental (Cloud Run)
│   ├── vlr-silver-transform/          # Silver layer (Spark)
│   └── vlr-gold-transform/            # Gold layer (Spark)
├── terraform/                         # Infrastructure as Code
├── notebooks/                         # Exploratory analysis
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLR Analytics

Table of Contents

Architecture

Technology Stack

Processing Strategy

Initial Backfill (Batch)

Incremental Processing (Every 15 Days)

Data Architecture

Bronze Layer — Raw Ingestion

Silver Layer — Clean & Structured

Gold Layer — Aggregated Analytics

Pipeline Performance

Orchestration Design

1. Ingestion Dispatcher

2. Completion & Transformation DAG

Infrastructure

Monitoring & Observability

Analytics & Dashboards

BigQuery

Looker Studio Dashboards

Notebook Exploration

Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
airflow		airflow
functions		functions
images		images
notebooks		notebooks
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

VLR Analytics

Table of Contents

Architecture

Technology Stack

Processing Strategy

Initial Backfill (Batch)

Incremental Processing (Every 15 Days)

Data Architecture

Bronze Layer — Raw Ingestion

Silver Layer — Clean & Structured

Gold Layer — Aggregated Analytics

Pipeline Performance

Orchestration Design

1. Ingestion Dispatcher

2. Completion & Transformation DAG

Infrastructure

Monitoring & Observability

Analytics & Dashboards

BigQuery

Looker Studio Dashboards

Notebook Exploration

Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages