AlphaGPU · poojathakur00 · Aug 2, 2025 · Aug 8, 2025 · Aug 8, 2025 · Aug 9, 2025
@@ -0,0 +1,78 @@
+# Creating New Challenges for LeetGPU
+
+LeetGPU challenges are low-level GPU programming tasks focused on writing custom CUDA, Triton, or Tinygrad kernels. They evaluate both functional correctness and performance under real GPU constraints.
+
+This guide provides instructions for creating new GPU programming challenges for LeetGPU. It covers the complete process from concept to submission.
+
+## Challenge Structure
+
+Each challenge follows this directory structure:
+
+```
+challenges/<difficulty>/<number>_<name>/
+├── challenge.html          # Problem description and examples
+├── challenge.py           # Reference implementation and test cases
+└── starter/              # Starter templates for each framework
+    ├── starter.cu           # CUDA template
+    ├── starter.mojo         # Mojo template
+    ├── starter.pytorch.py   # PyTorch template
+    ├── starter.tinygrad.py  # TinyGrad template
+    └── starter.triton.py    # Triton template
+```
+
+### Challenge.html template
+
+
+# [Challenge Name]
+
+## Description
+
+[Provide a clear, concise explanation of what the algorithm or function is supposed to do. Include input and output specifications, if necessary.]
+
+### Mathematical Formulation
+
+[If applicable, provide the mathematical formula using LaTeX notation]
+
+$$
+\text{[Your formula here]}
+$$
+
+## Implementation Requirements
+
+- **No External Libraries:** Solutions must be implemented using only native features. No external libraries or frameworks are permitted.
+- **Function Signature:** The solve function signature is fixed and must not be modified. Implement your solution according to the provided signature.
+- **Output Variable:** Results must be written to the designated output parameter: `[output_parameter_name]`
+
+
+
+## Examples
+
+### Example 1
+**Input:**
+```
+[Provide specific input values]
+```
+
+**Expected Output:**
+```
+[Show the corresponding output values]
+```
+
+### Example 2
+**Input:**
+```
+[Provide different input values]
+```
+
+**Expected Output:**
+```
+[Show the corresponding output values]
+```
+
+## Constraints
+
+- **Input Size:** [Specify the range of input dimensions, e.g., "1 ≤ N ≤ 1,000,000"]
+- **Value Range:** [Specify the range of input values, e.g., "-1000.0 ≤ input[i] ≤ 1000.0"]
+- **Memory Limits:** [If applicable, specify any memory constraints]
+
+
@@ -0,0 +1,201 @@
+# Starter Code Creation Process for LeetGPU Challenges
+
+A starter code is a template file that provides the basic structure and function signatures for implementing GPU-accelerated algorithms in LeetGPU challenges. It gives users a runnable foundation while leaving the core algorithmic logic as their task.
+
+## Major Components
+
+- **Function Signatures:** Standardized `solve` function with consistent parameters across all frameworks
+- **Framework-Specific Templates:** CUDA, Triton, Mojo, PyTorch, and TinyGrad implementations
+- **Memory Management:** Proper device pointer handling and memory allocation patterns
+- **Kernel Structure:** Basic kernel function templates with grid/block sizing
+- **Error Handling:** Bounds checking and synchronization primitives
+
+
+### Identify Framework Requirements
+
+Each framework has specific requirements:
+
+**CUDA:**
+- Kernel functions with `__global__` qualifier(for easy problems)
+- `extern "C"` solve function for framework integration
+- Proper memory management and synchronization
+- Grid and block size 
+- 
+
+
+**Triton:**
+- `@triton.jit` decorator for kernel compilation
+- Pointer type conversions for data types
+- Block size and grid calculations
+- PyTorch restriction compliance
+
+**Mojo:**
+- `@export` decorator for framework integration
+- Proper GPU imports and memory types
+- Device context management
+- Function parameter types
+
+**PyTorch/TinyGrad:**
+- Tensor-based function signatures
+- GPU tensor parameters
+- Simple, direct implementations
+
+
+## Easy Problems
+
+### CUDA Starter Template
+
+```cuda
+#include <cuda_runtime.h>
+
+__global__ void kernel_name() {
+}
+
+// input, output are device pointers (i.e. pointers to memory on the GPU)
+extern "C" void solve(input, output,size) {
+
+    // define grid, block size
+    kernel_name<<<blocksPerGrid, threadsPerBlock>>>(input, output, size);
+    cudaDeviceSynchronize();
+}
+```
+
+
+
+
+
+### Triton Starter Template
+
+```python
+# The use of PyTorch in Triton programs is not allowed for the purposes of fair benchmarking.
+import triton
+import triton.language as tl
+
+@triton.jit
+def kernel_name(input_ptr, output_ptr, input size, block size):
+    input_ptr = input_ptr.to(tl.pointer_type(tl.float32))
+    output_ptr = output_ptr.to(tl.pointer_type(tl.float32)) 
+
+    # TODO: Implement kernel logic
+    # Use tl.program_id(0) to get block index
+    # Use tl.program_id(1) to get thread ndex within block
+
+# input_ptr, output_ptr are raw device pointers
+def solve(input_ptr, output_ptr, input size):    
+    # define grid, block size
+    kernel_name[grid](input_ptr, output_ptr, input size, block size)
+```
+
+
+
+
+
+### Mojo Starter Template
+
+```mojo
+from gpu.host import DeviceContext
+from gpu.id import block_dim, block_idx, thread_idx
+from memory import UnsafePointer
+from math import ceildiv
+
+fn kernel_name(input, output, size):
+    # TODO: Implement kernel logic
+    # Use thread_idx() to get thread index within block
+    # Use block_idx() to get block index
+    pass
+
+# input, output are device pointers (i.e. pointers to memory on the GPU)
+@export
+def solve(input, output, size):
+    #calculate threads per block
+    var ctx = DeviceContext()
+
+    ctx.enqueue_function[kernel_name](
+        input, output, size,
+        grid_dim = num_blocks,
+        block_dim = BLOCK_SIZE
+    )
+
+    ctx.synchronize()
+```
+
+### PyTorch Starter Template
+
+```python
+import torch
+
+def solve(input, output, size):
+    # TODO: Implement solution using PyTorch operations
+    pass
+```
+
+### TinyGrad Starter Template
+
+```python
+import tinygrad
+
+def solve(input, output, size):
+    # TODO: Implement solution using TinyGrad operations
+    pass
+```
+
+
+## Medium and Hard Problems
+
+### CUDA Starter Template
+
+```cuda
+#include <cuda_runtime.h>
+
+// input, output are device pointers (i.e. pointers to memory on the GPU)
+extern "C" void solve(input, output, size) {
+
+}
+```
+
+### Triton Starter Template
+
+```python
+# The use of PyTorch in Triton programs is not allowed for the purposes of fair benchmarking.
+import triton
+import triton.language as tl
+
+# input_ptr, output_ptr are raw device pointers
+def solve():    
+    pass
+```
+
+
+### Mojo Starter Template
+
+```mojo
+from gpu.host import DeviceContext
+from gpu.id import block_dim, block_idx, thread_idx
+from memory import UnsafePointer
+from math import ceildiv
+
+@export
+def solve(input, output, size):
+
+    pass
+```
+
+### PyTorch Starter Template
+
+```python
+import torch
+
+def solve(input, output, size):
+    # TODO: Implement solution using PyTorch operations
+    pass
+```
+
+### TinyGrad Starter Template
+
+```python
+import tinygrad
+
+def solve(input, output, size):
+    # TODO: Implement solution using TinyGrad operations
+    pass
+```
@@ -0,0 +1,111 @@
+# Testing Guide for LeetGPU Challenges
+
+This guide covers how to create test cases and validate your challenges to ensure they work correctly across all frameworks.
+
+## Table of Contents
+
+1. [Test Case Types](#test-case-types)
+2. [Test Case Design Principles](#test-case-design-principles)
+3. [Creating Robust Test Cases](#creating-robust-test-cases)
+4. [Edge Cases and Boundary Conditions](#edge-cases-and-boundary-conditions)
+5. [Performance Testing](#performance-testing)
+6. [Validation Strategies](#validation-strategies)
+7. [Common Testing Patterns](#common-testing-patterns)
+8. [Debugging Test Issues](#debugging-test-issues)
+
+## Test Case Types
+
+### 1. Example Test (`generate_example_test`)
+- **Purpose**: Simple test case that matches the example in `challenge.html`
+- **Complexity**: Low - should be easy to understand and verify manually
+- **Size**: Small (typically 3-10 elements)
+- **Values**: Simple, predictable values
+
+### 2. Functional Tests (`generate_functional_test`)
+- **Purpose**: Comprehensive test suite covering various scenarios
+- **Complexity**: Medium - includes edge cases and typical usage
+- **Size**: Varied (small to medium)
+- **Values**: Diverse, including edge cases
+
+### 3. Performance Test (`generate_performance_test`)
+- **Purpose**: Large test case for performance evaluation
+- **Complexity**: High - tests scalability and efficiency
+- **Size**: Large (typically 1M+ elements)
+- **Values**: Random or structured large datasets
+
+## Test Case Design Principles
+
+### 1. Coverage
+- **Input ranges**: Test minimum, maximum, and typical values
+- **Input sizes**: Test small, medium, and large inputs
+- **Data patterns**: Test edge cases, special values, and random data
+- **Error conditions**: Test boundary conditions and invalid inputs
+
+### 2. Determinism
+- **Reproducible**: Tests should produce the same results every time
+- **Seeded randomness**: Use fixed seeds for random test cases
+- **Clear expectations**: Expected outputs should be well-defined
+
+### 3. Efficiency
+- **Fast execution**: Tests should run quickly for development
+- **Memory efficient**: Avoid unnecessarily large test cases
+- **Scalable**: Performance tests should be appropriately sized
+
+## Debugging Test Issues
+
+### Common Issues and Solutions
+
+#### 1. Memory Issues
+```python
+# Problem: CUDA out of memory
+# Solution: Reduce test case sizes
+def generate_performance_test(self) -> Dict[str, Any]:
+    # Reduce size if memory issues occur
+    size = 100_000  # Instead of 1_000_000
+    return {
+        "input": torch.empty(size, device="cuda", dtype=torch.float32).uniform_(-100.0, 100.0),
+        "output": torch.empty(size, device="cuda", dtype=torch.float32),
+        "N": size
+    }
+```
+
+#### 2. Precision Issues
+```python
+# Problem: Floating point precision errors
+# Solution: Adjust tolerances
+def __init__(self):
+    super().__init__(
+        name="Complex Algorithm",
+        atol=1e-03,  # Increase tolerance for complex algorithms
+        rtol=1e-03,
+        num_gpus=1,
+        access_tier="free"
+    )
+```
+
+#### 3. Shape Mismatch Issues
+```python
+# Problem: Tensor shape mismatches
+# Solution: Add shape validation
+def reference_impl(self, input: torch.Tensor, output: torch.Tensor, N: int):
+    # Validate shapes
+    assert input.shape == (N,), f"Expected input shape ({N},), got {input.shape}"
+    assert output.shape == (N,), f"Expected output shape ({N},), got {output.shape}"
+
+    # Rest of implementation...
+```
+
+### Debugging Checklist
+
+- [ ] Reference implementation produces correct results
+- [ ] All test cases have required parameters
+- [ ] Tensor shapes match expectations
+- [ ] Data types are consistent (float32)
+- [ ] Tolerances are appropriate for the algorithm
+- [ ] Performance test size is reasonable
+- [ ] Edge cases are covered
+- [ ] Random test cases use appropriate ranges
+
+---
+
+*This testing guide ensures your challenges are robust, well-tested, and ready for production use.*