diff --git a/docs/Getting_started.md b/docs/Getting_started.md new file mode 100644 index 0000000..17c21e6 --- /dev/null +++ b/docs/Getting_started.md @@ -0,0 +1,78 @@ +# Creating New Challenges for LeetGPU + +LeetGPU challenges are low-level GPU programming tasks focused on writing custom CUDA, Triton, or Tinygrad kernels. They evaluate both functional correctness and performance under real GPU constraints. + +This guide provides instructions for creating new GPU programming challenges for LeetGPU. It covers the complete process from concept to submission. + +## Challenge Structure + +Each challenge follows this directory structure: + +``` +challenges//_/ +├── challenge.html # Problem description and examples +├── challenge.py # Reference implementation and test cases +└── starter/ # Starter templates for each framework + ├── starter.cu # CUDA template + ├── starter.mojo # Mojo template + ├── starter.pytorch.py # PyTorch template + ├── starter.tinygrad.py # TinyGrad template + └── starter.triton.py # Triton template +``` + +### Challenge.html template + + +# [Challenge Name] + +## Description + +[Provide a clear, concise explanation of what the algorithm or function is supposed to do. Include input and output specifications, if necessary.] + +### Mathematical Formulation + +[If applicable, provide the mathematical formula using LaTeX notation] + +$$ +\text{[Your formula here]} +$$ + +## Implementation Requirements + +- **No External Libraries:** Solutions must be implemented using only native features. No external libraries or frameworks are permitted. +- **Function Signature:** The solve function signature is fixed and must not be modified. Implement your solution according to the provided signature. +- **Output Variable:** Results must be written to the designated output parameter: `[output_parameter_name]` + + + +## Examples + +### Example 1 +**Input:** +``` +[Provide specific input values] +``` + +**Expected Output:** +``` +[Show the corresponding output values] +``` + +### Example 2 +**Input:** +``` +[Provide different input values] +``` + +**Expected Output:** +``` +[Show the corresponding output values] +``` + +## Constraints + +- **Input Size:** [Specify the range of input dimensions, e.g., "1 ≤ N ≤ 1,000,000"] +- **Value Range:** [Specify the range of input values, e.g., "-1000.0 ≤ input[i] ≤ 1000.0"] +- **Memory Limits:** [If applicable, specify any memory constraints] + + diff --git a/docs/Starter_Codes.md b/docs/Starter_Codes.md new file mode 100644 index 0000000..67b40ea --- /dev/null +++ b/docs/Starter_Codes.md @@ -0,0 +1,201 @@ +# Starter Code Creation Process for LeetGPU Challenges + +A starter code is a template file that provides the basic structure and function signatures for implementing GPU-accelerated algorithms in LeetGPU challenges. It gives users a runnable foundation while leaving the core algorithmic logic as their task. + +## Major Components + +- **Function Signatures:** Standardized `solve` function with consistent parameters across all frameworks +- **Framework-Specific Templates:** CUDA, Triton, Mojo, PyTorch, and TinyGrad implementations +- **Memory Management:** Proper device pointer handling and memory allocation patterns +- **Kernel Structure:** Basic kernel function templates with grid/block sizing +- **Error Handling:** Bounds checking and synchronization primitives + + +### Identify Framework Requirements + +Each framework has specific requirements: + +**CUDA:** +- Kernel functions with `__global__` qualifier(for easy problems) +- `extern "C"` solve function for framework integration +- Proper memory management and synchronization +- Grid and block size +- + + +**Triton:** +- `@triton.jit` decorator for kernel compilation +- Pointer type conversions for data types +- Block size and grid calculations +- PyTorch restriction compliance + +**Mojo:** +- `@export` decorator for framework integration +- Proper GPU imports and memory types +- Device context management +- Function parameter types + +**PyTorch/TinyGrad:** +- Tensor-based function signatures +- GPU tensor parameters +- Simple, direct implementations + + +## Easy Problems + +### CUDA Starter Template + +```cuda +#include + +__global__ void kernel_name() { +} + +// input, output are device pointers (i.e. pointers to memory on the GPU) +extern "C" void solve(input, output,size) { + + // define grid, block size + kernel_name<<>>(input, output, size); + cudaDeviceSynchronize(); +} +``` + + + + + +### Triton Starter Template + +```python +# The use of PyTorch in Triton programs is not allowed for the purposes of fair benchmarking. +import triton +import triton.language as tl + +@triton.jit +def kernel_name(input_ptr, output_ptr, input size, block size): + input_ptr = input_ptr.to(tl.pointer_type(tl.float32)) + output_ptr = output_ptr.to(tl.pointer_type(tl.float32)) + + # TODO: Implement kernel logic + # Use tl.program_id(0) to get block index + # Use tl.program_id(1) to get thread ndex within block + +# input_ptr, output_ptr are raw device pointers +def solve(input_ptr, output_ptr, input size): + # define grid, block size + kernel_name[grid](input_ptr, output_ptr, input size, block size) +``` + + + + + +### Mojo Starter Template + +```mojo +from gpu.host import DeviceContext +from gpu.id import block_dim, block_idx, thread_idx +from memory import UnsafePointer +from math import ceildiv + +fn kernel_name(input, output, size): + # TODO: Implement kernel logic + # Use thread_idx() to get thread index within block + # Use block_idx() to get block index + pass + +# input, output are device pointers (i.e. pointers to memory on the GPU) +@export +def solve(input, output, size): + #calculate threads per block + var ctx = DeviceContext() + + ctx.enqueue_function[kernel_name]( + input, output, size, + grid_dim = num_blocks, + block_dim = BLOCK_SIZE + ) + + ctx.synchronize() +``` + +### PyTorch Starter Template + +```python +import torch + +def solve(input, output, size): + # TODO: Implement solution using PyTorch operations + pass +``` + +### TinyGrad Starter Template + +```python +import tinygrad + +def solve(input, output, size): + # TODO: Implement solution using TinyGrad operations + pass +``` + + +## Medium and Hard Problems + +### CUDA Starter Template + +```cuda +#include + +// input, output are device pointers (i.e. pointers to memory on the GPU) +extern "C" void solve(input, output, size) { + +} +``` + +### Triton Starter Template + +```python +# The use of PyTorch in Triton programs is not allowed for the purposes of fair benchmarking. +import triton +import triton.language as tl + +# input_ptr, output_ptr are raw device pointers +def solve(): + pass +``` + + +### Mojo Starter Template + +```mojo +from gpu.host import DeviceContext +from gpu.id import block_dim, block_idx, thread_idx +from memory import UnsafePointer +from math import ceildiv + +@export +def solve(input, output, size): + + pass +``` + +### PyTorch Starter Template + +```python +import torch + +def solve(input, output, size): + # TODO: Implement solution using PyTorch operations + pass +``` + +### TinyGrad Starter Template + +```python +import tinygrad + +def solve(input, output, size): + # TODO: Implement solution using TinyGrad operations + pass +``` diff --git a/docs/TESTING_GUIDE.md b/docs/TESTING_GUIDE.md new file mode 100644 index 0000000..31a5acb --- /dev/null +++ b/docs/TESTING_GUIDE.md @@ -0,0 +1,111 @@ +# Testing Guide for LeetGPU Challenges + +This guide covers how to create test cases and validate your challenges to ensure they work correctly across all frameworks. + +## Table of Contents + +1. [Test Case Types](#test-case-types) +2. [Test Case Design Principles](#test-case-design-principles) +3. [Creating Robust Test Cases](#creating-robust-test-cases) +4. [Edge Cases and Boundary Conditions](#edge-cases-and-boundary-conditions) +5. [Performance Testing](#performance-testing) +6. [Validation Strategies](#validation-strategies) +7. [Common Testing Patterns](#common-testing-patterns) +8. [Debugging Test Issues](#debugging-test-issues) + +## Test Case Types + +### 1. Example Test (`generate_example_test`) +- **Purpose**: Simple test case that matches the example in `challenge.html` +- **Complexity**: Low - should be easy to understand and verify manually +- **Size**: Small (typically 3-10 elements) +- **Values**: Simple, predictable values + +### 2. Functional Tests (`generate_functional_test`) +- **Purpose**: Comprehensive test suite covering various scenarios +- **Complexity**: Medium - includes edge cases and typical usage +- **Size**: Varied (small to medium) +- **Values**: Diverse, including edge cases + +### 3. Performance Test (`generate_performance_test`) +- **Purpose**: Large test case for performance evaluation +- **Complexity**: High - tests scalability and efficiency +- **Size**: Large (typically 1M+ elements) +- **Values**: Random or structured large datasets + +## Test Case Design Principles + +### 1. Coverage +- **Input ranges**: Test minimum, maximum, and typical values +- **Input sizes**: Test small, medium, and large inputs +- **Data patterns**: Test edge cases, special values, and random data +- **Error conditions**: Test boundary conditions and invalid inputs + +### 2. Determinism +- **Reproducible**: Tests should produce the same results every time +- **Seeded randomness**: Use fixed seeds for random test cases +- **Clear expectations**: Expected outputs should be well-defined + +### 3. Efficiency +- **Fast execution**: Tests should run quickly for development +- **Memory efficient**: Avoid unnecessarily large test cases +- **Scalable**: Performance tests should be appropriately sized + +## Debugging Test Issues + +### Common Issues and Solutions + +#### 1. Memory Issues +```python +# Problem: CUDA out of memory +# Solution: Reduce test case sizes +def generate_performance_test(self) -> Dict[str, Any]: + # Reduce size if memory issues occur + size = 100_000 # Instead of 1_000_000 + return { + "input": torch.empty(size, device="cuda", dtype=torch.float32).uniform_(-100.0, 100.0), + "output": torch.empty(size, device="cuda", dtype=torch.float32), + "N": size + } +``` + +#### 2. Precision Issues +```python +# Problem: Floating point precision errors +# Solution: Adjust tolerances +def __init__(self): + super().__init__( + name="Complex Algorithm", + atol=1e-03, # Increase tolerance for complex algorithms + rtol=1e-03, + num_gpus=1, + access_tier="free" + ) +``` + +#### 3. Shape Mismatch Issues +```python +# Problem: Tensor shape mismatches +# Solution: Add shape validation +def reference_impl(self, input: torch.Tensor, output: torch.Tensor, N: int): + # Validate shapes + assert input.shape == (N,), f"Expected input shape ({N},), got {input.shape}" + assert output.shape == (N,), f"Expected output shape ({N},), got {output.shape}" + + # Rest of implementation... +``` + +### Debugging Checklist + +- [ ] Reference implementation produces correct results +- [ ] All test cases have required parameters +- [ ] Tensor shapes match expectations +- [ ] Data types are consistent (float32) +- [ ] Tolerances are appropriate for the algorithm +- [ ] Performance test size is reasonable +- [ ] Edge cases are covered +- [ ] Random test cases use appropriate ranges + +--- + +*This testing guide ensures your challenges are robust, well-tested, and ready for production use.* \ No newline at end of file diff --git a/docs/challenge_template.py b/docs/challenge_template.py new file mode 100644 index 0000000..08f5387 --- /dev/null +++ b/docs/challenge_template.py @@ -0,0 +1,136 @@ +import ctypes +from typing import Any, List, Dict +import torch +from core.challenge_base import ChallengeBase + +class Challenge(ChallengeBase): + def __init__(self): + super().__init__( + name="[CHALLENGE_NAME]", # e.g., "ReLU", "Softmax", "Multi-Head Attention" + atol=1e-05, # Absolute tolerance for testing. 1e-05 is a good default. + rtol=1e-05, # Relative tolerance for testing. 1e-05 is a good default. + num_gpus=1, # Number of GPUs required. + access_tier="free" # Access tier + ) + + def reference_impl(self, *args, **kwargs): + """ + Reference implementation of the algorithm/function. + + Common patterns: + - Assert input shapes and properties (dtype, device) + - Implement the core algorithm logic + - Use output.copy_(result) to write results + + Example signature patterns: + - Simple: (input: torch.Tensor, output: torch.Tensor, N: int) + - Complex: (Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor, output: torch.Tensor, N: int, d_model: int, h: int) + """ + # TODO: Add input assertions + # assert input.shape == expected_shape + # assert input.dtype == expected_dtype + # assert input.device == expected_device + + # TODO: Implement core algorithm logic + # result = your_algorithm_implementation() + + # TODO: Copy result to output tensor + # output.copy_(result) + pass + + def get_solve_signature(self) -> Dict[str, Any]: + """ + Define the C function signature for the solver. + + Common ctypes patterns: + - Tensor pointers: ctypes.POINTER(ctypes.c_float) + - Integers: ctypes.c_int + - Floats: ctypes.c_float + """ + return { + # TODO: Define your function signature + # "input": ctypes.POINTER(ctypes.c_float), + # "output": ctypes.POINTER(ctypes.c_float), + # "N": ctypes.c_int, + # Add other parameters as needed + } + + def generate_example_test(self) -> Dict[str, Any]: + """ + Generate a simple example test case. + Usually small, hand-crafted data for basic demonstration. + """ + dtype = torch.float32 + + # TODO: Create example input tensors + # input_tensor = torch.tensor([...], device="cuda", dtype=dtype) + # output_tensor = torch.empty(shape, device="cuda", dtype=dtype) + + return { + # TODO: Return test case dictionary + # "input": input_tensor, + # "output": output_tensor, + # "N": size, + # Add other parameters as needed + } + + def generate_functional_test(self) -> List[Dict[str, Any]]: + """ + Generate comprehensive functional test cases. + + Common test patterns: + - Edge cases (zeros, negatives, single elements) + - Boundary conditions + - Various sizes + - Random data + - Special mathematical cases + """ + dtype = torch.float32 + test_cases = [] + + # TODO: Add basic test case + # test_cases.append({ + # "input": torch.tensor([...], device="cuda", dtype=dtype), + # "output": torch.empty(shape, device="cuda", dtype=dtype), + # "N": size + # }) + + # TODO: Add edge cases + # - All zeros + # - All negatives + # - Single element + # - Large values + # - Small values + # - Mixed positive/negative + + # TODO: Add random test cases + # test_cases.append({ + # "input": torch.empty(size, device="cuda", dtype=dtype).uniform_(min_val, max_val), + # "output": torch.empty(size, device="cuda", dtype=dtype), + # "N": size + # }) + + return test_cases + + def generate_performance_test(self) -> Dict[str, Any]: + """ + Generate a large-scale performance test case. + Usually uses large tensors with random data. + """ + dtype = torch.float32 + + # TODO: Set appropriate size for performance testing + # Common sizes: 25000000, 500000, 1024x1024, etc. + N = 1000000 # Adjust based on your challenge + + # TODO: Create large tensors for performance testing + # input_tensor = torch.empty(N, device="cuda", dtype=dtype).uniform_(min_val, max_val) + # output_tensor = torch.empty(N, device="cuda", dtype=dtype) + + return { + # TODO: Return performance test case + # "input": input_tensor, + # "output": output_tensor, + # "N": N, + # Add other parameters as needed + } \ No newline at end of file