diff --git a/README.md b/README.md index b71c458..755a75b 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,184 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Hanming Zhang +* Tested on: Windows 10 Education, i7-6700K @ 4.00GHz 16.0GB, GTX 980 4096MB (Personal Desktop) -### (TODO: Your README) +### Project Features +##### (All scans are exclusive. To make inclusive scan result become exclusive, just shift the result right by one element and insert the identity on the head. I implemented a kernel to realize this on naive.cu. On the other hand, to make a exclusive scan result become inclusive, shift the result left by one element and insert the sum of previous elements on the last place) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +- CPU Scan +- CPU Compaction without Scan +- CPU Compaction with Scan +- Naive GPU Scan +- Work-Efficient GPU Scan +- Work-Efficient GPU Stream Compaction +- Thrust's Scan +#### Extra Credits implemented: +- Work-Efficient GPU Scan Optimization(Why-is-my-GPU-approach-so-slow Optimization) : All old branches in the kernel are removed, and all threads launched do work now. See **kernEffcientUpSweep** and **kernEffcientUpSweep** in **efficient.cu** for details. +- Radix Sort : see the last part of **main** function(**line 179 - 207**) for test and **RadixSort.h** and **RadixSort.cu** for details. Call **RadixSort::sort** to use it. +- GPU Scan using shared memory & hardware Optimization : see **StreamCompaction::Efficient::scanDynamicShared** and **StreamCompaction::Naive::scanDynamicShared** for details + +### Project Analysis (Under x64 Release mode) + +- #### Total number size performance analysis + ###### in this SCAN analysis, all GPU block sizes are temporarily set *64*, and Efficient Scan method is *optimized(branches in the kernel is removed)*. *No shared memory* so far. No Power Of Two = Power Of Two - 3 + + ![](img/1TotalNumberAnalysis.jpg) + + ###### Analysis: + First of all, in case of POT and NPOT, Naive's and Thrust's POT always consume more time bsed on the data we get. This is understandable because there is just less operations needed. However, in case of Work-Efficient method, we fill our NPOT array zeroes in this case to make it a POT array (power + 1) so that we can use efficient algorithm. As a result, there is no clear time-consuming difference between POT and NPOT. + + Besides that, navie and Work-Efficient method all appear some sudden consuming time increasing conditions, like from size 512 to 1024 for Work-Efficient and from 1024 to 2048 for naive. I think this can be explained as since there are always limited numbe of blocks can be processed at one time, when that limitation is passed, there will be a relatively huge consuming time increase. + + Finally, Thrust scan method appears pretty stable under different size conditions. + +- #### Block Size performance analysis + ###### in this SCAN analysis, the total number size is set *2048*, and Efficient Scan method is *optimized(branches in the kernel is removed)*. *No shared memory* so far. No Power Of Two = Power Of Two - 3 + + ![](img/2BlockSizeAnalysis.jpg) + + ###### Analysis: + Based on the data we got, apparently, there is no clear time-consuming difference between different block sizes, except that the time size 64 used is very slightly smaller than others. I think it's just a choice between more blocks with smaller single block size OR less block with larger block size and it won't influence a SM so much in terms of the total time to process them. Besides that, no matter naive or Work-Efficient, we all use **global memory** here, so both methods spend a lot of time on that. + + +- #### Efficient-working GPU kernel Optimization analysis(before and after branches are removed in kernel, EC 1) + ###### in this analysis, all GPU block sizes are set *64*, total number size *2048*. No Power Of Two = Power Of Two - 3 + + ![](img/3WorkEfficientAnalysis.jpg) + + ###### Analysis: + As we can see on the data, after I optimized the code, removed the if statement in the kernel and only launch threads that actually do the work, our performance improve slightly. This is understandable because **the number of threads actually do the work doesn't change at all**. Since we use global memory here and for loop happens on the host side, the only difference is that we launch 2048 threads 11 times(since 2^11 = 2048), most of which actually doesn't do any work, and launch (2^10 + 2^9 + 2^8 + ... + 2^0) threads that all do the work. However, we have to notice, the luck thing is that **there is not so much to do in the if statement in the kernel, so the stall time is no so long**, if there is much more work to do in the if statement, the performance will improve further. + + +- #### Global memory VS (Shared memory + Hardware Optimization) (EC 3) + ###### in this SCAN analysis, all GPU block sizes are set *64*, total number size *2048*. Efficient Scan method is *optimized(branches in the kernel is removed)*. No Power Of Two = Power Of Two - 3 + + ![](img/4GlobalVSShared.jpg) + + ###### Analysis: + Result is pretty clear, Naive method improves about **300%** and Work-Efficient improves about **450%** in performance. **K.O.** + + First of all, the loop times of global memory method and shared memory method are different. For the **global**, in our case, since 2^11 = 2048, we have to run kernel **11 times** in the host side. However, for the **shared memory**, since block size is 64 (2^6), we run for loop 6 + 1 = **7 times** in each kernel. + + We also largely reduce our global memory access, for the **global**, we have to access global memory (1 read + 1 write) x 11 = **22 times**. However, in the **shared memory**, we only need to access global memory 1 read + 1 write = **2 times** for each kernel. + + Hardware optimization also works well here. Just like the difference between stride = 1, 2, 4 .... and stride = 4, 2, 1 ... In our case, it's **stride = 64, 32, 16, 8, 4, 2, 1**, which means, instead of just waiting for certain threads in the block to finish its tasks, most blocks just finish their tasks earlier and leave rest tasks only certain small amount of blocks to finish. + + + +- #### Scan Methods Comparsion & Analysis + ###### in this SCAN analysis, all GPU block sizes are set *64*, total number size *2048*. Efficient Scan method is *optimized(branches in the kernel is removed)*. No Power Of Two = Power Of Two - 3 + + ![](img/5DifferentScan.jpg) + + ###### Analysis: + + Work-Efficient is not so efficient in our case, although I tried to optimize it and remove the branch(if statement) in the kernel and only launch threads that do the work. + + Global memory access is still where our performance bottlenecks are. If we use global memory, we have to access it so many times, which really take "long" time here. The result(such huge performance improvement) we get about shared memory tell everything about it. + + Naive method has a better performance than Work-Efficient in this case and the reason of which might be our number size is still limited here, only 2048, if we have millions of numbers, Work-Efficient method will definitely have a better performance. + + Thrust keep very stable and always have a relatively good performance. + + +- #### Radix Sort analysis and comparison with build-in qsort + ###### in this analysis, all GPU block sizes are set *64*, total number size *2048*. I use global memory efficient Scan method is *optimized(branches in the kernel is removed)*. No Power Of Two = Power Of Two - 3 + +qsort | Radix Sort +------------ | ------------- +0.0539 ms | 1.0717 ms + + + ###### Analysis: + I think the main responsibility for such a long Radix Sort time is that I use global memory (because is before Shared Memory & hardware optimization part in EC list, so I implemented it First). There is so many global memory accesses, especially when we generate the arrays we need. Besides that, the Efficient-working scan method I use here is actually the most inefficient one. + + Qsort runs pretty well, and take less than 0.1 ms to finish sorting. + + + + + +- #### Output of the test Programming + ``` + **************** + ** SCAN TESTS ** + **************** + [ 13 35 38 24 16 12 38 34 30 23 43 19 30 ... 22 0 ] + ==== cpu scan, power-of-two ==== + elapsed time: 0.002811ms (std::chrono Measured) + [ 0 13 48 86 110 126 138 176 210 240 263 306 325 ... 50256 50278 ] + ==== cpu scan, non-power-of-two ==== + elapsed time: 0.002811ms (std::chrono Measured) + [ 0 13 48 86 110 126 138 176 210 240 263 306 325 ... 50217 50240 ] + passed + ==== naive scan, power-of-two ==== + elapsed time: 0.061728ms (CUDA Measured) + passed + ==== naive scan, non-power-of-two ==== + elapsed time: 0.059904ms (CUDA Measured) + passed + ==== naive scan(Dynamic Share Memory), power-of-two ==== + elapsed time: 0.01408ms (CUDA Measured) + passed + ==== naive scan(Dynamic Share Memory), non-power-of-two ==== + elapsed time: 0.014048ms (CUDA Measured) + passed + ==== work-efficient scan, power-of-two ==== + elapsed time: 0.093376ms (CUDA Measured) + passed + ==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.093536ms (CUDA Measured) + passed + ==== work-efficient scan(Dynamic Share Memory), power-of-two ==== + elapsed time: 0.016832ms (CUDA Measured) + passed + ==== work-efficient scan(Dynamic Share Memory), non-power-of-two ==== + elapsed time: 0.0168ms (CUDA Measured) + passed + ==== thrust scan, power-of-two ==== + elapsed time: 0.023744ms (CUDA Measured) + passed + ==== thrust scan, non-power-of-two ==== + elapsed time: 0.014848ms (CUDA Measured) + passed + + ***************************** + ** STREAM COMPACTION TESTS ** + ***************************** + [ 1 1 2 0 2 0 0 2 2 3 3 3 0 ... 2 0 ] + ==== cpu compact without scan, power-of-two ==== + elapsed time: 0.004344ms (std::chrono Measured) + [ 1 1 2 2 2 2 3 3 3 3 2 3 3 ... 1 2 ] + passed + ==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.004343ms (std::chrono Measured) + [ 1 1 2 2 2 2 3 3 3 3 2 3 3 ... 1 1 ] + passed + ==== cpu compact with scan ==== + elapsed time: 0.008431ms (std::chrono Measured) + [ 1 1 2 2 2 2 3 3 3 3 2 3 3 ... 1 2 ] + passed + ==== work-efficient compact, power-of-two ==== + elapsed time: 0.1088ms (CUDA Measured) + passed + ==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.108832ms (CUDA Measured) + passed + + ***************************** + ** Radix Sort TEST ** + ***************************** + Randomly Generate Array with maxmium 16. + [ 13 1 14 8 10 4 8 10 6 7 7 15 12 ... 14 13 ] + qosrt Result : + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 15 15 ] + elapsed time: 0.053908ms (std::chrono Measured) + Radix sort Result : + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 15 15 ] + elapsed time: 1.07917ms (CUDA Measured) + sort result comparison between qsort and radix sort : + passed + ``` diff --git a/img/1TotalNumberAnalysis.jpg b/img/1TotalNumberAnalysis.jpg new file mode 100644 index 0000000..33a3fe2 Binary files /dev/null and b/img/1TotalNumberAnalysis.jpg differ diff --git a/img/2BlockSizeAnalysis.jpg b/img/2BlockSizeAnalysis.jpg new file mode 100644 index 0000000..0e56538 Binary files /dev/null and b/img/2BlockSizeAnalysis.jpg differ diff --git a/img/3WorkEfficientAnalysis.jpg b/img/3WorkEfficientAnalysis.jpg new file mode 100644 index 0000000..9ad97c6 Binary files /dev/null and b/img/3WorkEfficientAnalysis.jpg differ diff --git a/img/4GlobalVSShared.jpg b/img/4GlobalVSShared.jpg new file mode 100644 index 0000000..652250d Binary files /dev/null and b/img/4GlobalVSShared.jpg differ diff --git a/img/5DifferentScan.jpg b/img/5DifferentScan.jpg new file mode 100644 index 0000000..7abde67 Binary files /dev/null and b/img/5DifferentScan.jpg differ diff --git a/src/main.cpp b/src/main.cpp index 7305641..5af01c0 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -11,9 +11,10 @@ #include #include #include +#include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 2048; //1 << 8; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int a[SIZE], b[SIZE], c[SIZE]; @@ -59,6 +60,23 @@ int main(int argc, char* argv[]) { //printArray(SIZE, c, true); printCmpResult(NPOT, b, c); + + ///------------EC : navie Dynamic Shared Memo Test-------------------- + zeroArray(SIZE, c); + printDesc("naive scan(Dynamic Share Memory), power-of-two"); + StreamCompaction::Naive::scanDynamicShared(SIZE, c, a); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("naive scan(Dynamic Share Memory), non-power-of-two"); + StreamCompaction::Naive::scanDynamicShared(NPOT, c, a); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(NPOT, b, c); + ///------------------------------------------------------------------------- + zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); @@ -73,6 +91,24 @@ int main(int argc, char* argv[]) { //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); + ///------------EC : efficient Dynamic Shared Memo Test-------------------- + zeroArray(SIZE, c); + printDesc("work-efficient scan(Dynamic Share Memory), power-of-two"); + StreamCompaction::Efficient::scanDynamicShared(SIZE, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan(Dynamic Share Memory), non-power-of-two"); + StreamCompaction::Efficient::scanDynamicShared(NPOT, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(NPOT, b, c); + ///------------------------------------------------------------------------- + + + zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); @@ -139,5 +175,50 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + zeroArray(SIZE, c); + printDesc("work-efficient compact(Dynammic Shared), power-of-two"); + count = StreamCompaction::Efficient::compactDynamicShared(SIZE, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient compact(Dynammic Shared), non-power-of-two"); + count = StreamCompaction::Efficient::compactDynamicShared(NPOT, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(count, c, true); + printCmpLenResult(count, expectedNPOT, b, c); + + + printf("\n"); + printf("*****************************\n"); + printf("** Radix Sort TEST **\n"); + printf("*****************************\n"); + + int numberCount = SIZE; // should smaller than SIZE (256 here) + + // These two values must be compatible + int maxValue = 16; + int maxBits = 4; + + genArray(numberCount, a, maxValue); + printf("Randomly Generate Array with maxmium 16.\n"); + printArray(numberCount, a, true); + + zeroArray(numberCount, b); + StreamCompaction::CPU::quickSort(numberCount, b, a); + printf("qosrt Result : \n"); + printArray(numberCount, b, true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + + zeroArray(numberCount, c); + RadixSort::sort(numberCount, maxBits, c, a); + printf("Radix sort Result : \n"); + printArray(numberCount, c, true); + printElapsedTime(RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + + printf("sort result comparison between qsort and radix sort : \n"); + printCmpLenResult(numberCount, numberCount, b, c); + system("pause"); // stop Win32 console from closing on exit } diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index cdbef77..9ea1176 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -9,6 +9,8 @@ set(SOURCE_FILES "efficient.cu" "thrust.h" "thrust.cu" + "RadixSort.h" + "RadixSort.cu" ) cuda_add_library(stream_compaction diff --git a/stream_compaction/RadixSort.cu b/stream_compaction/RadixSort.cu new file mode 100644 index 0000000..3fee797 --- /dev/null +++ b/stream_compaction/RadixSort.cu @@ -0,0 +1,111 @@ +#include +#include +#include "common.h" +#include "efficient.h" +#include "RadixSort.h" + + +namespace RadixSort { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + __global__ void kernGen_b_e_array(int N, int idxBit, int* b_array, int* e_array, const int *dev_data) { + // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + int temp_result = (dev_data[index] >> idxBit) & 1; + b_array[index] = temp_result; + e_array[index] = 1 - temp_result; + + } + + template + __global__ void kern_Gen_d_array_and_scatter(int N, const int totalFalses, const int* b_array, const int* f_array, int* dev_data) + { + //Allocate appropriate shared memory + __shared__ int tile[BLOCK_SIZE]; + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + tile[threadIdx.x] = dev_data[index]; + __syncthreads(); + + int t_array_value = index - f_array[index] + totalFalses; + + int d_array_value = b_array[index] ? t_array_value : f_array[index]; + + dev_data[d_array_value] = tile[threadIdx.x]; + } + + + void sort(int n, int numOfBits, int *odata, const int *idata) { + + int* dev_data; + int* b_array; + int* e_array; + int* f_array; + + int* host_f_array = new int[n]; + + dim3 blockDim(blockSize); + dim3 gridDim((n + blockSize - 1) / blockSize); + + cudaMalloc((void**)&dev_data, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_data failed!"); + cudaMalloc((void**)&b_array, n * sizeof(int)); + checkCUDAError("cudaMalloc b_array failed!"); + cudaMalloc((void**)&e_array, n * sizeof(int)); + checkCUDAError("cudaMalloc e_array failed!"); + cudaMalloc((void**)&f_array, n * sizeof(int)); + checkCUDAError("cudaMalloc f_array failed!"); + cudaDeviceSynchronize(); + + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("RadixSort cudaMemcpy failed!"); + + timer().startGpuTimer(); + + for (int k = 0; k <= numOfBits - 1; k++) { + kernGen_b_e_array << > > (n, k, b_array, e_array, dev_data); + + cudaMemcpy(host_f_array, e_array, sizeof(int) * n, cudaMemcpyDeviceToHost); + + int totalFalses = host_f_array[n - 1]; + + // Get Exclusive scan result as a whole + StreamCompaction::Efficient::scan(n, host_f_array, host_f_array); + //StreamCompaction::Efficient::scanDynamicShared(n, host_f_array, host_f_array); + + totalFalses += host_f_array[n - 1]; + + cudaMemcpy(f_array, host_f_array, sizeof(int) * n, cudaMemcpyHostToDevice); + + // Since here we run exclusive scan as a whole, + // and we don't want each tile to run StreamCompaction::Efficient::scan individually. + // value in d_array here is actually index value in the whole data array, not just index in that tile + // so, there is NO need to merge here + kern_Gen_d_array_and_scatter << > > (n, totalFalses, b_array, f_array, dev_data); + } + + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy failed!"); + + cudaFree(dev_data); + cudaFree(b_array); + cudaFree(e_array); + cudaFree(f_array); + + delete[] host_f_array; + } +} \ No newline at end of file diff --git a/stream_compaction/RadixSort.h b/stream_compaction/RadixSort.h new file mode 100644 index 0000000..d7664d1 --- /dev/null +++ b/stream_compaction/RadixSort.h @@ -0,0 +1,9 @@ +#pragma once + +#include "common.h" + +namespace RadixSort { + StreamCompaction::Common::PerformanceTimer& timer(); + + void sort(int n, int numOfBits, int *odata, const int *idata); +} \ No newline at end of file diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..b469ee8 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,12 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if(index >= n) { + return; + } + + bools[index] = idata[index] ? 1 : 0; } /** @@ -33,6 +39,14 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + if (bools[index]) { + odata[indices[index]] = idata[index]; + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 55f1b38..b8f71cd 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -1,18 +1,24 @@ #pragma once -#include -#include - -#include -#include -#include -#include +#include +#include + +#include +#include +#include +#include #include -#include +#include + + +#include + #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +#define blockSize 64 + /** * Check for CUDA errors; print and exit if there was a problem. */ @@ -37,96 +43,96 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices); - /** - * This class is used for timing the performance - * Uncopyable and unmovable - * - * Adapted from WindyDarian(https://github.com/WindyDarian) - */ - class PerformanceTimer - { - public: - PerformanceTimer() - { - cudaEventCreate(&event_start); - cudaEventCreate(&event_end); - } - - ~PerformanceTimer() - { - cudaEventDestroy(event_start); - cudaEventDestroy(event_end); - } - - void startCpuTimer() - { - if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } - cpu_timer_started = true; - - time_start_cpu = std::chrono::high_resolution_clock::now(); - } - - void endCpuTimer() - { - time_end_cpu = std::chrono::high_resolution_clock::now(); - - if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } - - std::chrono::duration duro = time_end_cpu - time_start_cpu; - prev_elapsed_time_cpu_milliseconds = - static_cast(duro.count()); - - cpu_timer_started = false; - } - - void startGpuTimer() - { - if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } - gpu_timer_started = true; - - cudaEventRecord(event_start); - } - - void endGpuTimer() - { - cudaEventRecord(event_end); - cudaEventSynchronize(event_end); - - if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } - - cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); - gpu_timer_started = false; - } - - float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 - { - return prev_elapsed_time_cpu_milliseconds; - } - - float getGpuElapsedTimeForPreviousOperation() //noexcept - { - return prev_elapsed_time_gpu_milliseconds; - } - - // remove copy and move functions - PerformanceTimer(const PerformanceTimer&) = delete; - PerformanceTimer(PerformanceTimer&&) = delete; - PerformanceTimer& operator=(const PerformanceTimer&) = delete; - PerformanceTimer& operator=(PerformanceTimer&&) = delete; - - private: - cudaEvent_t event_start = nullptr; - cudaEvent_t event_end = nullptr; - - using time_point_t = std::chrono::high_resolution_clock::time_point; - time_point_t time_start_cpu; - time_point_t time_end_cpu; - - bool cpu_timer_started = false; - bool gpu_timer_started = false; - - float prev_elapsed_time_cpu_milliseconds = 0.f; - float prev_elapsed_time_gpu_milliseconds = 0.f; + /** + * This class is used for timing the performance + * Uncopyable and unmovable + * + * Adapted from WindyDarian(https://github.com/WindyDarian) + */ + class PerformanceTimer + { + public: + PerformanceTimer() + { + cudaEventCreate(&event_start); + cudaEventCreate(&event_end); + } + + ~PerformanceTimer() + { + cudaEventDestroy(event_start); + cudaEventDestroy(event_end); + } + + void startCpuTimer() + { + if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } + cpu_timer_started = true; + + time_start_cpu = std::chrono::high_resolution_clock::now(); + } + + void endCpuTimer() + { + time_end_cpu = std::chrono::high_resolution_clock::now(); + + if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + + std::chrono::duration duro = time_end_cpu - time_start_cpu; + prev_elapsed_time_cpu_milliseconds = + static_cast(duro.count()); + + cpu_timer_started = false; + } + + void startGpuTimer() + { + if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } + gpu_timer_started = true; + + cudaEventRecord(event_start); + } + + void endGpuTimer() + { + cudaEventRecord(event_end); + cudaEventSynchronize(event_end); + + if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } + + cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); + gpu_timer_started = false; + } + + float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 + { + return prev_elapsed_time_cpu_milliseconds; + } + + float getGpuElapsedTimeForPreviousOperation() //noexcept + { + return prev_elapsed_time_gpu_milliseconds; + } + + // remove copy and move functions + PerformanceTimer(const PerformanceTimer&) = delete; + PerformanceTimer(PerformanceTimer&&) = delete; + PerformanceTimer& operator=(const PerformanceTimer&) = delete; + PerformanceTimer& operator=(PerformanceTimer&&) = delete; + + private: + cudaEvent_t event_start = nullptr; + cudaEvent_t event_end = nullptr; + + using time_point_t = std::chrono::high_resolution_clock::time_point; + time_point_t time_start_cpu; + time_point_t time_end_cpu; + + bool cpu_timer_started = false; + bool gpu_timer_started = false; + + float prev_elapsed_time_cpu_milliseconds = 0.f; + float prev_elapsed_time_gpu_milliseconds = 0.f; }; } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..f05ecb0 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,15 +1,15 @@ #include #include "cpu.h" -#include "common.h" +#include "common.h" namespace StreamCompaction { namespace CPU { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** @@ -17,9 +17,21 @@ namespace StreamCompaction { * For performance analysis, this is supposed to be a simple for loop. * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ + + void scanImplementation(int n, int *odata, const int *idata) { + // an EXCLUSIVE scan + odata[0] = 0; + for (int k = 1; k < n; k++) { + odata[k] = odata[k - 1] + idata[k - 1]; + } + } + + void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + // TODO + scanImplementation(n, odata, idata); + timer().endCpuTimer(); } @@ -29,10 +41,19 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); + int count = 0; + + timer().startCpuTimer(); // TODO + + for (int k = 0; k < n; k++) { + if (idata[k] != 0) { + odata[count++] = idata[k]; + } + } + timer().endCpuTimer(); - return -1; + return count; } /** @@ -41,10 +62,56 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); + + int* tempArray = new int[n]; + int* scanResult = new int[n]; + int count = 0; + + timer().startCpuTimer(); // TODO + + // Step 1 : compute temp array + for (int k = 0; k < n; k++) { + tempArray[k] = idata[k] ? 1 : 0; + } + + // Step 2 : run exclusive on temp array + scanImplementation(n, scanResult, tempArray); + + // Step 3 : scatter + count = tempArray[n - 1] ? (scanResult[n - 1] + 1) : scanResult[n - 1]; + + for (int k = 0; k < n; k++) { + if (tempArray[k]) { + odata[scanResult[k]] = idata[k]; + } + } + timer().endCpuTimer(); - return -1; + + delete[] tempArray; + delete[] scanResult; + + return count; + } + + int compare(const void * a, const void * b) + { + return (*(int*)a - *(int*)b); + } + + void quickSort(int n, int *odata, const int *idata){ + + for (int k = 0; k < n; k++) { + odata[k] = idata[k]; + } + + timer().startCpuTimer(); + + qsort(odata, n, sizeof(int), compare); + + timer().endCpuTimer(); + } } } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 236ce11..4123241 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -6,10 +6,14 @@ namespace StreamCompaction { namespace CPU { StreamCompaction::Common::PerformanceTimer& timer(); + void scanImplementation(int n, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata); int compactWithoutScan(int n, int *odata, const int *idata); int compactWithScan(int n, int *odata, const int *idata); + + void quickSort(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..9a25b41 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,12 @@ #include "common.h" #include "efficient.h" +#define NUM_BANKS 32 +#define LOG_NUM_BANKS 5 +#define CONFLICT_FREE_OFFSET(n) \ + ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS)) + + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,13 +18,203 @@ namespace StreamCompaction { return timer; } + __global__ void kernSetZero(int N, int* dev_data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + dev_data[index] = 0; + } + + __global__ void kernEffcientUpSweep(int N, int offset, int* dev_data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + // ------------ old branch mehthod---------------- + /*if ((index + 1) % offset == 0) { + dev_data[index] += dev_data[index - offset / 2]; + }*/ + // ----------------------------------------------- + + int targetIndex = (index + 1) * offset - 1; + dev_data[targetIndex] += dev_data[targetIndex - offset / 2]; + } + + __global__ void kernSetRootZero(int N, int* dev_data) { + dev_data[N - 1] = 0; + } + + __global__ void kernEfficientDownSweep(int N, int offset, int* dev_data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + // ------------ old branch mehthod---------------- + /*if ((index + 1) % offset == 0) { + int t = dev_data[index - offset / 2]; + dev_data[index - offset / 2] = dev_data[index]; + dev_data[index] += t; + }*/ + // ----------------------------------------------- + + int targetIndex = (index + 1) * offset - 1; + + int t = dev_data[targetIndex - offset / 2]; + dev_data[targetIndex - offset / 2] = dev_data[targetIndex]; + dev_data[targetIndex] += t; + } + + __global__ void kernSetCompactCount(int N, int* dev_count, int* bools, int* indices) { + dev_count[0] = bools[N - 1] ? (indices[N - 1] + 1) : indices[N - 1]; + } + + /// ------------------- EX : Dynamic Shared Memo ---------------------- + __global__ void kernScanDynamicShared(int n, int *g_odata, int *g_idata, int *OriRoot) { + extern __shared__ int temp[]; + + //int index = threadIdx.x + (blockIdx.x * blockDim.x); + //if (index >= N) { + // return; + //} + + int thid = threadIdx.x; + // assume it's always a 1D block + int blockOffset = 2 * blockDim.x * blockIdx.x; + int offset = 1; + + //temp[2 * thid] = g_idata[blockOffset + 2 * thid]; + //temp[2 * thid + 1] = g_idata[blockOffset + 2 * thid + 1]; + int ai = thid; + int bi = thid + (n / 2); + int bankOffsetA = CONFLICT_FREE_OFFSET(ai); + int bankOffsetB = CONFLICT_FREE_OFFSET(bi); + temp[ai + bankOffsetA] = g_idata[blockOffset + ai]; + temp[bi + bankOffsetB] = g_idata[blockOffset + bi]; + + + + // UP-sweep + for (int d = n >> 1; d > 0; d >>= 1) { + __syncthreads(); + if (thid < d) { + //int ai = offset * (2 * thid + 1) - 1; + //int bi = offset * (2 * thid + 2) - 1; + int ai = offset * (2 * thid + 1) - 1; + int bi = offset * (2 * thid + 2) - 1; + ai += CONFLICT_FREE_OFFSET(ai); + bi += CONFLICT_FREE_OFFSET(bi); + + + temp[bi] += temp[ai]; + } + offset *= 2; + } + + __syncthreads(); + // save origin root and set it to zero + if (thid == 0) { + OriRoot[blockIdx.x] = temp[n - 1]; + //temp[n - 1] = 0; + temp[n - 1 + CONFLICT_FREE_OFFSET(n - 1)] = 0; + } + + for (int d = 1; d < n; d *= 2) { + offset >>= 1; + __syncthreads(); + if (thid < d) { + //int ai = offset * (2 * thid + 1) - 1; + //int bi = offset * (2 * thid + 2) - 1; + int ai = offset * (2 * thid + 1) - 1; + int bi = offset * (2 * thid + 2) - 1; + ai += CONFLICT_FREE_OFFSET(ai); + bi += CONFLICT_FREE_OFFSET(bi); + + int t = temp[ai]; + temp[ai] = temp[bi]; + temp[bi] += t; + } + } + __syncthreads(); + //g_odata[blockOffset + 2 * thid] = temp[2 * thid]; + //g_odata[blockOffset + 2 * thid + 1] = temp[2 * thid + 1]; + g_odata[blockOffset + ai] = temp[ai + bankOffsetA]; + g_odata[blockOffset + bi] = temp[bi + bankOffsetB]; + } + + __global__ void kernAddOriRoot(int N, int* OriRoot, int* dev_odata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + dev_odata[index] += OriRoot[blockIdx.x]; + } + + /// ------------------------------------------------------------------- + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + int dMax = ilog2ceil(n); + int size = (int)powf(2.0f, (float)dMax); + + int* dev_data; + + cudaMalloc((void**)&dev_data, size * sizeof(int)); + checkCUDAError("cudaMalloc dev_data failed!"); + cudaDeviceSynchronize(); + + dim3 blockDim(blockSize); + dim3 gridDim((size + blockSize - 1) / blockSize); + + kernSetZero << < gridDim, blockDim >> > (size, dev_data); + checkCUDAError("kernSetZero failed!"); + + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("efficient cudaMemcpy failed!"); + + timer().startGpuTimer(); - // TODO + + // Step 1 : Up-sweep + for (int d = 0; d <= dMax - 1; d++) { + // ------------ old branch mehthod---------------- + //kernEffcientUpSweep << > > (size, (int)powf(2.0f, (float)d + 1.0f), dev_data); + // ----------------------------------------------- + + //only launch threads that acutally work + int temp_size = (int)powf(2.0f, (float)(dMax - d - 1)); + kernEffcientUpSweep << > > (temp_size, (int)powf(2.0f, (float)d + 1.0f), dev_data); + + } + checkCUDAError("kernEffcientUpSweep failed!"); + + // Step 2 : Down-sweep + kernSetRootZero << < dim3(1), dim3(1) >> > (size, dev_data); + checkCUDAError("kernSetRootZero failed!"); + + for (int d = dMax - 1; d >= 0; d--) { + // ------------ old branch mehthod---------------- + //kernEfficientDownSweep << > > (size, (int)powf(2.0f, (float)d + 1.0f), dev_data); + // ----------------------------------------------- + + //only launch threads that acutally work + int temp_size = (int)powf(2.0f, (float)(dMax - d - 1)); + kernEfficientDownSweep << > > (temp_size, (int)powf(2.0f, (float)d + 1.0f), dev_data); + } + checkCUDAError("kernEfficientDownSweep failed!"); + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("efficient cudaMemcpy failed!"); + + cudaFree(dev_data); } /** @@ -31,10 +227,254 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + // compact Set-up + int* dev_idata; + int* dev_odata; + int* bools; + int* indices; + int* dev_count; + int count; + + dim3 blockDim(blockSize); + dim3 gridDim((n + blockSize - 1) / blockSize); + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + cudaMalloc((void**)&bools, n * sizeof(int)); + checkCUDAError("cudaMalloc bools failed!"); + cudaMalloc((void**)&dev_count, sizeof(int)); + checkCUDAError("cudaMalloc dev_count failed!"); + cudaDeviceSynchronize(); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("efficient compact cudaMemcpy failed!"); + + + + // scan Set-up + int dMax = ilog2ceil(n); + int size = (int)powf(2.0f, (float)dMax); + + dim3 scan_gridDim((size + blockSize - 1) / blockSize); + + cudaMalloc((void**)&indices, size * sizeof(int)); + checkCUDAError("cudaMalloc indices failed!"); + cudaDeviceSynchronize(); + + kernSetZero << < scan_gridDim, blockDim >> > (size, indices); + checkCUDAError("kernSetZero failed!"); + + timer().startGpuTimer(); - // TODO + // Step 1 : compute bools array + StreamCompaction::Common::kernMapToBoolean << > > (n, bools, dev_idata); + checkCUDAError("kernMapToBoolean failed!"); + + cudaMemcpy(indices, bools, sizeof(int) * n, cudaMemcpyDeviceToDevice); + checkCUDAError("cudaMemcpy failed!"); + + // Step 2 : exclusive scan indices + // Up-sweep + for (int d = 0; d <= dMax - 1; d++) { + // ------------ old branch mehthod---------------- + //kernEffcientUpSweep << > > (size, (int)powf(2.0f, (float)d + 1.0f), indices); + // ----------------------------------------------- + + //only launch threads that acutally work + int temp_size = (int)powf(2.0f, (float)(dMax - d - 1)); + kernEffcientUpSweep << > > (temp_size, (int)powf(2.0f, (float)d + 1.0f), indices); + } + checkCUDAError("kernEffcientUpSweep failed!"); + + // Down-sweep + kernSetRootZero << < dim3(1), dim3(1) >> > (size, indices); + checkCUDAError("kernSetRootZero failed!"); + + for (int d = dMax - 1; d >= 0; d--) { + // ------------ old branch mehthod---------------- + //kernEfficientDownSweep << > > (size, (int)powf(2.0f, (float)d + 1.0f), indices); + // ----------------------------------------------- + + //only launch threads that acutally work + int temp_size = (int)powf(2.0f, (float)(dMax - d - 1)); + kernEfficientDownSweep << > > (temp_size, (int)powf(2.0f, (float)d + 1.0f), indices); + } + checkCUDAError("kernEfficientDownSweep failed!"); + + // Step 3 : Scatter + StreamCompaction::Common::kernScatter << > > (n, dev_odata, dev_idata, bools, indices); + checkCUDAError("kernScatter failed!"); + + kernSetCompactCount << > > (n, dev_count, bools, indices); + checkCUDAError("kernSetCompactCount failed!"); + timer().endGpuTimer(); - return -1; + + + cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy failed!"); + + cudaMemcpy(odata, dev_odata, sizeof(int) * count, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy failed!"); + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(bools); + cudaFree(dev_count); + cudaFree(indices); + + return count; } + + + void scanDynamicShared(int n, int *odata, const int *idata) { + int dMax = ilog2ceil(n); + int size = (int)powf(2.0f, (float)dMax); + + int* dev_data; + int* ori_root; + + int dynamicMemoBlockSize = 64; + + dim3 blockDim(dynamicMemoBlockSize); + dim3 gridDim((size + dynamicMemoBlockSize - 1) / dynamicMemoBlockSize); + + + cudaMalloc((void**)&dev_data, sizeof(int) * size); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&ori_root, sizeof(int) * gridDim.x); + checkCUDAError("cudaMalloc ori_root failed!"); + cudaDeviceSynchronize(); + + kernSetZero << < gridDim, blockDim >> > (size, dev_data); + checkCUDAError("kernSetZero failed!"); + kernSetZero << < dim3((gridDim.x + dynamicMemoBlockSize - 1) / dynamicMemoBlockSize), blockDim >> > (gridDim.x, ori_root); + checkCUDAError("kernSetZero failed!"); + + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("naive cudaMemcpy failed!"); + + int sharedMemoryPerBlockInBytes = dynamicMemoBlockSize * sizeof(int); // Compute This + + timer().startGpuTimer(); + + //kernScanDynamicShared << > > (size, dynamicMemoBlockSize, dev_data, dev_data, ori_root); + kernScanDynamicShared << > > (dynamicMemoBlockSize, dev_data, dev_data, ori_root); + + + sharedMemoryPerBlockInBytes = gridDim.x * sizeof(int); + // We only do scan of scan ONE time here + // Actually, it should be a while loop here + // This process should happen until root number we get < blockDim.x + //kernScanDynamicShared << < dim3(1), dim3(gridDim.x), sharedMemoryPerBlockInBytes >> > (gridDim.x, gridDim.x, ori_root, ori_root, ori_root); + kernScanDynamicShared << < dim3(1), dim3(gridDim.x / 2), sharedMemoryPerBlockInBytes >> > (gridDim.x, ori_root, ori_root, ori_root); + + kernAddOriRoot << > > (size, ori_root, dev_data); + + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy failed!"); + + cudaFree(dev_data); + cudaFree(ori_root); + } + + + int compactDynamicShared(int n, int *odata, const int *idata) { + // compact Set-up + int* dev_idata; + int* dev_odata; + int* bools; + int* indices; + int* dev_count; + int count; + + dim3 blockDim(blockSize); + dim3 gridDim((n + blockSize - 1) / blockSize); + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + cudaMalloc((void**)&bools, n * sizeof(int)); + checkCUDAError("cudaMalloc bools failed!"); + cudaMalloc((void**)&dev_count, sizeof(int)); + checkCUDAError("cudaMalloc dev_count failed!"); + cudaDeviceSynchronize(); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("efficient compact cudaMemcpy failed!"); + + + // scan Set-up + int dMax = ilog2ceil(n); + int size = (int)powf(2.0f, (float)dMax); + + int* ori_root; + + dim3 scan_gridDim((size + blockSize - 1) / blockSize); + + cudaMalloc((void**)&indices, size * sizeof(int)); + checkCUDAError("cudaMalloc indices failed!"); + cudaMalloc((void**)&ori_root, sizeof(int) * gridDim.x); + checkCUDAError("cudaMalloc ori_root failed!"); + cudaDeviceSynchronize(); + + kernSetZero << < scan_gridDim, blockDim >> > (size, indices); + checkCUDAError("kernSetZero failed!"); + kernSetZero << < dim3((scan_gridDim.x + gridDim.x - 1) / gridDim.x), blockDim >> > (gridDim.x, ori_root); + checkCUDAError("kernSetZero failed!"); + + int sharedMemoryPerBlockInBytes = blockDim.x * sizeof(int); // Compute This + + timer().startGpuTimer(); + + // Step 1 : compute bools array + StreamCompaction::Common::kernMapToBoolean << > > (n, bools, dev_idata); + checkCUDAError("kernMapToBoolean failed!"); + + cudaMemcpy(indices, bools, sizeof(int) * n, cudaMemcpyDeviceToDevice); + checkCUDAError("cudaMemcpy failed!"); + + // Step 2 : exclusive scan indices + //kernScanDynamicShared << > > (size, blockDim.x, indices, indices, ori_root); + kernScanDynamicShared << > > (blockDim.x, indices, indices, ori_root); + + sharedMemoryPerBlockInBytes = gridDim.x * sizeof(int); + // We only do scan of scan ONE time here + // Actually, it should be a while loop here + // This process should happen until root number we get < blockDim.x + //kernScanDynamicShared << < dim3(1), dim3(gridDim.x), sharedMemoryPerBlockInBytes >> > (scan_gridDim.x, scan_gridDim.x, ori_root, ori_root, ori_root); + kernScanDynamicShared << < dim3(1), dim3(gridDim.x / 2), sharedMemoryPerBlockInBytes >> > (scan_gridDim.x, ori_root, ori_root, ori_root); + kernAddOriRoot << > > (size, ori_root, indices); + + + // Step 3 : Scatter + StreamCompaction::Common::kernScatter << > > (n, dev_odata, dev_idata, bools, indices); + checkCUDAError("kernScatter failed!"); + + kernSetCompactCount << > > (n, dev_count, bools, indices); + checkCUDAError("kernSetCompactCount failed!"); + + timer().endGpuTimer(); + + + cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy failed!"); + cudaMemcpy(odata, dev_odata, sizeof(int) * count, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy failed!"); + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(bools); + cudaFree(dev_count); + cudaFree(indices); + cudaFree(ori_root); + + return count; + } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..42f8b95 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -8,6 +8,10 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); + void scanDynamicShared(int n, int *odata, const int *idata); + int compact(int n, int *odata, const int *idata); + + int compactDynamicShared(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..117799e 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,8 @@ #include "common.h" #include "naive.h" +#define identity 0 + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -11,15 +13,198 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + __global__ void kernGenExclusiveScanFromInclusiveScan(int N, int* dev_odata, int* dev_idata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + // shift right + if (index == 0) { + dev_odata[index] = identity; + } + else { + dev_odata[index] = dev_idata[index - 1]; + } + } + + __global__ void kernNaiveParallelScan(int N, int d, int* dev_odata, int* dev_idata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + int offset = (int)(powf(2.0f, (float)d - 1.0f)); + + if (index >= offset) { + dev_odata[index] = dev_idata[index - offset] + dev_idata[index]; + } + else { + dev_odata[index] = dev_idata[index]; + } + } + + /// ------------------- EX : Dynamic Shared Memo ---------------------- + __global__ void kernScanDynamicShared(int N, int n, int *g_odata, int *g_idata, int *OriRoot) { + extern __shared__ int temp[]; + + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + int thid = threadIdx.x; + int pout = 0, pin = 1; + // assume it's always a 1D block + int blockOffset = blockDim.x * blockIdx.x; + + temp[pout * n + thid] = (thid > 0) ? g_idata[blockOffset + thid - 1] : 0; + + __syncthreads(); + + for (int offset = 1; offset < n; offset *= 2) { + pout = 1 - pout; + pin = 1 - pin; + if (thid >= offset) { + temp[pout*n + thid] = temp[pin*n + thid] + temp[pin*n + thid - offset]; + } + else { + temp[pout*n + thid] = temp[pin*n + thid]; + } + __syncthreads(); + } + + /*if (thid == blockDim.x - 1) { + OriRoot[blockIdx.x] = g_idata[blockOffset + thid] + temp[pout*n + thid]; + }*/ + if (thid == 0) { + OriRoot[blockIdx.x] = g_idata[blockOffset + n - 1] + temp[pout*n + n - 1]; + } + + g_odata[blockOffset + thid] = temp[pout*n + thid]; + } + + __global__ void kernAddOriRoot(int N, int* OriRoot, int* dev_odata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + dev_odata[index] += OriRoot[blockIdx.x]; + } + + __global__ void kernSetZero(int N, int* dev_data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + dev_data[index] = 0; + } + + /// ------------------------------------------------------------------- /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO + + int* dev_idata; + int* dev_odata; + int* temp; + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + cudaDeviceSynchronize(); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + cudaMemcpy(dev_odata, odata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("naive cudaMemcpy failed!"); + + dim3 blockDim(blockSize); + dim3 gridDim((n + blockSize- 1) / blockSize); + int dMax = ilog2ceil(n); + + timer().startGpuTimer(); + + for (int d = 1; d <= dMax; d++) { + // call cuda here + // PAY ATTENTION : this is an inclusive scan + kernNaiveParallelScan << > > (n, d, dev_odata, dev_idata); + checkCUDAError("kernNaiveParallelScan failed!"); + + // swap input & output buffer + temp = dev_idata; + dev_idata = dev_odata; + dev_odata = temp; + } + // generate exclusive result from inclusive + kernGenExclusiveScanFromInclusiveScan << > > (n, dev_odata, dev_idata); + checkCUDAError("kernGenExclusiveScanFromInclusiveScan failed!"); + timer().endGpuTimer(); + + // copy from dev_odata to host odata + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("naive cudaMemcpy failed!"); + + cudaFree(dev_idata); + cudaFree(dev_odata); } + + void scanDynamicShared(int n, int *odata, const int *idata) { + + int dMax = ilog2ceil(n); + int size = (int)powf(2.0f, (float)dMax); + + int* dev_idata; + int* dev_odata; + int* ori_root; + + int dynamicMemoBlockSize = 64; + + dim3 blockDim(dynamicMemoBlockSize); + dim3 gridDim((size + dynamicMemoBlockSize - 1) / dynamicMemoBlockSize); + + + cudaMalloc((void**)&dev_idata, sizeof(int) * size); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, sizeof(int) * size); + checkCUDAError("cudaMalloc dev_odata failed!"); + cudaMalloc((void**)&ori_root, sizeof(int) * gridDim.x); + checkCUDAError("cudaMalloc ori_root failed!"); + cudaDeviceSynchronize(); + + kernSetZero << < gridDim, blockDim >> > (size, dev_idata); + checkCUDAError("kernSetZero failed!"); + kernSetZero << < dim3((gridDim.x + dynamicMemoBlockSize - 1) / dynamicMemoBlockSize), blockDim >> > (gridDim.x, ori_root); + checkCUDAError("kernSetZero failed!"); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("naive cudaMemcpy failed!"); + + int sharedMemoryPerBlockInBytes = 2 * dynamicMemoBlockSize * sizeof(int); // Compute This + + timer().startGpuTimer(); + + kernScanDynamicShared << > > (size, dynamicMemoBlockSize, dev_odata, dev_idata, ori_root); + + // We only do scan of scan ONE time here + // Actually, it should be a while loop here + // This process should happen until root number we get < blockDim.x + sharedMemoryPerBlockInBytes = 2 * gridDim.x * sizeof(int); + kernScanDynamicShared << < dim3(1), gridDim, sharedMemoryPerBlockInBytes >> > (gridDim.x, gridDim.x, ori_root, ori_root, ori_root); + kernAddOriRoot << > > (size, ori_root, dev_odata); + + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy failed!"); + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(ori_root); + } } } diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h index 37dcb06..4f6d839 100644 --- a/stream_compaction/naive.h +++ b/stream_compaction/naive.h @@ -7,5 +7,7 @@ namespace StreamCompaction { StreamCompaction::Common::PerformanceTimer& timer(); void scan(int n, int *odata, const int *idata); + + void scanDynamicShared(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..270a525 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -14,15 +14,35 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + + thrust::host_vector host_idata(idata, idata + n); + //thrust::device_vector dev_idata(n); + thrust::device_vector dev_idata(host_idata); + thrust::device_vector dev_odata(n); + + //thrust::copy(host_idata.begin(), host_idata.end(), dev_idata.begin());//copy to device + + cudaDeviceSynchronize(); + timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin()); + timer().endGpuTimer(); - } + + cudaMemcpy(odata, thrust::raw_pointer_cast(&dev_odata[0]), sizeof(int) * n, cudaMemcpyDeviceToHost); + + + } } }